TrueNas Core falsely listing HDD as FAULTED

ForMyDemons

Dabbler
Joined
Apr 29, 2022
Messages
21
Hello,

i have truenas TrueNAS-13.0-U6.1 with a couple of WDC WD60EZAZ-00SF3B0 (SMR)

Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
16gig ram
500gig ssd as system drive
startech 2P6G-PCIE-SATA-CARD

and now its the second time i get the same HDD listed as Faulted.

First time i tried rebooting, to clear the error cache and so forth, nothing worked.

I finally then put the disk on a sata usb adapter on my normal machine to check smart in crystaldisk and crystaldisk shows a good health status.

I then put the disk back into the system and everything showed green again but now its listed as faulted again... i'm not a nas pro so what the hell?

i read something about SMR and falsely interpreting writing stops as a faulted disk, can that be the problem?

its so weird like thousands of hours fine and now that...

1702583057209.png


1702583126232.png
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Tl;dnr = SMR drives don't work well with ZFS.

One major problem with SMR disks is fragmentation. Even if the drive worked for years, after a while data gets so fragmented that the drive may timeout during a write. Then ZFS may consider the write failed, and log it as such.

And yes, SMR drives DO get fragmented!
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
its so weird like thousands of hours fine and now that...

It's really not weird. ZFS has an easy time laying down data contiguously when there's lots of free space. Assuming your filer is getting fuller as you use it, it becomes increasingly difficult to find contiguous free space and instead ZFS starts to have to use smaller blocks of free space (as a result of fragmentation), which shingled drives are terrible at. Thus you get disk errors and life starts to suck.
 

ForMyDemons

Dabbler
Joined
Apr 29, 2022
Messages
21
oof funny.... that perfectly alignes perfectly because that all happened shortly after i deleted a bunch of data (videos) and i got from 70+% usage down to 24%... but how can i fix it?

is there a defrag command and a working reset command?

since replacing all drives with uuh CMR??? isn't really feasible :x
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
That is one of the serious problems with drive managed SMR, (aka DM-SMR), their are very limited actions that an outside user, (aka the OS or ZFS), can take to fix drive fragmentation.

Another problem with SMR disks is replacement. Even if the replacement disk is CMR, with the other, (source), disks SMR, their can be problems reading all that fragmented data. Then, potentially a failed replacement with a good disk.

Yes, I know replacing with CMR is not feasible for many people. We try and catch as many people who spec out such SMR disks, and point them to CMR disks instead.


Their may be ways to "de-fragment" a DM-SMR disk. I don't know of a way. And with only 1 disk of redundancy, (your RAID-Z1 pool...), I am not sure I would risk it.

The only thing I can think of, and it REQUIRES you to remove the disk from the ZFS pool, is a secure erase. This is done via the SATA commands, not externally. Whence the disk is securely erased, it is possible that all the "shingles" on the SMR drive are cleared from the SMR shingle directory. Then you may be able to re-silver, (aka re-sync), it back in to your pool.

Might be an interesting experiment. But, not without backups, a good test plan and a huge pile of luck.
 

ForMyDemons

Dabbler
Joined
Apr 29, 2022
Messages
21
Ok thank you and yes i already noticed if i want to replace the one SMR with CMR i have to replace ALL disks from SMR to CMR and thats costly, so annoying.

But since i wanted to expand my storage anway (gopro is so storage hungry) and already had to learn its not possible to just add another disk to the pool i will copy everything left to a external drive, kill the array, wipe it and create a new one... damn it having a nas is really much more difficult as i thought when i took the old pc and thought "hey just throw some disks in it and use truenas will be fun"

one last question, is this a specific zfs filesystem issue and truenas with the fragmentation issue?
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Ok thank you and yes i already noticed if i want to replace the one SMR with CMR i have to replace ALL disks from SMR to CMR and thats costly, so annoying.
...
Well, you can wait on replacing all the other SMR disks. It is just that with ZFS, you never want the replacement disk to be SMR. That is where even more problems can arise.

...
But since i wanted to expand my storage anway (gopro is so storage hungry) and already had to learn its not possible to just add another disk to the pool i will copy everything left to a external drive, kill the array, wipe it and create a new one... damn it having a nas is really much more difficult as i thought when i took the old pc and thought "hey just throw some disks in it and use truenas will be fun"
...
TrueNAS Core or SCALE are not the easiest SOHO NAS around. But, except for building a NAS yourself, it is a good, solid package that does have some hardware limitations.

...
one last question, is this a specific zfs filesystem issue and truenas with the fragmentation issue?
Mostly it is a ZFS issue. And TrueNAS Core or SCALE only use ZFS.

The way ZFS does it's writes, is to bundle them up into transactions like so;
  1. Write data to free space
  2. Write the directory entry update to free space
  3. Write the top level metadata to the oldest slot, thus activating the above
This COW, (Copy On Write), behavior performs more writes than other file systems. And in some cases, lots of little writes.
 
Last edited:

ForMyDemons

Dabbler
Joined
Apr 29, 2022
Messages
21
i see so i only have really 3 options there

1. replace all SMR drives to CMR
2. live with the fact of the ZFS issue and periodically have to wipe the faulted drive (will give it a try and will give you feedback here if it worked next week or so, when my 5th drive arrives and i rebuild the thing anyway and can before rebudilng a new pool try that out)
3. switch away from zfs to a different solution

thank you very much so far :)
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Sorry we could not give a better answer... but that is the truth as far as we know it now.
 

ForMyDemons

Dabbler
Joined
Apr 29, 2022
Messages
21
Nah was very helpfuk to understand the issue, just one last Thing with the sata wipe you mentioned is that something Else as choosing wipe disk from the gui?

A specific cli command i have to Use in truenas?
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Yes, the Secure Erase is something totally different and has nothing to do with TrueNAS. And yes, it would be from the command line.

But, that is all I know. I don't even know how to run it. Or if it will temporarily solve the SMR disk fragmentation issue.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
A specific cli command i have to Use in truenas?

Since I have no idea if an ATA_SECURE_ERASE to a WD RED SMR will simply reset the SMR translation table (which would be fast) or if the drive will actually respond by "I need to write actual zeroes to every single LBA" (which could take hours) so I wouldn't recommend running it from TrueNAS itself, unless you're doing it in a tmux session that won't time out.

If you choose to do so, here be dragons:

camcontrol security adaX -s erase -e erase

This line should trigger a SECURE ERASE on the ATA device adaX but note that as mentioned earlier in this thread, SMR drives are in general bad for use with ZFS and the WD RED line specifically bad for a firmware-level issue that can cause sector IDNF (ID Not Found) errors.
 

ForMyDemons

Dabbler
Joined
Apr 29, 2022
Messages
21
Hey its me again,

i sold all smr drives and have now 5 WD60EFPX-68C5ZN0 (WD RED CMR) drives in the system and recreated the pool.
i created a weekly short smart test and a bi monthly long smart test.

today i logged into truenas and i got a faulty drive, i could not make the hdd online again, shut down the system and just replugged the sata connectors and now its all online again and resilvering.

sadly i dismissed the alerts but is there something known that drives are flagged faulty wrongly or is my smart scheduling too hard?

also just got the message

device has read 3 write 498 checksum 9... i dont get it, drives are not even a month old

One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
  • Disk WDC WD60EFPX-68C5ZN0 WD-WX32D633V2JL is FAULTED

 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
startech 2P6G-PCIE-SATA-CARD

What chipset is on this card? This is a terrible way to attach disks; see my resource


I think it's probably an AsMedia controller of some sort, probably 106x family, which usually work okay, but there are also knockoff versions out there. Us forum old timers much prefer a good quality HBA rather than dodgy made-for-cheap-Windows adapters.

i created a weekly short smart test and a bi monthly long smart test.
is my smart scheduling too hard?

Nah, we do quadhr shorts and thrice weekly for longs;

LONG
SMART Long
22 * 1,2,3,4,5,6,7,8,9,10,11,12 tue,thu,sun
more_vert
SHORT
SMART Short
03,07,11,15,19,23 * 1,2,3,4,5,6,7,8,9,10,11,12 mon,tue,wed,thu,fri,sat,sun
more_vert

as has been standard company policy for many years.
 

ForMyDemons

Dabbler
Joined
Apr 29, 2022
Messages
21
yes i use a startech 2P6G-PCIE-SATA-CARD for 2 drives and from the datasheet i see it uses ASMedia - ASM1061 chipset also its an asrock board from a 6th gen i5 intel cpu.
 
Joined
Jun 15, 2022
Messages
674
StarTech and ZFS do not play well together under "heavy load," and what that means varies. The big issues I'm aware of are:
  1. Undersized heatsinks. ZFS can stress a controller more than a home NAS solution, exposing the weakness of small heatsinks, moreso in "quiet" home NAS with low airflow.
  2. Not-great firmware. Lots of things can go wrong in a computer, lots more things when ZFS puts it under heavy load; if the firmware has any flaws and/or doesn't handle situations well when things go wrong it's a recipe for catastrophic failure. TrueNAS is an industrial solution meant to be run on industrial-grade hardware for industrial-level reliability, underpinning your system with a weak foundation invites system collapse.
  3. Not great hardware. If any transistor or memory cell on a drive controller flakes out your game of DOOM is probably fine, but that's not a great strategy for data you want flawless retention of.
  4. Not great cabling. People who install cheap drive controllers install cheap cabling, and cheap cabling typically have a.) much higher cable failure rates, and b.) much less shielding which allows cross-talk to potentially corrupt data en-route. If you use cheap cables treat them like they're extremely fragile, and don't use long cables (don't exceed 2/3 the specification's maximum cable length).
On a related note, if you're not using ECC memory, you might want to.

  • Ubuntu Server is great on low-end hardware and will do a super job of data retention for a home user. RAID-5 or RAID-6 with a USB Disk backup plan works smashingly well.
  • Ubuntu Server with ZFS will do a better job, though should be used on better hardware.
  • TruNAS does an incredible job when used on hardware that can handle how hard it's going to be hammered in the all-out battle TrueNAS wages to protect your data.
 
Joined
Jun 15, 2022
Messages
674
i see i entered a rabbit hole :eek:
Yes. Up-side: Rabbits.

There are a lot of resources to help things go smoothly, a bunch are listed in my signature (click it!).
 

ForMyDemons

Dabbler
Joined
Apr 29, 2022
Messages
21
ya i guess i will swap the startech adapter out for an lsi sata adapter later this year and the old asrock 6th cpu board with the cpu when i upgrade my own machine so it will receive a msi mortar z390m board with an i9 9900k... just need then to undervolt the i9 for my 5hdds.. hopefully then everything runs fine
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
For the safety of your data, swap that Startech card for a SAS HBA right now.
Why the "need" the undervolt? Anything which could affect stability, any deviation from defaults is a potential threat.
 
Top