SMART Test Failed - How to see any details?

HarambeLives

Contributor
Joined
Jul 19, 2021
Messages
153
That is Hard Disk Sentinel, a windows program. the drive failed that initial SMART test I made the thread about, then passed 2 back to back Long tests. It failed another short test this morning so I yanked and replaced the drive, the array is currently resilvering

I put the drive in my NVR Server which runs windows, and has a few bays free and got this information. It looks like the number increased between 2 short tests and that's why it reported failing the test in TrueNAS.

Annoyingly though I have other drives in other systems with even more "write errors corrected" and no software other than TrueNAS reports a fail. Write errors in TrueNAS itself show zero, because these were corrected by the drive internally I suppose

I'm doing a full WRITE + READ test now with Hard Disk sentinel, so far all sectors are writing at 200MB/s ish with no errors. I really feel like this drive should not have been alerted to at all, but what do I know

Maybe if TrueNAS actually gave its users information about the errors its seeing, we could all make informed decisions.
 

HarambeLives

Contributor
Joined
Jul 19, 2021
Messages
153
The full output from TrueNAS is in the OP, it actually does show the 13 writes, but doesn't point to them

Maybe I'm wrong about it failing on that? The fact I'm unsure highlights just how poor TrueNAS has handled this "failing" drive
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Elements in grown defect list: 0
So looking at the drive data here is what I see. Elements in grown defects list is still zero, meaning no sectors mapped out.

And I'm not a SAS drive expert so I have to search the internet and here is something I found on:

Delayed errors are errors that slow down other requests. ECC corrected errors aren't much of a worry on SCSI/SAS drives, we have drives deployed with hundreds of millions of them and they still run fine. Correction algorithm invocations is a bit more serious, those may require rereading/rewriting the disk, and retrying the ECC calculation.

Unfortunately this does not tell me what the ECC error represents. If it's like a SATA drive then it's generally the data cable or HBA which causes the ECC errors and they are recorded in the drive for life.

With the high number of hours that you have on the drive, it might actually be starting to fail, but it's up to you if you feel the drive is failing. It might be interesting to know if any other drives in your system are showing these errors as well.

So the data on your drive still doesn't indicate there has been a hard failure. When running the SMART Long test I'm curious if the drive fixed itself. It's odd for a drive to fail a Short test but pass a Long test. Are the Short tests passing now too?

As for TrueNAS reporting all this data, it might be doing that properly since there does not appear to be a failure anymore. It told you the SMART Short test failed but you were able to get a Log test to eventually pass. I understand you would prefer some better tools and all I can suggest is to submit a suggestion to the developers and see if they make a change or not. It may be possible that in Scale to use some Debian/Linux tool in a jail or some Plug-in. I have no idea, I don't play with Scale yet.
 
Top