Logic behind putting a disk to FAULTED state

coolnodje

Explorer
Joined
Jan 29, 2016
Messages
66
I have a disk that has just started to show some "Completed: read failure" errors and apparently as such, Truenas has decided to put it to a FAULTED state which in turn has put my pool in a DEGRADED state.

The pool is in Z2 and Replacement has be ordered, I'm just trying to understand exactly how this happened in term of signal and automatic action taken by Truenas.

My SMART test are scheduled as follow:
SHORT weekly
LONG monthly
( I do realize now that OFFLINE tests are also schedulable)

When I look at the SMART Test results of the disk in error, I see only `Short offline` and `Extended Offline`.
I do suppose `Extended Offline` corresponds to scheduled LONG tests but that's a bit troubling. What would scheduled OFFLINE be displayed as?

In any case, for the disk in question, I have a first FAILED `Extended Offline` dating back 16 days ago, and a first FAILED `Short offline` dating back from this morning. (though I have manually started an Offline test yesterday evening when discovering the issue, which should still be running now)

BUT the Truenas alert that triggered the FAULTED status on the disk dates back from a day ago, i.e. before (or long after the failed extended) a SMART test started to report the error:

Pool subramanya state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
  • Disk WDC WD30EFRX-68EUZN0 WD-WCC4N7HZ39J7 is FAULTED


Hence my question: how did TrueNAS take the decision to put this disk in FAULTED? And what signal did it get that triggered the action?

Also, AFAIK, Offline uncorrectable sectors even though very bad sign, do not directly lead to a run time problem for a disk. Unreadable sector should be marked as such and the disk would still continue to operate.

Shall I put the disk back ONLINE, or perform a full resilver as a last test before replacing it?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
how did TrueNAS take the decision to put this disk in FAULTED? And what signal did it get that triggered the action?
I have a disk that has just started to show some "Completed: read failure" errors and apparently as such, Truenas has decided to put it to a FAULTED state
When you have read errors that involve blocks that have pool data written to them, ZFS stops wanting to work with that disk. (as it should) You may get away with disks where the errors are found by SMART long tests in areas of the disk not yet touched by ZFS data.

Shall I put the disk back ONLINE, or perform a full resilver as a last test before replacing it?
If you want to run badblocks on that disk and put it back into service elsewhere (with a filesystem other than ZFS) and sweat out the dying days of its usefulness hoping that it doesn't lose any important data for you, go right ahead.

I would suggest taking it far away from your ZFS pool if you like your data.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Also, AFAIK, Offline uncorrectable sectors even though very bad sign, do not directly lead to a run time problem for a disk. Unreadable sector should be marked as such and the disk would still continue to operate.
I am aware of a number of people who enjoy wasting their time and have entertained the idea of using dd to write zeros to the identified bad block(s) which would force the disk to perform internal sparing... as far as I can tell, it either doesn't work or just delays the death of the disk slightly. Have at it if you want to waste your time.
 

coolnodje

Explorer
Joined
Jan 29, 2016
Messages
66
When you have read errors that involve blocks that have pool data written to them, ZFS stops wanting to work with that disk. (as it should) You may get away with disks where the errors are found by SMART long tests in areas of the disk not yet touched by ZFS data.
Thanks a lot, that explains it fully.
I had a few failed disks but never had FreeNas put then off as FAULTED. Must have been because errors were in untouched areas.
 
Top