Self-Healed hard drive?

Joined
Jun 24, 2017
Messages
338
Hey gents,
I was wondering if someone could explain to me what a self-healed (assuming error) is on the new FreeNAS GUI?

Basically, Ive been having some oddities happen on my NAS, and my drives have thrown a DEGRADED state... but after reboot, they either show "self-healed" portions or they show 0s all the way across the board on errors... These drives have been running pretty problem free for about 4 months, so, i kind of think there isnt a problem specifically with the drives... (I have been having some issues with FreeNAS itself, but thats a whole other set of beans)...

Anyway, is there a simple explanation outside of "it healed itself" that "self-healed" can be explained as? More so, is it something to pay attention to if no errors are being reported after reboots? (Because all drives are showing green and good with 0 errors right now)...

For further details (no need to read if just explaining above)...

Drives ARE Seagate Baraccudas (yeah, i know... but they're dirt cheap, i got a bunch of them and theyre all still under warranty... and contain no 'important' data)
4x8TB in RAIDZ1
Running FreeNAS 11.3 RC-1
96GB ECC RAM
CPU: E5-2609 V2

Problems ive been having with the FreeNAS have primarily been jail issues... but with the drive acting wonky, it seems to make the network slow down or something as i get a lot more buffering when playing videos to kodi if one of the drives is reporting errors that are being or are "self-healed"... this remedies itself with a reboot... The other weird part is that it seems to be 2 different drives that are doing this unless FreeNAS is changing its assigned group seat (ie: it WAS da3 4 days ago, last night it was da2... i didnt notate the serial number of the wonky drive(s) until starting today so i can see if its just the same drive... FreeNAS has also taken one of these drives "offline" and seemed to resilver it after reboot once...

Anyway, all the help sent is appreciated...
 

garm

Wizard
Joined
Aug 19, 2017
Messages
1,556
A drive, vdev or pool is declared degraded if ZFS detects problems with the data. If you reboot the error count is reset. A resilver will heal the data errors if there is sufficient redundancy. ZFS will only spot the data issues on read, that’s why we have scrubs, a forced read of all the data to try and determine if there are any errors. So schedule regular scrubs are important. This will not tell you why the data is corrupted, for this you have S.M.A.R.T tests, you need to schedule those as well, both long and short.

to get a handle on the situation as is, you need to trigger a scrub and long smart tests.
post the results in the thread, in [code] tags
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
with the drive acting wonky, it seems to make the network slow down or something as i get a lot more buffering when playing videos
It could be that these are the dredded SMR drives that do not play well with ZFS. It would be helpful if you posted the model number.

Also, please review these:

 
Joined
Jun 24, 2017
Messages
338
It could be that these are the dredded SMR drives that do not play well with ZFS. It would be helpful if you posted the model number.

Also, please review these:


What is it exactly you would like me to see in the Forum Rules? I didnt curse, i wasnt disrespectful, i wasnt talking piracy, i wasnt soliciting for help, i had searched... and frankly, still havent actually gotten an answer to my original question, but took the advice i was given and rolled with it... (mind you, i asked if someone could explain in more simplistic terms what a self-healing drive is... but more complex than just saying it heals itself...).

As for the model:

ATA ST8000DM004-2CX1
 
Joined
Jun 24, 2017
Messages
338
It could be that these are the dredded SMR drives that do not play well with ZFS. It would be helpful if you posted the model number.

And yes, after a little research just now... it does look like they are SMR drives...

However, I was under the assumption that SMR doesnt really run into issues until you begin re-writes over old data... Thus far, I am at about 20% usage... would i be encountering rewrites already?
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
What is it exactly you would like me to see in the Forum Rules
The forum rules don't say that you can't ask for help, that is what we are here for. It isn't something you did wrong, but the guidance in the rules also tell about what information it is good to include in your post, about your system, so we can know what we are working with and don't need to spend so much time asking questions.
I was under the assumption that SMR doesn't really run into issues until you begin re-writes over old data... Thus far, I am at about 20% usage... would I be encountering rewrites already?
There doesn't appear to be a lot of consistency in the kind of problems presented by SMR drives. The ones I had in my system performed consistently slower than the non SMR drives in the same system and I saw about a 30% improvement in overall system performance once I replaced the SMR drives and I only had two SMR drives that I got by mistake in a pool of 12 drives. I don’t know the answer to why, I can only tell you what I observed in my own system and what I have heard from other members here in the forum.
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
consistency
I realized I was self contradictory. There doesn’t appear to be a good reason for why or when the SMR drives begin giving problems. I think it is because they are rearranging data on the platter behind the back of the file system. The ones I had started giving me problems when I was initially resilvering them into the pool. The resilver that should have taken only a few hours (under six) took over 24 hours to complete. It only got worse when I added the second drive into the pool.
 
Joined
Jun 24, 2017
Messages
338
The forum rules don't say that you can't ask for help, that is what we are here for. It isn't something you did wrong, but the guidance in the rules also tell about what information it is good to include in your post, about your system, so we can know what we are working with and don't need to spend so much time asking questions.

I did include system specs... however, i hadnt thought to include the model of the drives specifically as the original question wasnt really about the drives specifically themselves, but more about the "self-healed" part of FreeNAS's GUI report... and when asked for the model numbers of the drives, I "immediately" supplied it... (again, i would have on my OP, but didnt really think it pertinent... now that you brought up issues with SMR drives, i understand why i should have included it if i was actually asking for help about the drives... but again, i included the issues i was having, but it wasnt 'really' part of my original question...

I guess, ultimately, im trying to wrap my head around what the self-healed portion of FreeNAS's reports are and why that number would go from a fairly large one (iirc, it was like 433,000 to 0 when i rebooted... As of now, I am running SMART tests (long) on the drives (short has already concluded with no errors) and a full scrub (last one was last thursday)... but they take several hours, so i cant post the results quite yet...
 
Joined
Jun 24, 2017
Messages
338
SMART long tests are still running, but zpool shows:

Code:

root@freenas[~]# zpool status
  pool: Storage
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0 in 0 days 06:03:20 with 0 errors on Mon Dec 30 16:23:17 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        Storage                                         ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/62e076a3-dd66-11e9-951a-6805ca901b9c  ONLINE       0     0     0
            gptid/35d71b50-da80-11e9-bba5-6805ca901b9c  ONLINE       0     0     0
            gptid/cbe0a193-bf6e-11e9-8b7e-6805ca901b9c  ONLINE       0     0     0
            gptid/cf97bcae-bf6e-11e9-8b7e-6805ca901b9c  ONLINE       0     0     0

errors: No known data errors

  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:29 with 0 errors on Thu Dec 26 03:45:29 2019
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          ada0p2    ONLINE       0     0     0

errors: No known data errors
root@freenas[~]#                                    
 

HolyK

Ninja Turtle
Moderator
Joined
May 26, 2011
Messages
654
Is there a simple explanation outside of "it healed itself" that "self-healed" can be explained as? More so, is it something to pay attention to if no errors are being reported after reboots? (Because all drives are showing green and good with 0 errors right now)...

Self-heal can happen in two (three) cases in general:
- Data corruption was detected and was "self-healed" by using redundant data. This of course requires redundant environment (RAIDZ#, Mirror, ...). In case you don't have redundant env (so single-disk vdev(s)) the data corruption can not be repaired. Damage is permanent and you will have to restore from backups/snapshots (if you have any).
- Meta-data corruption was detected and "self-healed". Metadata are stored at two places so even in non-redundant environment it is possible for ZFS to self-heal this kind of corruption (unless some major breakdown happened and both places got corrupted). Metadata "heal" is usually quite fast and zpool status usually shows "repaired in 0 sec". In such case it was either metadata or very tiny portion of data which was healed (maybe a near-bit flip?).
- The third one is basically a whole disk being kicked out of the pool for whatever reason and included back (either the same disk or spare/replaced) and the "self-heal" is a full resilver process.

Either way if this is a one-time situation i would keep calm and don't bother much (considering you have periodical SCRUBs and SMART tests scheduled). If the corruption is repeating (but not growing) it could be that some block on the HDD has issues and SCRUB (or just a plain read) is hitting that sport and fixing it. If it is growing or the amount of corrupted data are huge then you have an issue. To pinpoint "where" more investigation is necessary...

FreeNAS has also taken one of these drives "offline" and seemed to resilver it after reboot once...
In such case post a long smart test results (once you have that finished) of that device. Also i would recommend to start saving the smart status somewhere in order to have historical values (to see growth if any). You can use a small .sh script with something like smartctl -a /dev/da0 > /<somewhere>/`date +"%Y%m%d_%H%M"` and schedule it via cron after every SMART test. (Do not use the code snipped directly in cron as the symbol "%" has different meaning and it would not work!)
 
Joined
Jun 24, 2017
Messages
338
Self-heal can happen in two cases in general:
- Data corruption was detected and was "self-healed" by using redundant data. This of course requires redundant environment (RAIDZ#, Mirror, ...). In case you don't have redundant env (so single-disk vdev(s)) the data corruption can not be repaired. Damage is permanent and you will have to restore from backups/snapshots (if you have any).
- Meta-data corruption was detected and "self-healed". Metadata are stored at two places so even in non-redundant environment it is possible for ZFS to self-heal this kind of corruption (unless some major breakdown happened and both places got corrupted). Metadata "heal" is usually quite fast and zpool status usually shows "repaired in 0 sec". In such case it was either metadata or very tiny portion of data which was healed (maybe a near-bit flip?).

So, in essence, self-healed usually deals only with the soft data and not the physical health of the drive itself? Or, do re-written bad sectors fall under the "self-healed" envelope as well?

And thanks for the response... this is sort of what I was looking for (kind of a road marker to let me know if i should worry about the hardware a lot more, or just be patient, run tests and see where it lands... nothing on the drives isnt replaceable (as in, it either isnt 'important' ...like TV shows, or it is backed up elsewhere (like documents) and all of the data is backed up on a separate server...)
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Or, do re-written bad sectors fall under the "self-healed" envelope as well?
True bad sectors should be removed from use by the hard drive itself and mapped out of use, Pending Sector Errors that go away are not considered self-healing. I really think you have your answer to your original question, the self-healing is really just data redundancy playing in your favor. If you are having a lot of these issue then I would think you have a hardware problem and maybe it is those SMR drives you have, and I don't want to tell you that it is the SMR drives fr sure, it is just the first place I'd look. Did you do a proper Burn-In Test of your hardware to attempt to identify any potential hardware issues? Do you leave your system on all the time or do you power it off often? You might also want to post the output for all your hard drives
Code:
smartctl -a /dev/xxx
for each drive. If you have any obvious problems them someone should be able to pick up on that and let you know.

Good Luck and Happy New Year!
 
Joined
Jun 24, 2017
Messages
338
True bad sectors should be removed from use by the hard drive itself and mapped out of use, Pending Sector Errors that go away are not considered self-healing. I really think you have your answer to your original question, the self-healing is really just data redundancy playing in your favor. If you are having a lot of these issue then I would think you have a hardware problem and maybe it is those SMR drives you have, and I don't want to tell you that it is the SMR drives fr sure, it is just the first place I'd look. Did you do a proper Burn-In Test of your hardware to attempt to identify any potential hardware issues? Do you leave your system on all the time or do you power it off often? You might also want to post the output for all your hard drives
Code:
smartctl -a /dev/xxx
for each drive. If you have any obvious problems them someone should be able to pick up on that and let you know.

Good Luck and Happy New Year!

As is notated above, The long SMART scans are currently running... they will probably be done late today or early tomorrow and i will post the results as soon as I have them. As for the zpool scrub, its results are posted above and show no errors...

My system runs pretty much constantly, unless i reboot it for a reason (usually an update, or if there are errors... so, my machine has been rebooted more in the last week than in the last 3 months...)

One thing that is a little bothersome is that checking on the SMART reporting, a lot of things are popping up as "Old-Age" ... is this normal on a drive thats less than a year old? (they do have about 7k hours run time on them)
 

HolyK

Ninja Turtle
Moderator
Joined
May 26, 2011
Messages
654
So, in essence, self-healed usually deals only with the soft data and not the physical health of the drive itself? Or, do re-written bad sectors fall under the "self-healed" envelope as well?
Well it depends. Self-heal is FS level *thing* so it is ZFS doing a huge favor for us. Question is if ZFS detects the "wrong data" what happens next ... like "Hey there is "1" but it should be "0" but i've fixed that for you!". I mean if it is able to "heal" the issue i don't think it immediately reports the block as "bad" to the controller/diskFW. I don't know details but i would say there is some internal logic at which point the bad block is actually reported, moved to "B list" which would lead to "uncorrectable/pending reallocation" and later on the block being reallocated.

Bottom line ... block can be either "1" or "0". If for whatever reason it is changing or NOT changing its state it is a "bad block". In general you can have five main issues (excluding other situations like whole track or cluster being screwed)
- Block is "0" and you're not able to set it to "1" -> This one gets reported after several failed attempts
- Block is "1" and you're unable to set it to "0" -> Again, bad block for sure. Both of these are easy to spot because the CRC of the write request would not match. (eg.: I requested to write "0" then check what is in there returns "1")
- Block is "0" and you set it to "1" BUT at the same time the next (or previous) block (or more) is changed as well to "1" unintentionally.
- Block is "1" and you set it to "0" but like in previous point it messes up the near-block. Both of these are quite shitty issues because recognizing such corruption is tricky as it might be the very last block written in your request so you will not even know that one extra block was changed. And this is a nice example where SCRUB takes the place. It reads the data and checks/repairs the issues. Question is if ZFS keeps some historical data about "what" was broken/fixed. So if it hits the same spot multiple times it should (i guess?) report the block as BAD.
- And the last one is where the block response is either very slow or completely unresponsive/dead. So either the write or read fails. You will not get 1 neither 0. Well this one for sure gets reported and reallocated sooner or later...

Note that this is FS reporting the issues but Disk FW has its own logic to detect bad blocks or other stuff. A nice example is if the response time of a block is too huge. So if there is a read request from upper level and the data (1 and 0) are returned back but it was slow. ZFS would get proper data so all OK from that point of view but HDD FW can evaluate the blocks or maybe whole cluster being very slow and forces all blocks to be reallocated elsewhere. I recall from very old days that the threshold for force-reallocation was 5000ms. Some low-level tools (like mHDD) had a function to force this value to a different one and then make a full HDD sweep while forcing reallocation of the slow-responsive.

And thanks for the response... this is sort of what I was looking for (kind of a road marker to let me know if i should worry about the hardware a lot more, or just be patient, run tests and see where it lands... nothing on the drives isnt replaceable (as in, it either isnt 'important' ...like TV shows, or it is backed up elsewhere (like documents) and all of the data is backed up on a separate server...)
In general you should (need?) to have periodical SCRUB and SMART tests scheduled. If some critical event occurs FreeNAS will inform you about that. Make sure you have working email notifications. If something happens then it depends what.
 
Joined
Jun 24, 2017
Messages
338
Great... to add more fun into the mix:


CRITICAL
Device: /dev/da1 [SAT], Read SMART Error Log Failed.
Tue, 31 Dec 2019 06:51:31 (America/Los_Angeles)


but the smart test doesnt show any failures, and for some reason, it stopped the long test and is now doing short tests...
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
Post the output of smartctl -a /dev/da1 in code tags.
 
Joined
Jun 24, 2017
Messages
338
It should be noted, i restarted the test when it sent the flag up in FreeNAS

Code:

root@freenas[~]# smartctl -a /dev/da1
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p5 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda Compute
Device Model:     ST8000DM004-2CX188
Serial Number:    ZCT09TD3
LU WWN Device Id: 5 000c50 0b279d58a
Firmware Version: 0001
User Capacity:    8,001,563,222,016 bytes [8.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5425 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Dec 31 09:20:57 2019 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 249) Self-test routine in progress...
                                        90% of test remaining.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 995) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x30a5) SCT Status supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   084   064   006    Pre-fail  Always       -       242385968
  3 Spin_Up_Time            0x0003   092   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       97
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   089   060   045    Pre-fail  Always       -       784182026
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       7279 (181 39 0)
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       93
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   097   000    Old_age   Always       -       13
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   077   059   040    Old_age   Always       -       23 (Min/Max 17/26)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       310
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       390
194 Temperature_Celsius     0x0022   023   041   000    Old_age   Always       -       23 (0 17 0 0 0)
195 Hardware_ECC_Recovered  0x001a   084   064   000    Old_age   Always       -       242385968
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       7229 (145 115 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       92126409357
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       71360100897

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Self-test routine in progress 90%      7279         -
# 2  Short offline       Completed without error       00%      7278         -
# 3  Short offline       Interrupted (host reset)      00%      7277         -
# 4  Extended offline    Interrupted (host reset)      00%      7277         -
# 5  Short offline       Completed without error       00%      7256         -
# 6  Short offline       Completed without error       00%      7255         -
# 7  Short offline       Completed without error       00%      7254         -
# 8  Short offline       Completed without error       00%      7253         -
# 9  Short offline       Completed without error       00%      7252         -
#10  Short offline       Completed without error       00%      7251         -
#11  Short offline       Completed without error       00%      7250         -
#12  Short offline       Completed without error       00%      7249         -
#13  Short offline       Completed without error       00%      7248         -
#14  Short offline       Completed without error       00%      7247         -
#15  Short offline       Completed without error       00%      7246         -
#16  Short offline       Completed without error       00%      7245         -
#17  Short offline       Completed without error       00%      7244         -
#18  Short offline       Completed without error       00%      7243         -
#19  Short offline       Completed without error       00%      7242         -
#20  Short offline       Completed without error       00%      7241         -
#21  Short offline       Completed without error       00%      7240         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@freenas[~]#  

 
Top