Is the disk truly faulted?

schoolpost

Dabbler
Joined
Feb 14, 2018
Messages
20
Recently was notified of a degraded volume ( a RAIDZ3 ) due to a faulted disk. Nothing out of the ordinary, disks will wear over time and this disk ( the whole server in fact ) are online basically 24/7 for about a year now.

It's a 4TB Seagate IronWolf NAS drive, didn't expect to see it go so soon, but again I get that these occur from time to time.

What's strange...and it's something I've observed ever since I've started using FreeNAS for years now...

Rebooting the system and letting it do a scrub/resilver, the state of the volume went from degraded back to fully healthy! ( did not swap the disk out )

I expected this behavior to happen, because I've seen this "smoke and mirrors" solution work for me for quite some time now. ( going back to some much older versions of FreeNAS too )

Basically...

When I encounter a faulted disk, just reboot, let it resilver and do scrub and everything is back to normal....but how? What happened to the fault? Is it still a faulty drive?


I understand this may just be an early sign of a more severe failure in the future, but I mean for now if FreeNAS reports everything is healthy and normal why should I bother? A dead and clicking drive on the other hand obviously needs immediate replacement.

I purchase IronWolf NAS and RED drives for the extended warranty ( among other reasons ) is a fault like the ones I describe above covered under this warranty? Because unlike other faults, like instance's where the drive clicks or other more fatal faults....I'm not sure if Seagate would recognize a fault and replace the drive? or do they replace a drive you send in to them regardless? Is the fault something they would be able to reproduce in their warranty evaluation?

I will send the drive back without hesitation if I know the drive is:
  1. Truly in a bad state that is bound to have immediate issues in the near future.
  2. Will be covered under the manufacturer warranty.

Looking for other's experience/advice on the matter.
 

Attachments

  • diskfault1.PNG
    diskfault1.PNG
    27.8 KB · Views: 667
  • diskfault3.PNG
    diskfault3.PNG
    477.2 KB · Views: 649

c77dk

Patron
Joined
Nov 27, 2019
Messages
468
HI,

Please give some more info about the hardware.

I would take a close look at the SMART data of the disk and compare to another of the disks. Are you doing regular long tests of the drives? and what are the results?
 

schoolpost

Dabbler
Joined
Feb 14, 2018
Messages
20
HI,

Please give some more info about the hardware.

I would take a close look at the SMART data of the disk and compare to another of the disks. Are you doing regular long tests of the drives? and what are the results?

Updated my signature to show my system specs ( similar to yours )

I can check the SMART data on the drives, but as I've stated above this isn't an isolated incident, It's something I've seen across various iterations of my hardware and earlier versions of FreeNAS too.

I'm basically asking, is it common/normal to be able to just reboot your system and have the FreeNAS basically resilver/scrub the pool like there was no fault in the first place?

As of typing this, the pool is healthy. Where almost two weeks ago the fault was first found and I rebooted last night.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
What happened to the fault? Is it still a faulty drive?
Typically Yes, the drive is faulty. At times the drive will try to read some data and if the drive takes too long to return a valid data then the drive gets dropped. I'd recommend that you run a SMART Long test on all drives, at a minimum the suspect drive, and examine the results, or post the results here for comment. I have a little troubleshooting guide in my signature that can help you.
but as I've stated above this isn't an isolated incident, It's something I've seen across various iterations of my hardware and earlier versions of FreeNAS too.
You need to pay attention to your hardware configuration, where each part is. Your HBA, data cables, or drives could be faulty. Track the failures by the drive serial number, not da3 as the da3 can mover around, even though typically they don't but sometimes they do. Pay attention to consistencies and common parts, such as maybe the drives connected to the drive data cable on HBA1 port 1 are the only ones failing, then you would swap the data cable between port 1 and port 2 and see if the problem is still on the same drives or moved. You can deduce a lot from slow intentional changes in your system before spending any money. It could be a faulty cable or a faulty HBA or maybe it's just one drive. Take your time and write down your observations, you will figure it out.

Good Luck!
 

schoolpost

Dabbler
Joined
Feb 14, 2018
Messages
20
Here is the SMART results after running a long test on the culprit disk.

Code:
=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST4000VN008-2DR166
LU WWN Device Id: 5 000c50 0b452e0a3
Firmware Version: SC60
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5980 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Oct  9 14:29:29 2020 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  591) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 670) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x50bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   083   064   044    Pre-fail  Always       -       187776953
  3 Spin_Up_Time            0x0003   095   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       36
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   085   060   045    Pre-fail  Always       -       308665987
  9 Power_On_Hours          0x0032   083   083   000    Old_age   Always       -       15124 (70 62 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       34
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   069   064   040    Old_age   Always       -       31 (Min/Max 30/35)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       67
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       1942
194 Temperature_Celsius     0x0022   031   040   000    Old_age   Always       -       31 (0 20 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       15119 (159 161 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       6006428196
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       30568727242

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     15116         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I am sorry to say that there is no smoking gun on the drive test. I'd recommend moving your SATA data cables around and then tracking the problem and seeing if the problem remains with the SATA cable location or the drive. Hopefully its not a HBA problem. The problem is it could take a while to isolate the problem. Write down exactly what you do because if it takes a week or longer the the failure to occur again, you are likely going to forget what you did.

I wish you luck.
 

fahadshery

Contributor
Joined
Sep 29, 2017
Messages
179
I am sorry to say that there is no smoking gun on the drive test. I'd recommend moving your SATA data cables around and then tracking the problem and seeing if the problem remains with the SATA cable location or the drive. Hopefully its not a HBA problem. The problem is it could take a while to isolate the problem. Write down exactly what you do because if it takes a week or longer the the failure to occur again, you are likely going to forget what you did.

I wish you luck.
I know it's an old post but just to let you know that your answer saved my day today!!!
All of a sudden, My 2 drives were showing "Faulted" after I change the location of one of my LSI controllers.
After reading your answer, I shutdown TrueNas. Take the drives out and double checked the cabling again making sure all tight and secured.
Powered on my TrueNas box whilst drives were still out and it showed my pool not available. I disconnected my pool then re-inserted all the drives back in...
rebooted the server again and camcontrol devlist started showing my drives again.
I re-imported my pool and bammm I'm online again!
I will re do my SMART tests and see the health of the drives but luckily I do have back drives at hands in case I need to replace any!
Thank you for your answer
 
Top