Infant Mortality - New Drive?

kirkdickinson

Contributor
Joined
Jun 29, 2015
Messages
174
I have a Z2 pool with 6 4TB drives. I ordered some 10TB drives with the goal to replace them one at a time to increase the pool size. I replaced one and it took 6-8 hours to resliver it. I used the replace feature so the old drive was still online until the new one was ready. I checked it the next day and all seemed to be fine. I got busy and didn't get the other drives replaced. About a week later, I checked and had a degraded pool notice. That brand new drive was faulted and offline. I just finished replacing that faulted drive with another 10TB drive and no longer have a degraded pool.

I am confused by the SMART information. What is this telling me?

Last login: Sat May 1 11:19:01 on pts/0 FreeBSD 11.2-STABLE (FreeNAS.amd64) #0 r325575+4710c8b6420(HEAD): Fri Feb 14 13:59:19 UTC 2020 FreeNAS (c) 2009-2019, The FreeNAS Development Team All rights reserved. FreeNAS is released under the modified BSD license. For more information, documentation, help or support, go here: http://freenas.org Welcome to FreeNAS Warning: settings changed through the CLI are not written to the configuration database and will be reset on reboot. root@freenas:~ # smartctl -a /dev/da6 smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: WDC WD101EFAX-68LDBN0 Serial Number: VCH9T8UN LU WWN Device Id: 5 000cca 0b0d28a01 Firmware Version: 81.00A81 User Capacity: 10,000,831,348,736 bytes [10.0 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5400 rpm Form Factor: 3.5 inches Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Sat May 1 11:21:21 2021 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 87) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: (1120) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 068 068 016 Pre-fail Always - 362676230 2 Throughput_Performance 0x0004 123 123 054 Old_age Offline - 124 3 Spin_Up_Time 0x0007 100 100 024 Pre-fail Always - 0 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 2 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000a 100 100 067 Old_age Always - 0 8 Seek_Time_Performance 0x0004 128 128 020 Old_age Offline - 18 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 212 10 Spin_Retry_Count 0x0012 100 100 060 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 2 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 97 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 97 194 Temperature_Celsius 0x0002 196 196 000 Old_age Always - 33 (Min/Max 16/40) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 88 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 ATA Error Count: 229 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 229 occurred at disk power-on lifetime: 69 hours (2 days + 21 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 41 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 10 e0 0d 09 40 00 2d+21:19:12.714 READ FPDMA QUEUED 2f 00 01 10 00 00 00 00 2d+21:19:12.714 READ LOG EXT 60 00 20 a0 0f 09 40 00 2d+21:19:02.779 READ FPDMA QUEUED 60 c0 18 e0 0e 09 40 00 2d+21:19:02.779 READ FPDMA QUEUED 60 00 08 e0 0b 09 40 00 2d+21:19:02.779 READ FPDMA QUEUED Error 228 occurred at disk power-on lifetime: 69 hours (2 days + 21 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 41 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 68 e0 0c 09 40 00 2d+21:19:02.584 READ FPDMA QUEUED 2f 00 01 10 00 00 00 00 2d+21:19:02.584 READ LOG EXT 61 08 a0 60 fe 3f 40 00 2d+21:18:52.503 WRITE FPDMA QUEUED 61 08 98 60 fc 3f 40 00 2d+21:18:52.503 WRITE FPDMA QUEUED 61 08 90 60 04 40 40 00 2d+21:18:52.503 WRITE FPDMA QUEUED Error 227 occurred at disk power-on lifetime: 69 hours (2 days + 21 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 41 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 20 98 f2 08 40 00 2d+21:18:39.463 READ FPDMA QUEUED 2f 00 01 10 00 00 00 00 2d+21:18:39.463 READ LOG EXT 60 40 40 18 f6 08 40 00 2d+21:18:32.546 READ FPDMA QUEUED 60 80 38 98 f5 08 40 00 2d+21:18:32.546 READ FPDMA QUEUED 60 00 30 98 f4 08 40 00 2d+21:18:32.546 READ FPDMA QUEUED Error 226 occurred at disk power-on lifetime: 69 hours (2 days + 21 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 41 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 08 98 ef 08 40 00 2d+21:18:32.345 READ FPDMA QUEUED 60 18 50 e0 4a c3 40 00 2d+21:18:25.417 READ FPDMA QUEUED 60 18 48 c0 c6 bd 40 00 2d+21:18:25.417 READ FPDMA QUEUED 60 40 40 18 f6 08 40 00 2d+21:18:25.417 READ FPDMA QUEUED 60 80 38 98 f5 08 40 00 2d+21:18:25.417 READ FPDMA QUEUED Error 225 occurred at disk power-on lifetime: 69 hours (2 days + 21 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 41 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 78 d0 b3 08 40 00 2d+21:17:54.855 READ FPDMA QUEUED 60 40 90 90 b7 08 40 00 2d+21:17:47.886 READ FPDMA QUEUED 60 40 88 d0 b5 08 40 00 2d+21:17:47.886 READ FPDMA QUEUED 60 00 80 d0 b4 08 40 00 2d+21:17:47.886 READ FPDMA QUEUED 60 00 70 d0 b2 08 40 00 2d+21:17:47.886 READ FPDMA QUEUED SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. root@freenas:~ #
 

Pitfrr

Wizard
Joined
Feb 10, 2014
Messages
1,531
It seems indeed that you are facing some issues with your new drive.

From the SMART data you can tell:
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 88
This one is concerning: it means you have 88 pending sectors... This shouldn't happen on a new drive...

And a bit further the same error repeats itself:
40 41 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0
It means the drive encountered an uncorrectable error at LBA 0x00000000 (at the beginning of the disk as it seems?!).

That's not very good indeed for a new drive.

Before replacing old drives with new drives (or any drive by the way) you should run some burn-in tests for them, to make sure they are good.
It is not too late. I would advise you to run some SMART long test, followed by some passes of badblocks and again some SMART long. I don't have the link here but if I recall there are some scripts for that in the resource section of the forum.
It could take several days (or even weeks...) on 10TB drives...

For this particular drive, I suppose WD has some diagnostic tools as well. You could run that and maybe (depending on the outcome of the tool), you might be able to RMA it. Because I'm not sure if having "Current pending sector" is a valid ground for RMA... "Offline uncorrectable" for sure.
But maybe some other forum members have more experience here...
 

kirkdickinson

Contributor
Joined
Jun 29, 2015
Messages
174
It seems indeed that you are facing some issues with your new drive.

From the SMART data you can tell:
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 88
This one is concerning: it means you have 88 pending sectors... This shouldn't happen on a new drive...

And a bit further the same error repeats itself:
40 41 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0
It means the drive encountered an uncorrectable error at LBA 0x00000000 (at the beginning of the disk as it seems?!).

That's not very good indeed for a new drive.

Before replacing old drives with new drives (or any drive by the way) you should run some burn-in tests for them, to make sure they are good.
It is not too late. I would advise you to run some SMART long test, followed by some passes of badblocks and again some SMART long. I don't have the link here but if I recall there are some scripts for that in the resource section of the forum.
It could take several days (or even weeks...) on 10TB drives...

For this particular drive, I suppose WD has some diagnostic tools as well. You could run that and maybe (depending on the outcome of the tool), you might be able to RMA it. Because I'm not sure if having "Current pending sector" is a valid ground for RMA... "Offline uncorrectable" for sure.
But maybe some other forum members have more experience here...

Thanks for the reply. I got it from NewEgg. I am going to try to RMA it unless someone tells me that this is correctable. I just will never trust this drive again after this.

Thanks
 

GBillR

Contributor
Joined
Jun 12, 2016
Messages
189
Thanks for the reply. I got it from NewEgg. I am going to try to RMA it unless someone tells me that this is correctable. I just will never trust this drive again after this.

Thanks
Definitely looks like a bad drive to me. I do not see any smart tests though. I would run a short and long test just to satisfy my own curiosity.

WD will RMA a drive under these conditions, but I never liked those RMA replacements they send back to me... if Newegg will RMA and provide you a new replacement, that's a better solution.
 

kirkdickinson

Contributor
Joined
Jun 29, 2015
Messages
174
Definitely looks like a bad drive to me. I do not see any smart tests though. I would run a short and long test just to satisfy my own curiosity.

WD will RMA a drive under these conditions, but I never liked those RMA replacements they send back to me... if Newegg will RMA and provide you a new replacement, that's a better solution.

Are the replacements refirbs? I already filled out the RMA request and UPS will pick this drive up next week to go back.
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
That's a bad drive. Unfortunately, it happens.

You really want to do a burn in test on new drives. It doesn't have to be much. Spin it up in a USB enclosure for a couple hours, run an extended SMART test at a minimum. Many of us here will run I/O against the drive for a day or two, followed by a SMART test. For more serious use, run badblocks to completion, followed by a SMART test, possibly add a cold soak (power off allow to cool to ambient), followed by some more I/O exercise. I find I lose 1 drive in 10 for retail supply chains, but the above steps catch 90+% of those failures.
 

kirkdickinson

Contributor
Joined
Jun 29, 2015
Messages
174
That's a bad drive. Unfortunately, it happens.

You really want to do a burn in test on new drives. It doesn't have to be much. Spin it up in a USB enclosure for a couple hours, run an extended SMART test at a minimum. Many of us here will run I/O against the drive for a day or two, followed by a SMART test. For more serious use, run badblocks to completion, followed by a SMART test, possibly add a cold soak (power off allow to cool to ambient), followed by some more I/O exercise. I find I lose 1 drive in 10 for retail supply chains, but the above steps catch 90+% of those failures.

So live and learn. I have this fresh new drive in now as part of my pool. What should I do with that one? I don't have any extras here yet, been ordering them as I can afford them. I have one in route that will be here next week. Should I run all those tests on it then replace the untesting but working drive with the fully tested one? Then test the untested drive and after sucessful test start replacing the other drives that need replacing?

I have a Windows 10 PC with a hot swap bay on the front. Would be easy enough to run all the tests there.

How long does it take to run a LONG SMART test on a 10TB 5000RPM drive?
 

Pitfrr

Wizard
Joined
Feb 10, 2014
Messages
1,531
I would definitely run some burn-in tests on the remaining drive! Better to know it right now... :-O

Long SMART tests are read-only tests so not destructive for the data on the drive, you can run them anytime (and you should plan regular SMART short and long on all your drives if not already done).
But on new drives (in "new" I mean for your system), I would definitely run more intensive burn-in tests with at least badblocks (in write mode as well, so destructive for the data on the drive).

You can start a SMART test from the command line:
smartctl -t long /dev/ad0
Replace long with short for... the short test (which takes about 5 minutes but is of course very limited). :smile:

badblocks can be run similarly from a terminal but you might want to take the drive out of the pool first, otherwise TrueNAS is not going to be happy! :smile:
Look a this resource to have more detail on the burn-in tests and a script.
 

Pitfrr

Wizard
Joined
Feb 10, 2014
Messages
1,531
How long does it take to run a LONG SMART test on a 10TB 5000RPM drive?
smartctl tells you how long a long test is taking, have a look at the line:

Extended self-test routine recommended polling time: (1120) minutes.

Yep... that's long... :tongue:

That's why when you start a badblocks test for example, for one pass, it uses 4 patterns. So it writes all the disk with pattern 1, read all the disk (i.e. verify that pattern 1 has been written), write all the disk with pattern 2, read all the disk... and so on!
So if you want to do more than one pass... let's say 4 passes, you're in for probably two weeks! :-D
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504

GBillR

Contributor
Joined
Jun 12, 2016
Messages
189
Are the replacements refirbs? I already filled out the RMA request and UPS will pick this drive up next week to go back.
My experience with WD is that the replacements are white label drives that appear refurbished. Not 100% sure what the warranty on those is either, since I've not used one of those long enough to RMA... I personally don't trust them in my primary machine, but have a couple in my backup machine. I've never processed an RMA for a drive with a seller before. Given this is NewEgg, and assuming you bought the drive from them directly and not a 3rd-party, I would expect a new replacement if within their return window... but who knows what you'll get.
 

diedrichg

Wizard
Joined
Dec 4, 2012
Messages
1,319
What wattage is your power supply?
 

kirkdickinson

Contributor
Joined
Jun 29, 2015
Messages
174
What wattage is your power supply?
I looked it up.
FSP 400W Micro ATX12V / SFX12V 80 PLUS BRONZE Certified Active PFC Power Supply with Intel Haswell Ready (FSP400-60GHS(85)-R1)

FSP 400W Micro ATX12V / SFX12V 80 PLUS BRONZE
 

diedrichg

Wizard
Joined
Dec 4, 2012
Messages
1,319
Yep, that's your problem. You're in the same situation I was a few months ago. After two RMA's I finally came to the realization that I was pulling too much power with my two new additional drives and a HBA card. I swapped out my puny 450W power supply for a bigger one I had laying around. That instantly solved my issue. Give that a try before you go down anymore RMA roads.
 

kirkdickinson

Contributor
Joined
Jun 29, 2015
Messages
174
Yep, that's your problem. You're in the same situation I was a few months ago. After two RMA's I finally came to the realization that I was pulling too much power with my two new additional drives and a HBA card. I swapped out my puny 450W power supply for a bigger one I had laying around. That instantly solved my issue. Give that a try before you go down anymore RMA roads.

Thanks for the recommendation. I didn't think of that. Bigger drives need more power. Also found out that WD recently admitted to lying about these 10TB Red Plus that were supposed to be 5400 RPM. They are actually all 7200 RPM, at least the ones I ordered. The Western digital WD101EFAX drives have been relabeled as WD101EFBX. Same drive.
 

kirkdickinson

Contributor
Joined
Jun 29, 2015
Messages
174
So, my 2nd replaced drive also failed. That one came from NewEgg the same day as the other. Here is the SMART info:
Code:
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD101EFAX-68LDBN0
Serial Number:    VCH9SVKN
LU WWN Device Id: 5 000cca 0b0d28866
Firmware Version: 81.00A81
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jun 23 14:29:43 2021 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (   87) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 993) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0004   128   128   054    Old_age   Offline      -       108
  3 Spin_Up_Time            0x0007   100   100   024    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       3
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   100   100   067    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0004   128   128   020    Old_age   Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       1296
10 Spin_Retry_Count        0x0012   100   100   060    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       3
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       339
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       339
194 Temperature_Celsius     0x0002   171   171   000    Old_age   Always       -       38 (Min/Max 25/44)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 3
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3 occurred at disk power-on lifetime: 1211 hours (50 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 41 00 00 00 00 00  Error: IDNF at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 28 78 ff 43 40 00  44d+15:07:46.877  READ FPDMA QUEUED
  60 90 38 78 01 44 40 00  44d+15:07:42.306  READ FPDMA QUEUED
  60 00 30 78 00 44 40 00  44d+15:07:42.306  READ FPDMA QUEUED
  60 00 20 78 fe 43 40 00  44d+15:07:42.306  READ FPDMA QUEUED
  60 00 18 78 fd 43 40 00  44d+15:07:42.306  READ FPDMA QUEUED

Error 2 occurred at disk power-on lifetime: 887 hours (36 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 41 00 00 00 00 00  Error: IDNF at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 40 00 40 c1 d3 40 00  31d+03:13:01.948  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00  31d+03:13:01.948  READ LOG EXT
  60 40 20 80 c1 d3 40 00  31d+03:13:00.146  READ FPDMA QUEUED
  60 40 18 80 08 f4 40 00  31d+03:13:00.146  READ FPDMA QUEUED
  60 40 10 c0 be d3 40 00  31d+03:13:00.146  READ FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 550 hours (22 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 41 00 00 00 00 00  Error: IDNF at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 88 68 2e 5f 40 00  17d+01:22:37.545  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00  17d+01:22:37.545  READ LOG EXT
  60 40 08 e8 1b db 40 00  17d+01:22:35.746  READ FPDMA QUEUED
  60 00 00 e8 1a db 40 00  17d+01:22:35.746  READ FPDMA QUEUED
  60 00 90 68 2f 5f 40 00  17d+01:22:35.745  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
Last edited by a moderator:

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
So, my 2nd replaced drive also failed. That one came from NewEgg the same day as the other. Here is the SMART info:
Code:
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD101EFAX-68LDBN0
Serial Number:    VCH9SVKN
LU WWN Device Id: 5 000cca 0b0d28866
Firmware Version: 81.00A81
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jun 23 14:29:43 2021 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (   87) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 993) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0004   128   128   054    Old_age   Offline      -       108
  3 Spin_Up_Time            0x0007   100   100   024    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       3
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   100   100   067    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0004   128   128   020    Old_age   Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       1296
10 Spin_Retry_Count        0x0012   100   100   060    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       3
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       339
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       339
194 Temperature_Celsius     0x0002   171   171   000    Old_age   Always       -       38 (Min/Max 25/44)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 3
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3 occurred at disk power-on lifetime: 1211 hours (50 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 41 00 00 00 00 00  Error: IDNF at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 28 78 ff 43 40 00  44d+15:07:46.877  READ FPDMA QUEUED
  60 90 38 78 01 44 40 00  44d+15:07:42.306  READ FPDMA QUEUED
  60 00 30 78 00 44 40 00  44d+15:07:42.306  READ FPDMA QUEUED
  60 00 20 78 fe 43 40 00  44d+15:07:42.306  READ FPDMA QUEUED
  60 00 18 78 fd 43 40 00  44d+15:07:42.306  READ FPDMA QUEUED

Error 2 occurred at disk power-on lifetime: 887 hours (36 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 41 00 00 00 00 00  Error: IDNF at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 40 00 40 c1 d3 40 00  31d+03:13:01.948  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00  31d+03:13:01.948  READ LOG EXT
  60 40 20 80 c1 d3 40 00  31d+03:13:00.146  READ FPDMA QUEUED
  60 40 18 80 08 f4 40 00  31d+03:13:00.146  READ FPDMA QUEUED
  60 40 10 c0 be d3 40 00  31d+03:13:00.146  READ FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 550 hours (22 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 41 00 00 00 00 00  Error: IDNF at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 88 68 2e 5f 40 00  17d+01:22:37.545  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00  17d+01:22:37.545  READ LOG EXT
  60 40 08 e8 1b db 40 00  17d+01:22:35.746  READ FPDMA QUEUED
  60 00 00 e8 1a db 40 00  17d+01:22:35.746  READ FPDMA QUEUED
  60 00 90 68 2f 5f 40 00  17d+01:22:35.745  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
(I edited your post, adding CODE tags to make it easier to read.)

This drive doesn't show any of the usual leading indicators of failure. These key SMART values are all zero:
Code:
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0

The ATA errors you're getting can be caused by controller hiccups, bad cables, etc. I have drives with similar ATA errors that have been running for over 30,000 hours and are still going strong.

This disk has never had any tests run on it... I recommend that you schedule regular SMART tests on your disks; this helps immensely in proactively identifying and replacing failing disks.
 

kirkdickinson

Contributor
Joined
Jun 29, 2015
Messages
174
(I edited your post, adding CODE tags to make it easier to read.)

This drive doesn't show any of the usual leading indicators of failure. These key SMART values are all zero:
Code:
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0

The ATA errors you're getting can be caused by controller hiccups, bad cables, etc. I have drives with similar ATA errors that have been running for over 30,000 hours and are still going strong.

This disk has never had any tests run on it... I recommend that you schedule regular SMART tests on your disks; this helps immensely in proactively identifying and replacing failing disks.

I have six of these drives here that I don't trust because of two of these getting kicked from the pool.

I have parts for a Ryzen Photoshop workstation that hasn't yet been built. I could put that together, throw FreeNAS on that and test all of these drives on a separate computer before starting to swap them into my production machine to expand the pool.

I have a new power supply for my FreeNAS production server that I haven't put in yet. Should I order all new SATA cables and swap them?
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
I'll second the cables... Standard SATA cables are only rated for maybe 50 insert/remove events. I've had drives get kicked out of the pool with bad cables, and I have had drives with single LBA errors that ran for nearly 50k hours. I'd run an extended SMART test, use new cables and see what happens. Pay attention to cable stress, get 90 deg plugs if needed. The 90 deg locking cables made a world of difference for my setup.
 

kirkdickinson

Contributor
Joined
Jun 29, 2015
Messages
174
I'll second the cables... Standard SATA cables are only rated for maybe 50 insert/remove events. I've had drives get kicked out of the pool with bad cables, and I have had drives with single LBA errors that ran for nearly 50k hours. I'd run an extended SMART test, use new cables and see what happens. Pay attention to cable stress, get 90 deg plugs if needed. The 90 deg locking cables made a world of difference for my setup.

I can swap them out, but these cables have probably only 1 insert event when the case was built. There are hot swap cages on the front of the case.
 
Top