New build, new HDD, hardware errors on 3/8 drives (so far)

Dimand

Cadet
Joined
Feb 10, 2023
Messages
5
Hi,

I have recently repurposed an older PC to be a NAS however I have been getting a lot of errors.
I have no important data on this system. Still in the testing phase.

Hardware:
CPU: i7-3820 CPU
Memory: 64GB, DDR3-1600 CL10
MOBO: Asus Sabertooth X79
Controller: H310 RAID Controller Card
PSU: Cooler Master Silent Pro Hybrid 850 W 80+ Gold
Storage: 8x ST16000NM001G-2KK103
Boot Storage: ST320LT007-9ZV142

Everything has direct airflow and is running at a good temp. I initially installed core but migrated to scale.
This has been running for about a week with minimal load and so far three drives have started having issues.

HDD: ZL2AM7B3
Code:
root@truenas[~]# smartctl -a /dev/da4
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Exos X16
Device Model:     ST16000NM001G-2KK103
Serial Number:    ZL2AM7B3
LU WWN Device Id: 5 000c50 0c7bfca19
Firmware Version: SN04
User Capacity:    16,000,900,661,248 bytes [16.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Feb  7 00:33:08 2023 AEDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                (  575) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (1435) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x70bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   066   064   044    Pre-fail  Always       -       4297554
  3 Spin_Up_Time            0x0003   098   098   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       3
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   253   045    Pre-fail  Always       -       437840
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       169
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       3
 18 Head_Health             0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   070   052   040    Old_age   Always       -       30 (Min/Max 28/48)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1
193 Load_Cycle_Count        0x0032   099   099   000    Old_age   Always       -       2020
194 Temperature_Celsius     0x0022   030   048   000    Old_age   Always       -       30 (0 21 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       32
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       32
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Pressure_Limit          0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       28h+44m+11.203s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       4292756
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       4798

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%       134         4178936
# 2  Extended offline    Completed: read failure       90%       134         4178936
# 3  Conveyance offline  Completed without error       00%       134         -
# 4  Extended offline    Completed: read failure       90%       133         4178936
# 5  Short offline       Completed without error       00%       133         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


HDD:ZL26AYKJ
Code:
root@truenas[/dev]# smartctl -a /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.79+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Exos X16
Device Model:     ST16000NM001G-2KK103
Serial Number:    ZL26AYKJ
LU WWN Device Id: 5 000c50 0c63a8746
Firmware Version: SN04
User Capacity:    16,000,900,661,248 bytes [16.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Feb  8 21:57:14 2023 AEDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                (  567) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (1441) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x70bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   070   064   044    Pre-fail  Always       -       9288806
  3 Spin_Up_Time            0x0003   096   096   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       4
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   070   060   045    Pre-fail  Always       -       9299644
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       214
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       4
 18 Head_Health             0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   066   060   040    Old_age   Always       -       34 (Min/Max 29/35)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       2
193 Load_Cycle_Count        0x0032   099   099   000    Old_age   Always       -       2237
194 Temperature_Celsius     0x0022   034   040   000    Old_age   Always       -       34 (0 22 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       8
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       8
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Pressure_Limit          0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       53h+04m+19.441s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       7992572
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1296234

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%       214         138458528
# 2  Conveyance offline  Completed without error       00%       214         -
# 3  Short offline       Completed without error       00%       214         -
# 4  Extended offline    Completed without error       00%       155         -
# 5  Short offline       Completed without error       00%       133         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


HDD:ZL235TEN
Code:
root@truenas[~]# smartctl -a /dev/sdf
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.79+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Exos X16
Device Model:     ST16000NM001G-2KK103
Serial Number:    ZL235TEN
LU WWN Device Id: 5 000c50 0c47df593
Firmware Version: SN04
User Capacity:    16,000,900,661,248 bytes [16.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Feb 11 08:46:51 2023 AEDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                (  567) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (1459) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x70bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   057   057   044    Pre-fail  Always       -       57438431
  3 Spin_Up_Time            0x0003   095   095   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       5
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       2368
  7 Seek_Error_Rate         0x000f   073   060   045    Pre-fail  Always       -       19789590
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       273
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       5
 18 Head_Health             0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   016   016   000    Old_age   Always       -       84
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       65537
190 Airflow_Temperature_Cel 0x0022   064   051   040    Old_age   Always       -       36 (Min/Max 32/42)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       3
193 Load_Cycle_Count        0x0032   099   099   000    Old_age   Always       -       2268
194 Temperature_Celsius     0x0022   036   049   000    Old_age   Always       -       36 (0 21 0 0 0)
197 Current_Pending_Sector  0x0012   092   092   000    Old_age   Always       -       4632
198 Offline_Uncorrectable   0x0010   092   092   000    Old_age   Offline      -       4632
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Pressure_Limit          0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       116h+14m+38.644s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       52778755
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       4659676

SMART Error Log Version: 1
ATA Error Count: 84 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 84 occurred at disk power-on lifetime: 273 hours (11 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 a0 81 88 0d  Error: WP at LBA = 0x0d8881a0 = 227049888

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 28 18 da 10 43 00   2d+09:46:47.693  WRITE FPDMA QUEUED
  61 00 10 ff ff ff 4f 00   2d+09:46:47.311  WRITE FPDMA QUEUED
  61 00 10 ff ff ff 4f 00   2d+09:46:47.310  WRITE FPDMA QUEUED
  61 00 10 90 02 40 40 00   2d+09:46:47.309  WRITE FPDMA QUEUED
  60 00 10 ff ff ff 4f 00   2d+09:46:47.309  READ FPDMA QUEUED

Error 83 occurred at disk power-on lifetime: 273 hours (11 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 78 81 88 0d  Error: UNC at LBA = 0x0d888178 = 227049848

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 28 c8 81 88 4d 00   2d+09:46:44.582  READ FPDMA QUEUED
  60 00 28 98 81 88 4d 00   2d+09:46:44.489  READ FPDMA QUEUED
  60 00 30 68 81 88 4d 00   2d+09:46:44.489  READ FPDMA QUEUED
  60 00 20 90 7c 88 4d 00   2d+09:46:44.487  READ FPDMA QUEUED
  60 00 30 60 7c 88 4d 00   2d+09:46:44.487  READ FPDMA QUEUED

Error 82 occurred at disk power-on lifetime: 273 hours (11 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 e8 74 88 0d  Error: WP at LBA = 0x0d8874e8 = 227046632

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 10 50 da 10 43 00   2d+09:46:41.992  WRITE FPDMA QUEUED
  61 00 10 ff ff ff 4f 00   2d+09:46:39.671  WRITE FPDMA QUEUED
  61 00 10 ff ff ff 4f 00   2d+09:46:39.671  WRITE FPDMA QUEUED
  61 00 10 90 02 40 40 00   2d+09:46:39.670  WRITE FPDMA QUEUED
  60 00 10 ff ff ff 4f 00   2d+09:46:39.670  READ FPDMA QUEUED

Error 81 occurred at disk power-on lifetime: 273 hours (11 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 d0 74 88 0d  Error: WP at LBA = 0x0d8874d0 = 227046608

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 10 ff ff ff 4f 00   2d+09:46:36.830  WRITE FPDMA QUEUED
  61 00 10 ff ff ff 4f 00   2d+09:46:36.829  WRITE FPDMA QUEUED
  61 00 10 90 02 40 40 00   2d+09:46:36.828  WRITE FPDMA QUEUED
  60 00 10 ff ff ff 4f 00   2d+09:46:36.827  READ FPDMA QUEUED
  60 00 10 ff ff ff 4f 00   2d+09:46:36.827  READ FPDMA QUEUED

Error 80 occurred at disk power-on lifetime: 273 hours (11 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 a8 74 88 0d  Error: UNC at LBA = 0x0d8874a8 = 227046568

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 30 88 74 88 4d 00   2d+09:46:34.075  READ FPDMA QUEUED
  60 00 28 60 74 88 4d 00   2d+09:46:34.062  READ FPDMA QUEUED
  ea 00 00 00 00 00 00 00   2d+09:46:34.024  FLUSH CACHE EXT
  61 00 10 40 da 10 43 00   2d+09:46:34.006  WRITE FPDMA QUEUED
  61 00 10 ff ff ff 4f 00   2d+09:46:34.006  WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%       273         218987208
# 2  Short offline       Completed without error       00%       273         -
# 3  Extended offline    Completed without error       00%       238         -
# 4  Extended offline    Interrupted (host reset)      00%       215         -
# 5  Extended offline    Completed without error       00%       156         -
# 6  Short offline       Completed without error       00%       133         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



The last one I have left in while I have a camera record to the pool. ZFS reporting errors.
Of note, the 2nd and 3rd drives passed extended tests before they failed.
This failure rate seems very high. Perhaps it is a bad batch of drives but perhaps I am also missing something.

Is there a recommended burn in test I can put the rest of these drives through? At this point my plan is to wipe the whole thing and start fresh with replacement drives, but I want to be sure its not some other issue and I would also like to increase my confidence that the remaining 5 drives are not going to fail in the next month.

Thanks for any help.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
I don't know what is going on, but I am highly suspicious of the Load_Cycle_Count on those 3 drives. All are around 2,200 while the actual Power_On_Hours is around 200. This seems to indicate that the drives are auto-parking on idle. But, on a NAS this can be a problem.

You might check your other 5 drives and see what they have for Load_Cycle_Count.


Please post the output of zpool status in code tags. That should show us the exact errors, (Read, Write, Checksum).
 

Dimand

Cadet
Joined
Feb 10, 2023
Messages
5
Output from one of the other drives.
Code:
root@truenas[~]# smartctl -a /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.79+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Exos X16
Device Model:     ST16000NM001G-2KK103
Serial Number:    ZL20WP1P
LU WWN Device Id: 5 000c50 0c30fe000
Firmware Version: SN04
User Capacity:    16,000,900,661,248 bytes [16.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Feb 11 14:48:26 2023 AEDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  567) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (1499) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x70bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   078   064   044    Pre-fail  Always       -       64228026
  3 Spin_Up_Time            0x0003   095   095   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       5
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   073   060   045    Pre-fail  Always       -       20252801
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       279
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       5
 18 Head_Health             0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   067   057   040    Old_age   Always       -       33 (Min/Max 31/38)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       3
193 Load_Cycle_Count        0x0032   099   099   000    Old_age   Always       -       2209
194 Temperature_Celsius     0x0022   033   043   000    Old_age   Always       -       33 (0 22 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Pressure_Limit          0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       117h+59m+12.334s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       53908314
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       10319720

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       239         -
# 2  Extended offline    Interrupted (host reset)      00%       215         -
# 3  Extended offline    Completed without error       00%       156         -
# 4  Short offline       Completed without error       00%       133         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


zpool status
Code:
root@truenas[~]# zpool status
  pool: boot-pool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:00:23 with 0 errors on Tue Feb  7 03:45:23 2023
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          sdg2      ONLINE       0     0     0

errors: No known data errors

  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 108K in 00:03:05 with 26 errors on Sat Feb 11 08:42:55 2023
config:

        NAME                                      STATE     READ WRITE CKSUM
        tank                                      DEGRADED     0     0     0
          raidz2-0                                DEGRADED   370    29     0
            213b022e-a099-11ed-bc18-5404a63dc9c8  ONLINE       0     0     0
            21040ca9-a099-11ed-bc18-5404a63dc9c8  ONLINE       0     0     0
            211f7e3d-a099-11ed-bc18-5404a63dc9c8  ONLINE       0     0     0
            15269526673564211540                  UNAVAIL      0     0     0  was /dev/disk/by-partuuid/21288b25-a099-11ed-bc18-5404a63dc9c8
            5509422462999954482                   UNAVAIL      0     0     0  was /dev/gptid/21172af5-a099-11ed-bc18-5404a63dc9c8
            21307d02-a099-11ed-bc18-5404a63dc9c8  ONLINE       0     0     0
            20efd339-a099-11ed-bc18-5404a63dc9c8  ONLINE       0     0     0
            210e1d1a-a099-11ed-bc18-5404a63dc9c8  DEGRADED   464    30     0  too many errors

errors: 15 data errors, use '-v' for a list
root@truenas[~]# root@truenas[~]# zpool status
  pool: boot-pool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:00:23 with 0 errors on Tue Feb  7 03:45:23 2023
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          sdg2      ONLINE       0     0     0

errors: No known data errors

  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 108K in 00:03:05 with 26 errors on Sat Feb 11 08:42:55 2023
config:

        NAME                                      STATE     READ WRITE CKSUM
        tank                                      DEGRADED     0     0     0
          raidz2-0                                DEGRADED   370    29     0
            213b022e-a099-11ed-bc18-5404a63dc9c8  ONLINE       0     0     0
            21040ca9-a099-11ed-bc18-5404a63dc9c8  ONLINE       0     0     0
            211f7e3d-a099-11ed-bc18-5404a63dc9c8  ONLINE       0     0     0
            15269526673564211540                  UNAVAIL      0     0     0  was /dev/disk/by-partuuid/21288b25-a099-11ed-bc18-5404a63dc9c8
            5509422462999954482                   UNAVAIL      0     0     0  was /dev/gptid/21172af5-a099-11ed-bc18-5404a63dc9c8
            21307d02-a099-11ed-bc18-5404a63dc9c8  ONLINE       0     0     0
            20efd339-a099-11ed-bc18-5404a63dc9c8  ONLINE       0     0     0
            210e1d1a-a099-11ed-bc18-5404a63dc9c8  DEGRADED   464    30     0  too many errors

errors: 15 data errors, use '-v' for a list


-v simply lists the files that have errors. All camera recordings as that's most of what has been written.
The unavailable disks I removed to organise an RMA but am holding off on that till I sort this out.
So Load_Cycle_Count is also in the 2k range even for the drives with no error. Everything is set to default for both drives and OS, but apparently this is an issue.
Good catch @Arwen.

Possible solution, first scan for drives with openSeaChest_Info --scan
Code:
root@truenas[~]# openSeaChest_Info --scan
==========================================================================================
 openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2022 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Info Version: 2.0.0-2_2_1 X86_64
 Build Date: Dec 13 2022
 Today: Sat Feb 11 15:29:57 2023        User: root
==========================================================================================
Vendor   Handle       Model Number            Serial Number          FwRev
AHCI     /dev/sg0     SGPIO Enclosure                                2.00
ATA      /dev/sg1     ST16000NM001G-2KK103    ZL20WP1P               SN04
ATA      /dev/sg2     ST16000NM001G-2KK103    ZL2NNC3W               SN04
ATA      /dev/sg3     ST16000NM001G-2KK103    ZL2EZA32               SN04
ATA      /dev/sg4     ST16000NM001G-2KK103    ZL2165P9               SN04
ATA      /dev/sg5     ST16000NM001G-2KK103    ZL2LAGAE               SN04
ATA      /dev/sg6     ST16000NM001G-2KK103    ZL235TEN               SN04
ATA      /dev/sg7     ST320LT007-9ZV142       W0Q8D0NE               0005DEM1
ATA      /dev/sg8     MARVELL VIRTUALL                               1.09

root@truenas[~]#

Next check the default power settings with openSeaChest_PowerControl -d /dev/sg1 --showEPCSettings
Code:
root@truenas[~]# openSeaChest_PowerControl -d /dev/sg1 --showEPCSettings
==========================================================================================
 openSeaChest_PowerControl - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2022 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_PowerControl Version: 3.0.1-2_2_1 X86_64
 Build Date: Dec 13 2022
 Today: Sat Feb 11 15:34:14 2023        User: root
==========================================================================================

/dev/sg1 - ST16000NM001G-2KK103 - ZL20WP1P - ATA
.

===EPC Settings===
        * = timer is enabled
        C column = Changeable
        S column = Savable
        All times are in 100 milliseconds

Name       Current Timer Default Timer Saved Timer   Recovery Time C S
Idle A     *1            *1            *1            1             Y Y
Idle B     *1200         *1200         *1200         4             Y Y
Idle C      0             6000          6000         20            Y Y
Standby Z   0             9000          9000         110           Y Y

Idle B is likely the culprit as it unloads the heads.
Lets fix this with:
openSeaChest_PowerControl -d /dev/sg1 --powerBalanceFeature disable
openSeaChest_PowerControl -d /dev/sg1 --EPCfeature disable disable
Repeat for all drives as needed.
Hopefully that's helpful to someone.


I'm not sure this Load_Cycle_Count explains the three dead drives however. The internet seems to think that drives should be able to survive 100,000's of load cycles. I wonder if there are any other problematic default configs that need fixing.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
Irrespective of the high load cycle count, "pending" and "offline_uncorrectable" are genuine hardware failures, so you have five drives to RMA. Bad batch indeed.

The "H310 RAID card" is a further potential issue for ZFS if not flashed to IT mode.
 

Dimand

Cadet
Joined
Feb 10, 2023
Messages
5
Irrespective of the high load cycle count, "pending" and "offline_uncorrectable" are genuine hardware failures, so you have five drives to RMA. Bad batch indeed.

The "H310 RAID card" is a further potential issue for ZFS if not flashed to IT mode.
Can you show where there at 5 drives with errors? I could only see issues in the smart outputs on three of them.
 

Dimand

Cadet
Joined
Feb 10, 2023
Messages
5
Also just to clarify about IT mode. It appears everything is set up correctly?


Code:
root@truenas[~]# sas2flash -list
LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved

        Adapter Selected is a LSI SAS: SAS2008(B2)

        Controller Number              : 0
        Controller                     : SAS2008(B2)
        PCI Address                    : 00:02:00:00
        SAS Address                    : 5d4ae52-0-acaa-a105
        NVDATA Version (Default)       : 14.01.00.08
        NVDATA Version (Persistent)    : 14.01.00.08
        Firmware Product ID            : 0x2213 (IT)
        Firmware Version               : 20.00.07.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : SAS9211-8i
        BIOS Version                   : 07.11.10.00
        UEFI BSD Version               : N/A
        FCODE Version                  : N/A
        Board Name                     : 6Gbps SAS HBA
        Board Assembly                 : N/A
        Board Tracer Number            : N/A

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
Correct.
As for the drives, I counted the two 'UNAVAILABLE' in zpool status and the three from the first post with bad sectors as a total of five faulty drives.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
I have also had a similar ration of defect drives from the same model. The RMA was always smooth, but I would make a good backup and high level of redundancy a priority even more than usual.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Is there a recommended burn in test I can put the rest of these drives through?
Yes:

There are other ways to do it, and it isn't really using badblocks for its intended purpose, but it's simple and does the job, both of which are qualities I like.

You need to set up a regular schedule of SMART self-tests--the history you've posted indicates you're only running them on an ad hoc basis.
 

Dimand

Cadet
Joined
Feb 10, 2023
Messages
5
Correct.
As for the drives, I counted the two 'UNAVAILABLE' in zpool status and the three from the first post with bad sectors as a total of five faulty drives.
Ah, just for clarity the two 'UNAVAILABLE' are two of the original drives from the first post. Thanks for checking my sas2flash details.
I have also had a similar ration of defect drives from the same model. The RMA was always smooth, but I would make a good backup and high level of redundancy a priority even more than usual.
Glad to know I'm not alone. I guess at least I can say they were cheap.
Yes:

There are other ways to do it, and it isn't really using badblocks for its intended purpose, but it's simple and does the job, both of which are qualities I like.

You need to set up a regular schedule of SMART self-tests--the history you've posted indicates you're only running them on an ad hoc basis.
Thanks, I am running through badblocks now on the 5 error free drives and 1 with errors for curiosities sake. Looks like it will take ~5 days. I'll make up a cron job for self tests once I put the server into production.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
I'll make up a cron job for self tests once I put the server into production.
No need for a cron job; these can (and should) be scheduled through the web GUI.
 
Top