Pool keeps degrading

Neo2199 · Nov 12, 2022

I am having an issue when my fresh build array is degrading. Most of the time I am getting write errors, but sometimes read as well.
This is my first interaction with TrueNAS but I have dome some research and can't figure this one out.

The most similar problem is described in this thread, but yet it is different.

Setup:
OS: TrueNAS-SCALE-22.02.4
CPU: Intel(R) Pentium(R) G4500 @3.5 GHz
MB: Gigabyte Z170N-Gaming 5
RAM: 8GB (2x) Corsair 4GB DDR4 2400MHz
HDDS:
- 3x 6TB WD Blue WD60EZRZ (from 2016)
- 1x 6TB WD Red WD60EFAX brand new
- 1x Kingston 120GB SSD SA400S37/120G
ZFS: RAID1Z with above drives
PSU: PicoPSU-80-WI-32V

History:
This is my NAS setup that I used since 2016 just with Windows and motherboard raid. It might not be the perfect solution, but worked for my needs. Let's put it that way ;)
My UPS failed (of course my cache was set to don't wait for write completion), 1 of 4 WD drives failed, my array fell apart and I spent past month or so with data recovery software. I managed to recover everything, but it was so close of never seeing my data.
I have since then replaced the batteries, got a new drive and run SMART and surface tests on remaining old drives.

What Happens now:
Fresh install of TrueNAS Scale. Then I create a default pool and import my 13TB of data from a backup drive. After few hours or a day I come back and array is degraded. Most often degraded drive is brand new WD red.

What have I tried:
I have replaced reported faulty drive with spare 5TB drive.
I have replaced all sata cables.
I have replaced whole computer.

What appears to be helping:
Now I have TrueNAS Core installed and so far so good. It is just 12hrs...
Interestingly my speeds are half of TrueNAS Scale.

Also I am thinking that some power saving feature might be messing up stuff here, but I have all of the power saving features disabled. Also does not explain why same setup and just different TrueNAS version would help.

My logs:

Code:

root@NAS[~]# smartctl -x /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.142+truenas] (local build)

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red (SMR)
Device Model:     WDC WD60EFAX-68JH4N1
Serial Number:    WD-WXB2DA1R5TLE
LU WWN Device Id: 5 0014ee 2bf94f8ca
Firmware Version: 83.00A83
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Nov 11 07:33:57 2022 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
  3 Spin_Up_Time            POS--K   225   224   021    -    3741
  4 Start_Stop_Count        -O--CK   100   100   000    -    17
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   100   100   000    -    170
 10 Spin_Retry_Count        -O--CK   100   253   000    -    0
 11 Calibration_Retry_Count -O--CK   100   253   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    15
192 Power-Off_Retract_Count -O--CK   200   200   000    -    6
193 Load_Cycle_Count        -O--CK   200   200   000    -    35
194 Temperature_Celsius     -O---K   117   109   000    -    33
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    0


SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 103 (device log contains only the most recent 24 errors)

Error 103 [6] occurred at disk power-on lifetime: 163 hours (6 days + 19 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 00 00 40 04 70 40 00  Error: IDNF at LBA = 0x00400470 = 4195440

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 08 00 70 00 00 fd 41 c9 d0 40 08     04:56:25.132  READ FPDMA QUEUED
  61 00 08 00 68 00 02 ba a0 f4 70 40 08     04:56:25.017  WRITE FPDMA QUEUED
  61 00 08 00 60 00 02 ba a0 f2 70 40 08     04:56:25.016  WRITE FPDMA QUEUED
  61 00 08 00 58 00 00 00 40 04 70 40 08     04:56:25.010  WRITE FPDMA QUEUED
  61 00 08 00 50 00 00 00 40 02 70 40 08     04:56:25.010  WRITE FPDMA QUEUED

Error 102 [5] occurred at disk power-on lifetime: 163 hours (6 days + 19 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 00 fd 7c 8e 30 40 00  Error: IDNF at LBA = 0xfd7c8e30 = 4252798512

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 08 00 50 00 00 03 c0 2a 38 40 08     04:56:09.354  READ FPDMA QUEUED
  60 00 08 00 48 00 01 1e e6 ae 48 40 08     04:56:08.671  READ FPDMA QUEUED
  60 00 08 00 40 00 01 1d f9 2a 08 40 08     04:56:08.669  READ FPDMA QUEUED
  60 00 08 00 38 00 00 fd 41 cb e0 40 08     04:56:08.669  READ FPDMA QUEUED
  61 07 c0 00 30 00 00 fd 7c 95 e8 40 08     04:56:08.669  WRITE FPDMA QUEUED

Error 101 [4] occurred at disk power-on lifetime: 163 hours (6 days + 19 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 00 fd 7c 85 68 40 00  Error: IDNF at LBA = 0xfd7c8568 = 4252796264

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 10 00 00 00 02 ba a0 f2 90 40 08     04:56:00.465  READ FPDMA QUEUED
  60 00 10 00 f8 00 02 ba a0 f0 90 40 08     04:56:00.465  READ FPDMA QUEUED
  60 00 10 00 b8 00 00 00 40 02 90 40 08     04:56:00.465  READ FPDMA QUEUED
  61 07 c0 00 b0 00 00 fd 7c 86 70 40 08     04:56:00.465  WRITE FPDMA QUEUED
  61 07 b8 00 90 00 00 fd 7c 8e 30 40 08     04:56:00.465  WRITE FPDMA QUEUED

Error 100 [3] occurred at disk power-on lifetime: 163 hours (6 days + 19 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 00 fd 7c 82 50 40 00  Error: IDNF at LBA = 0xfd7c8250 = 4252795472

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 08 00 08 00 00 fd 41 ca 90 40 08     04:55:54.512  READ FPDMA QUEUED
  60 00 08 00 00 00 01 18 5f 3a c0 40 08     04:55:54.052  READ FPDMA QUEUED
  60 00 18 00 f8 00 00 fd 41 c2 68 40 08     04:55:54.040  READ FPDMA QUEUED
  61 00 80 00 a0 00 00 fd 7c 85 f0 40 08     04:55:52.265  WRITE FPDMA QUEUED
  61 00 80 00 98 00 00 fd 7c 85 68 40 08     04:55:52.263  WRITE FPDMA QUEUED

Error 99 [2] occurred at disk power-on lifetime: 152 hours (6 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 00 00 00 00 84 40 00  Error: IDNF at LBA = 0x00000084 = 132

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 01 00 18 00 00 00 00 00 86 40 08     04:03:39.619  WRITE FPDMA QUEUED
  61 00 01 00 10 00 00 00 00 00 85 40 08     04:03:39.619  WRITE FPDMA QUEUED
  61 00 01 00 08 00 00 00 00 00 84 40 08     04:03:39.619  WRITE FPDMA QUEUED
  61 00 01 00 00 00 00 00 00 00 83 40 08     04:03:39.619  WRITE FPDMA QUEUED
  ef 00 10 00 02 00 00 00 00 00 00 a0 08     04:03:39.614  SET FEATURES [Enable SATA feature]

Error 98 [1] occurred at disk power-on lifetime: 152 hours (6 days + 8 hours)
  10 -- 51 00 00 00 00 00 00 00 86 40 00  Error: IDNF at LBA = 0x00000086 = 134


Error 97 [0] occurred at disk power-on lifetime: 152 hours (6 days + 8 hours)
  10 -- 51 00 00 00 00 00 00 00 83 40 00  Error: IDNF at LBA = 0x00000083 = 131


Error 96 [23] occurred at disk power-on lifetime: 152 hours (6 days + 8 hours)
  10 -- 51 00 00 00 00 00 00 00 87 40 00  Error: IDNF at LBA = 0x00000087 = 135


SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       160         -
# 2  Extended offline    Completed without error       00%        98         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
Device State:                        DST executing in background (3)
Current Temperature:                    33 Celsius
Power Cycle Min/Max Temperature:     32/34 Celsius
Lifetime    Min/Max Temperature:     26/41 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4              15  ---  Lifetime Power-On Resets
0x01  0x010  4             170  ---  Power-on Hours
0x01  0x018  6     26397621528  ---  Logical Sectors Written
0x01  0x020  6        97102774  ---  Number of Write Commands
0x01  0x028  6      1103767073  ---  Logical Sectors Read
0x01  0x030  6         2038117  ---  Number of Read Commands
0x01  0x038  6       612000000  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4             170  ---  Spindle Motor Power-on Hours
0x03  0x010  4             116  ---  Head Flying Hours
0x03  0x018  4              42  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4               0  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x03  0x038  4               0  ---  Number of Realloc. Candidate Logical Sectors
0x03  0x040  4               6  ---  Number of High Priority Unload Events
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4             103  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               0  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              33  ---  Current Temperature
0x05  0x010  1              34  ---  Average Short Term Temperature
0x05  0x018  1               -  ---  Average Long Term Temperature
0x05  0x020  1              41  ---  Highest Temperature
0x05  0x028  1              28  ---  Lowest Temperature
0x05  0x030  1              36  ---  Highest Average Short Term Temperature
0x05  0x038  1              32  ---  Lowest Average Short Term Temperature
0x05  0x040  1               -  ---  Highest Average Long Term Temperature
0x05  0x048  1               -  ---  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              65  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4              55  ---  Number of Hardware Resets
0x06  0x010  4              28  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0xff  =====  =               =  ===  == Vendor Specific Statistics (rev 1) ==
0xff  0x008  7               0  ---  Vendor Specific
0xff  0x010  7               0  ---  Vendor Specific
0xff  0x018  7               0  ---  Vendor Specific

Pending Defects log (GP Log 0x0c)
No Defects Logged

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            8  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            9  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4        43309  Vendor specific

#                                                                                                                                                                                   
root@NAS[~]#

Code:

root@NAS[~]# smartctl -x /dev/sdc
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.142+truenas] (local build)

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue
Device Model:     WDC WD60EZRZ-00GZ5B1
Serial Number:    WD-WXJ1H26LX894
LU WWN Device Id: 5 0014ee 20dd74b72
Firmware Version: 80.00A80
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5700 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Nov 11 07:35:36 2022 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
  3 Spin_Up_Time            POS--K   186   186   021    -    9666
  4 Start_Stop_Count        -O--CK   084   084   000    -    16207
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   068   068   000    -    23538
 10 Spin_Retry_Count        -O--CK   100   100   000    -    0
 11 Calibration_Retry_Count -O--CK   100   100   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    228
192 Power-Off_Retract_Count -O--CK   200   200   000    -    123
193 Load_Cycle_Count        -O--CK   120   120   000    -    240509
194 Temperature_Celsius     -O---K   112   095   000    -    40
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   200   200   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    0

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     23528         -
# 2  Extended offline    Completed without error       00%     23465         -

Code:

root@NAS[~]# smartctl -x /dev/sdd
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.142+truenas] (local build)

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue
Device Model:     WDC WD60EZRZ-00GZ5B1
Serial Number:    WD-WXJ1H26SMCYJ
LU WWN Device Id: 5 0014ee 20dd767bd
Firmware Version: 80.00A80
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5700 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Nov 11 07:36:09 2022 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED


SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
  3 Spin_Up_Time            POS--K   197   196   021    -    9125
  4 Start_Stop_Count        -O--CK   084   084   000    -    16189
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   068   068   000    -    23545
 10 Spin_Retry_Count        -O--CK   100   100   000    -    0
 11 Calibration_Retry_Count -O--CK   100   100   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    222
192 Power-Off_Retract_Count -O--CK   200   200   000    -    119
193 Load_Cycle_Count        -O--CK   122   122   000    -    235715
194 Temperature_Celsius     -O---K   115   101   000    -    37
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   200   200   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    0

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     23535         -
# 2  Extended offline    Completed without error       00%     23472         -
# 3  Extended offline    Aborted by host               90%         0         -

Code:

root@NAS[~]# smartctl -x /dev/sde
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.142+truenas] (local build)

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue
Device Model:     WDC WD60EZRZ-00RWYB1
Serial Number:    WD-WX21DB5096DA
LU WWN Device Id: 5 0014ee 2631e3d7f
Firmware Version: 80.00A80
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5700 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Nov 11 07:36:50 2022 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   199   051    -    1
  3 Spin_Up_Time            POS--K   249   199   021    -    6508
  4 Start_Stop_Count        -O--CK   084   084   000    -    16193
  5 Reallocated_Sector_Ct   PO--CK   195   195   140    -    170
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   066   066   000    -    25245
 10 Spin_Retry_Count        -O--CK   100   100   000    -    0
 11 Calibration_Retry_Count -O--CK   100   100   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    231
192 Power-Off_Retract_Count -O--CK   200   200   000    -    124
193 Load_Cycle_Count        -O--CK   118   118   000    -    247562
194 Temperature_Celsius     -O---K   110   098   000    -    42
196 Reallocated_Event_Count -O--CK   196   058   000    -    4
197 Current_Pending_Sector  -O--CK   200   200   000    -    1
198 Offline_Uncorrectable   ----CK   200   200   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    1
200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    20

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 133 (device log contains only the most recent 24 errors)

Error 133 [12] occurred at disk power-on lifetime: 25222 hours (1050 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 02 e5 f7 d8 40 00  Error: UNC at LBA = 0x02e5f7d8 = 48625624

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 07 e8 00 b8 00 00 02 e5 fe d8 40 08  1d+00:36:56.472  READ FPDMA QUEUED
  60 07 e8 00 b0 00 00 02 e5 f6 f0 40 08  1d+00:36:56.464  READ FPDMA QUEUED
  60 07 e8 00 a8 00 00 02 e5 ef 08 40 08  1d+00:36:56.447  READ FPDMA QUEUED
  60 07 e8 00 a0 00 00 02 e5 e7 20 40 08  1d+00:36:56.442  READ FPDMA QUEUED
  60 07 e8 00 98 00 00 02 e5 df 38 40 08  1d+00:36:56.425  READ FPDMA QUEUED

Error 132 [11] occurred at disk power-on lifetime: 24144 hours (1006 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 02 ba a0 f4 af 40 00  Error: UNC at LBA = 0x2baa0f4af = 11721045167

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 01 00 b8 00 02 ba a0 f4 af 40 00  1d+23:26:53.428  READ FPDMA QUEUED
  60 00 01 00 b0 00 02 ba a0 f4 ae 40 00  1d+23:26:53.427  READ FPDMA QUEUED
  60 00 01 00 a8 00 02 ba a0 f4 ad 40 00  1d+23:26:53.399  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 68 00  1d+23:26:53.377  READ LOG EXT
  60 00 01 00 a0 00 02 ba a0 f4 ac 40 00  1d+23:26:53.178  READ FPDMA QUEUED

Error 131 [10] occurred at disk power-on lifetime: 24144 hours (1006 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 02 ba a0 f4 ac 40 00  Error: UNC at LBA = 0x2baa0f4ac = 11721045164

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 01 00 a0 00 02 ba a0 f4 ac 40 00  1d+23:26:53.178  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 68 00  1d+23:26:53.169  READ LOG EXT
  60 00 01 00 98 00 02 ba a0 f4 ab 40 00  1d+23:26:52.961  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 68 00  1d+23:26:52.950  READ LOG EXT
  60 00 01 00 90 00 02 ba a0 f4 aa 40 00  1d+23:26:52.745  READ FPDMA QUEUED

Error 130 [9] occurred at disk power-on lifetime: 24144 hours (1006 days + 0 hours)
  40 -- 51 00 00 00 02 ba a0 f4 ab 40 00  Error: UNC at LBA = 0x2baa0f4ab = 11721045163

Error 129 [8] occurred at disk power-on lifetime: 24144 hours (1006 days + 0 hours)
  40 -- 51 00 00 00 02 ba a0 f4 aa 40 00  Error: UNC at LBA = 0x2baa0f4aa = 11721045162

Error 128 [7] occurred at disk power-on lifetime: 24144 hours (1006 days + 0 hours)
  40 -- 51 00 00 00 02 ba a0 f4 a9 40 00  Error: UNC at LBA = 0x2baa0f4a9 = 11721045161

Error 127 [6] occurred at disk power-on lifetime: 24144 hours (1006 days + 0 hours)
  40 -- 51 00 00 00 02 ba a0 f4 a8 40 00  Error: UNC at LBA = 0x2baa0f4a8 = 11721045160

Error 126 [5] occurred at disk power-on lifetime: 24144 hours (1006 days + 0 hours)
  40 -- 51 00 01 00 02 ba a0 f4 a8 40 00  Error: UNC at LBA = 0x2baa0f4a8 = 11721045160

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     25235         -
# 2  Extended offline    Completed without error       00%     25171         -


                                                                                                                                                                              
root@NAS[~]#

Code:

Nov 11 00:53:12 NAS kernel: ipmi_si: Unable to find any System Interface(s)
Nov 11 00:53:12 NAS kernel: md: resync of RAID array md127
Nov 11 00:53:12 NAS kernel: Adding 2097084k swap on /dev/mapper/md127.  Priority:-2 extents:1 across:2097084k FS
Nov 11 00:53:44 NAS kernel: md: md127: resync done.
Nov 11 00:54:19 NAS kernel: ata3.00: configured for UDMA/133
Nov 11 00:54:19 NAS kernel: sd 2:0:0:0: [sdb] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=7s
Nov 11 00:54:19 NAS kernel: sd 2:0:0:0: [sdb] tag#18 Sense Key : Illegal Request [current]
Nov 11 00:54:19 NAS kernel: sd 2:0:0:0: [sdb] tag#18 Add. Sense: Logical block address out of range
Nov 11 00:54:19 NAS kernel: sd 2:0:0:0: [sdb] tag#18 CDB: Write(16) 8a 00 00 00 00 01 8e c4 8f 80 00 00 01 00 00 00
Nov 11 00:54:19 NAS kernel: zio pool=Pool vdev=/dev/disk/by-partuuid/d470d17d-88af-40b8-9f48-cfda45d58857 error=5 type=2 offset=3423241895936 size=131072 flags=40080caa
Nov 11 00:54:19 NAS kernel: ata3: EH complete
Nov 11 00:54:27 NAS kernel: ata3.00: configured for UDMA/133
Nov 11 00:54:27 NAS kernel: sd 2:0:0:0: [sdb] tag#14 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=15s
Nov 11 00:54:27 NAS kernel: sd 2:0:0:0: [sdb] tag#14 Sense Key : Illegal Request [current]
Nov 11 00:54:27 NAS kernel: sd 2:0:0:0: [sdb] tag#14 Add. Sense: Logical block address out of range
Nov 11 00:54:27 NAS kernel: sd 2:0:0:0: [sdb] tag#14 CDB: Write(16) 8a 00 00 00 00 01 8e c4 90 88 00 00 01 00 00 00
Nov 11 00:54:27 NAS kernel: zio pool=Pool vdev=/dev/disk/by-partuuid/d470d17d-88af-40b8-9f48-cfda45d58857 error=5 type=2 offset=3423242031104 size=131072 flags=40080caa

Ericloewe · Nov 12, 2022

The SMART error log for /dev/sda is full (for a drive with less than 200 hours on it...) of ominous errors, but I can't tell you for sure what they're about. Also, it's SMR, so performance is going to be miserable.
Additionally, /dev/sde is starting to fail, as evidenced by the uncorrectable errors and various SMART parameters (multizone error rate, reallocated sectors, and raw read error rate are not looking great). Very suggestive of a drive on its way out.

Also worrying is the very high load cycle count. WD Greens/newBlues and Reds used to be specced at like 3/600 000 load cycles. This used to be a big deal back in 2014/2015, when people often would try to save a couple of bucks by going with Greens (now Blues) instead of Reds, but sort of faded out of being a relevant phenomenon.

Your Blues are constantly trying to park their heads, to save energy. But the OS and ZFS are also constantly trying to do stuff. The result is that the disk has to unpark the heads again. These cycles wear out the arm mechanism on the disk - I guess as parts wear down, the disk has a harder time precisely positioning the head for I/O operations.
Back in the day, WD had a utility that allowed you to configure the period after which heads would be parked, but there was some talk of it no longer working at about the time Greens were rebranded as Blues.

Neo2199 · Nov 12, 2022

Yes, I do have warranty claim open for the sda. I might replace it with CMR. I read somewhere it should be CMR but well.
The sde is on ti's last legs and other drives might follow in a year or two. I am now looking for replacement of sde. Considering transitional upgrade to 3x12TB setup? My home NAS does not really need performance. Anything that saturates gigabit will do. But cost and power consumption are things I am looking for. Maybe shuck some WD12TB drives?

My NAS was setup to spin down at idle. Maybe this is the load cycles increase? This happened like 3x a day or so for past 6 years or so. Not sure what is my plan for TrueNAS build.

Ericloewe · Nov 12, 2022

Well, 16k start/stop cycles is a whole lot, too. To be honest, I'm not sure what the design target is for that, but I doubt 16k is a good place to be at. But that's yet another thing on top of of the load cycle count (which is an order of magnitude lower, so the drives are spinning down, but not at an insane rate).

Neo2199 · Nov 12, 2022

Oh yeah, one more thing. My original NAS was configured with aggressive LPM for sata controller. I think that's on me from 6 years ago.

Jailer · Nov 12, 2022

Ericloewe said:
The SMART error log for /dev/sda is full (for a drive with less than 200 hours on it...) of ominous errors, but I can't tell you for sure what they're about

Probably that 80W pico PSU that is powering this system.

Ericloewe · Nov 12, 2022

I missed that one. I assume the 80W is for the +5V and +3.3V, making it only "miserably underpowered". Of course, the +12V PSU powering it still needs to deal with the load imposed by the disks plus the rest of the system.
Four disks means a 120W PSU just for the disks... And since they're spinning up all the time, the whole thing is just begging for problems of all sorts.

Neo2199 · Jan 7, 2023

So I have some updates.

The short version. I have replaced all old drives with new WD RED PRO 14TB models. All is good so far.

The long version. I was seriously doubting that 4 old drives will be basically all faulty at roughly the same time. But since those were WD blue 6TB shucked, they all have 8s idle parking which resulted in really high head parking count. I suspect that this behavior caused all drives to fail in similar time and fashion. I have managed to address the issue early enough where only one drive failed and recovered all the data. But during the TrueNAS setup I have noticed the errors from drives that were just duing while I was testing the fixed NAS. The chances of that happening are pretty low, so I suspected some other issues. But once I have replaced all drives with new drives system is working fine. If I don't post here another update all is well. Thanks all for the help.

artlessknave · Jan 7, 2023

Neo2199 said:
PSU: PicoPSU-80-WI-32V

WATT...

how is this system even running.

Important Announcement for the TrueNAS Community.

Pool keeps degrading

Neo2199

Cadet

Ericloewe

Server Wrangler

Neo2199

Cadet

Ericloewe

Server Wrangler

Neo2199

Cadet

Jailer

Not strong, but bad

Ericloewe

Server Wrangler

Neo2199

Cadet

artlessknave

Wizard

Similar threads

Important Announcement for the TrueNAS Community.

Pool keeps degrading

Cadet

Server Wrangler

Cadet

Server Wrangler

Cadet

Not strong, but bad

Server Wrangler

Cadet

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Pool keeps degrading"

Similar threads