pool degraded, SMART test passes but logs errors

fmiz · Jan 27, 2024

Hello, I understand what I'm asking is similar to what others have reported, I need some advice to understand what is failing in my setup.
I have a raidz2 pool with 6 SATA disks, probably after a scrub the error "CAM status: ATA Status Error" started showing up in logs, making the pool degraded. I searched for advice online and then read the throubleshooting guide. I have not strictly followed the standard procedure written there, this is what I've done so far after seeing the error logged:
- run a SMART long test on the drive
- checked SMART attributes 5, 197, and 198, they are all 0, but it had Device Error Count: 46
- swapped the SATA data cable with another drive from the array (I've discovered that I have no spare sata3 cables left at home...)
- the drive has now changed name, from ada2 to ada4, meaning that now I'm testing both a different cable and sata port at the same time
- run a scrub
- the error showed up and the pool is degraded again
- SMART attributes 5, 197, and 198 are still 0, but now it shows Device Error Count: 59
SMART then is logging something, for example:

Code:

Error 59 [10] occurred at disk power-on lifetime: 17685 hours (736 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 55 c9 4b cf 40 00  Error: UNC at LBA = 0x155c94bcf = 5734222799

Others on this forum already reported a something similar to what I'm seeing, but the LBA in my case seems to be a real value, not something like 0000 or 0xffff (suggesting instead a drive communication failure). Details follow from here.

System hardware:
- Supermicro X10SLH-F
- Intel i3-4160, 16GB (DDR3)
- 6 WDC WD30EFRX-68EUZN0 (3TB SATA WD RED), connected to the onboard sata controller (intel pch?)
OS is TrueNAS-13.0-U6.1 (CORE)

This is a sample of /var/log/messages:

Code:

Jan 26 06:33:46 freenas (ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 d8 41 9e 40 55 01 00 08 00 00
Jan 26 06:33:46 freenas (ada4:ahcich4:0:0:0): CAM status: ATA Status Error
Jan 26 06:33:46 freenas (ada4:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 26 06:33:46 freenas (ada4:ahcich4:0:0:0): RES: 41 40 c8 49 9e 40 55 01 00 00 00
Jan 26 06:33:46 freenas (ada4:ahcich4:0:0:0): Retrying command, 3 more tries remain
Jan 26 06:33:53 freenas (ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 d8 41 9e 40 55 01 00 08 00 00
Jan 26 06:33:53 freenas (ada4:ahcich4:0:0:0): CAM status: ATA Status Error
Jan 26 06:33:53 freenas (ada4:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 26 06:33:53 freenas (ada4:ahcich4:0:0:0): RES: 41 40 c8 49 9e 40 55 01 00 00 00
Jan 26 06:33:53 freenas (ada4:ahcich4:0:0:0): Retrying command, 2 more tries remain
Jan 26 06:34:08 freenas (ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 d8 18 61 9e 40 55 01 00 07 00 00
Jan 26 06:34:08 freenas (ada4:ahcich4:0:0:0): CAM status: ATA Status Error
Jan 26 06:34:08 freenas (ada4:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 26 06:34:08 freenas (ada4:ahcich4:0:0:0): RES: 41 40 88 61 9e 40 55 01 00 00 00
Jan 26 06:34:08 freenas (ada4:ahcich4:0:0:0): Retrying command, 3 more tries remain
Jan 26 06:34:15 freenas (ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 d8 18 61 9e 40 55 01 00 07 00 00
Jan 26 06:34:15 freenas (ada4:ahcich4:0:0:0): CAM status: ATA Status Error
Jan 26 06:34:15 freenas (ada4:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 26 06:34:15 freenas (ada4:ahcich4:0:0:0): RES: 41 40 8f 61 9e 40 55 01 00 00 00
Jan 26 06:34:15 freenas (ada4:ahcich4:0:0:0): Retrying command, 2 more tries remain
Jan 26 06:34:22 freenas (ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 d8 18 61 9e 40 55 01 00 07 00 00
Jan 26 06:34:22 freenas (ada4:ahcich4:0:0:0): CAM status: ATA Status Error
Jan 26 06:34:22 freenas (ada4:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 26 06:34:22 freenas (ada4:ahcich4:0:0:0): RES: 41 40 88 61 9e 40 55 01 00 00 00
Jan 26 06:34:22 freenas (ada4:ahcich4:0:0:0): Retrying command, 1 more tries remain
Jan 26 06:34:29 freenas (ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 d8 18 61 9e 40 55 01 00 07 00 00
Jan 26 06:34:29 freenas (ada4:ahcich4:0:0:0): CAM status: ATA Status Error
Jan 26 06:34:29 freenas (ada4:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 26 06:34:29 freenas (ada4:ahcich4:0:0:0): RES: 41 40 8f 61 9e 40 55 01 00 00 00
Jan 26 06:34:29 freenas (ada4:ahcich4:0:0:0): Retrying command, 0 more tries remain
Jan 26 06:34:36 freenas (ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 d8 18 61 9e 40 55 01 00 07 00 00
Jan 26 06:34:36 freenas (ada4:ahcich4:0:0:0): CAM status: ATA Status Error
Jan 26 06:34:36 freenas (ada4:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 26 06:34:36 freenas (ada4:ahcich4:0:0:0): RES: 41 40 88 61 9e 40 55 01 00 00 00
Jan 26 06:34:36 freenas (ada4:ahcich4:0:0:0): Error 5, Retries exhausted
Jan 26 06:34:48 freenas (ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 c8 a9 9e 40 55 01 00 01 00 00
Jan 26 06:34:48 freenas (ada4:ahcich4:0:0:0): CAM status: ATA Status Error
Jan 26 06:34:48 freenas (ada4:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 26 06:34:48 freenas (ada4:ahcich4:0:0:0): RES: 41 40 1f aa 9e 40 55 01 00 00 00
Jan 26 06:34:48 freenas (ada4:ahcich4:0:0:0): Retrying command, 3 more tries remain
Jan 26 07:22:58 freenas 1 2024-01-26T07:22:58.092445+01:00 freenas.local collectd 3263 - - nut plugin: nut_connect: upscli_connect (localhost, 3493) failed: Connection failure: Connection refused
Jan 26 09:15:51 freenas (ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 f8 49 c9 40 55 01 00 08 00 00
Jan 26 09:15:51 freenas (ada4:ahcich4:0:0:0): CAM status: ATA Status Error
Jan 26 09:15:51 freenas (ada4:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 26 09:15:51 freenas (ada4:ahcich4:0:0:0): RES: 41 40 cf 4b c9 40 55 01 00 00 00
Jan 26 09:15:51 freenas (ada4:ahcich4:0:0:0): Retrying command, 3 more tries remain
Jan 26 09:15:58 freenas (ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 f8 49 c9 40 55 01 00 08 00 00
Jan 26 09:15:58 freenas (ada4:ahcich4:0:0:0): CAM status: ATA Status Error
Jan 26 09:15:58 freenas (ada4:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 26 09:15:58 freenas (ada4:ahcich4:0:0:0): RES: 41 40 d8 4b c9 40 55 01 00 00 00
Jan 26 09:15:58 freenas (ada4:ahcich4:0:0:0): Retrying command, 2 more tries remain
Jan 26 09:16:05 freenas (ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 f8 49 c9 40 55 01 00 08 00 00
Jan 26 09:16:05 freenas (ada4:ahcich4:0:0:0): CAM status: ATA Status Error
Jan 26 09:16:05 freenas (ada4:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 26 09:16:05 freenas (ada4:ahcich4:0:0:0): RES: 41 40 d8 4b c9 40 55 01 00 00 00
Jan 26 09:16:05 freenas (ada4:ahcich4:0:0:0): Retrying command, 1 more tries remain
Jan 26 09:16:12 freenas (ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 f8 49 c9 40 55 01 00 08 00 00
Jan 26 09:16:12 freenas (ada4:ahcich4:0:0:0): CAM status: ATA Status Error
Jan 26 09:16:12 freenas (ada4:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 26 09:16:12 freenas (ada4:ahcich4:0:0:0): RES: 41 40 cf 4b c9 40 55 01 00 00 00
Jan 26 09:16:12 freenas (ada4:ahcich4:0:0:0): Retrying command, 0 more tries remain
Jan 26 09:16:19 freenas (ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 f8 49 c9 40 55 01 00 08 00 00
Jan 26 09:16:19 freenas (ada4:ahcich4:0:0:0): CAM status: ATA Status Error
Jan 26 09:16:19 freenas (ada4:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 26 09:16:19 freenas (ada4:ahcich4:0:0:0): RES: 41 40 cf 4b c9 40 55 01 00 00 00
Jan 26 09:16:19 freenas (ada4:ahcich4:0:0:0): Error 5, Retries exhausted

This is what ZFS shows:

Code:

root@freenas[/]# zpool status heaven
  pool: heaven
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 1.94M in 05:38:45 with 0 errors on Fri Jan 26 09:36:14 2024
config:

        NAME                                            STATE     READ WRITE CKSUM
        heaven                                          DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            gptid/73149100-c06f-11e4-9fa4-0cc47a09c018  ONLINE       0     0     0
            gptid/73ec5508-c06f-11e4-9fa4-0cc47a09c018  ONLINE       0     0     0
            gptid/74b74322-c06f-11e4-9fa4-0cc47a09c018  ONLINE       0     0     0
            gptid/75808f9b-c06f-11e4-9fa4-0cc47a09c018  ONLINE       0     0     0
            gptid/7648fa64-c06f-11e4-9fa4-0cc47a09c018  ONLINE       0     0     0
            gptid/771697bb-c06f-11e4-9fa4-0cc47a09c018  FAULTED     64     0     2  too many errors

errors: No known data errors

This is the SMART information of the suspect drive:

Code:

root@freenas[/]# smartctl -x /dev/ada4
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WCC4N7RE3DC6
LU WWN Device Id: 5 0014ee 2b54c166e
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Jan 28 01:16:35 2024 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (39840) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 399) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    590
  3 Spin_Up_Time            POS--K   174   171   021    -    6291
  4 Start_Stop_Count        -O--CK   100   100   000    -    103
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   100   253   000    -    0
  9 Power_On_Hours          -O--CK   076   076   000    -    17724
 10 Spin_Retry_Count        -O--CK   100   100   000    -    0
 11 Calibration_Retry_Count -O--CK   100   100   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    103
192 Power-Off_Retract_Count -O--CK   200   200   000    -    21
193 Load_Cycle_Count        -O--CK   200   200   000    -    203
194 Temperature_Celsius     -O---K   122   100   000    -    28
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   172   165   000    -    11466
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb7  GPL,SL  VS       1  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 59 (device log contains only the most recent 24 errors)
        CR     = Command Register
        FEATR  = Features Register
        COUNT  = Count (was: Sector Count) Register
        LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
        LH     = LBA High (was: Cylinder High) Register    ]   LBA
        LM     = LBA Mid (was: Cylinder Low) Register      ] Register
        LL     = LBA Low (was: Sector Number) Register     ]
        DV     = Device (was: Device/Head) Register
        DC     = Device Control Register
        ER     = Error register
        ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 59 [10] occurred at disk power-on lifetime: 17685 hours (736 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 55 c9 4b cf 40 00  Error: UNC at LBA = 0x155c94bcf = 5734222799

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 07 e0 00 20 00 01 55 c9 59 f8 40 08     08:33:44.688  READ FPDMA QUEUED
  60 08 00 00 18 00 01 55 c9 51 f8 40 08     08:33:44.688  READ FPDMA QUEUED
  60 08 00 00 10 00 01 55 c9 49 f8 40 08     08:33:44.688  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08     08:33:44.656  READ LOG EXT
  60 07 e0 00 00 00 01 55 c9 59 f8 40 08     08:33:37.662  READ FPDMA QUEUED

Error 58 [9] occurred at disk power-on lifetime: 17685 hours (736 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 55 c9 4b cf 40 00  Error: UNC at LBA = 0x155c94bcf = 5734222799

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 07 e0 00 00 00 01 55 c9 59 f8 40 08     08:33:37.662  READ FPDMA QUEUED
  60 08 00 00 f8 00 01 55 c9 51 f8 40 08     08:33:37.662  READ FPDMA QUEUED
  60 08 00 00 f0 00 01 55 c9 49 f8 40 08     08:33:37.662  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08     08:33:37.630  READ LOG EXT
  60 07 e0 00 e0 00 01 55 c9 59 f8 40 08     08:33:30.635  READ FPDMA QUEUED

Error 57 [8] occurred at disk power-on lifetime: 17685 hours (736 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 55 c9 4b d8 40 00  Error: UNC at LBA = 0x155c94bd8 = 5734222808

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 07 e0 00 e0 00 01 55 c9 59 f8 40 08     08:33:30.635  READ FPDMA QUEUED
  60 08 00 00 d8 00 01 55 c9 51 f8 40 08     08:33:30.635  READ FPDMA QUEUED
  60 08 00 00 d0 00 01 55 c9 49 f8 40 08     08:33:30.635  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08     08:33:30.604  READ LOG EXT
  60 07 e0 00 c0 00 01 55 c9 59 f8 40 08     08:33:23.609  READ FPDMA QUEUED

Error 56 [7] occurred at disk power-on lifetime: 17685 hours (736 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 55 c9 4b d8 40 00  Error: UNC at LBA = 0x155c94bd8 = 5734222808

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 07 e0 00 c0 00 01 55 c9 59 f8 40 08     08:33:23.609  READ FPDMA QUEUED
  60 08 00 00 b8 00 01 55 c9 51 f8 40 08     08:33:23.609  READ FPDMA QUEUED
  60 08 00 00 b0 00 01 55 c9 49 f8 40 08     08:33:23.609  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08     08:33:23.578  READ LOG EXT
  60 07 e0 00 a0 00 01 55 c9 59 f8 40 08     08:33:16.583  READ FPDMA QUEUED

Error 55 [6] occurred at disk power-on lifetime: 17685 hours (736 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 55 c9 4b cf 40 00  Error: UNC at LBA = 0x155c94bcf = 5734222799

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 07 e0 00 a0 00 01 55 c9 59 f8 40 08     08:33:16.583  READ FPDMA QUEUED
  60 08 00 00 98 00 01 55 c9 51 f8 40 08     08:33:16.577  READ FPDMA QUEUED
  60 08 00 00 90 00 01 55 c9 49 f8 40 08     08:33:16.577  READ FPDMA QUEUED
  60 00 d0 00 88 00 01 55 c9 48 38 40 08     08:33:16.564  READ FPDMA QUEUED
  60 07 a0 00 80 00 01 55 c9 37 70 40 08     08:33:16.551  READ FPDMA QUEUED

Error 54 [5] occurred at disk power-on lifetime: 17682 hours (736 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 55 9e aa 1f 40 00  Error: UNC at LBA = 0x1559eaa1f = 5731428895

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 01 30 00 d0 00 01 55 9e ab 18 40 08     05:52:19.598  READ FPDMA QUEUED
  60 01 00 00 c8 00 01 55 9e a9 c8 40 08     05:52:19.598  READ FPDMA QUEUED
  60 00 40 00 c0 00 01 55 9e a9 40 40 08     05:52:18.152  READ FPDMA QUEUED
  60 00 40 00 b8 00 01 55 9e a8 58 40 08     05:52:18.148  READ FPDMA QUEUED
  60 00 c0 00 b0 00 01 55 9e a7 30 40 08     05:52:18.138  READ FPDMA QUEUED

Error 53 [4] occurred at disk power-on lifetime: 17682 hours (736 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 55 9e 61 88 40 00  Error: UNC at LBA = 0x1559e6188 = 5731410312

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 03 80 00 e8 00 01 55 9e 70 b8 40 08     05:52:10.807  READ FPDMA QUEUED
  60 07 c8 00 e0 00 01 55 9e 68 f0 40 08     05:52:10.807  READ FPDMA QUEUED
  60 07 d8 00 d8 00 01 55 9e 61 18 40 08     05:52:10.807  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08     05:52:10.775  READ LOG EXT
  60 03 80 00 c8 00 01 55 9e 70 b8 40 08     05:52:03.780  READ FPDMA QUEUED

Error 52 [3] occurred at disk power-on lifetime: 17682 hours (736 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 55 9e 61 8f 40 00  Error: UNC at LBA = 0x1559e618f = 5731410319

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 03 80 00 c8 00 01 55 9e 70 b8 40 08     05:52:03.780  READ FPDMA QUEUED
  60 07 c8 00 c0 00 01 55 9e 68 f0 40 08     05:52:03.780  READ FPDMA QUEUED
  60 07 d8 00 b8 00 01 55 9e 61 18 40 08     05:52:03.780  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08     05:52:03.748  READ LOG EXT
  60 03 80 00 a8 00 01 55 9e 70 b8 40 08     05:51:56.754  READ FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     17723         -
# 2  Extended offline    Completed without error       00%     17664         -
# 3  Short offline       Completed without error       00%     17626         -
# 4  Extended offline    Completed without error       00%     17502         -
# 5  Short offline       Completed without error       00%     17480         -
# 6  Short offline       Completed without error       00%     17325         -
# 7  Short offline       Completed without error       00%     17157         -
# 8  Extended offline    Completed without error       00%     17069         -
# 9  Short offline       Completed without error       00%     16989         -
#10  Short offline       Completed without error       00%     16804         -
#11  Extended offline    Completed without error       00%     16788         -
#12  Short offline       Completed without error       00%     16541         -
#13  Short offline       Completed without error       00%     16374         -
#14  Short offline       Completed without error       00%     16206         -
#15  Short offline       Completed without error       00%     16038         -
#16  Extended offline    Completed without error       00%     15949         -
#17  Short offline       Completed without error       00%     15870         -
#18  Short offline       Completed without error       00%     15702         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
Device State:                        Active (0)
Current Temperature:                    28 Celsius
Power Cycle Min/Max Temperature:     19/31 Celsius
Lifetime    Min/Max Temperature:      2/50 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (28)

Index    Estimated Time   Temperature Celsius
  29    2024-01-27 17:19    28  *********
 ...    ..(476 skipped).    ..  *********
  28    2024-01-28 01:16    28  *********

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP/SMART Log 0x04) not supported

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            2  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            3  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4       174708  Vendor specific

This is another drive from the same array:

Code:

root@freenas[/]# smartctl -x /dev/ada0
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WCC4N6LTT28P
LU WWN Device Id: 5 0014ee 2605b09c5
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Jan 28 01:17:44 2024 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (40500) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 406) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
  3 Spin_Up_Time            POS--K   179   177   021    -    6008
  4 Start_Stop_Count        -O--CK   100   100   000    -    104
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   100   253   000    -    0
  9 Power_On_Hours          -O--CK   076   076   000    -    17725
 10 Spin_Retry_Count        -O--CK   100   100   000    -    0
 11 Calibration_Retry_Count -O--CK   100   100   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    104
192 Power-Off_Retract_Count -O--CK   200   200   000    -    21
193 Load_Cycle_Count        -O--CK   200   200   000    -    259
194 Temperature_Celsius     -O---K   122   101   000    -    28
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb7  GPL,SL  VS       1  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     17724         -
# 2  Short offline       Completed without error       00%     17626         -
# 3  Extended offline    Completed without error       00%     17499         -
# 4  Short offline       Completed without error       00%     17480         -
# 5  Short offline       Completed without error       00%     17325         -
# 6  Short offline       Completed without error       00%     17158         -
# 7  Extended offline    Completed without error       00%     17069         -
# 8  Short offline       Completed without error       00%     16990         -
# 9  Short offline       Completed without error       00%     16805         -
#10  Extended offline    Completed without error       00%     16788         -
#11  Short offline       Completed without error       00%     16542         -
#12  Short offline       Completed without error       00%     16374         -
#13  Short offline       Completed without error       00%     16206         -
#14  Short offline       Completed without error       00%     16038         -
#15  Extended offline    Completed without error       00%     15950         -
#16  Short offline       Completed without error       00%     15870         -
#17  Short offline       Completed without error       00%     15703         -
#18  Short offline       Completed without error       00%     15535         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
Device State:                        Active (0)
Current Temperature:                    28 Celsius
Power Cycle Min/Max Temperature:     18/29 Celsius
Lifetime    Min/Max Temperature:      2/49 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (54)

Index    Estimated Time   Temperature Celsius
  55    2024-01-27 17:20    28  *********
 ...    ..(476 skipped).    ..  *********
  54    2024-01-28 01:17    28  *********

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP/SMART Log 0x04) not supported

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            2  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            3  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4       174776  Vendor specific

At this point, I think I have another test to do, swap the SATA cable again and see what happens, maybe when I swapped it first I managed to somehow connect the same cable to the same drive to a different port on the motherboard (I took the opportunity to improve a bit the cable management of the system while I had the system open). What do you think? Am I just wasting time or worse stressing the array for no reason? Why are the relevant SMART attributes stuck at 0?

fmiz · Jan 27, 2024

Bonus: scrolling up the terminal I've found an older dump of SMART info from the suspect drive:

Code:

root@freenas[/]# smartctl -x /dev/ada2
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WCC4N7RE3DC6
LU WWN Device Id: 5 0014ee 2b54c166e
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jan 23 03:42:01 2024 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (39840) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 399) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    412
  3 Spin_Up_Time            POS--K   173   171   021    -    6308
  4 Start_Stop_Count        -O--CK   100   100   000    -    102
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   100   253   000    -    0
  9 Power_On_Hours          -O--CK   076   076   000    -    17676
 10 Spin_Retry_Count        -O--CK   100   100   000    -    0
 11 Calibration_Retry_Count -O--CK   100   100   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    102
192 Power-Off_Retract_Count -O--CK   200   200   000    -    21
193 Load_Cycle_Count        -O--CK   200   200   000    -    200
194 Temperature_Celsius     -O---K   122   100   000    -    28
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   172   165   000    -    11466
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb7  GPL,SL  VS       1  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 46 (device log contains only the most recent 24 errors)
        CR     = Command Register
        FEATR  = Features Register
        COUNT  = Count (was: Sector Count) Register
        LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
        LH     = LBA High (was: Cylinder High) Register    ]   LBA
        LM     = LBA Mid (was: Cylinder Low) Register      ] Register
        LL     = LBA Low (was: Sector Number) Register     ]
        DV     = Device (was: Device/Head) Register
        DC     = Device Control Register
        ER     = Error register
        ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 46 [21] occurred at disk power-on lifetime: 17673 hours (736 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 55 9f 4c e8 40 00  Error: UNC at LBA = 0x1559f4ce8 = 5731470568

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 02 80 00 28 00 01 55 9f 4c 08 40 08     05:11:10.110  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08     05:11:10.079  READ LOG EXT
  60 02 80 00 18 00 01 55 9f 4c 08 40 08     05:11:03.084  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08     05:11:03.052  READ LOG EXT
  60 00 40 00 08 00 01 55 9f 66 28 40 08     05:10:56.012  READ FPDMA QUEUED

Error 45 [20] occurred at disk power-on lifetime: 17673 hours (736 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 55 9f 4c df 40 00  Error: UNC at LBA = 0x1559f4cdf = 5731470559

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 02 80 00 18 00 01 55 9f 4c 08 40 08     05:11:03.084  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08     05:11:03.052  READ LOG EXT
  60 00 40 00 08 00 01 55 9f 66 28 40 08     05:10:56.012  READ FPDMA QUEUED
  60 02 80 00 00 00 01 55 9f 4c 08 40 08     05:10:56.012  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08     05:10:55.980  READ LOG EXT

Error 44 [19] occurred at disk power-on lifetime: 17673 hours (736 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 55 9f 4c df 40 00  Error: UNC at LBA = 0x1559f4cdf = 5731470559

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 40 00 08 00 01 55 9f 66 28 40 08     05:10:56.012  READ FPDMA QUEUED
  60 02 80 00 00 00 01 55 9f 4c 08 40 08     05:10:56.012  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08     05:10:55.980  READ LOG EXT
  60 00 40 00 f0 00 01 55 9f 66 28 40 08     05:10:48.961  READ FPDMA QUEUED
  60 01 40 00 e8 00 01 55 9f 57 08 40 08     05:10:48.961  READ FPDMA QUEUED

Error 43 [18] occurred at disk power-on lifetime: 17673 hours (736 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 55 9f 4c e8 40 00  Error: UNC at LBA = 0x1559f4ce8 = 5731470568

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 40 00 f0 00 01 55 9f 66 28 40 08     05:10:48.961  READ FPDMA QUEUED
  60 01 40 00 e8 00 01 55 9f 57 08 40 08     05:10:48.961  READ FPDMA QUEUED
  60 02 80 00 e0 00 01 55 9f 4c 08 40 08     05:10:48.961  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08     05:10:48.929  READ LOG EXT
  60 01 40 00 c8 00 01 55 9f 57 08 40 08     05:10:41.936  READ FPDMA QUEUED

Error 42 [17] occurred at disk power-on lifetime: 17673 hours (736 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 02 c0 00 01 55 9f 4c df 40 00  Error: UNC at LBA = 0x1559f4cdf = 5731470559

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 01 40 00 c8 00 01 55 9f 57 08 40 08     05:10:41.936  READ FPDMA QUEUED
  60 02 80 00 c0 00 01 55 9f 4c 08 40 08     05:10:41.922  READ FPDMA QUEUED
  60 00 60 00 b8 00 01 55 9f 49 e0 40 08     05:10:41.919  READ FPDMA QUEUED
  60 02 60 00 b0 00 01 55 9f 47 38 40 08     05:10:40.906  READ FPDMA QUEUED
  60 01 80 00 a8 00 01 55 9f 43 40 40 08     05:10:40.899  READ FPDMA QUEUED

Error 41 [16] occurred at disk power-on lifetime: 17673 hours (736 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 02 c0 00 01 55 9e c5 07 40 00  Error: UNC 704 sectors at LBA = 0x1559ec507 = 5731435783

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  25 00 00 02 c0 00 01 55 9e c3 30 40 08     05:10:32.870  READ DMA EXT
  b0 00 da 00 00 00 00 00 c2 4f 00 40 08     05:10:32.799  SMART RETURN STATUS
  25 00 00 02 c0 00 01 55 9e c3 30 40 08     05:10:25.773  READ DMA EXT
  b0 00 d1 00 01 00 00 00 c2 4f 01 40 08     05:10:25.769  SMART READ ATTRIBUTE THRESHOLDS [OBS-4]
  25 00 00 02 c0 00 01 55 9e c3 30 40 08     05:10:18.743  READ DMA EXT

Error 40 [15] occurred at disk power-on lifetime: 17673 hours (736 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 02 c0 00 01 55 9e c5 07 40 00  Error: UNC 704 sectors at LBA = 0x1559ec507 = 5731435783

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  25 00 00 02 c0 00 01 55 9e c3 30 40 08     05:10:25.773  READ DMA EXT
  b0 00 d1 00 01 00 00 00 c2 4f 01 40 08     05:10:25.769  SMART READ ATTRIBUTE THRESHOLDS [OBS-4]
  25 00 00 02 c0 00 01 55 9e c3 30 40 08     05:10:18.743  READ DMA EXT
  b0 00 d0 00 01 00 00 00 c2 4f 00 40 08     05:10:18.739  SMART READ DATA
  25 00 00 02 c0 00 01 55 9e c3 30 40 08     05:10:11.714  READ DMA EXT

Error 39 [14] occurred at disk power-on lifetime: 17673 hours (736 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 02 c0 00 01 55 9e c4 f7 40 00  Error: UNC 704 sectors at LBA = 0x1559ec4f7 = 5731435767

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  25 00 00 02 c0 00 01 55 9e c3 30 40 08     05:10:18.743  READ DMA EXT
  b0 00 d0 00 01 00 00 00 c2 4f 00 40 08     05:10:18.739  SMART READ DATA
  25 00 00 02 c0 00 01 55 9e c3 30 40 08     05:10:11.714  READ DMA EXT
  ec 00 00 00 01 00 00 00 00 00 00 40 08     05:10:11.714  IDENTIFY DEVICE
  25 00 00 02 c0 00 01 55 9e c3 30 40 08     05:10:04.688  READ DMA EXT

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     17664         -
# 2  Short offline       Completed without error       00%     17626         -
# 3  Extended offline    Completed without error       00%     17502         -
# 4  Short offline       Completed without error       00%     17480         -
# 5  Short offline       Completed without error       00%     17325         -
# 6  Short offline       Completed without error       00%     17157         -
# 7  Extended offline    Completed without error       00%     17069         -
# 8  Short offline       Completed without error       00%     16989         -
# 9  Short offline       Completed without error       00%     16804         -
#10  Extended offline    Completed without error       00%     16788         -
#11  Short offline       Completed without error       00%     16541         -
#12  Short offline       Completed without error       00%     16374         -
#13  Short offline       Completed without error       00%     16206         -
#14  Short offline       Completed without error       00%     16038         -
#15  Extended offline    Completed without error       00%     15949         -
#16  Short offline       Completed without error       00%     15870         -
#17  Short offline       Completed without error       00%     15702         -
#18  Short offline       Completed without error       00%     15534         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
Device State:                        Active (0)
Current Temperature:                    28 Celsius
Power Cycle Min/Max Temperature:     24/34 Celsius
Lifetime    Min/Max Temperature:      2/50 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (461)

Index    Estimated Time   Temperature Celsius
 462    2024-01-22 19:45    34  ***************
 ...    ..(124 skipped).    ..  ***************
 109    2024-01-22 21:50    34  ***************
 110    2024-01-22 21:51    33  **************
 ...    ..( 25 skipped).    ..  **************
 136    2024-01-22 22:17    33  **************
 137    2024-01-22 22:18    32  *************
 ...    ..( 17 skipped).    ..  *************
 155    2024-01-22 22:36    32  *************
 156    2024-01-22 22:37    31  ************
 ...    ..(  5 skipped).    ..  ************
 162    2024-01-22 22:43    31  ************
 163    2024-01-22 22:44    30  ***********
 ...    ..(  9 skipped).    ..  ***********
 173    2024-01-22 22:54    30  ***********
 174    2024-01-22 22:55    29  **********
 ...    ..( 47 skipped).    ..  **********
 222    2024-01-22 23:43    29  **********
 223    2024-01-22 23:44    28  *********
 ...    ..( 81 skipped).    ..  *********
 305    2024-01-23 01:06    28  *********
 306    2024-01-23 01:07    31  ************
 ...    ..(  8 skipped).    ..  ************
 315    2024-01-23 01:16    31  ************
 316    2024-01-23 01:17    32  *************
 ...    ..( 12 skipped).    ..  *************
 329    2024-01-23 01:30    32  *************
 330    2024-01-23 01:31    33  **************
 ...    ..( 39 skipped).    ..  **************
 370    2024-01-23 02:11    33  **************
 371    2024-01-23 02:12    34  ***************
 ...    ..( 89 skipped).    ..  ***************
 461    2024-01-23 03:42    34  ***************

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP/SMART Log 0x04) not supported

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            2  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            3  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4        30798  Vendor specific

Jailer · Jan 27, 2024

Even though that drive is passing short and long SMART tests I would replace it just for peace of mind. You're getting a lot of internal CRC errors and that would worry me if that drive was part of a pool that contained important data.

joeschmuck · Jan 27, 2024

Your drive WD-WCC4N7RE3DC6 is failing. Replace it. Do not gamble that a passing SMART Long test means all is good.
Specifically ID 1 and ID 200 are your indicators. ID 1 is valid, this is a WD drive, not Seagate. ID 200 is valid as well, MultiZone Error is actually a write error. Don't ask my why it's called MultiZone, I'm sure there is some reason for it.

Of course you have other error messages from TrueNAS which got you here.

Check your warranty, you have only 2 years of runtime on the drive. This kind of failure would qualify for RMA.

fmiz · Jan 28, 2024

joeschmuck said:
Check your warranty, you have only 2 years of runtime on the drive. This kind of failure would qualify for RMA.

The drive is much older though, I think I bought it in 2015. I managed to find an old WD spec sheet, MTBF is 1,000,000 hours but the warranty lasts 3 years. Will WDC accept an RMA request for a "lightly" used almost 9 years old drive?

fmiz · Mar 10, 2024

So I'm here with a minor update, I hope this can be useful to others to compare between truenas SCALE and CORE error messages. I have backed up all my stuff and kept testing until I got the error to show up again: I had to do a scrub, a SMART long self test then another scrub. I think this helps show how subtle this has been.
This is what can bee seen from dmesg on TrueNAS SCALE:

Code:

[84292.766458] ata5.00: exception Emask 0x0 SAct 0x2028000 SErr 0x0 action 0x0
[84292.766803] ata5.00: irq_stat 0x40000008
[84292.767134] ata5.00: failed command: READ FPDMA QUEUED
[84292.767447] ata5.00: cmd 60/00:c8:d8:41:9e/08:00:55:01:00/40 tag 25 ncq dma 1048576 in
                        res 41/40:00:c8:49:9e/00:00:55:01:00/40 Emask 0x409 (media error) <F>
[84292.768106] ata5.00: status: { DRDY ERR }
[84292.768422] ata5.00: error: { UNC }
[84292.770541] ata5.00: configured for UDMA/133
[84292.772422] sd 4:0:0:0: [sde] tag#25 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=7s
[84292.773738] sd 4:0:0:0: [sde] tag#25 Sense Key : Medium Error [current]
[84292.775012] sd 4:0:0:0: [sde] tag#25 Add. Sense: Unrecovered read error - auto reallocate failed
[84292.776821] sd 4:0:0:0: [sde] tag#25 CDB: Read(16) 88 00 00 00 00 01 55 9e 41 d8 00 00 08 00 00 00
[84292.778824] I/O error, dev sde, sector 5731402200 op 0x0:(READ) flags 0x0 phys_seg 34 prio class 2
[84292.781020] zio pool=heaven vdev=/dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7RE3DC6-part2 error=5 type=1 offset=2932330377216 size=1048576 flags=1074267312
[84292.784101] ata5: EH complete

The webUI shows less than CORE, unfortunately, but it still shows something:

Code:

Mar 10 15:43:47 freenas kernel: ata5.00: configured for UDMA/133
Mar 10 15:43:47 freenas kernel: sd 4:0:0:0: [sde] tag#25 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=7s
Mar 10 15:43:47 freenas kernel: sd 4:0:0:0: [sde] tag#25 Sense Key : Medium Error [current] 
Mar 10 15:43:47 freenas kernel: sd 4:0:0:0: [sde] tag#25 Add. Sense: Unrecovered read error - auto reallocate failed
Mar 10 15:43:47 freenas kernel: sd 4:0:0:0: [sde] tag#25 CDB: Read(16) 88 00 00 00 00 01 55 9e 41 d8 00 00 08 00 00 00
Mar 10 15:43:47 freenas kernel: zio pool=heaven vdev=/dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N7RE3DC6-part2 error=5 type=1 offset=2932330377216 size=1048576 flags=1074267312
Mar 10 15:43:47 freenas kernel: ata5: EH complete

This is the current SMART status, the error 73 was triggered on SCALE, everything before that was still under truenas CORE

Code:

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.1.74-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WCC4N7RE3DC6
LU WWN Device Id: 5 0014ee 2b54c166e
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Mar 10 19:07:59 2024 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (39840) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 399) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    899
  3 Spin_Up_Time            POS--K   172   171   021    -    6375
  4 Start_Stop_Count        -O--CK   100   100   000    -    108
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   100   253   000    -    0
  9 Power_On_Hours          -O--CK   076   076   000    -    17831
 10 Spin_Retry_Count        -O--CK   100   100   000    -    0
 11 Calibration_Retry_Count -O--CK   100   100   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    108
192 Power-Off_Retract_Count -O--CK   200   200   000    -    23
193 Load_Cycle_Count        -O--CK   200   200   000    -    232
194 Temperature_Celsius     -O---K   122   100   000    -    28
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   180   165   000    -    8017
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb7  GPL,SL  VS       1  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 73 (device log contains only the most recent 24 errors)
        CR     = Command Register
        FEATR  = Features Register
        COUNT  = Count (was: Sector Count) Register
        LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
        LH     = LBA High (was: Cylinder High) Register    ]   LBA
        LM     = LBA Mid (was: Cylinder Low) Register      ] Register
        LL     = LBA Low (was: Sector Number) Register     ]
        DV     = Device (was: Device/Head) Register
        DC     = Device Control Register
        ER     = Error register
        ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 73 [0] occurred at disk power-on lifetime: 17828 hours (742 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 55 9e 49 c8 40 00  Error: UNC at LBA = 0x1559e49c8 = 5731404232

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 03 40 00 78 00 01 55 9e 49 d8 40 08  1d+14:45:27.605  READ FPDMA QUEUED
  60 08 00 00 c8 00 01 55 9e 41 d8 40 08  1d+14:45:27.605  READ FPDMA QUEUED
  60 02 00 00 80 00 01 55 9e 3f 78 40 08  1d+14:45:27.590  READ FPDMA QUEUED
  60 01 00 00 70 00 01 55 9e 3c f0 40 08  1d+14:45:27.589  READ FPDMA QUEUED
  60 06 88 00 c0 00 01 55 9e 36 08 40 08  1d+14:45:27.577  READ FPDMA QUEUED

Error 72 [23] occurred at disk power-on lifetime: 17748 hours (739 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 55 a0 d0 4f 40 00  Error: UNC at LBA = 0x155a0d04f = 5731569743

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 06 c0 00 f0 00 01 55 a0 cc a8 40 08     16:15:15.274  READ FPDMA QUEUED
  60 00 10 00 e8 00 01 5d 50 9e 90 40 08     16:15:15.274  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08     16:15:15.243  READ LOG EXT
  60 06 c0 00 d8 00 01 55 a0 cc a8 40 08     16:15:08.248  READ FPDMA QUEUED
  60 00 10 00 d0 00 01 5d 50 9e 90 40 08     16:15:08.248  READ FPDMA QUEUED

Error 71 [22] occurred at disk power-on lifetime: 17748 hours (739 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 55 a0 d0 58 40 00  Error: UNC at LBA = 0x155a0d058 = 5731569752

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 06 c0 00 d8 00 01 55 a0 cc a8 40 08     16:15:08.248  READ FPDMA QUEUED
  60 00 10 00 d0 00 01 5d 50 9e 90 40 08     16:15:08.248  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08     16:15:08.216  READ LOG EXT
  60 00 40 00 c0 00 00 de 0d b1 20 40 08     16:15:01.190  READ FPDMA QUEUED
  60 00 40 00 b8 00 00 de 0d b0 e0 40 08     16:15:01.190  READ FPDMA QUEUED

Error 70 [21] occurred at disk power-on lifetime: 17748 hours (739 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 55 a0 d0 4f 40 00  Error: UNC at LBA = 0x155a0d04f = 5731569743

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 40 00 c0 00 00 de 0d b1 20 40 08     16:15:01.190  READ FPDMA QUEUED
  60 00 40 00 b8 00 00 de 0d b0 e0 40 08     16:15:01.190  READ FPDMA QUEUED
  60 05 00 00 b0 00 00 de 0d ab e0 40 08     16:15:01.190  READ FPDMA QUEUED
  60 06 c0 00 a8 00 01 55 a0 cc a8 40 08     16:15:01.190  READ FPDMA QUEUED
  60 00 10 00 a0 00 01 5d 50 9e 90 40 08     16:15:01.161  READ FPDMA QUEUED

Error 69 [20] occurred at disk power-on lifetime: 17748 hours (739 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 55 a0 d0 4f 40 00  Error: UNC at LBA = 0x155a0d04f = 5731569743

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 05 00 00 80 00 00 de 0d ab e0 40 08     16:14:54.144  READ FPDMA QUEUED
  60 00 40 00 78 00 00 de 0d ab a0 40 08     16:14:54.124  READ FPDMA QUEUED
  60 07 c0 00 70 00 00 de 0d a3 e0 40 08     16:14:54.114  READ FPDMA QUEUED
  60 08 00 00 68 00 00 de 0d 9b e0 40 08     16:14:54.104  READ FPDMA QUEUED
  60 07 c0 00 60 00 00 de 0d 93 b8 40 08     16:14:54.096  READ FPDMA QUEUED

Error 68 [19] occurred at disk power-on lifetime: 17748 hours (739 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 55 a0 d0 4f 40 00  Error: UNC at LBA = 0x155a0d04f = 5731569743

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 01 00 00 e0 00 00 de 0d 63 78 40 08     16:14:46.972  READ FPDMA QUEUED
  60 08 00 00 d8 00 00 de 0d 5b 78 40 08     16:14:46.971  READ FPDMA QUEUED
  60 03 c0 00 d0 00 00 de 0d 57 b8 40 08     16:14:46.958  READ FPDMA QUEUED
  60 04 c0 00 c8 00 00 de 0d 52 f8 40 08     16:14:46.948  READ FPDMA QUEUED
  60 08 00 00 c0 00 00 de 0d 4a f8 40 08     16:14:46.940  READ FPDMA QUEUED

Error 67 [18] occurred at disk power-on lifetime: 17748 hours (739 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 55 a0 bf 80 40 00  Error: UNC at LBA = 0x155a0bf80 = 5731565440

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 05 c0 00 40 00 01 55 a0 c6 48 40 08     16:14:39.736  READ FPDMA QUEUED
  60 04 00 00 38 00 01 55 a0 c1 b0 40 08     16:14:39.736  READ FPDMA QUEUED
  60 08 00 00 30 00 01 55 a0 b9 b0 40 08     16:14:39.736  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08     16:14:39.704  READ LOG EXT
  60 05 c0 00 20 00 01 55 a0 c6 48 40 08     16:14:32.709  READ FPDMA QUEUED

Error 66 [17] occurred at disk power-on lifetime: 17748 hours (739 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 55 a0 bb d7 40 00  Error: UNC at LBA = 0x155a0bbd7 = 5731564503

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 05 c0 00 20 00 01 55 a0 c6 48 40 08     16:14:32.709  READ FPDMA QUEUED
  60 04 00 00 18 00 01 55 a0 c1 b0 40 08     16:14:32.709  READ FPDMA QUEUED
  60 08 00 00 10 00 01 55 a0 b9 b0 40 08     16:14:32.709  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08     16:14:32.677  READ LOG EXT
  60 00 40 00 00 00 00 de 0d 1b f8 40 08     16:14:25.659  READ FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     17819         -
# 2  Short offline       Completed without error       00%     17723         -
# 3  Extended offline    Completed without error       00%     17664         -
# 4  Short offline       Completed without error       00%     17626         -
# 5  Extended offline    Completed without error       00%     17502         -
# 6  Short offline       Completed without error       00%     17480         -
# 7  Short offline       Completed without error       00%     17325         -
# 8  Short offline       Completed without error       00%     17157         -
# 9  Extended offline    Completed without error       00%     17069         -
#10  Short offline       Completed without error       00%     16989         -
#11  Short offline       Completed without error       00%     16804         -
#12  Extended offline    Completed without error       00%     16788         -
#13  Short offline       Completed without error       00%     16541         -
#14  Short offline       Completed without error       00%     16374         -
#15  Short offline       Completed without error       00%     16206         -
#16  Short offline       Completed without error       00%     16038         -
#17  Extended offline    Completed without error       00%     15949         -
#18  Short offline       Completed without error       00%     15870         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
Device State:                        Active (0)
Current Temperature:                    28 Celsius
Power Cycle Min/Max Temperature:     27/33 Celsius
Lifetime    Min/Max Temperature:      2/50 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (214)

Index    Estimated Time   Temperature Celsius
 215    2024-03-10 11:10    30  ***********
 ...    ..(  4 skipped).    ..  ***********
 220    2024-03-10 11:15    30  ***********
 221    2024-03-10 11:16    31  ************
 ...    ..(118 skipped).    ..  ************
 340    2024-03-10 13:15    31  ************
 341    2024-03-10 13:16    30  ***********
 ...    ..(  4 skipped).    ..  ***********
 346    2024-03-10 13:21    30  ***********
 347    2024-03-10 13:22    29  **********
 ...    ..( 38 skipped).    ..  **********
 386    2024-03-10 14:01    29  **********
 387    2024-03-10 14:02    28  *********
 ...    ..(148 skipped).    ..  *********
  58    2024-03-10 16:31    28  *********
  59    2024-03-10 16:32    27  ********
 ...    ..(129 skipped).    ..  ********
 189    2024-03-10 18:42    27  ********
 190    2024-03-10 18:43    28  *********
 ...    ..(  8 skipped).    ..  *********
 199    2024-03-10 18:52    28  *********
 200    2024-03-10 18:53    29  **********
 ...    ..(  6 skipped).    ..  **********
 207    2024-03-10 19:00    29  **********
 208    2024-03-10 19:01    30  ***********
 ...    ..(  5 skipped).    ..  ***********
 214    2024-03-10 19:07    30  ***********

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP/SMART Log 0x04) not supported

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            5  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            6  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4       151774  Vendor specific

joeschmuck · Mar 16, 2024

Are you planning to replace the drive or just wait until it completely craps out on you? I'm asking because if you are just waiting, I will unwatch this thread and consider the problem fixed.

fmiz · Mar 16, 2024

joeschmuck said:
Are you planning to replace the drive or just wait until it completely craps out on you? I'm asking because if you are just waiting, I will unwatch this thread and consider the problem fixed.

I replaced all the drives and migrated to SCALE, this is solved, now I just have to understand if everything is fine with ACLs/permissions (I do not need them but I've discovered that chmod always fails... I'm confused with the docs). Thanks for your advice on my issue joeschmuck. If I ever manage to find enough time, I'd like to help you update the hard drive troubleshooting guide.

joeschmuck · Mar 16, 2024

fmiz said:
I'd like to help you update the hard drive troubleshooting guide

I will take that help. I am slowly rewriting it, like you, it is difficult to find time. I'm updating and including NVMe. But that comes after Multi-Report v3.0 comes out, which I hope to finalize before the weekend is up. I have a few loose ends to tie up first, then I will send it out to a few people to test, folks with some small and huge systems. So maybe next weekend it will be public. If you use Multi-Report, let me know if you would like to test version 3.0. I will not have the GUI configuration program I wanted completed by then, it will come eventually, it is well over half way done but that took me months to get there, I'm not a Python programmer, but I will get there eventually.

I hope the ACLs are playing nice with you. I hate those damn things. Fortunately on my personal system I just open it up. My server is not exposed to the internet. Simple for me but I do need to know how to spell ACL.

fmiz · Mar 16, 2024

joeschmuck said:
If you use Multi-Report, let me know if you would like to test version 3.0

Before posting here with the issue I also had read about it, but I've never used it. I think that truenas should keep track of how the smart values change over time, with all the users on this forum with different setups it could be used to build something like backblaze's reports...

joeschmuck · Mar 16, 2024

People use Multi-Report for various reasons, my reason for creating the script (shortly after I joined the forums) was to ensure my drives were being tested as I expected. Let's face it, FreeNAS 8.0x back then had some issue, and TrueNAS has a few issues as well, mostly caused by user action, but not by doing something wrong. However the TrueNAS software could be updated to notice the change and send an alert, but it doesn't, not yet at least.

If you have your TrueNAS all setup and SMART tests are scheduled as you desire. Months later you need to replace a drive for some reason, it does not need to ba a failure, it could be that you want to stick in a larger drive, or replace a HDD with a SSD. Regardless you replace the drive and several days/weeks go by and you notice SMART tests are not being conducted on this new drive. You would not be happy.

Why would this happen? Because TrueNAS uses the serial number to identify which drives were setup for SMART testing. If you do not go back into the GUI and setup SMART testing again for this new drive, TrueNAS will run normally, no alerts, nothing.

So this script was born to toss the user a quick email stating all is good or not so good. Below is a screen capture of the chart section. Below this is a text section to look at should the script tell you there is an issue. This script have evolved a lot, now someone can run the script and send me directly an email (using the '-dump email' switch) with a good amount of data (nothing personal except your real email address, which I never share) and I can diagnose the problem and provide a solution.

Important Announcement for the TrueNAS Community.

pool degraded, SMART test passes but logs errors

fmiz

Dabbler

fmiz

Dabbler

Jailer

Not strong, but bad

joeschmuck

Old Man

fmiz

Dabbler

fmiz

Dabbler

joeschmuck

Old Man

fmiz

Dabbler

joeschmuck

Old Man

fmiz

Dabbler

joeschmuck

Old Man

Similar threads

Important Announcement for the TrueNAS Community.

pool degraded, SMART test passes but logs errors

Dabbler

Dabbler

Not strong, but bad

Old Man

Dabbler

Dabbler

Old Man

Dabbler

Old Man

Dabbler

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "pool degraded, SMART test passes but logs errors"

Similar threads