Unhealthy pool - Replace HDD or not (yet)?

Mastakilla · May 31, 2023

Hi everyone,

A couple of days ago, one of my HDDs caused troubles during a scrub of my main pool. In the pool status window, it had multiple read and write errors and caused the degrading of my pool. Also it seemed like this HDD became unreachable or something, as I couldn't run a SMART test on it anymore (it didn't find the HDD or something like that).

After powering off / on my server, the HDD became available again and TrueNAS automatically started resilvering my pool. It also shortly became Online (Healthy) again, but after awhile it became Online (Unhealthy) (I don't remember the exact moment that it became unhealthy or what triggered it).

I ran a long SMART test on it, which has completed this morning. The pool status window in TrueNAS also seems to have reset (it no longer shows the read and write errors, but it does show 3 CKSUM errors now).

I'm wondering now if I should replace the HDD or give it another chance? It is RAIDZ2, so I should be fine if it dies on me again. But I also already have a spare HDD ready, in case it is better to replace it...

Below you can find some output of my research into this...

Pool status:

SMART output after completing a long SMART test:

data# smartctl -a /dev/da6
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: HGST Ultrastar He10
Device Model: HGST HUH721010ALN600
Serial Number: 2YHUHLHD
LU WWN Device Id: 5 000cca 273d9af54
Firmware Version: LHGAT38Q
User Capacity: 10,000,831,348,736 bytes [10.0 TB]
Sector Size: 4096 bytes logical/physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed May 31 12:05:50 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 93) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: (1086) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 134 134 054 Pre-fail Offline - 96
3 Spin_Up_Time 0x0007 146 146 024 Pre-fail Always - 451 (Average 447)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 248
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 128 128 020 Pre-fail Offline - 18
9 Power_On_Hours 0x0012 096 096 000 Old_age Always - 28856
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 248
22 Helium_Level 0x0023 100 100 025 Pre-fail Always - 100
192 Power-Off_Retract_Count 0x0032 099 099 000 Old_age Always - 1802
193 Load_Cycle_Count 0x0012 099 099 000 Old_age Always - 1802
194 Temperature_Celsius 0x0002 157 157 000 Old_age Always - 38 (Min/Max 18/65)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 28856 -
# 2 Short offline Completed without error 00% 28835 -
# 3 Short offline Completed without error 00% 28726 -
# 4 Short offline Completed without error 00% 28702 -
# 5 Short offline Completed without error 00% 28678 -
# 6 Short offline Completed without error 00% 28654 -
# 7 Short offline Completed without error 00% 28630 -
# 8 Short offline Completed without error 00% 28606 -
# 9 Short offline Completed without error 00% 28582 -
#10 Short offline Completed without error 00% 28558 -
#11 Short offline Completed without error 00% 28534 -
#12 Short offline Completed without error 00% 28510 -
#13 Short offline Completed without error 00% 28486 -
#14 Short offline Completed without error 00% 28462 -
#15 Short offline Completed without error 00% 28438 -
#16 Short offline Completed without error 00% 28414 -
#17 Short offline Completed without error 00% 28390 -
#18 Short offline Completed without error 00% 28366 -
#19 Short offline Completed without error 00% 28342 -
#20 Short offline Completed without error 00% 28318 -
#21 Short offline Completed without error 00% 28294 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Dmesg output from before power off / on the server. Below I've copied the parts that I think that are relevant (please let me know if you need the full output):

...
mps0: Controller reported scsi ioc terminated tgt 14 SMID 1394 loginfo 31110d00
mps0: Controller reported scsi ioc terminated tgt 14 SMID 526 loginfo 31110d00
(da6:mps0:0:14:0): WRITE(10). CDB: 2a 00 09 89 1d 27 00 00 02 00
mps0: Controller reported scsi ioc terminated tgt 14 SMID 1048 loginfo 31110d00
mps0: Controller reported scsi ioc terminated tgt 14 SMID 2102 loginfo 31110d00
mps0: Controller reported scsi ioc terminated tgt 14 SMID 1554 loginfo 31110d00
mps0: Controller reported scsi ioc terminated tgt 14 SMID 513 loginfo 31110d00
mps0: Controller reported scsi ioc terminated tgt 14 SMID 568 loginfo 31110d00
(da6:mps0:0:14:0): CAM status: CCB request completed with an error
(da6:mps0:0:14:0): Retrying command, 3 more tries remain
(da6:mps0:0:14:0): READ(10). CDB: 28 00 52 29 f7 f7 00 00 05 00
(da6:mps0:0:14:0): CAM status: CCB request completed with an error
(da6:mps0:0:14:0): Retrying command, 3 more tries remain
(da6:mps0:0:14:0): READ(10). CDB: 28 00 52 29 f7 f1 00 00 11 00
(da6:mps0:0:14:0): CAM status: CCB request completed with an error
(da6:mps0:0:14:0): Retrying command, 3 more tries remain
(da6:mps0:0:14:0): READ(10). CDB: 28 00 52 29 f8 02 00 00 16 00
(da6:mps0:0:14:0): CAM status: CCB request completed with an error
(da6:mps0:0:14:0): Retrying command, 3 more tries remain
(da6:mps0:0:14:0): READ(10). CDB: 28 00 76 8f c0 74 00 00 05 00
(da6:mps0:0:14:0): CAM status: CCB request completed with an error
(da6:mps0:0:14:0): Retrying command, 3 more tries remain
(da6:mps0:0:14:0): READ(10). CDB: 28 00 76 8f c0 6e 00 00 06 00
(da6:mps0:0:14:0): CAM status: CCB request completed with an error
(da6:mps0:0:14:0): Retrying command, 3 more tries remain
(da6:mps0:0:14:0): READ(10). CDB: 28 00 52 29 f7 e0 00 00 06 00
(da6:mps0:0:14:0): CAM status: CCB request completed with an error
(da6:mps0:0:14:0): Retrying command, 3 more tries remain
(da6:mps0:0:14:0): READ(10). CDB: 28 00 52 29 f7 e0 00 00 06 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 52 29 f7 f7 00 00 05 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 52 29 f7 f1 00 00 11 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 52 29 f8 02 00 00 16 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
...
(da6:mps0:0:14:0): READ(10). CDB: 28 00 52 29 f7 e0 00 00 06 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Error 5, Retries exhausted
GEOM_ELI: g_eli_read_done() failed (error=5) gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe.eli[READ(offset=5629082468352, length=24576)]
(da6:mps0:0:14:0): READ(10). CDB: 28 00 52 29 f7 f7 00 00 05 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Error 5, Retries exhausted
GEOM_ELI: g_eli_read_done() failed (error=5) gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe.eli[READ(offset=5629082562560, length=20480)]
(da6:mps0:0:14:0): READ(10). CDB: 28 00 52 29 f7 f1 00 00 11 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Error 5, Retries exhausted
GEOM_ELI: g_eli_read_done() failed (error=5) gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe.eli[READ(offset=5629082537984, length=69632)]
(da6:mps0:0:14:0): READ(10). CDB: 28 00 52 29 f8 02 00 00 16 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Error 5, Retries exhausted
GEOM_ELI: g_eli_read_done() failed (error=5) gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe.eli[READ(offset=5629082607616, length=90112)]
(da6:mps0:0:14:0): READ(10). CDB: 28 00 76 8f c0 6e 00 00 06 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Error 5, Retries exhausted
GEOM_ELI: g_eli_read_done() failed (error=5) gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe.eli[READ(offset=8130305908736, length=24576)]
(da6:mps0:0:14:0): READ(10). CDB: 28 00 76 8f c0 74 00 00 05 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Error 5, Retries exhausted
GEOM_ELI: g_eli_read_done() failed (error=5) gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe.eli[READ(offset=8130305933312, length=20480)]
(da6:mps0:0:14:0): WRITE(10). CDB: 2a 00 09 89 1d 27 00 00 02 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Error 5, Retries exhausted
GEOM_ELI: g_eli_write_done() failed (error=5) gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe.eli[WRITE(offset=638101123072, length=8192)]
(da6:mps0:0:14:0): READ(10). CDB: 28 00 00 40 00 c2 00 00 02 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 91 87 ff 42 00 00 02 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 91 87 ff 82 00 00 02 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 00 40 00 c2 00 00 02 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
...
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 00 40 00 c2 00 00 02 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Error 5, Retries exhausted
GEOM_ELI: g_eli_read_done() failed (error=5) gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe.eli[READ(offset=270336, length=8192)]
(da6:mps0:0:14:0): READ(10). CDB: 28 00 91 87 ff 42 00 00 02 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Error 5, Retries exhausted
GEOM_ELI: g_eli_read_done() failed (error=5) gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe.eli[READ(offset=9983650177024, length=8192)]
(da6:mps0:0:14:0): READ(10). CDB: 28 00 91 87 ff 82 00 00 02 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Error 5, Retries exhausted
GEOM_ELI: g_eli_read_done() failed (error=5) gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe.eli[READ(offset=9983650439168, length=8192)]
(da6:mps0:0:14:0): WRITE(10). CDB: 2a 00 07 6d 9e ec 00 00 01 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): WRITE(10). CDB: 2a 00 07 6d 9e ec 00 00 01 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): WRITE(10). CDB: 2a 00 07 6d 9e ec 00 00 01 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): WRITE(10). CDB: 2a 00 07 6d 9e ec 00 00 01 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Error 5, Retries exhausted
GEOM_ELI: g_eli_write_done() failed (error=5) gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe.eli[WRITE(offset=493282050048, length=4096)]
(da6:mps0:0:14:0): READ(10). CDB: 28 00 00 40 00 c2 00 00 02 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 91 87 ff 42 00 00 02 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 91 87 ff 82 00 00 02 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 00 40 00 c2 00 00 02 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 91 87 ff 42 00 00 02 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 91 87 ff 82 00 00 02 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 00 40 00 c2 00 00 02 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 91 87 ff 42 00 00 02 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 91 87 ff 82 00 00 02 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 00 40 00 c2 00 00 02 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 91 87 ff 42 00 00 02 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 91 87 ff 82 00 00 02 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 00 40 00 c2 00 00 02 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Error 5, Retries exhausted
GEOM_ELI: g_eli_read_done() failed (error=5) gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe.eli[READ(offset=270336, length=8192)]
(da6:mps0:0:14:0): READ(10). CDB: 28 00 91 87 ff 42 00 00 02 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Error 5, Retries exhausted
GEOM_ELI: g_eli_read_done() failed (error=5) gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe.eli[READ(offset=9983650177024, length=8192)]
(da6:mps0:0:14:0): READ(10). CDB: 28 00 91 87 ff 82 00 00 02 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Error 5, Retries exhausted
...
GEOM_ELI: g_eli_read_done() failed (error=5) gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe.eli[READ(offset=32768, length=4096)]
(da6:mps0:0:14:0): READ(10). CDB: 28 00 91 87 ff f9 00 00 01 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 91 87 ff f9 00 00 01 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 91 87 ff f9 00 00 01 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 91 87 ff f9 00 00 01 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 91 87 ff f9 00 00 01 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Error 5, Retries exhausted
GEOM_ELI: g_eli_read_done() failed (error=5) gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe.eli[READ(offset=9983650926592, length=4096)]
(da6:mps0:0:14:0): READ(10). CDB: 28 00 00 40 00 80 00 00 01 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 00 40 00 80 00 00 01 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 00 40 00 80 00 00 01 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 00 40 00 80 00 00 01 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Retrying command (per sense data)
(da6:mps0:0:14:0): READ(10). CDB: 28 00 00 40 00 80 00 00 01 00
(da6:mps0:0:14:0): CAM status: SCSI Status Error
(da6:mps0:0:14:0): SCSI status: Check Condition
(da6:mps0:0:14:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
(da6:mps0:0:14:0): Error 5, Retries exhausted
GEOM_ELI: g_eli_read_done() failed (error=5) gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe.eli[READ(offset=0, length=4096)]
mps0: Controller reported scsi ioc terminated tgt 14 SMID 563 loginfo 31111000
mps0: Controller reported scsi ioc terminated tgt 14 SMID 950 loginfo 31111000
mps0: Controller reported scsi ioc terminated tgt 14 SMID 878 loginfo 31111000
mps0: Controller reported scsi ioc terminated tgt 14 SMID 2084 loginfo 31111000
mps0: Controller reported scsi ioc terminated tgt 14 SMID 207 loginfo 31111000
mps0: Controller reported scsi ioc terminated tgt 14 SMID 176 loginfo 31111000
mps0: Controller reported scsi ioc terminated tgt 14 SMID 1810 loginfo 31111000
mps0: Controller reported scsi ioc terminated tgt 14 SMID 1205 loginfo 31111000
mps0: Controller reported scsi ioc terminated tgt 14 SMID 1503 loginfo 31111000
...

NugentS · May 31, 2023

Personally I would (in order):
1. Make absolutely sure the cables are seated properly - see if the problem comes back
2. Replace the SATA cable - see if the problem comes back
3. Remove, replace HDD. Making sure that the cables are seated properly
4. Badblocks the drive either on the TN machine or elsewhere - see what happens

Personally this looks like cabling

Mastakilla · May 31, 2023

Thanks for the response NugentS!

Can you perhaps explain why you suspect cabling?

At first I thought cabling to be unlikely, as I haven't opened the case in more than a year. But then I remembered that I did open the case a few days before the issue. This was without touching any of the cabling, but perhaps the panel that I opened made a slight contact with the cable...

Anyway, to make the story complete:

25 May I got warning email that one of my fans stopped working. On that same day, I opened the case and indeed confirmed that a fan had died. This fan was cooling 4 of 8 HDDs, but even without the fan the HDD temperature didn't rise to a problematic level (40-45°C vs 35-40°C). I did not replace the fan at that time yet.
While checking the fan on 25 May, I also noticed a strange noise coming from my server. After investigating where this noise was coming from, I found that one of my HDDs was vibrating more than normal. As I didn't see any errors yet in TrueNAS, I didn't take any action on this yet either.
On 27 May 1h28 I got following mail:
New alerts:
* Pool hgstpool state is DEGRADED: One or more devices are faulted in response
to persistent errors. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
The following devices are not healthy:
* Disk ATA HGST HUH721010AL 2YHUHLHD is FAULTED
On 27 May 1h53 I got 3 mails:
New alerts:
* Device: /dev/da6 [SAT], failed to read SMART Attribute Data.
New alerts:
* Device: /dev/da6 [SAT], Read SMART Error Log Failed.
New alerts:
* Device: /dev/da6 [SAT], not capable of SMART self-check.
New alerts:
* Device: /dev/da6 [SAT], Read SMART Self-Test Log Failed.
I only discovered these emails on 28 May, when my monthly scrub was running (at first I thought these issues were triggered by the stress caused by the monthly scrub, but I just checked and my monthly scrub only starts on the 28th of the month at 4am).
After discovering the emails, I first tried to collect as much info while the system was still running (like the dmesg above and also I tried running a short SMART test, which didn't work).
After that I powered off the server, replaced the broken 140mm fan with a 120mm fan that I still had spare. Visually checked the HDD for issues (didn't touch the cables) and powered on the server again.
After powering on the server, I noticed that a short SMART test worked again and then started a long SMART test.
Once that test completed, I posted my issue here...

So in short:

The vibrating HDD makes me suspect the HDD itself, but I didn't spot any issues in my SMART test results (can anyone confirm this?)
The cables can very perhaps also be the cause, but that would mean that these are EXTREMELY fragile (as the case panel hardly even touches those cables, if at all). Is this really possible you think?

For now, I'll also power off the server once more, unplug / replug the cables of that specific HDD and then wait for the problem to re-occur (or not) before taking any further action (this is what you were suggesting, right?)

Thanks!

NugentS · May 31, 2023

A vibrating HDD - new info.
Remove it and test outside the existing server. Really test it and make sure its not vibrating out of the ordinary. Then consider putting it back - and making sure it doesn't vibrate. Maybe put another drive in its slot and checking for vibration.

Vibration != good

Mastakilla · May 31, 2023

You are absolutely right on that one! Can't explain why I didn't realize that myself sooner

Turned of my NAS until I replace the HDD with the spare I have. Will test it from my desktop later...

Any info on what you understood from the logs I posted? Why you suspected cables at first? (which it - also? - still could be)

NugentS · May 31, 2023

I always suspect cables when its checksum errors. An HBA error will normally be across multiple drives.
Just on principle - you might check the HBA for excessive heat and make sure its properly cooled

Mastakilla · Jun 1, 2023

Excessive heat for the HBA should not be the issue, as I went totally berserk to tackle this potential issue
1) By replacing the HBA heatsink with an old CPU heatsink
2) Actively cooling it with a 120mm fan and it is also cooled by my Noctua CPU cooler
3) I stress tested my setup by simulating a 45°C room temperature

sretalla · Jun 1, 2023

Wow, that's a messed-up system...

Cooling should usually flow from front to back... what you have there is some kind of heat tornado.

CPU fan blows hot air down toward the HBA and the fan on that "stand" pushes it back up toward the CPU cooler?

Is there no way that you can orient the CPU cooler correctly and have the exhaust from that go straight out the back of the chassis (as I suppose was the proper design)?

What fan do you have on the front of the chassis to cover the disks on the right of the picture? Can that be spun up to push more air through that bottom part of the chassis (probably enough)?

Davvo · Jun 1, 2023

I agree with @NugentS that this looks like a cable (data or power) issue.
Would be useful to have the data table from the smart output. Vibration is no good, but it can depend on loose mountings, I would check those too.

Mastakilla · Jun 3, 2023

Hi everyone,

In meantime I have replaced the HDD with a spare HDD and resilvered my pool successfully. I haven't started on stress testing the potentially damaged HDD yet...

sretalla said:
Wow, that's a messed-up system...

Cooling should usually flow from front to back... what you have there is some kind of heat tornado.

CPU fan blows hot air down toward the HBA and the fan on that "stand" pushes it back up toward the CPU cooler?

Is there no way that you can orient the CPU cooler correctly and have the exhaust from that go straight out the back of the chassis (as I suppose was the proper design)?

What fan do you have on the front of the chassis to cover the disks on the right of the picture? Can that be spun up to push more air through that bottom part of the chassis (probably enough)?

I know that my CPU cooler isn't ideally placed, but this was the only way it could be mounted on my motherboard / in my case I'm afraid. It was either the stock AMD CPU cooler or this leftover Noctua CPU cooler and I think that Noctua would even beat the stock cooler without a fan :D (the Noctua also blows the air upward and not downward as you thought)
Also this server isn't really your typical server with high rpm fans where orientation and airflow are critical for a working cooling. My server is build for silence and I'm only using large sub-1000rpm fans, which airflow doesn't reach very far at all. That is also why I'm using that extra fan directly pointed at the HBA.
Anyway, as explained, cooling for this server has been stress tested to the extreme and is certainly overkill for all components, even during extreme heatwaves.

Davvo said:
I agree with @NugentS that this looks like a cable (data or power) issue.
Would be useful to have the data table from the smart output. Vibration is no good, but it can depend on loose mountings, I would check those too.

Thanks for your feedback!
Can you please explain how I can retrieve this "data table" from the smart output?
Regarding the HDD mounting:
Each HDD is mounted on a tray, as shown below. This tray is screwed rock solid in the case. The HDD itself is screwed on to the tray via a rubber (so there is about 1mm of rubber between the tray and the HDD itself). When replacing the HDD I was able to confirm that everything was properly mounted, so that certainly can't be the reason for the vibration.

joeschmuck · Jun 3, 2023

Mastakilla said:
Can you please explain how I can retrieve this "data table" from the smart output?

With the drive in the system you would use the command smartctl -a /dev/ad? where ?= the drive letter. As @NugentS suggested, running badblocks is a very good test and maybe you should see what happens there. It is a destructive test so when you run this, make double sure you enter the correct drive ID. Badblocks could take a few days to complete, it's very intense. If badblocks comes back okay then also run a SMART Long test smartctl -t long /dev/da? and this will take likely many hours to complete since it's a 10TB drive, might be over 24 hours. If that comes back okay then the drive should be good. I would also recommend that if you have an available motherboard SATA port, use that. Your drive ID may change from "da?" to "ada?", and this will make it very easy to identify the suspect drive. Of course always use the serial number, what will be presented with the first smartctl command above.

As for vibration... Bearings do go bad. How much vibration is the drive emitting? Is it just a little bit you can hear like a light hum and if you touch the drive is it just a very slight vibration, or is it loud or feels like it's just too much? It's all subjective since you are doing the checks. If you feel the drive is vibrating too much, then it's vibrating too much. Checking this kind of stuff is something learned over time but as the folks before me have said, vibrations are not good. Compare it to the other drives in the system, is it really different/worse? That could be your gage. And if the drive is still in warranty, maybe it's time to get an exchange.

Davvo · Jun 3, 2023

Additionally, some (most?) drives have G sensors able to measure vibrations. Can't remember if that data is in handled in smart, you might need your manufacturer's help.

joeschmuck · Jun 3, 2023

Davvo said:
Additionally, some (most?) drives have G sensors able to measure vibrations. Can't remember if that data is in handled in smart, you might need your manufacturer's help.

Sometimes I will see G sensor data in the SMART data but odds are you would need to use smartctl -x /dev/da? to have the data spit out, however I was always under the impression this was a Drop/Shock sensor, not vibration sensor. The drop sensor would be used to validate a warranty claim, but if the sensor thought the drive was dropped, the warranty is void. I do not recall ever seeing Vibration sensor readings in the SMART data, but then again, I have never looked for it either. Some drives (WD) may report vibration. See the below link for a listing of what SMART may report on your drives.

Self-Monitoring, Analysis and Reporting Technology - Wikipedia

en.wikipedia.org

Mastakilla · Jun 5, 2023

joeschmuck said:
With the drive in the system you would use the command smartctl -a /dev/ad? where ?= the drive letter. As @NugentS suggested, running badblocks is a very good test and maybe you should see what happens there. It is a destructive test so when you run this, make double sure you enter the correct drive ID. Badblocks could take a few days to complete, it's very intense. If badblocks comes back okay then also run a SMART Long test smartctl -t long /dev/da? and this will take likely many hours to complete since it's a 10TB drive, might be over 24 hours. If that comes back okay then the drive should be good. I would also recommend that if you have an available motherboard SATA port, use that. Your drive ID may change from "da?" to "ada?", and this will make it very easy to identify the suspect drive. Of course always use the serial number, what will be presented with the first smartctl command above.

As for vibration... Bearings do go bad. How much vibration is the drive emitting? Is it just a little bit you can hear like a light hum and if you touch the drive is it just a very slight vibration, or is it loud or feels like it's just too much? It's all subjective since you are doing the checks. If you feel the drive is vibrating too much, then it's vibrating too much. Checking this kind of stuff is something learned over time but as the folks before me have said, vibrations are not good. Compare it to the other drives in the system, is it really different/worse? That could be your gage. And if the drive is still in warranty, maybe it's time to get an exchange.

Thanks for the info!

smartctl -a /dev/ad? is exactly what I've posted in my start post. I guess @Davvo must have looked over this... I ran this command AFTER running a SMART long test (as you described). On first sight, I couldn't spot any issues in the SMART output, but perhaps I looked over something? (please let me know)
I'll try running badblocks later when thoroughly testing the HDD...

Regarding the vibrations, they did seem more than normal, as I otherwise I wouldn't have noticed them... Also, as it is RAIDZ2, I would expect all drives to be used similarly, while, when I noticed it, only that drive was clearly giving A LOT more vibration then the others. As explained above, all my HDDs are separated from the case with about 1mm of rubber. The other HDDs I could hardly feel vibrating, while this one clearly stood out. The vibration was also constant for at least the couple minutes that I was observing it.
Now the weirdest part... After HDD "broke down" and degraded my pool, it came back to live after powering off / on the system. It then started resilvering automatically using the "broken" HDD. During this resilvering, I did not notice the excessive vibration anymore :?

joeschmuck · Jun 5, 2023

Odd results. If it was vibrating before, I'm sure it will vibrate again.

Davvo · Jun 5, 2023

Mastakilla said:
smartctl -a /dev/ad? is exactly what I've posted in my start post. I guess @Davvo must have looked over this...

Yup, my bad. You posted it in [QUOTE][/QUOTE] instead of [CODE][/CODE] and I overlooked the compacted data.

Important Announcement for the TrueNAS Community.

Unhealthy pool - Replace HDD or not (yet)?

Mastakilla

Patron

NugentS

MVP

Mastakilla

Patron

NugentS

MVP

Mastakilla

Patron

NugentS

MVP

Mastakilla

Patron

sretalla

Powered by Neutrality

Davvo

MVP

Mastakilla

Patron

joeschmuck

Old Man

Davvo

MVP

joeschmuck

Old Man

Self-Monitoring, Analysis and Reporting Technology - Wikipedia

Mastakilla

Patron

joeschmuck

Old Man

Davvo

MVP

Similar threads

Important Announcement for the TrueNAS Community.

Unhealthy pool - Replace HDD or not (yet)?

Patron

MVP

Patron

MVP

Patron

MVP

Patron

Powered by Neutrality

MVP

Patron

Old Man

MVP

Old Man

Patron

Old Man

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Unhealthy pool - Replace HDD or not (yet)?"

Similar threads