Short SMART test failing (sometimes)

Constantin.FF

Dabbler
Joined
Apr 6, 2022
Messages
13
Hi all,

I got a strange issue with one of mine ssd drives - short smart test is failing once every 3 times (approximately)

Error message is Interrupted (host reset)
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Interrupted (host reset) 90% 812 -
# 2 Short offline Completed without error 00% 788 -
# 3 Short offline Completed without error 00% 768 -
# 4 Short offline Completed without error 00% 604 -
# 5 Short offline Interrupted (host reset) 90% 601 -
# 6 Short offline Completed without error 00% 450 -
# 7 Short offline Completed without error 00% 450 -
# 8 Short offline Interrupted (host reset) 90% 450 -
# 9 Short offline Completed without error 00% 449 -
#10 Short offline Completed without error 00% 449 -
#11 Short offline Completed without error 00% 449 -
#12 Short offline Interrupted (host reset) 90% 449 -
#13 Short offline Completed without error 00% 448 -
#14 Extended offline Completed without error 00% 369 -
#15 Short offline Completed without error 00% 281 -
#16 Short offline Completed without error 00% 225 -
#17 Short offline Completed without error 00% 225 -
#18 Short offline Interrupted (host reset) 90% 225 -
#19 Extended offline Interrupted (host reset) 20% 224 -
#20 Short offline Interrupted (host reset) 50% 223 -
#21 Extended offline Interrupted (host reset) 50% 222 -

The test that makes it fail, I believe, is 218 CRC_Error_Count 0x0032 100 100 000 Old_age Always - 6

Which I understand is connection issue - SATA cable, controller or connector
So I have changed the disk cable and socket to which it is connected. Still continue to get the same error.
I have 3 more ssd and 1 hdd connected without any issue. I use the faulty one together with mirror for ix-applications
Previously after I get failing smart test I run manual one which get success. And I do zpool clear ssd-pool-1

What do you think might be the issue?

1 Raw_Read_Error_Rate 0x0032 100 100 000 Old_age Always - 100
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 820
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 20
148 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
149 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
167 Write_Protect_Mode 0x0000 100 100 000 Old_age Offline - 0
168 SATA_Phy_Error_Count 0x0012 100 100 000 Old_age Always - 6
169 Bad_Block_Rate 0x0000 100 100 000 Old_age Offline - 0
170 Bad_Blk_Ct_Lat/Erl 0x0000 100 100 010 Old_age Offline - 0/0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 MaxAvgErase_Ct 0x0000 100 100 000 Old_age Offline - 0
181 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
182 Erase_Fail_Count 0x0000 100 100 000 Old_age Offline - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
192 Unsafe_Shutdown_Count 0x0012 100 100 000 Old_age Always - 9
194 Temperature_Celsius 0x0022 021 045 000 Old_age Always - 21 (Min/Max 16/45)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
199 SATA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
218 CRC_Error_Count 0x0032 100 100 000 Old_age Always - 6
231 SSD_Life_Left 0x0000 095 095 000 Old_age Offline - 95
233 Flash_Writes_GiB 0x0032 100 100 000 Old_age Always - 2999
241 Lifetime_Writes_GiB 0x0032 100 100 000 Old_age Always - 1001
242 Lifetime_Reads_GiB 0x0032 100 100 000 Old_age Always - 44
244 Average_Erase_Count 0x0000 100 100 000 Old_age Offline - 51
245 Max_Erase_Count 0x0000 100 100 000 Old_age Offline - 86
246 Total_Erase_Count 0x0000 100 100 000 Old_age Offline - 51195

root@truenas[~]# smartctl --all sdc
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.79+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
Smartctl open device: sdc failed: No such device
root@truenas[~]# smartctl --all /dev/sdc
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.79+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Phison Driven SSDs
Device Model: KINGSTON SA400S37120G
Serial Number: 50026B7381A308C7
LU WWN Device Id: 5 0026b7 381a308c7
Firmware Version: S3E00100
User Capacity: 120,034,123,776 bytes [120 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
TRIM Command: Available
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 4
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Mar 12 10:13:09 2023 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 41) The self-test routine was interrupted
by the host with a hard or soft reset.
Total time to complete Offline
data collection: ( 120) seconds.
Offline data collection
capabilities: (0x11) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0002) Does not save SMART data before
entering power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 10) minutes.
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x0032 100 100 000 Old_age Always - 100
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 820
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 20
148 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
149 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
167 Write_Protect_Mode 0x0000 100 100 000 Old_age Offline - 0
168 SATA_Phy_Error_Count 0x0012 100 100 000 Old_age Always - 6
169 Bad_Block_Rate 0x0000 100 100 000 Old_age Offline - 0
170 Bad_Blk_Ct_Lat/Erl 0x0000 100 100 010 Old_age Offline - 0/0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 MaxAvgErase_Ct 0x0000 100 100 000 Old_age Offline - 0
181 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
182 Erase_Fail_Count 0x0000 100 100 000 Old_age Offline - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
192 Unsafe_Shutdown_Count 0x0012 100 100 000 Old_age Always - 9
194 Temperature_Celsius 0x0022 021 045 000 Old_age Always - 21 (Min/Max 16/45)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
199 SATA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
218 CRC_Error_Count 0x0032 100 100 000 Old_age Always - 6
231 SSD_Life_Left 0x0000 095 095 000 Old_age Offline - 95
233 Flash_Writes_GiB 0x0032 100 100 000 Old_age Always - 2999
241 Lifetime_Writes_GiB 0x0032 100 100 000 Old_age Always - 1001
242 Lifetime_Reads_GiB 0x0032 100 100 000 Old_age Always - 44
244 Average_Erase_Count 0x0000 100 100 000 Old_age Offline - 51
245 Max_Erase_Count 0x0000 100 100 000 Old_age Offline - 86
246 Total_Erase_Count 0x0000 100 100 000 Old_age Offline - 51195
SMART Error Log Version: 1
Warning: ATA error count 0 inconsistent with error log pointer 1
ATA Error Count: 0
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 0 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 00 00 00 00 40 Error: ABRT
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
b0 d5 01 01 4f c2 00 08 00:00:00.000 SMART READ LOG
b0 da 00 00 4f c2 00 08 00:00:00.000 SMART RETURN STATUS
b0 d0 01 00 4f c2 00 08 00:00:00.000 SMART READ DATA
b0 d5 01 06 4f c2 00 08 00:00:00.000 SMART READ LOG
b0 d5 01 01 4f c2 00 08 00:00:00.000 SMART READ LOG
Error -1 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was doing SMART Offline or Self-test.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 00 00 00 00 40 Error: ABRT
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
b0 d4 00 01 4f c2 00 08 00:00:00.000 SMART EXECUTE OFF-LINE IMMEDIATE
b0 da 00 00 4f c2 00 08 00:00:00.000 SMART RETURN STATUS
b0 d0 01 00 4f c2 00 08 00:00:00.000 SMART READ DATA
b0 d5 01 06 4f c2 00 08 00:00:00.000 SMART READ LOG
b0 d5 01 01 4f c2 00 08 00:00:00.000 SMART READ LOG
Error -2 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was doing SMART Offline or Self-test.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 00 00 00 00 40 Error: ABRT
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
b0 d0 01 00 4f c2 00 08 00:00:00.000 SMART READ DATA
b0 d1 01 01 4f c2 00 08 00:00:00.000 SMART READ ATTRIBUTE THRESHOLDS [OBS-4]
b0 da 00 00 4f c2 00 08 00:00:00.000 SMART RETURN STATUS
b0 d5 01 00 4f c2 00 08 00:00:00.000 SMART READ LOG
b0 d5 01 01 4f c2 00 08 00:00:00.000 SMART READ LOG
Error -3 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was doing SMART Offline or Self-test.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 00 00 00 00 40 Error: ABRT
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
b0 d0 01 00 4f c2 00 08 00:00:00.000 SMART READ DATA
b0 d1 01 01 4f c2 00 08 00:00:00.000 SMART READ ATTRIBUTE THRESHOLDS [OBS-4]
b0 da 00 00 4f c2 00 08 00:00:00.000 SMART RETURN STATUS
b0 d5 01 00 4f c2 00 08 00:00:00.000 SMART READ LOG
b0 d5 01 01 4f c2 00 08 00:00:00.000 SMART READ LOG
Error -4 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 00 00 00 00 40 Error: ABRT
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
b0 d0 01 00 4f c2 00 08 00:00:00.000 SMART READ DATA
b0 d1 01 01 4f c2 00 08 00:00:00.000 SMART READ ATTRIBUTE THRESHOLDS [OBS-4]
b0 da 00 00 4f c2 00 08 00:00:00.000 SMART RETURN STATUS
b0 d5 01 00 4f c2 00 08 00:00:00.000 SMART READ LOG
b0 d5 01 01 4f c2 00 08 00:00:00.000 SMART READ LOG
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Interrupted (host reset) 90% 812 -
# 2 Short offline Completed without error 00% 788 -
# 3 Short offline Completed without error 00% 768 -
# 4 Short offline Completed without error 00% 604 -
# 5 Short offline Interrupted (host reset) 90% 601 -
# 6 Short offline Completed without error 00% 450 -
# 7 Short offline Completed without error 00% 450 -
# 8 Short offline Interrupted (host reset) 90% 450 -
# 9 Short offline Completed without error 00% 449 -
#10 Short offline Completed without error 00% 449 -
#11 Short offline Completed without error 00% 449 -
#12 Short offline Interrupted (host reset) 90% 449 -
#13 Short offline Completed without error 00% 448 -
#14 Extended offline Completed without error 00% 369 -
#15 Short offline Completed without error 00% 281 -
#16 Short offline Completed without error 00% 225 -
#17 Short offline Completed without error 00% 225 -
#18 Short offline Interrupted (host reset) 90% 225 -
#19 Extended offline Interrupted (host reset) 20% 224 -
#20 Short offline Interrupted (host reset) 50% 223 -
#21 Extended offline Interrupted (host reset) 50% 222 -
Selective Self-tests/Logging not supported
 
Top