Unexpected SSD lifetime degradation

GerritTSnail · Jul 17, 2023

5 months ago I installed a mirror pool of two crucial MX500 ssds and a mirror pool of two old random ssds in my system. The mx500 pool runs the applications and a HAOS VM. The other pool runs a ubuntu VM. Now I am already getting ''202 Percent_Lifetime_Remain'' SMART errors on the MX500 pool. The old pool of random ssds is still doing fine.

Is this quantity of wear normal? How woried should i be about these drives failing? I have all data replicated to another system periodically.

I also read somwhere theses ssd sometimes can have some weird firmware quirks.

sretalla · Jul 18, 2023

GerritTSnail said:
202 Percent_Lifetime_Remain

That refers to SMART value number 202: which stores the "percentage lifetime remain(ing)"

It does not indicate that 202% of lifetime is consumed.

Can you be a bit clearer about what the SMART errors being reported are?

can you show the output of smartctl -a /dev/daX (replacing that last bit with the actual disk reference)?

GerritTSnail · Jul 18, 2023

Smart stil passes but i get this alert in the GUI and mails

Device: /dev/sdf [SAT], Failed SMART usage Attribute: 202 Percent_Lifetime_Remain..

This is the output of both drives

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.79+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Crucial/Micron Client SSDs
Device Model: CT500MX500SSD1
Serial Number: 2247E68A4219
LU WWN Device Id: 5 00a075 1e68a4219
Firmware Version: M3CR045
User Capacity: 500,107,862,016 bytes [500 GB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Jul 18 19:23:11 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 30) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x0031) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 4212
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 10
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 252 252 000 Old_age Always - 1047
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 8
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 61
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 065 049 000 Old_age Always - 35 (Min/Max 0/51)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_ECC_Cnt 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 252 252 001 Old_age Offline - 104
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 84484933620
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 1018832649
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 7619386413

SMART Error Log Version: 1
Warning: ATA error count 0 inconsistent with error log pointer 4

ATA Error Count: 0
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error -3 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was in an unknown state.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
00 ec 00 00 00 00 00

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ec 00 00 00 00 00 00 00 00:00:00.000 IDENTIFY DEVICE
ec 00 00 00 00 00 00 00 00:00:00.000 IDENTIFY DEVICE
ec 00 00 00 00 00 00 00 00:00:00.000 IDENTIFY DEVICE
ec 00 00 00 00 00 00 00 00:00:00.000 IDENTIFY DEVICE
c8 00 00 00 00 00 00 00 00:00:00.000 READ DMA

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 4170 -
# 2 Extended offline Completed without error 00% 4003 -
# 3 Extended offline Completed without error 00% 3836 -
# 4 Extended offline Completed without error 00% 3669 -
# 5 Extended offline Completed without error 00% 3502 -
# 6 Extended offline Completed without error 00% 3335 -
# 7 Extended offline Completed without error 00% 3166 -
# 8 Extended offline Completed without error 00% 2998 -
# 9 Extended offline Completed without error 00% 2830 -
#10 Extended offline Completed without error 00% 2661 -
#11 Extended offline Completed without error 00% 2493 -
#12 Extended offline Completed without error 00% 2324 -
#13 Extended offline Completed without error 00% 2156 -
#14 Extended offline Completed without error 00% 1987 -
#15 Extended offline Completed without error 00% 1818 -
#16 Extended offline Completed without error 00% 1652 -
#17 Extended offline Completed without error 00% 1483 -
#18 Extended offline Completed without error 00% 1315 -
#19 Extended offline Completed without error 00% 1148 -
#20 Extended offline Completed without error 00% 981 -
#21 Extended offline Completed without error 00% 815 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

=== START OF INFORMATION SECTION ===
Model Family: Crucial/Micron Client SSDs
Device Model: CT500MX500SSD1
Serial Number: 2247E68A4385
LU WWN Device Id: 5 00a075 1e68a4385
Firmware Version: M3CR045
User Capacity: 500,107,862,016 bytes [500 GB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Jul 18 19:23:51 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 30) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x0031) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 4237
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 10
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 253 253 000 Old_age Always - 1039
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 8
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 65
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 067 049 000 Old_age Always - 33 (Min/Max 0/51)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_ECC_Cnt 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 253 253 001 Old_age Offline - 103
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 84423062379
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 1018134464
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 7624467752

SMART Error Log Version: 1
Warning: ATA error count 0 inconsistent with error log pointer 4

ATA Error Count: 0
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error -3 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was in an unknown state.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
00 ec 00 00 00 00 00

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ec 00 00 00 00 00 00 00 00:00:00.000 IDENTIFY DEVICE
ec 00 00 00 00 00 00 00 00:00:00.000 IDENTIFY DEVICE
ec 00 00 00 00 00 00 00 00:00:00.000 IDENTIFY DEVICE
ec 00 00 00 00 00 00 00 00:00:00.000 IDENTIFY DEVICE
c8 00 00 00 00 00 00 00 00:00:00.000 READ DMA

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 4194 -
# 2 Extended offline Completed without error 00% 4026 -
# 3 Extended offline Completed without error 00% 3858 -
# 4 Extended offline Completed without error 00% 3689 -
# 5 Extended offline Completed without error 00% 3520 -
# 6 Extended offline Completed without error 00% 3351 -
# 7 Extended offline Completed without error 00% 3182 -
# 8 Extended offline Completed without error 00% 3012 -
# 9 Extended offline Completed without error 00% 2843 -
#10 Extended offline Completed without error 00% 2673 -
#11 Extended offline Completed without error 00% 2504 -
#12 Extended offline Completed without error 00% 2334 -
#13 Extended offline Completed without error 00% 2164 -
#14 Extended offline Completed without error 00% 1994 -
#15 Extended offline Completed without error 00% 1825 -
#16 Extended offline Completed without error 00% 1658 -
#17 Extended offline Completed without error 00% 1488 -
#18 Extended offline Completed without error 00% 1319 -
#19 Extended offline Completed without error 00% 1152 -
#20 Extended offline Completed without error 00% 985 -
#21 Extended offline Completed without error 00% 818 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

samarium · Jul 18, 2023

https://forums.tomshardware.com/thr...-smart-but-only-in-wd-data-lifeguard.3672638/ references a micron pdf about the smart attributes and their meaning. Interesting reading. I've saved a copy in my tech docs stash.

sretalla · Jul 18, 2023

Interestingly, the value in question for "202 Percent_Lifetime_Remain" in both cases is just over 100% (103 and 104).

I wonder if this is triggering the warning somehow due to being outside the "expected" range of 0-100.

Perhaps it indicates additional blocks reserved for bad sector replacement are still available.

WI_Hedgehog · Jul 19, 2023

Consumer SSDs are known for saving cost by having fewer reserve blocks. I would guess (without digging into the gritty details of drives i don't own) this reports your drives have (had) about 60 blocks left before failure:

Drive 1:

Drive #	ID#	ATTRIBUTE_NAME	VALUE	WORST	THRESH	TYPE	WHEN_FAILED	RAW_VALUE
1	180	Unused_Reserve_NAND_Blk	000	000	000	Pre-fail	-	61
2	180	Unused_Reserve_NAND_Blk	000	000	000	Pre-fail	-	65
-
1	194	Temperature_Celsius	065	049	000	Old_age	-	35
2	194	Temperature_Celsius	067	049	000	Old_age	-	33

(Note in this format which is more common to SSD S.M.A.R.T. reporting than HDD reporting, a "relative VALUE" of 100 is "best" [perhaps think of it as 100%], 0 is the failure threshold [0% or "none left"]. The Raw Value is the actual stored number, and that does not always represent what one might think it should.)

Also it seems your drives may have been running hot (which in my experience can cause the issues you're apparently seeing). I realize the maximum temperature is 51°C=123°F, though in my estimation anything over 32/33°C on a writable drive contributes to shortened drive life.*

This is just a starting point, sometimes drives report things in ways other than expected.

---
* Over 32/33°C seems to contribute to SSD memory write issues (possibly depending on the speed/amount of data written) and HDD bearing issues. This is from my own data logging which is a small sample set and not conclusive. I consider it an "observable trend." With that said, manufacturer rated drive maximum operating temperatures have been increasing over the history of drive manufacturing, and with that drive service life also seems to be increasing, however my observations show that a Maximum Operating Temperature (M.O.T.) increase of 10° means a drive is better able to sustain a continual temperature increase, but still has maximum lifespan with the same operating temperature as previous. I would presume this is due to advances in bearing technology and manufacturing, with the laws of thermodynamics being constant.

As 20°C=68°F (standard room temperature) is an unrealistic goal, we should ask what a realistic goal is. I think +10°C Standard Operating Temperature (S.O.T.) and +13°C M.O.T. is a bit optimistic though achivable in ideal situations. While I try for +10/13 in system builds I often find that impractical and settle for +12/14. That seems to "greatly extend drive life," meaning similar OEM systems experience drive life expectancy of 5 years and I'm running 10, at which point the power/space/speed vs. storage capacity/cost leans toward drive replacemnt.

In the bigger picture I'm not sure anyone tracks as much data on drive life, so it's hard to say if I'm on to something or it's not really a contributing factor (Backblaze 2022 drive stats). I present it as something to consider during system design and testing and drive life tracking.

Important Announcement for the TrueNAS Community.

Unexpected SSD lifetime degradation

GerritTSnail

Dabbler

sretalla

Powered by Neutrality

GerritTSnail

Dabbler

Device: /dev/sdf [SAT], Failed SMART usage Attribute: 202 Percent_Lifetime_Remain..

samarium

Contributor

sretalla

Powered by Neutrality

WI_Hedgehog

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Unexpected SSD lifetime degradation

GerritTSnail

Dabbler

sretalla

Powered by Neutrality

GerritTSnail

Dabbler

Device: /dev/sdf [SAT], Failed SMART usage Attribute: 202 Percent_Lifetime_Remain..​

samarium

Contributor

sretalla

Powered by Neutrality

WI_Hedgehog

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Unexpected SSD lifetime degradation"

Similar threads

Device: /dev/sdf [SAT], Failed SMART usage Attribute: 202 Percent_Lifetime_Remain..