SSD Wear Monitoring?

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
SMART is the basic tool...it includes Wear Monitoring.
We'd like to go further and provide alerts/predictions... its easier for the drives we know well, but not for generic drives.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
We'd like to go further and provide alerts/predictions... its easier for the drives we know well, but not for generic drives.
The forum might be able to help crowdsource this by providing smartctl output with model/family/firmware and the attribute list.

Samsung uses SMART 177 "Wear_Leveling_Count" as a percentage lifespan decreasing from 100-0.
Intel DC S3500 uses 233 "Media_Wearout_Indicator" in a similar manner.

Predictions would have to be based on sampling the wearout indicator in a regular fashion (daily? hourly?) and projecting a trendline - maybe using the "total LBAs written" value if that's exposed as well.

But because every manufacturer uses different standards, we'd need a "translation table" to get TrueNAS to see the value(s) that it needs.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
The forum might be able to help crowdsource this by providing smartctl output with model/family/firmware and the attribute list.

Samsung uses SMART 177 "Wear_Leveling_Count" as a percentage lifespan decreasing from 100-0.
Intel DC S3500 uses 233 "Media_Wearout_Indicator" in a similar manner.

Predictions would have to be based on sampling the wearout indicator in a regular fashion (daily? hourly?) and projecting a trendline - maybe using the "total LBAs written" value if that's exposed as well.

But because every manufacturer uses different standards, we'd need a "translation table" to get TrueNAS to see the value(s) that it needs.

There's several variables...

1) How do TB writes translate to erase cycles (changes with ZFS version and workload)
2) How do Erase cycles translate to bad blocks
3) How do bad blocks turn into failed drives or errored data

For example, just because an SSD has its lifetime at zero, doesn't mean it has actually failed or will fail soon..
Formatting the drives to smaller sizes increases lifetime considerably.

This is coupled with the fact that SSD models (and flash technology) changes every year...

SMART is a pretty useful indicator... even if not perfect.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Pardon the rambling nature of this reply, it was written in an "on and off" fashion across no less than three different devices.

There's several variables...

1) How do TB writes translate to erase cycles (changes with ZFS version and workload)
2) How do Erase cycles translate to bad blocks
3) How do bad blocks turn into failed drives or errored data

The workload variable makes it very difficult to try to identify the relationship between "logical TB written to TrueNAS" and "physical TB written to pool devices" indeed, especially if there's multiple workloads with varying recordsizes per pool.

If I can hazard a guess, the kind of user who's looking for SSD wearout value is planning for proactive replacement such as evergreening pool devices or cycling out SLOG/metadata vdevs (especially if hotswap isn't available). I'm not going to speak for @TECK but personally my first request would be for something more generalized in the lines of "how many days did it take to decrease the wearout indicator by a value of 1%"

Next step is to take the ratio of "total logical GB written to pool vs GB written to a given device" and make a very general statement of "it took you X days and Y GB of logical writes to consume 1% of this SSD" with the final widget showing "Estimated life remaining at a rate of (Y/X) GB/day = Z days" - refresh the widget every time the wearout indicator drops a point. Perhaps I can try to code a plugin/widget myself if there's demand.

Failed P/E cycles are usually tracked separately and unfortunately can happen at any point in a device's lifespan ...

For example, just because an SSD has its lifetime at zero, doesn't mean it has actually failed or will fail soon..
Formatting the drives to smaller sizes increases lifetime considerably.

I've got an S3500 that would agree with you there.

Code:
=== START OF INFORMATION SECTION ===
Model Family:     Intel 730 and DC S35x0/3610/3700 Series SSDs
Device Model:     INTEL SSDSC2BB480G4
---trimmed---
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  5 Reallocated_Sector_Ct   -O--CK   098   098   000    -    0
170 Available_Reservd_Space PO--CK   100   100   010    -    0
171 Program_Fail_Count      -O--CK   100   100   000    -    0
172 Erase_Fail_Count        -O--CK   100   100   000    -    0
183 SATA_Downshift_Count    -O--CK   100   100   000    -    0
184 End-to-End_Error        PO--CK   100   100   090    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O--CK   100   100   000    -    0
199 CRC_Error_Count         -OSRCK   100   100   000    -    0
225 Host_Writes_32MiB       -O--CK   100   100   000    -    35789268
232 Available_Reservd_Space PO--CK   100   100   010    -    0
233 Media_Wearout_Indicator -O--CK   001   001   000    -    0
241 Host_Writes_32MiB       -O--CK   100   100   000    -    35789268
242 Host_Reads_32MiB        -O--CK   100   100   000    -    3386178


For those playing along at home, that's a drive claiming fully worn out (233 value 001) with just over 1PB of writes and zero program/erase failures. It's been relegated to external USB usage for a non-ZFS workload now.

This is coupled with the fact that SSD models (and flash technology) changes every year...

SMART is a pretty useful indicator... even if not perfect.

It'd likely be a fool's errand to try to claim full support for all SSDs, but certainly the ones that are used in commercial TrueNAS hardware would be a good place to start. (As well as providing an unofficial "if you're looking for suggestions on good SSDs to use in your DIY unit" list.)
 
Top