SSD Wear

NugentS · Aug 15, 2023

Just for amusement / interest.

On Saturday 15/07/2023 I replaced a worn out (very old) SANDisk SD8SBAT128G1122 on 5% wear left with a brand new Patriot P220 (128GB). This is a boot drive for my primary TrueNAS (part of a mirror).

Note that the system dataset is on these boot drives

One month later its 15/08/2023 and the wear indicator is down to 85% - so it should last 6 months at this rate - 15% / month.

Original SMART attributes when new

Code:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x1300   100   100   050    Old_age   Offline      -       0
  9 Power_On_Hours          0x1200   100   100   000    Old_age   Offline      -       2
 12 Power_Cycle_Count       0x1200   100   100   000    Old_age   Offline      -       2
167 Unknown_Attribute       0x2200   100   100   000    Old_age   Offline      -       0
168 Unknown_Attribute       0x1200   100   100   000    Old_age   Offline      -       0
169 Unknown_Attribute       0x1300   100   100   010    Old_age   Offline      -       131076
173 Unknown_Attribute       0x1200   200   200   000    Old_age   Offline      -       4295098369
175 Program_Fail_Count_Chip 0x2200   100   100   010    Old_age   Offline      -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x3300   100   100   000    Old_age   Offline      -       60
192 Power-Off_Retract_Count 0x1200   100   100   000    Old_age   Offline      -       2
194 Temperature_Celsius     0x2200   032   032   000    Old_age   Offline      -       32 (Min/Max 25/36)
231 Unknown_SSD_Attribute   0x2300   100   100   005    Old_age   Offline      -       0
233 Media_Wearout_Indicator 0x2300   100   100   000    Old_age   Offline      -       35
234 Unknown_Attribute       0x3200   100   100   005    Old_age   Offline      -       3530
241 Total_LBAs_Written      0x3200   100   100   000    Old_age   Offline      -       25
242 Total_LBAs_Read         0x3200   100   100   000    Old_age   Offline      -       0

Current SMART Attributes

Code:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x1300   100   100   050    Old_age   Offline      -       0
  9 Power_On_Hours          0x1200   100   100   000    Old_age   Offline      -       677
 12 Power_Cycle_Count       0x1200   100   100   000    Old_age   Offline      -       6
167 Unknown_Attribute       0x2200   100   100   000    Old_age   Offline      -       0
168 Unknown_Attribute       0x1200   100   100   000    Old_age   Offline      -       0
169 Unknown_Attribute       0x1300   100   100   010    Old_age   Offline      -       131076
173 Unknown_Attribute       0x1200   200   200   000    Old_age   Offline      -       8605073517
175 Program_Fail_Count_Chip 0x2200   100   100   010    Old_age   Offline      -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x3300   100   100   000    Old_age   Offline      -       60
192 Power-Off_Retract_Count 0x1200   100   100   000    Old_age   Offline      -       4
194 Temperature_Celsius     0x2200   033   033   000    Old_age   Offline      -       33 (Min/Max 15/44)
231 Unknown_SSD_Attribute   0x2300   097   097   005    Old_age   Offline      -       3
233 Media_Wearout_Indicator 0x2300   100   100   000    Old_age   Offline      -       6490
234 Unknown_Attribute       0x3200   100   100   005    Old_age   Offline      -       1186135
241 Total_LBAs_Written      0x3200   100   100   000    Old_age   Offline      -       3355
242 Total_LBAs_Read         0x3200   100   100   000    Old_age   Offline      -       123

In the same timescale the second SANDisk (same size, same age as the original (ie ancient)) went from 32% to 29% - so the new disk appears to be wearing at 5 times the rate of the old disk. Given that changing the disk requires surgery (its not in a hot swap bay, and there might be velcro involved), and a power down because there are too many cables around - I think I am going to have to do something a bit better. I am less than impressed (OK it was cheap, so I am not complaining as such) with the potential lifetime on the Patriot Drive.

I have picked up a pair of Intel DC3710 100GB drives which are a high endurance drive that I think I will use (post testing). Will require a re-install and upload of the config file as they are smaller - but thats easy

Etorix · Aug 15, 2023

My suggestion is to move the system dataset to the HDD pool.

NugentS · Aug 15, 2023

I agree - but I deliberately haven't as I was interested in the figures above.
It does show that the system dataset does a lot of writes over time - which is why I posted the data

NugentS · Aug 15, 2023

The Patriot is a cheap QLC drive - so I shouldn't (and am not) be suprised at its shitty endurance. I am suprised by how much the system dataset is effecting the drive (but not concerned). Before I use the Intel DC drives I may try out the Silicon Power A55 first (also very cheap) - a TLC drive that I bought to swap out the second SANDisk when it run's out. However at this rate the new Patriot will run out before the old very used 2nd SANDisk - which does amuse me immensely

cobrakiller58 · Aug 15, 2023

Okay, this is interesting. Does that mean Scale writes to your system dataset something like 3mb/s continuously?

NugentS · Aug 15, 2023

No - averages 1.18MiB/second according to IX's own reporting which is triple (ish) the figure you gave.

HoneyBadger · Aug 15, 2023

Are you doing something particularly intensive on your system? I'm writing at less than half that with both the System and Apps datasets residing the same SSD pool (mean I/O rate over a 24hr period is ~130KB/s per disk, 4wZ1, ~390KB/s total data writes)

cobrakiller58 · Aug 15, 2023

Well I was ball parking so did not expect the numbers to be right on lol
This is considerably more than my Core install is doing at 86.98 KiB/second, very interesting indeed. Thank you for posting

cobrakiller58 · Aug 15, 2023

HoneyBadger said:
Are you doing something particularly intensive on your system? I'm writing at less than half that with both the System and Apps datasets residing the same SSD pool (mean I/O rate over a 24hr period is ~130KB/s per disk, 4wZ1, ~390KB/s total data writes)

On a Scale install?

HoneyBadger · Aug 15, 2023

cobrakiller58 said:
On a Scale install?

Yep, running 22.12.3.3

cobrakiller58 · Aug 15, 2023

Interesting..... I wonder what the difference between yours and NugentS are to cause the extra writes

WI_Hedgehog · Aug 15, 2023

NugentS said:

Just for amusement / interest.

On Saturday 15/07/2023 I replaced a worn out (very old) SANDisk SD8SBAT128G1122 on 5% wear left with a brand new Patriot P220 (128GB). This is a boot drive for my primary TrueNAS (part of a mirror).

Note that the system dataset is on these boot drives

One month later its 15/08/2023 and the wear indicator is down to 85% - so it should last 6 months at this rate - 15% / month.

Original SMART attributes when new

Code:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x1300   100   100   050    Old_age   Offline      -       0
  9 Power_On_Hours          0x1200   100   100   000    Old_age   Offline      -       2
 12 Power_Cycle_Count       0x1200   100   100   000    Old_age   Offline      -       2
167 Unknown_Attribute       0x2200   100   100   000    Old_age   Offline      -       0
168 Unknown_Attribute       0x1200   100   100   000    Old_age   Offline      -       0
169 Unknown_Attribute       0x1300   100   100   010    Old_age   Offline      -       131076
173 Unknown_Attribute       0x1200   200   200   000    Old_age   Offline      -       4295098369
175 Program_Fail_Count_Chip 0x2200   100   100   010    Old_age   Offline      -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x3300   100   100   000    Old_age   Offline      -       60
192 Power-Off_Retract_Count 0x1200   100   100   000    Old_age   Offline      -       2
194 Temperature_Celsius     0x2200   032   032   000    Old_age   Offline      -       32 (Min/Max 25/36)
231 Unknown_SSD_Attribute   0x2300   100   100   005    Old_age   Offline      -       0
233 Media_Wearout_Indicator 0x2300   100   100   000    Old_age   Offline      -       35
234 Unknown_Attribute       0x3200   100   100   005    Old_age   Offline      -       3530
241 Total_LBAs_Written      0x3200   100   100   000    Old_age   Offline      -       25
242 Total_LBAs_Read         0x3200   100   100   000    Old_age   Offline      -       0

Current SMART Attributes

Code:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x1300   100   100   050    Old_age   Offline      -       0
  9 Power_On_Hours          0x1200   100   100   000    Old_age   Offline      -       677
 12 Power_Cycle_Count       0x1200   100   100   000    Old_age   Offline      -       6
167 Unknown_Attribute       0x2200   100   100   000    Old_age   Offline      -       0
168 Unknown_Attribute       0x1200   100   100   000    Old_age   Offline      -       0
169 Unknown_Attribute       0x1300   100   100   010    Old_age   Offline      -       131076
173 Unknown_Attribute       0x1200   200   200   000    Old_age   Offline      -       8605073517
175 Program_Fail_Count_Chip 0x2200   100   100   010    Old_age   Offline      -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x3300   100   100   000    Old_age   Offline      -       60
192 Power-Off_Retract_Count 0x1200   100   100   000    Old_age   Offline      -       4
194 Temperature_Celsius     0x2200   033   033   000    Old_age   Offline      -       33 (Min/Max 15/44)
231 Unknown_SSD_Attribute   0x2300   097   097   005    Old_age   Offline      -       3
233 Media_Wearout_Indicator 0x2300   100   100   000    Old_age   Offline      -       6490
234 Unknown_Attribute       0x3200   100   100   005    Old_age   Offline      -       1186135
241 Total_LBAs_Written      0x3200   100   100   000    Old_age   Offline      -       3355
242 Total_LBAs_Read         0x3200   100   100   000    Old_age   Offline      -       123

In the same timescale the second SANDisk (same size, same age as the original (ie ancient)) went from 32% to 29% - so the new disk appears to be wearing at 5 times the rate of the old disk. Given that changing the disk requires surgery (its not in a hot swap bay, and there might be velcro involved), and a power down because there are too many cables around - I think I am going to have to do something a bit better. I am less than impressed (OK it was cheap, so I am not complaining as such) with the potential lifetime on the Patriot Drive.

Would you mind explaining how this is represented in the S.M.A.R.T. attributes and how that led to your conclusions? I'm a bit fuzzy on this one.

NugentS · Aug 15, 2023

HoneyBadger said:
Are you doing something particularly intensive on your system? I'm writing at less than half that with both the System and Apps datasets residing the same SSD pool (mean I/O rate over a 24hr period is ~130KB/s per disk, 4wZ1, ~390KB/s total data writes)

Yes

I do have a continuous process running that uses 50% of my CPU and is continuously writing to one disk (single disk pool), with batch writes to another pool - so the system is continuously busy.

Does that count?

NugentS · Aug 15, 2023

WI_Hedgehog said:
Would you mind explaining how this is represented in the S.M.A.R.T. attributes and how that led to your conclusions? I'm a bit fuzzy on this one.

I am using @joeschmuck's script (multi-report) to decrypt the values. I am assuming that it is correct.

WI_Hedgehog · Aug 15, 2023

NugentS said:
I am using @joeschmuck's script (multi-report) to decrypt the values. I am assuming that it is correct.

With my limited knowledge (much of it obtained while hungover):

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x1300 100 100 050 Old_age Offline - 0

175 Program_Fail_Count_Chip 0x2200 100 100 010 Old_age Offline - 0

180 Unused_Rsvd_Blk_Cnt_Tot 0x3300 100 100 000 Old_age Offline - 60

That looks like a reasonably decent consumer SSD; from an "industrial standpoint" it doesn't seem like a lot of extra blocks held in reserve, but none have been used yet so all good there.

233 Media_Wearout_Indicator 0x2300 100 100 000 Old_age Offline - 6490

This is what I (perhaps incorrectly) focus on. To me the raw value is "something," which is interpreted as 100% life left under the VALUE column.

If the following are accurate the drive is like new (which seems probable given the system rebooted only 6 times) (and again that's only my interpretation, I could be wrong):
241 Total_LBAs_Written 0x3200 100 100 000 Old_age Offline - 3355
242 Total_LBAs_Read 0x3200 100 100 000 Old_age Offline - 123

NugentS · Aug 15, 2023

@joeschmuck
Can you help out here?
How does the wear indicator = 85%, having gone from 100%?

@WI_Hedgehog and 2 of those reboots were by mistake. An issue with the UPS support
:-(

WI_Hedgehog · Aug 15, 2023

NugentS said:
@WI_Hedgehog and 2 of those reboots were by mistake. An issue with the UPS support
:-(

Well, like other things in life, how about we just sweep that under the rug and act as if it never happened?

12 Power_Cycle_Count 0x1200 100 100 000 Old_age Offline - 6
192 Power-Off_Retract_Count 0x1200 100 100 000 Old_age Offline - 4

By the way, you might want to run a ZFS integrity check (scrub); powering off a SSD without completing a flush/cleanup can create random issues depending on the wear leveling.
(It could be you've scheduled scrubs on the pool, and if that's not the case you could kick off a manual scrub.)

NugentS · Aug 15, 2023

All pools run scrubs on a regular, even frequent, basis

joeschmuck · Aug 15, 2023

WI_Hedgehog said:
233 Media_Wearout_Indicator 0x2300 100 100 000 Old_age Offline - 6490
This is what I (perhaps incorrectly) focus on. To me the raw value is "something," which is interpreted as 100% life left under the VALUE column.

Nope, you need to be looking at the "VALUE" column which is "100". It can be tricky to know when to use the RAW value and when not to. I started to use the json formatted output, it appears to be more consistent and easier to grab the data.

Let me tell you, there are a lot of different ways the drives report Wear Level and it drives me nuts. Which one takes precedence? How is the data to be interpreted? Not all of them are in a percentage value, it must be calculated. Nuts I Say!

Here are some of what I need to examine and pray I selected the correct one, these are the values in the order in which I look for them:

SSD's
Percentage Used Endurance Indicator --- I prefer this one if it exists, it tells you how much has been used. The math is 100-this value=%left.
ID 231
Percent_Lifetime_Remain
Media_Wearout_Indicator
Wear_Leveling_Count
SSD_Life_Left

In addition to the above values:

NVMe's
.nvme_smart_health_information_log.available_spare

And if it's SCSI
.scsi_percentage_used_endurance_indicator

There is a rhyme and reason for all this. As far as I'm aware, the multi_report script reports every drive sample I've tested, but I'm not infallible. If you open up the script and search for "# Get Wear Level" then you will find the start of that section so you can see what is going on.

NugentS said:
Can you help out here?
How does the wear indicator = 85%, having gone from 100%?

What data are you looking at? But nothing should suddenly jump from 100% to 85%. 100% = Full Life, 0% = Dead

I hope all the crap I typed helps some.

NugentS · Aug 15, 2023

I'll send you a datadump - and sent.
If the value is wrong then my amusement at the 15% loss will vanish very quickly. I'll almost be dissapointed

@joeschmuck has spent so much time and effort on this report - I never considered that the value MIGHT be wrong
Without looking at all the historical reports - I have looked at a few and the value does go down a bit everytime

Important Announcement for the TrueNAS Community.

SSD Wear

MVP

Wizard

MVP

MVP

Guru

MVP

actually does care

Guru

Guru

actually does care

Guru

Guru

MVP

MVP

Guru

MVP

Guru

MVP

Old Man

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "SSD Wear"

Similar threads