SSDs are dying one a time - trying to find the culprit

drobo · Dec 13, 2023

I have a DIY TrueNAS scale system that I set up 5-6 months ago, about once a month one of my SSD's will start to error and fault with read/write errors, for a while I can just reboot to clear it, but after a few weeks the drive eventually faults for good and never comes back If it remove it and connect it to another system it doesn't show up at all.

I've replaced 4 drives so far, about one a month, and I started to think I had one bad power/HBA cable until I started seeing faults now on 2 seperate drives. Due to life events I just RMA'd a couple drives thinking it was a bad batch, then started just buying new ones to keep it chugging along. The SSD's I started with I couldn't get anymore, so now I have a mix of vendors and they're starting to fault now too. I'd like to figure out what is going on as I have 2 faulted drives and only 1 spare to extend its life a bit. Thankfully I have a remote spinning rust server to backup my mission critical data [read: family photos] but I'd rather not lose all my other data..

I'm not looking for a miracle, but I'd love to figure out what is causing this so I can replace just that (hopefully) one part, since I don't have many spare parts, just a couple 2TB ssds that I planned to keep everything running for a bit. Please let me know any info I can provide to help narrow down the issue

System specs:

TrueNAS-SCALE-22.12.3.2
LSI SAS 9300-16I
16 x 2TB SDDs for storage connected to 9300-16I:
10 x PNY CS900 2TB
6 x TEAMGROUP T-Force Vulcan Z 2TB
2 x 240GB HPE in boot pool connected to motherboard sata
ASRock Rack E3C246D4U Motherboard
Intel core i3-9100T
16GB DDR4-2400MHz ECC NEMIX memory
Not sure on the PSU, think it was a plain Corsair 600 or 700W ATX PSU

Attached some pictures of the faults and some add. info:

The latest ssd to fault:

SSD that has been faulted for a bit while I waited for a new SSD to arrive:

Davvo · Dec 13, 2023

Have you flashed your HBAs in IT mode?
Have you tried replacing the PSU?
What is your usage? Do you have auto-trim enabled?

drobo · Dec 13, 2023

Davvo said:
Have you flashed your HBAs in IT mode?
Have you tried replacing the PSU?
What is your usage? Do you have auto-trim enabled?

The HBA was supposed to be flashed to IT mode before I purchased it, it's hard from a google search to find the command I can run to verify this as most of the results don't seem to work for me i.e. sas3flash or sas2flash both state command not found

I haven't replaced the PSU, I am certainly open to replacing it if I can narrow it down to that with a reasonable suspicion, budget is tight this time of year :/ I am using a few SATA power splitters, which I know isn't ideal but I needed some way to power 18 drives ha..

Apologies if this isn't what you meant, but my usage is
Used:
5.23 TiB
Available:
18.5 TiB

Auto-trim is off

LarsR · Dec 13, 2023

Which user are you using to run those commands? root or admin? If it's the admin user, try sudo sas3flash or sudo sas2flash.

drobo · Dec 13, 2023

LarsR said:
Which user are you using to run those commands? root or admin? If it's the admin user, try sudo sas3flash or sudo sas2flash.

Wow I feel stupid. I'm not new to linux, I don't know why it didn't occur to me to do that... but if I'm reading it right, I should be in IT mode.

drobo · Dec 13, 2023

I went ahead and upgraded the firmware to 16.00.12.00 as suggested in the truenas post about 9300 cards, but everything is as it was, still have 2 faulted ssd's. At this point after a reboot the first one will show fine and increment errors slowly, but then after about 15-20 minutes it errors enough to fault

Davvo · Dec 13, 2023

Why are you rebooting? In order to clear errors you can use zpool clear poolname. Actually, before doing so please post the output of zpool status.

The HBA looks like it's in IT mode.
About the power splitters... it could be the issue, but you have to actually tell us what is connected to what and how.

Suggested reading:

Proper Power Supply Sizing Guidance

I've seen about 1,000 threads like this one where people decide that they can power a dozen hard drives off a 360 watt supply. DO NOT DO THIS. I've seen another 1,000 threads where people decide to buy the cheapest power supply that they can...

www.truenas.com

drobo said:
Auto-trim is off

Does this mean you have a cronjob that periodically trims your SSD

EDIT: regarding usage, I mean what are you using your pools for (ie block storage, video editing, etc).

drobo · Dec 13, 2023

The latest reboot was just because of the firmware upgrade, before that I guess it's more accurate to say I was shutting down and powering back on after trying to reseat cables / replace drives.

Here's the requested output:

I'll have to open up the box later tonight but the more I think about it, the more I'm fairly certain almost all are on splitters. I'll read through the post you linked as well, thank you

Davvo · Dec 13, 2023

The thing that perplexes me is that usually power-related issues cause cksum/CRC errors, at least on HDDs.

Running a zpool clear storage-pool and a zpool scrub storage-pool might help troubleshooting the issue, but make sure you have a backup of the pool first.

drobo · Dec 13, 2023

Davvo said:
The thing that perplexes me is that usually power-related issues cause cksum/CRC errors, at least on HDDs.

Running a zpool clear storage-pool and a zpool scrub storage-pool might help troubleshooting the issue, but make sure you have a backup of the pool first.

thank you, I'll crack open my nas tonight and figure out everything I can about the PSU, the math suggests I should be good, aside from the splitters.. but I do have my old synology that I can spin up again to backup everything to. Once thats done I can run the scrub, it'll be at least a day before I can of course, pending everything to be backed up.
If anyone can think of anything I can check to narrow it down further in the meantime I'm all ears

Davvo · Dec 13, 2023

drobo said:
The math suggests I should be good, aside from the splitters..

Double check the Amps of each rail as I suspect that to be the issue.

drobo · Dec 13, 2023

Davvo said:
Double check the Amps of each rail as I suspect that to be the issue.

Will do, that'll be my priority once I can get in there and figure out the exact PSU model

cobrakiller58 · Dec 13, 2023

Honestly with PNY and TEAMGROUP branded SSD's I would suspect them... if these were WD Blues or another well known brand I would be less likely to suspect them.

Davvo · Dec 13, 2023

cobrakiller58 said:
Honestly with PNY and TEAMGROUP branded SSD's I would suspect them... if these were WD Blues or another well known brand I would be less likely to suspect them.

Actually I vaguely remember a thread about issues with PNY SSDs, bit it has been a while.

drobo · Dec 13, 2023

cobrakiller58 said:
Honestly with PNY and TEAMGROUP branded SSD's I would suspect them... if these were WD Blues or another well known brand I would be less likely to suspect them.

I would agree, I do plan to upgrade them eventually, but I've gone through 4 (2 PNY and 2 Teamgroup) already in as many months, I can understand not expecting a long life, but that seems excessive

At this point I'm worried about upgrading them to enterprise drives, just to have them fault due to something else :/

cobrakiller58 · Dec 13, 2023

I have some patriot burst that use the same controller as your CS900s PS3111-S11-13 and I've just had another just up and die on me, all within the warranty period. It is worthwhile to check to make sure it isn't something else causing the issue but the drives themselves should not be overlooked. Even the best manufacturers have issues.

Patrick M. Hausen · Dec 13, 2023

Run smartctl -x /dev/<disk> and look for the "Percentage Used Endurance Indicator" value.

I monitor my SSDs with this script (written for CORE, would need minor adjustments for SCALE):

Code:

#! /bin/sh

PREFIX='servers.'
SMARTCTL='/usr/local/sbin/smartctl -x'

time=$(/bin/date +%s)
hostname=$(/bin/hostname | /usr/bin/tr '.' '_')
drives=$(/bin/ls /dev | /usr/bin/egrep '^(a?da|nvme)[0-9]+$')

for drive in ${drives}
do
    case ${drive} in
        nvme*)
            wear=$(${SMARTCTL} /dev/${drive} | awk '/Percentage Used:/ { printf "%d", $3 }')
        ;;

        da*|ada*)
            wear=$(${SMARTCTL} /dev/${drive} | awk '/Percentage Used Endurance Indicator/ { printf "%d", $4 }')
        ;;
    esac

    # catch the case that $drive is not an SSD ...
    if [ "x${wear}" != 'x' ]
    then
        echo "${PREFIX}${hostname}.diskwear.${drive}.wear-percent ${wear} ${time}"
    fi
done

The output goes --> InfluxDB --> Grafana via a cron job and the result looks like this:

Bildschirmfoto 2023-12-13 um 18.39.54.png

drobo · Dec 13, 2023

Ahh now thats a neat script. I might have to look into that more once I have everything settled. As for the wear percentage, I'm not sure if something is wrong or what, but both the bad drives show 0 for the Percentage:

sdh stats:

sdq stats:

If I run it for my other drives, they are mostly 1 or 0 as well, which I suppose could be valid since these drives were brand new 4-5 months ago, and basically only used for video streaming and photo backups.

Patrick M. Hausen · Dec 13, 2023

Ok, so the problem is not that they have a ridiculously low TBW value from the start. Just wanted to make sure you are not really wearing them out.

Davvo · Dec 13, 2023

They have a 540TBW value according to this document. Which is also the reason I was interested in understanging the OP's use case.

Btw, @drobo you did not answer the question regarding the trim cronjob: not trimming the drives might cause issues.

Important Announcement for the TrueNAS Community.

SSDs are dying one a time - trying to find the culprit

Dabbler

Attachments

MVP

Dabbler

Guru

Dabbler

Attachments

Dabbler

MVP

Dabbler

MVP

Dabbler

MVP

Dabbler

Guru

MVP

Dabbler

Guru

Hall of Famer

Dabbler

Hall of Famer

MVP

Similar threads