SSDs are dying one a time - trying to find the culprit

drobo

Dabbler
Joined
Mar 4, 2019
Messages
18
I have a DIY TrueNAS scale system that I set up 5-6 months ago, about once a month one of my SSD's will start to error and fault with read/write errors, for a while I can just reboot to clear it, but after a few weeks the drive eventually faults for good and never comes back If it remove it and connect it to another system it doesn't show up at all.

I've replaced 4 drives so far, about one a month, and I started to think I had one bad power/HBA cable until I started seeing faults now on 2 seperate drives. Due to life events I just RMA'd a couple drives thinking it was a bad batch, then started just buying new ones to keep it chugging along. The SSD's I started with I couldn't get anymore, so now I have a mix of vendors and they're starting to fault now too. I'd like to figure out what is going on as I have 2 faulted drives and only 1 spare to extend its life a bit. Thankfully I have a remote spinning rust server to backup my mission critical data [read: family photos] but I'd rather not lose all my other data..

I'm not looking for a miracle, but I'd love to figure out what is causing this so I can replace just that (hopefully) one part, since I don't have many spare parts, just a couple 2TB ssds that I planned to keep everything running for a bit. Please let me know any info I can provide to help narrow down the issue

System specs:
  • TrueNAS-SCALE-22.12.3.2
  • LSI SAS 9300-16I
  • 16 x 2TB SDDs for storage connected to 9300-16I:
  • 10 x PNY CS900 2TB
  • 6 x TEAMGROUP T-Force Vulcan Z 2TB
  • 2 x 240GB HPE in boot pool connected to motherboard sata
  • ASRock Rack E3C246D4U Motherboard
  • Intel core i3-9100T
  • 16GB DDR4-2400MHz ECC NEMIX memory
  • Not sure on the PSU, think it was a plain Corsair 600 or 700W ATX PSU
Attached some pictures of the faults and some add. info:
1702478942264.png


The latest ssd to fault:
1702478975378.png


SSD that has been faulted for a bit while I waited for a new SSD to arrive:
1702479013837.png
 

Attachments

  • 1702477519588.png
    1702477519588.png
    40.9 KB · Views: 57
  • 1702477541932.png
    1702477541932.png
    37.1 KB · Views: 55
  • 1702478234525.png
    1702478234525.png
    10.4 KB · Views: 52

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Have you flashed your HBAs in IT mode?
Have you tried replacing the PSU?
What is your usage? Do you have auto-trim enabled?
 

drobo

Dabbler
Joined
Mar 4, 2019
Messages
18
Have you flashed your HBAs in IT mode?
Have you tried replacing the PSU?
What is your usage? Do you have auto-trim enabled?
The HBA was supposed to be flashed to IT mode before I purchased it, it's hard from a google search to find the command I can run to verify this as most of the results don't seem to work for me i.e. sas3flash or sas2flash both state command not found

I haven't replaced the PSU, I am certainly open to replacing it if I can narrow it down to that with a reasonable suspicion, budget is tight this time of year :/ I am using a few SATA power splitters, which I know isn't ideal but I needed some way to power 18 drives ha..

Apologies if this isn't what you meant, but my usage is
Used:
5.23 TiB
Available:
18.5 TiB

Auto-trim is off
 

LarsR

Guru
Joined
Oct 23, 2020
Messages
719
Which user are you using to run those commands? root or admin? If it's the admin user, try sudo sas3flash or sudo sas2flash.
 

drobo

Dabbler
Joined
Mar 4, 2019
Messages
18
Which user are you using to run those commands? root or admin? If it's the admin user, try sudo sas3flash or sudo sas2flash.
Wow I feel stupid. I'm not new to linux, I don't know why it didn't occur to me to do that... but if I'm reading it right, I should be in IT mode.

1702481612008.png
 

Attachments

  • 1702481494684.png
    1702481494684.png
    12 KB · Views: 63

drobo

Dabbler
Joined
Mar 4, 2019
Messages
18
I went ahead and upgraded the firmware to 16.00.12.00 as suggested in the truenas post about 9300 cards, but everything is as it was, still have 2 faulted ssd's. At this point after a reboot the first one will show fine and increment errors slowly, but then after about 15-20 minutes it errors enough to fault
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Why are you rebooting? In order to clear errors you can use zpool clear poolname. Actually, before doing so please post the output of zpool status.

The HBA looks like it's in IT mode.
About the power splitters... it could be the issue, but you have to actually tell us what is connected to what and how.

Suggested reading:

Auto-trim is off
Does this mean you have a cronjob that periodically trims your SSD

EDIT: regarding usage, I mean what are you using your pools for (ie block storage, video editing, etc).
 
Last edited:

drobo

Dabbler
Joined
Mar 4, 2019
Messages
18
The latest reboot was just because of the firmware upgrade, before that I guess it's more accurate to say I was shutting down and powering back on after trying to reseat cables / replace drives.

Here's the requested output:

1702485087410.png

1702485109753.png


I'll have to open up the box later tonight but the more I think about it, the more I'm fairly certain almost all are on splitters. I'll read through the post you linked as well, thank you
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
The thing that perplexes me is that usually power-related issues cause cksum/CRC errors, at least on HDDs.

Running a zpool clear storage-pool and a zpool scrub storage-pool might help troubleshooting the issue, but make sure you have a backup of the pool first.
 

drobo

Dabbler
Joined
Mar 4, 2019
Messages
18
The thing that perplexes me is that usually power-related issues cause cksum/CRC errors, at least on HDDs.

Running a zpool clear storage-pool and a zpool scrub storage-pool might help troubleshooting the issue, but make sure you have a backup of the pool first.
thank you, I'll crack open my nas tonight and figure out everything I can about the PSU, the math suggests I should be good, aside from the splitters.. but I do have my old synology that I can spin up again to backup everything to. Once thats done I can run the scrub, it'll be at least a day before I can of course, pending everything to be backed up.
If anyone can think of anything I can check to narrow it down further in the meantime I'm all ears
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
The math suggests I should be good, aside from the splitters..
Double check the Amps of each rail as I suspect that to be the issue.
 
Joined
Jan 18, 2017
Messages
525
Honestly with PNY and TEAMGROUP branded SSD's I would suspect them... if these were WD Blues or another well known brand I would be less likely to suspect them.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Honestly with PNY and TEAMGROUP branded SSD's I would suspect them... if these were WD Blues or another well known brand I would be less likely to suspect them.
Actually I vaguely remember a thread about issues with PNY SSDs, bit it has been a while.
 

drobo

Dabbler
Joined
Mar 4, 2019
Messages
18
Honestly with PNY and TEAMGROUP branded SSD's I would suspect them... if these were WD Blues or another well known brand I would be less likely to suspect them.
I would agree, I do plan to upgrade them eventually, but I've gone through 4 (2 PNY and 2 Teamgroup) already in as many months, I can understand not expecting a long life, but that seems excessive

At this point I'm worried about upgrading them to enterprise drives, just to have them fault due to something else :/
 
Joined
Jan 18, 2017
Messages
525
I have some patriot burst that use the same controller as your CS900s PS3111-S11-13 and I've just had another just up and die on me, all within the warranty period. It is worthwhile to check to make sure it isn't something else causing the issue but the drives themselves should not be overlooked. Even the best manufacturers have issues.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Run smartctl -x /dev/<disk> and look for the "Percentage Used Endurance Indicator" value.

I monitor my SSDs with this script (written for CORE, would need minor adjustments for SCALE):
Code:
#! /bin/sh

PREFIX='servers.'
SMARTCTL='/usr/local/sbin/smartctl -x'

time=$(/bin/date +%s)
hostname=$(/bin/hostname | /usr/bin/tr '.' '_')
drives=$(/bin/ls /dev | /usr/bin/egrep '^(a?da|nvme)[0-9]+$')

for drive in ${drives}
do
    case ${drive} in
        nvme*)
            wear=$(${SMARTCTL} /dev/${drive} | awk '/Percentage Used:/ { printf "%d", $3 }')
        ;;

        da*|ada*)
            wear=$(${SMARTCTL} /dev/${drive} | awk '/Percentage Used Endurance Indicator/ { printf "%d", $4 }')
        ;;
    esac

    # catch the case that $drive is not an SSD ...
    if [ "x${wear}" != 'x' ]
    then
        echo "${PREFIX}${hostname}.diskwear.${drive}.wear-percent ${wear} ${time}"
    fi
done


The output goes --> InfluxDB --> Grafana via a cron job and the result looks like this:
Bildschirmfoto 2023-12-13 um 18.39.54.png
 
Last edited:

drobo

Dabbler
Joined
Mar 4, 2019
Messages
18
Ahh now thats a neat script. I might have to look into that more once I have everything settled. As for the wear percentage, I'm not sure if something is wrong or what, but both the bad drives show 0 for the Percentage:

sdh stats:
1702490408706.png


sdq stats:
1702490378152.png


If I run it for my other drives, they are mostly 1 or 0 as well, which I suppose could be valid since these drives were brand new 4-5 months ago, and basically only used for video streaming and photo backups.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Ok, so the problem is not that they have a ridiculously low TBW value from the start. Just wanted to make sure you are not really wearing them out.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
They have a 540TBW value according to this document. Which is also the reason I was interested in understanging the OP's use case.

Btw, @drobo you did not answer the question regarding the trim cronjob: not trimming the drives might cause issues.
 
Top