NVME SSD pool intermittently removing devices

Patrick_3000

Contributor
Joined
Apr 28, 2021
Messages
167
I have two pools: an HDD pool for data and an SSD NVME pool for VMs. Since yesterday, the SSD NVME pool has been intermittently removing devices.

In particular, the SSD pool is a three-way mirror, 3 x 4 TB, with consumer grade SSDs which I know some people don't recommend, but they have low utilization. The pool is only 18.9% full.

The SSD pool has functioned perfectly for about a year. Yesterday, however, I got a notice from SCALE that one device was removed by the administrator so that the pool was unhealthy. I ordered a replacement SSD and, today, replaced the SSD that was taken offline. The pool resilvered, after which it said it was healthy, with all three SSDs functioning properly.

About an hour later, however, I got a notice from SCALE that one of the other SSDs was removed by the administrator--a different manufacturer this time (a Teamgroup SSD rather than a Crucial SSD that was removed yesterday). It seems to me the chance of both SSDs, from different manufacturers, failing within a day of each other after functioning properly for a year is low. And in fact, rebooting the server caused the supposedly "failed" SSD to all of a sudden be back online.

So, now I'm in the strange situation where all three disks in the three-way mirror are online, but the pool still shows a status of "not healthy." Also, unfortunately, SMART tests on SCALE do not work for NVME SSDs, so I don't know how to diagnose the problem, if there even is a problem. Does anyone know what to do? I''m afraid I'm going to keep getting intermittent notices that devices in the pool are removed, even if they're not failing.

The SCALE version is 23.10.2, and in case it's relevant, the CPU is a Ryzen 7 Pro 5750G, and the motherboard is an ASRock Rack x570d4u-2L2T, with 128 GB of ECC RAM.
 
Last edited:

M3PH

Dabbler
Joined
Mar 3, 2024
Messages
21
Do you have a nvme to usb enclosure? If so, would it be possible for you to put the drives, one at a time, into it and plug it into your workstation\desktop\laptop, run smartctrl on them and post the results? (if you don't have one sabrent make a really good one) If you are running windows you can achieve this by downloading a copy of Gsmartcontrol. I'm asking because resilvering and scrubs will put a lot of wear on the nand chips and we all know that nand chips have a limited number of writes before they fail. Consumer grade chips, especially from little known brands like teamgroup can be a lot less resilient to being subjected to repeated full drive writes than drives from more well known companies like sabrent or samsung.

Another thing i would check is if there are any known issues with the bios on the motherboard and\or the firmware on the drives and\or firmware issues on any hba's you maybe using. Also, check the drive temps. NVMe drives hate getting hot.

Finally, if you reboot the box when all the drives are working does the pool come back as healthy when the system comes back up? or does it fail a drive? and if it fails a drive, is it always the same drive or a different one? what happens if you do all of that but with the drives in different sockets? do you get the same drive fail then or is it a different one?

The issue you are having is sligtly different from the one I was having. I had a drive marked as "unavailable" but it wasn't marked as faulty, failed or degraded, wasn't producing any errors and when i tried to re-add it before wiping it, truenas refused to do it because it had a partition for the pool it was being added to on it. It was almost like zfs forgot the drive was part of the vdev and then didn't recognise the existing partition on the drive as a partition for the pool.

i can't give any better advice because you haven't stated the exact drive models so I can't do any research into them. Combine this with no smart logs and you are asking us to take stabs in the dark
 

Patrick_3000

Contributor
Joined
Apr 28, 2021
Messages
167
Do you have a nvme to usb enclosure? If so, would it be possible for you to put the drives, one at a time, into it and plug it into your workstation\desktop\laptop, run smartctrl on them and post the results? (if you don't have one sabrent make a really good one) If you are running windows you can achieve this by downloading a copy of Gsmartcontrol. I'm asking because resilvering and scrubs will put a lot of wear on the nand chips and we all know that nand chips have a limited number of writes before they fail. Consumer grade chips, especially from little known brands like teamgroup can be a lot less resilient to being subjected to repeated full drive writes than drives from more well known companies like sabrent or samsung.

Another thing i would check is if there are any known issues with the bios on the motherboard and\or the firmware on the drives and\or firmware issues on any hba's you maybe using. Also, check the drive temps. NVMe drives hate getting hot.

Finally, if you reboot the box when all the drives are working does the pool come back as healthy when the system comes back up? or does it fail a drive? and if it fails a drive, is it always the same drive or a different one? what happens if you do all of that but with the drives in different sockets? do you get the same drive fail then or is it a different one?

The issue you are having is sligtly different from the one I was having. I had a drive marked as "unavailable" but it wasn't marked as faulty, failed or degraded, wasn't producing any errors and when i tried to re-add it before wiping it, truenas refused to do it because it had a partition for the pool it was being added to on it. It was almost like zfs forgot the drive was part of the vdev and then didn't recognise the existing partition on the drive as a partition for the pool.

i can't give any better advice because you haven't stated the exact drive models so I can't do any research into them. Combine this with no smart logs and you are asking us to take stabs in the dark
Thanks for your suggestions.

I have now learned how to run SMART tests on NVME drives. Even though the SCALE UI doesn't allow it, it works from the command line via ssh. The SMART tests show all three NVME SSD drives as not having any problems.

I have determined, however, that in both cases where SCALE removed an SSD, it was the SSD installed in a PCIE NVME card, whereas the two SSDs installed directly on the motherboard have not shown any problems. It's an old PCIE NVME card, so I ordered a new one and replaced it. I'm hoping that will solve the problem.

Bottom line: it's possible that the problem was due to an old, and not very reliable, PCIE NVME card, and since I replaced the card, I've had no more problems. Hopefully that will remain the case.
 
Last edited:

M3PH

Dabbler
Joined
Mar 3, 2024
Messages
21
sorry for the delay in replaying, i have been very busy.

I use a highpoint 7204 card so I can have some nvme as my supermicro x10srl-f motherboard doesn't support it. It's worked flawlessly in both windows and linux boxes i've put it in since i bought it. it doesn't support bootable raid but i don't need that so it's fine. i would defo recommend highpoint stuff to anyone that is looking for a server grade nvme card
 

Patrick_3000

Contributor
Joined
Apr 28, 2021
Messages
167
sorry for the delay in replaying, i have been very busy.

I use a highpoint 7204 card so I can have some nvme as my supermicro x10srl-f motherboard doesn't support it. It's worked flawlessly in both windows and linux boxes i've put it in since i bought it. it doesn't support bootable raid but i don't need that so it's fine. i would defo recommend highpoint stuff to anyone that is looking for a server grade nvme card
Thanks. However, the only HighPoint cards I see that have high profile brackets are x16 (16 lanes), which is way overkill for me and would take away all bandwith currently going to my network card in the bottom slot. All I need is a card that will hold one NVME SSD, because I have two NVME SSD slots on the motherboard, and the pool is a three-way mirror. So I need a card that's x4 or, at most, x8.

Still, I appreciate your post because it has me thinking that maybe I need a better card, something that is server grade rather than consumer grade. I will do some research on this.
 

M3PH

Dabbler
Joined
Mar 3, 2024
Messages
21
And that is the disadvantage of using consumer grade hardware for your nas. Not enough PCIe lanes. I've been there and it is the reason I have the rig I do now. But, hopefully a new card will fix your issue
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,995
Do you have a nvme to usb enclosure? If so, would it be possible for you to put the drives, one at a time, into it and plug it into your workstation\desktop\laptop, run smartctrl on them and post the results?
That is a bit much just to run a SMART test. Here is an easier way on SCALE:
SHORT Test nvme device-self-test /dev/nvme0 -s 1
LONG Test nvme device-self-test /dev/nvme0 -s 2

Once smartmontools is upgraded to version 7.4 in TrueNAS, you can run the test normally with smartctl.

I know, how do I check the results?
nvme self-test-log /dev/nvme0

Now the tricky part, decoding it. Look for Self Test Result[0] -> Operation Result 0=Pass, 1=Fail -> Self Test Code 1 = Short, 2 = Long.

WARNING !!!! WARNING !!!
Do not do anything with the 'nvme' command other than what I posted above. If you do, know what you are doing.
 

Patrick_3000

Contributor
Joined
Apr 28, 2021
Messages
167
That is a bit much just to run a SMART test. Here is an easier way on SCALE:
SHORT Test nvme device-self-test /dev/nvme0 -s 1
LONG Test nvme device-self-test /dev/nvme0 -s 2

Once smartmontools is upgraded to version 7.4 in TrueNAS, you can run the test normally with smartctl.

I know, how do I check the results?
nvme self-test-log /dev/nvme0

Now the tricky part, decoding it. Look for Self Test Result[0] -> Operation Result 0=Pass, 1=Fail -> Self Test Code 1 = Short, 2 = Long.

WARNING !!!! WARNING !!!
Do not do anything with the 'nvme' command other than what I posted above. If you do, know what you are doing.
Thanks. As I mentioned in one of the comments subsequent to my initial post, I figured out how to do this--run SMART tests in SCALE from the command line--and the drives are fine.

At this point, I have replaced the PCIE card and am monitoring the situation. Even with the new PCIE card, I had a couple of drops when the SSD went offline and the pool was degraded, but for the last 24 hours, everything has been fine and the pool is healthy.

My next step, if the SSD keeps dropping out of the pool, will be to buy a higher end PCIE card and see if that solves the problem.
 

loca5790

Dabbler
Joined
Oct 16, 2023
Messages
18
I just went through something similair with Samsung 990 Pros. It ended up being two bad drives.... I replaced two drives and been up for 40 days now. Not sure how this really helps just my experience. I spent days thinking the coinicidence of two bad drives is just unreal.

Since then I have added the multi_report script so I get emails in case a drive drops again I don't loose a pool.
 

Patrick_3000

Contributor
Joined
Apr 28, 2021
Messages
167
I just went through something similair with Samsung 990 Pros. It ended up being two bad drives.... I replaced two drives and been up for 40 days now. Not sure how this really helps just my experience. I spent days thinking the coinicidence of two bad drives is just unreal.

Since then I have added the multi_report script so I get emails in case a drive drops again I don't loose a pool.
That's interesting feedback, although in my case, one of the drives is brand new. Still, two bad drives is possible.

It's always good to get email notification when the pool is unhealthy. The ability to fine tune email notification conditions is one of my favorite features of Truenas, both SCALE and Core.
 
Top