NVME SSD pool intermittently removing devices

Patrick_3000 · Mar 3, 2024

I have two pools: an HDD pool for data and an SSD NVME pool for VMs. Since yesterday, the SSD NVME pool has been intermittently removing devices.

In particular, the SSD pool is a three-way mirror, 3 x 4 TB, with consumer grade SSDs which I know some people don't recommend, but they have low utilization. The pool is only 18.9% full.

The SSD pool has functioned perfectly for about a year. Yesterday, however, I got a notice from SCALE that one device was removed by the administrator so that the pool was unhealthy. I ordered a replacement SSD and, today, replaced the SSD that was taken offline. The pool resilvered, after which it said it was healthy, with all three SSDs functioning properly.

About an hour later, however, I got a notice from SCALE that one of the other SSDs was removed by the administrator--a different manufacturer this time (a Teamgroup SSD rather than a Crucial SSD that was removed yesterday). It seems to me the chance of both SSDs, from different manufacturers, failing within a day of each other after functioning properly for a year is low. And in fact, rebooting the server caused the supposedly "failed" SSD to all of a sudden be back online.

So, now I'm in the strange situation where all three disks in the three-way mirror are online, but the pool still shows a status of "not healthy." Also, unfortunately, SMART tests on SCALE do not work for NVME SSDs, so I don't know how to diagnose the problem, if there even is a problem. Does anyone know what to do? I''m afraid I'm going to keep getting intermittent notices that devices in the pool are removed, even if they're not failing.

The SCALE version is 23.10.2, and in case it's relevant, the CPU is a Ryzen 7 Pro 5750G, and the motherboard is an ASRock Rack x570d4u-2L2T, with 128 GB of ECC RAM.

M3PH · Mar 3, 2024

Do you have a nvme to usb enclosure? If so, would it be possible for you to put the drives, one at a time, into it and plug it into your workstation\desktop\laptop, run smartctrl on them and post the results? (if you don't have one sabrent make a really good one) If you are running windows you can achieve this by downloading a copy of Gsmartcontrol. I'm asking because resilvering and scrubs will put a lot of wear on the nand chips and we all know that nand chips have a limited number of writes before they fail. Consumer grade chips, especially from little known brands like teamgroup can be a lot less resilient to being subjected to repeated full drive writes than drives from more well known companies like sabrent or samsung.

Another thing i would check is if there are any known issues with the bios on the motherboard and\or the firmware on the drives and\or firmware issues on any hba's you maybe using. Also, check the drive temps. NVMe drives hate getting hot.

Finally, if you reboot the box when all the drives are working does the pool come back as healthy when the system comes back up? or does it fail a drive? and if it fails a drive, is it always the same drive or a different one? what happens if you do all of that but with the drives in different sockets? do you get the same drive fail then or is it a different one?

The issue you are having is sligtly different from the one I was having. I had a drive marked as "unavailable" but it wasn't marked as faulty, failed or degraded, wasn't producing any errors and when i tried to re-add it before wiping it, truenas refused to do it because it had a partition for the pool it was being added to on it. It was almost like zfs forgot the drive was part of the vdev and then didn't recognise the existing partition on the drive as a partition for the pool.

i can't give any better advice because you haven't stated the exact drive models so I can't do any research into them. Combine this with no smart logs and you are asking us to take stabs in the dark

Patrick_3000 · Mar 5, 2024

M3PH said:
Do you have a nvme to usb enclosure? If so, would it be possible for you to put the drives, one at a time, into it and plug it into your workstation\desktop\laptop, run smartctrl on them and post the results? (if you don't have one sabrent make a really good one) If you are running windows you can achieve this by downloading a copy of Gsmartcontrol. I'm asking because resilvering and scrubs will put a lot of wear on the nand chips and we all know that nand chips have a limited number of writes before they fail. Consumer grade chips, especially from little known brands like teamgroup can be a lot less resilient to being subjected to repeated full drive writes than drives from more well known companies like sabrent or samsung.

Another thing i would check is if there are any known issues with the bios on the motherboard and\or the firmware on the drives and\or firmware issues on any hba's you maybe using. Also, check the drive temps. NVMe drives hate getting hot.

Finally, if you reboot the box when all the drives are working does the pool come back as healthy when the system comes back up? or does it fail a drive? and if it fails a drive, is it always the same drive or a different one? what happens if you do all of that but with the drives in different sockets? do you get the same drive fail then or is it a different one?

The issue you are having is sligtly different from the one I was having. I had a drive marked as "unavailable" but it wasn't marked as faulty, failed or degraded, wasn't producing any errors and when i tried to re-add it before wiping it, truenas refused to do it because it had a partition for the pool it was being added to on it. It was almost like zfs forgot the drive was part of the vdev and then didn't recognise the existing partition on the drive as a partition for the pool.

i can't give any better advice because you haven't stated the exact drive models so I can't do any research into them. Combine this with no smart logs and you are asking us to take stabs in the dark

Thanks for your suggestions.

I have now learned how to run SMART tests on NVME drives. Even though the SCALE UI doesn't allow it, it works from the command line via ssh. The SMART tests show all three NVME SSD drives as not having any problems.

I have determined, however, that in both cases where SCALE removed an SSD, it was the SSD installed in a PCIE NVME card, whereas the two SSDs installed directly on the motherboard have not shown any problems. It's an old PCIE NVME card, so I ordered a new one and replaced it. I'm hoping that will solve the problem.

Bottom line: it's possible that the problem was due to an old, and not very reliable, PCIE NVME card, and since I replaced the card, I've had no more problems. Hopefully that will remain the case.

M3PH · Mar 9, 2024

sorry for the delay in replaying, i have been very busy.

I use a highpoint 7204 card so I can have some nvme as my supermicro x10srl-f motherboard doesn't support it. It's worked flawlessly in both windows and linux boxes i've put it in since i bought it. it doesn't support bootable raid but i don't need that so it's fine. i would defo recommend highpoint stuff to anyone that is looking for a server grade nvme card

Patrick_3000 · Mar 9, 2024

M3PH said:
sorry for the delay in replaying, i have been very busy.

I use a highpoint 7204 card so I can have some nvme as my supermicro x10srl-f motherboard doesn't support it. It's worked flawlessly in both windows and linux boxes i've put it in since i bought it. it doesn't support bootable raid but i don't need that so it's fine. i would defo recommend highpoint stuff to anyone that is looking for a server grade nvme card

Thanks. However, the only HighPoint cards I see that have high profile brackets are x16 (16 lanes), which is way overkill for me and would take away all bandwith currently going to my network card in the bottom slot. All I need is a card that will hold one NVME SSD, because I have two NVME SSD slots on the motherboard, and the pool is a three-way mirror. So I need a card that's x4 or, at most, x8.

Still, I appreciate your post because it has me thinking that maybe I need a better card, something that is server grade rather than consumer grade. I will do some research on this.

M3PH · Mar 9, 2024

And that is the disadvantage of using consumer grade hardware for your nas. Not enough PCIe lanes. I've been there and it is the reason I have the rig I do now. But, hopefully a new card will fix your issue

joeschmuck · Mar 9, 2024

M3PH said:
Do you have a nvme to usb enclosure? If so, would it be possible for you to put the drives, one at a time, into it and plug it into your workstation\desktop\laptop, run smartctrl on them and post the results?

That is a bit much just to run a SMART test. Here is an easier way on SCALE:
SHORT Test nvme device-self-test /dev/nvme0 -s 1
LONG Test nvme device-self-test /dev/nvme0 -s 2

Once smartmontools is upgraded to version 7.4 in TrueNAS, you can run the test normally with smartctl.

I know, how do I check the results?
nvme self-test-log /dev/nvme0

Now the tricky part, decoding it. Look for Self Test Result[0] -> Operation Result 0=Pass, 1=Fail -> Self Test Code 1 = Short, 2 = Long.

WARNING !!!! WARNING !!!
Do not do anything with the 'nvme' command other than what I posted above. If you do, know what you are doing.

Patrick_3000 · Mar 9, 2024

joeschmuck said:
That is a bit much just to run a SMART test. Here is an easier way on SCALE:
SHORT Test nvme device-self-test /dev/nvme0 -s 1
LONG Test nvme device-self-test /dev/nvme0 -s 2

Once smartmontools is upgraded to version 7.4 in TrueNAS, you can run the test normally with smartctl.

I know, how do I check the results?
nvme self-test-log /dev/nvme0

Now the tricky part, decoding it. Look for Self Test Result[0] -> Operation Result 0=Pass, 1=Fail -> Self Test Code 1 = Short, 2 = Long.

WARNING !!!! WARNING !!!
Do not do anything with the 'nvme' command other than what I posted above. If you do, know what you are doing.

Thanks. As I mentioned in one of the comments subsequent to my initial post, I figured out how to do this--run SMART tests in SCALE from the command line--and the drives are fine.

At this point, I have replaced the PCIE card and am monitoring the situation. Even with the new PCIE card, I had a couple of drops when the SSD went offline and the pool was degraded, but for the last 24 hours, everything has been fine and the pool is healthy.

My next step, if the SSD keeps dropping out of the pool, will be to buy a higher end PCIE card and see if that solves the problem.

loca5790 · Mar 9, 2024

I just went through something similair with Samsung 990 Pros. It ended up being two bad drives.... I replaced two drives and been up for 40 days now. Not sure how this really helps just my experience. I spent days thinking the coinicidence of two bad drives is just unreal.

Since then I have added the multi_report script so I get emails in case a drive drops again I don't loose a pool.

Patrick_3000 · Mar 9, 2024

loca5790 said:
I just went through something similair with Samsung 990 Pros. It ended up being two bad drives.... I replaced two drives and been up for 40 days now. Not sure how this really helps just my experience. I spent days thinking the coinicidence of two bad drives is just unreal.

Since then I have added the multi_report script so I get emails in case a drive drops again I don't loose a pool.

That's interesting feedback, although in my case, one of the drives is brand new. Still, two bad drives is possible.

It's always good to get email notification when the pool is unhealthy. The ability to fine tune email notification conditions is one of my favorite features of Truenas, both SCALE and Core.

Important Announcement for the TrueNAS Community.

NVME SSD pool intermittently removing devices

Patrick_3000

Contributor

M3PH

Dabbler

Patrick_3000

Contributor

M3PH

Dabbler

Patrick_3000

Contributor

M3PH

Dabbler

joeschmuck

Old Man

Patrick_3000

Contributor

loca5790

Dabbler

Patrick_3000

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

NVME SSD pool intermittently removing devices

Contributor

Dabbler

Contributor

Dabbler

Contributor

Dabbler

Old Man

Contributor

Dabbler

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "NVME SSD pool intermittently removing devices"

Similar threads