NVME in top PCIE slot intermittently drops out

Patrick_3000

Contributor
Joined
Apr 28, 2021
Messages
167
I have an NVME SSD that's part of a pool and is on an x4 PCIE card in the top PCIE slot on my motherboard. The bottom PCIE slot has an x4 network card (x710-D2) in it.

This setup has worked fine for about a year, but within the past week, the NVME SSD has started dropping. When this happens, the SCALE UI gives the following message "Pool SSD_pool state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state."

The problem is likely not due to a faulty SSD, because I've tried two different SSDs, one of them brand new, and the problem occurs no matter what. Also, the problem likely isn't due to the PCIE card, because I've tried two of them also, one brand new, and the problem occurs no matter what.

What happens is that, after I reboot SCALE, the SSD is typically recognized, but afer an hour or so it drops out and the pool shows as degraded.

Moroever, in the IPMI web interface, the SSD in question shows as present and enabled even when SCALE doesn't recognize it. On the other hand, "fdisk -l" does not show the SSD.

Does anyone know how to troubleshoot this issue? The motherboard is ASRock Rack x570d4u-2L2T, and the CPU is Ryzen 7 Pro 5750G.
 

nabsltd

Contributor
Joined
Jul 1, 2022
Messages
133
What happens is that, after I reboot SCALE, the SSD is typically recognized, but afer an hour or so it drops out and the pool shows as degraded.
This sounds a lot like an overheating issue. Is it an M.2 or U.2 drive? Both can benefit from direct cooling, but U.2 drives often require it.

OTOH, the slot could be part of the problem. Have you tried swapping the two x4 cards?
 

Patrick_3000

Contributor
Joined
Apr 28, 2021
Messages
167
This sounds a lot like an overheating issue. Is it an M.2 or U.2 drive? Both can benefit from direct cooling, but U.2 drives often require it.

OTOH, the slot could be part of the problem. Have you tried swapping the two x4 Itcards?
It's M.2. Temperatures have been OK, nothing in excess of 40 Celsius around the time it was dropping out of the pool according the SCALE UI, and around 30 Celsius when idling.

I have swapped out x4 cards, and no matter which card I used, there have been drops.

On the other hand, it hasn't happened for the past couple days. I'm not sure why, but the pool has been healthy with no problems for around two days now.
 
Top