Weird NVMe behavior - Hardware issue?

Migsi

Dabbler
Joined
Mar 3, 2021
Messages
40
Hello there,

yesterday one of my TrueNAS 12 (U5.1) seemed to have a strange hiccup. I acknoweleged an outage of it starting at some point in the morning, not thinking to much about it as the internet connection to it is not that reliable. But after it wasn't reachable for all day and my brother who is on site confirmed an issue with the system itself I then tried to debug. As long as he didn't perform a hard reset of the machine, no service hosted by the system, nor its webinterface was reachable at all. Afterwards (when it was up again) we got a warning about one of the two NVMe drives causing slow I/O with a timestamp close to when it got unreachable. I've checked the reports, which went on for a while after the utage was noticable, then stopped, then carried on the afternoon and stopped again in the evening a while before the hard reset. They showed that indeed that SSD had latencies as high as 8 seconds.
1633510684406.png

For reference, this is the latency graph of the other NVMe drive during the same timeperiod.
1633510763007.png
As the system continued to run fine after the hard reset I performed long smart tests on both NVMe drives which showed no signs of breakage, thus I suspect the controller beeing the curlprit. I'm unsure though how to behave now, besides (probably) buying a replacement upfront in case this was a "warning" about a soon coming drive failure.

NOTE: I'll update my signature with technical details about my systems. In case you've read this thread and there is no information visible yet, please just wait a minute before demanding the obvious. ;)

EDIT: System information is now available in my signature, this post refers to the "Small NAS".
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
I imagine based on my reading of the manual (https://download.asrock.com/Manual/A320M-HDV R4.0.pdf) that there may be PCIe contention since you have an Athlon CPU installed.

The M.2 slot only supports 2 lanes with that CPU, not the 4 that the SSD is capable of.

Do you know which of the SSDs is experiencing the problem of extremely high write latency? (if it's the one in the M.2 slot, maybe there's something to my theory).
 

Migsi

Dabbler
Joined
Mar 3, 2021
Messages
40
I imagine based on my reading of the manual (https://download.asrock.com/Manual/A320M-HDV R4.0.pdf) that there may be PCIe contention since you have an Athlon CPU installed.

The M.2 slot only supports 2 lanes with that CPU, not the 4 that the SSD is capable of.

Do you know which of the SSDs is experiencing the problem of extremely high write latency? (if it's the one in the M.2 slot, maybe there's something to my theory).
You are right about the native port supporting Gen3 x2 only using this CPU, I already was aware of that. I'd have to ask my brother to check which one uses which slot as I'm not on site (and won't be that soon). As soon as he checked it, I'll add the info to this post.

Anyway I don't think that could actually cause the issue, as the system was running fine for well than over a year. Also PCIe 3.0 x2 is still capable of 16Gb/s in theory, which would be very hard to get saturated on that system. Checking the reports, it seems the disk simply misbehaved and took ages to perform tasks, the other NVMe was done with in milliseconds. I'll attach some more report screenshots.
1633514568078.png

1633514648535.png

1633514667591.png
 

Migsi

Dabbler
Joined
Mar 3, 2021
Messages
40
Excuse me if I *bump* this once, I'd still need some help figuring this out. :S
 
Top