SOLVED NVME SSD Dataset periodically offline

Joined
Feb 22, 2020
Messages
5
Hi,
please help me debug this issue:

My ssdPool (which consists only of 1 m.2 nvme SSD) goes periodically offline. It seems that the time it keeps on running OK is random. The SSD goes offline after 2 hours to about 2 days. After this happens, the device cannot be found anywhere - seems like disconnected. Even after reboot it will not appear - I must shut down the NAS completely for a couple of minutes and then it shows up normally like nothing happend.

My pool looks like this - it contains my apps dataset.
1681157471916.png


My first guess was overheating of the SSD but I ruled that out because adding a cooler and more airflow to the SSD didn't help at all and the temperatures seems fine under 50c all the time.
The SSD doesn't support SMART so I don't know how to debug further.

My plan B is to move everything to the second pool but I'm not sure how hard that would be considering I need to move apps folder and all the permissions as well.

Any advice appreciated. :)

EDIT: solved in post #11
 
Last edited:
Joined
Oct 22, 2019
Messages
3,641
You can use nvmecontrol to check its info and logs.

Examples for error, info, and tests log pages:
Code:
nvmecontrol logpage -p 1 nvme0
nvmecontrol logpage -p 2 nvme0
nvmecontrol logpage -p 6 nvme0


and the temperatures seems fine under 50c all the time.
It's idling at 50C?
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
You might also investigate power saving features. It is possible that the NVMe drive is going to power save automatically after a specific period of time since last access. Access would be random enough that you might not be able to make a correlation between last access and drop.

Further, PCIe power save might also be a factor.

Plus, I agree with @winnielinnie, 50c sounds very high.
 

tsm37

Dabbler
Joined
Feb 19, 2023
Messages
46
Agreed. 50C assuming idle is high for nvme m2. Is it pcie gen 4 nvme m2? If so, do you have a heatsink and thermal pad on? Gen 4 idle temp with heatink and thermal pad should be around 30-35C idle but below 40C unless your m2 is close to a, say, HBA card or other cards with heatink and generating heat.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
You can use nvmecontrol to check its info and logs.
Not on SCALE...

you need to use nvme:

nvme smart-log /dev/nvme0n1 for example.

You can run nvme list first to find the actual list of devices you can work with.
 
Joined
Feb 22, 2020
Messages
5
Not on SCALE...

you need to use nvme:

nvme smart-log /dev/nvme0n1 for example.

You can run nvme list first to find the actual list of devices you can work with.
Thanks for the correct command. My log is as follows:
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff critical_warning : 0 temperature : 29 C available_spare : 100% available_spare_threshold : 1% percentage_used : 0% endurance group critical warning summary: 0 data_units_read : 8235004 data_units_written : 6185963 host_read_commands : 34676013 host_write_commands : 111061253 controller_busy_time : 108 power_cycles : 82 power_on_hours : 489 unsafe_shutdowns : 54 media_errors : 0 num_err_log_entries : 150 Warning Temperature Time : 0 Critical Composite Temperature Time : 0 Temperature Sensor 1 : 29 C Temperature Sensor 2 : 0 C Thermal Management T1 Trans Count : 0 Thermal Management T2 Trans Count : 0 Thermal Management T1 Total Time : 0 Thermal Management T2 Total Time : 0
For me nothing seems suspicious there.
You might also investigate power saving features. It is possible that the NVMe drive is going to power save automatically after a specific period of time since last access. Access would be random enough that you might not be able to make a correlation between last access and drop.

Further, PCIe power save might also be a factor.

Plus, I agree with @winnielinnie, 50c sounds very high.
Thanks. That sounds possible. I;ve already disabled any power savings for the disk i GUI but I'll also look to bios options.
Unfortunatelly my SSD went offline again after about 2 days use.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
Except for the 150 error log entries, nothing suspicious.

You can read those with a command like this:
nvme error-log /dev/nvme0n1 -e 150 (obviously adjusting for your actual NVME drive name)
 
Joined
Feb 22, 2020
Messages
5
Except for the 150 error log entries, nothing suspicious.

You can read those with a command like this:
nvme error-log /dev/nvme0n1 -e 150 (obviously adjusting for your actual NVME drive name)
Wow, thanks.

I found this error:
0x2002(INVALID_OPCODE: The associated command opcode field is not valid)

Found the problem mentioned here, but no solutions: https://github.com/linux-nvme/nvme-cli/issues/627

sudo nvme id-ctrl /dev/nvme0n1 | grep oacs
gives me 0x17
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
If it's a samsung drive, it seems it's a known bug. Unless it's blocking you from doing something, there seems no need to do anything about it.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
Out of curiosity, what is the NVMe drive used for? I only ask due to how much data it's written in such a short period of time. I'm curious if it's overheating as well. But I don't know much about NVMe's as my only one is in my Windoze computer, not on a server. Maybe when I get really wealthy I can afford a vdev of NVMe's.
 
Joined
Feb 22, 2020
Messages
5
After some more debugging I got it to work :)
It works fine after I changed those settings in BIOS from their defaults:
NAS SSD fix.jpg

The most likely one seems to be the power management.
I have a ASUS board and 12th gen i3 so this information might be usefull for others.
Thanks to all for the great advise :)
 
Top