SOLVED NVME SSD Dataset periodically offline

Johnnnnnnnnnny · Apr 10, 2023

Hi,
please help me debug this issue:

My ssdPool (which consists only of 1 m.2 nvme SSD) goes periodically offline. It seems that the time it keeps on running OK is random. The SSD goes offline after 2 hours to about 2 days. After this happens, the device cannot be found anywhere - seems like disconnected. Even after reboot it will not appear - I must shut down the NAS completely for a couple of minutes and then it shows up normally like nothing happend.

My pool looks like this - it contains my apps dataset.

My first guess was overheating of the SSD but I ruled that out because adding a cooler and more airflow to the SSD didn't help at all and the temperatures seems fine under 50c all the time.
The SSD doesn't support SMART so I don't know how to debug further.

My plan B is to move everything to the second pool but I'm not sure how hard that would be considering I need to move apps folder and all the permissions as well.

Any advice appreciated. :)

EDIT: solved in post #11

winnielinnie · Apr 10, 2023

You can use nvmecontrol to check its info and logs.

Examples for error, info, and tests log pages:

Code:

nvmecontrol logpage -p 1 nvme0
nvmecontrol logpage -p 2 nvme0
nvmecontrol logpage -p 6 nvme0

Johnnnnnnnnnny said:
and the temperatures seems fine under 50c all the time.

It's idling at 50C?

Arwen · Apr 10, 2023

You might also investigate power saving features. It is possible that the NVMe drive is going to power save automatically after a specific period of time since last access. Access would be random enough that you might not be able to make a correlation between last access and drop.

Further, PCIe power save might also be a factor.

Plus, I agree with @winnielinnie, 50c sounds very high.

tsm37 · Apr 10, 2023

Agreed. 50C assuming idle is high for nvme m2. Is it pcie gen 4 nvme m2? If so, do you have a heatsink and thermal pad on? Gen 4 idle temp with heatink and thermal pad should be around 30-35C idle but below 40C unless your m2 is close to a, say, HBA card or other cards with heatink and generating heat.

sretalla · Apr 11, 2023

winnielinnie said:
You can use nvmecontrol to check its info and logs.

Not on SCALE...

you need to use nvme:

nvme smart-log /dev/nvme0n1 for example.

You can run nvme list first to find the actual list of devices you can work with.

Johnnnnnnnnnny · Apr 13, 2023

sretalla said:
Not on SCALE...

you need to use nvme:

nvme smart-log /dev/nvme0n1 for example.

You can run nvme list first to find the actual list of devices you can work with.

Thanks for the correct command. My log is as follows:


Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 29 C
available_spare                         : 100%
available_spare_threshold               : 1%
percentage_used                         : 0%
endurance group critical warning summary: 0
data_units_read                         : 8235004
data_units_written                      : 6185963
host_read_commands                      : 34676013
host_write_commands                     : 111061253
controller_busy_time                    : 108
power_cycles                            : 82
power_on_hours                          : 489
unsafe_shutdowns                        : 54
media_errors                            : 0
num_err_log_entries                     : 150
Warning Temperature Time                : 0
Critical Composite Temperature Time     : 0
Temperature Sensor 1           : 29 C
Temperature Sensor 2           : 0 C
Thermal Management T1 Trans Count       : 0
Thermal Management T2 Trans Count       : 0
Thermal Management T1 Total Time        : 0
Thermal Management T2 Total Time        : 0

For me nothing seems suspicious there.

Arwen said:
You might also investigate power saving features. It is possible that the NVMe drive is going to power save automatically after a specific period of time since last access. Access would be random enough that you might not be able to make a correlation between last access and drop.

Further, PCIe power save might also be a factor.

Plus, I agree with @winnielinnie, 50c sounds very high.

Thanks. That sounds possible. I;ve already disabled any power savings for the disk i GUI but I'll also look to bios options.
Unfortunatelly my SSD went offline again after about 2 days use.

sretalla · Apr 14, 2023

Except for the 150 error log entries, nothing suspicious.

You can read those with a command like this:
nvme error-log /dev/nvme0n1 -e 150 (obviously adjusting for your actual NVME drive name)

Johnnnnnnnnnny · Apr 14, 2023

sretalla said:
Except for the 150 error log entries, nothing suspicious.

You can read those with a command like this:
nvme error-log /dev/nvme0n1 -e 150 (obviously adjusting for your actual NVME drive name)

Wow, thanks.

I found this error:
0x2002(INVALID_OPCODE: The associated command opcode field is not valid)

Found the problem mentioned here, but no solutions: https://github.com/linux-nvme/nvme-cli/issues/627

sudo nvme id-ctrl /dev/nvme0n1 | grep oacs
gives me 0x17

sretalla · Apr 17, 2023

If it's a samsung drive, it seems it's a known bug. Unless it's blocking you from doing something, there seems no need to do anything about it.

joeschmuck · Apr 17, 2023

Out of curiosity, what is the NVMe drive used for? I only ask due to how much data it's written in such a short period of time. I'm curious if it's overheating as well. But I don't know much about NVMe's as my only one is in my Windoze computer, not on a server. Maybe when I get really wealthy I can afford a vdev of NVMe's.

Johnnnnnnnnnny · May 16, 2023

After some more debugging I got it to work :)
It works fine after I changed those settings in BIOS from their defaults:

The most likely one seems to be the power management.
I have a ASUS board and 12th gen i3 so this information might be usefull for others.
Thanks to all for the great advise :)

Important Announcement for the TrueNAS Community.

SOLVED NVME SSD Dataset periodically offline

Johnnnnnnnnnny

Cadet

winnielinnie

MVP

Arwen

MVP

tsm37

Dabbler

sretalla

Powered by Neutrality

Johnnnnnnnnnny

Cadet

sretalla

Powered by Neutrality

Johnnnnnnnnnny

Cadet

sretalla

Powered by Neutrality

joeschmuck

Old Man

Johnnnnnnnnnny

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

SOLVED NVME SSD Dataset periodically offline

Cadet

MVP

MVP

Dabbler

Powered by Neutrality

Cadet

Powered by Neutrality

Cadet

Powered by Neutrality

Old Man

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "NVME SSD Dataset periodically offline"

Similar threads