nvme controller crash (samsung 1725b nvme u.2)

Zofoor

Patron
Joined
Aug 16, 2016
Messages
219
Hi all!
I am going crazy with a new clean TrueNAS deployment.

CPU AMD 5950x
Mainboard x570-f
Ram: 3200 mhz ECC (the mainboard supports ECC ram)
1 x 3.2 TB Samsung 1725b nvme u.2
1 ssd disk used as boot

I have a single volume with the Samsung 1725b.
The system can't stay online more than 24h. The log shows:

Mar 21 00:04:31 truenas kernel: nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10
Mar 21 00:04:31 truenas kernel: block nvme0n1: no usable path - requeuing I/O
Mar 21 00:04:31 truenas kernel: block nvme0n1: no usable path - requeuing I/O
Mar 21 00:04:33 truenas kernel: nvme nvme0: Shutdown timeout set to 10 seconds
Mar 21 00:04:34 truenas kernel: nvme nvme0: 32/0/0 default/read/poll queues
Mar 21 00:04:42 truenas kernel: block nvme0n1: no usable path - requeuing I/O
Mar 21 00:05:04 truenas kernel: block nvme0n1: no usable path - requeuing I/O
Mar 21 00:05:06 truenas kernel: nvme nvme0: I/O 917 QID 11 timeout, disable controller
Mar 21 00:05:06 truenas kernel: nvme nvme0: failed to mark controller live state
Mar 21 00:05:06 truenas kernel: nvme nvme0: Removing after probe failure status: -19

I suspect some issues with some Samsung controllers:

The same disk works without issues with clear-linux.
The system is almost empty, and has just a few dockers to test the stability (shinobi CCTV, uptime kuma, traefik).
 

c77dk

Patron
Joined
Nov 27, 2019
Messages
468
Which version of TrueNAS? The log looks like SCALE?

There's some settings in bios which historically had to be changed on AMD boards - don't know if it's still required, and if that could trigger what you see.
 

Zofoor

Patron
Joined
Aug 16, 2016
Messages
219
Which version of TrueNAS? The log looks like SCALE?

There's some settings in bios which historically had to be changed on AMD boards - don't know if it's still required, and if that could trigger what you see.

Yes, TrueNAS Scale 22.02.0
Regarding the BIOS, the PCIe splots are configured in to work as version 3.0
 

Zofoor

Patron
Joined
Aug 16, 2016
Messages
219
I tried to upgrade the bios (from 4021 to 4204) and forcing ACS to Enable in the bios (following the advise of this topic: https://forums.unraid.net/topic/93793-nvme-cache-drive-routinely-goes-missing/ . The options available are Enable - Disable - Auto).

It sounds strange to me the issue could be fixed in the bios, as with other OS I didn't had this issue. But it definitely worth a try.
I didn't found any other bios settings that could help, but as there are so many options I could miss something.
 

c77dk

Patron
Joined
Nov 27, 2019
Messages
468
If I remember correctly there should be some C-state setting - but don't own a system with AMD CPU at the moment. Try searching here in the forums, I know it's been mentioned in other threads.

It would surprise me if it fixes, but if it isn't tried, then we can't rule it out completely.
 

Zofoor

Patron
Joined
Aug 16, 2016
Messages
219
System crashed again few minutes ago.
I've found the topics you mentioned, and applied the suggested changes (C-State to disable, Power Supply Idle Current to Typical Current Idle. On).
Let's see the impact of those additional changes.
 

Zofoor

Patron
Joined
Aug 16, 2016
Messages
219
If I remember correctly there should be some C-state setting - but don't own a system with AMD CPU at the moment. Try searching here in the forums, I know it's been mentioned in other threads.

It would surprise me if it fixes, but if it isn't tried, then we can't rule it out completely.

Another crash, so it didn't helped.
I tried all possible values for C-State, in addition to all other options I've found in this forum + few other things found elsewhere.

But with the last changes (C-state disabled + Power Supply Idle Current to Typical Current Idle) the error changed (and the system crashed in just few minutes, so it seems things gone worse):

[Mar21 20:57] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[ +9.196012] nvme 0000:08:00.0: can't change power state from D3cold to D0 (config space inaccessible)
[ +1.008200] sched: RT throttling activated
[ +0.604047] nvme nvme0: Removing after probe failure status: -19
[ +0.149148] zio pool=nvme_volume vdev=/dev/disk/by-partuuid/9746adf5-dc66-4c17-888a-4c7bf8d64c03 error=5 type=1 offset=368308146176 size=36864 flags=180880
[ +0.003120] zio pool=nvme_volume vdev=/dev/disk/by-partuuid/9746adf5-dc66-4c17-888a-4c7bf8d64c03 error=5 type=2 offset=3195008200704 size=131072 flags=180880
[ +0.000062] zio pool=nvme_volume vdev=/dev/disk/by-partuuid/9746adf5-dc66-4c17-888a-4c7bf8d64c03 error=5 type=1 offset=270336 size=8192 flags=b08c1
[ +0.003284] zio pool=nvme_volume vdev=/dev/disk/by-partuuid/9746adf5-dc66-4c17-888a-4c7bf8d64c03 error=5 type=2 offset=3195008462848 size=131072 flags=180880
[ +0.000002] zio pool=nvme_volume vdev=/dev/disk/by-partuuid/9746adf5-dc66-4c17-888a-4c7bf8d64c03 error=5 type=1 offset=551747952640 size=32768 flags=180880
[ +0.000006] zio pool=nvme_volume vdev=/dev/disk/by-partuuid/9746adf5-dc66-4c17-888a-4c7bf8d64c03 error=5 type=2 offset=3195007807488 size=131072 flags=180880
[ +0.000002] zio pool=nvme_volume vdev=/dev/disk/by-partuuid/9746adf5-dc66-4c17-888a-4c7bf8d64c03 error=5 type=2 offset=3195007938560 size=131072 flags=180880
[ +0.000002] zio pool=nvme_volume vdev=/dev/disk/by-partuuid/9746adf5-dc66-4c17-888a-4c7bf8d64c03 error=5 type=2 offset=3195008724992 size=131072 flags=180880
[ +0.000001] zio pool=nvme_volume vdev=/dev/disk/by-partuuid/9746adf5-dc66-4c17-888a-4c7bf8d64c03 error=5 type=2 offset=3179394756608 size=12288 flags=180880
[ +0.000003] zio pool=nvme_volume vdev=/dev/disk/by-partuuid/9746adf5-dc66-4c17-888a-4c7bf8d64c03 error=5 type=2 offset=3179394686976 size=69632 flags=180880
[ +0.000004] zio pool=nvme_volume vdev=/dev/disk/by-partuuid/9746adf5-dc66-4c17-888a-4c7bf8d64c03 error=5 type=2 offset=3179394105344 size=12288 flags=180880
[ +0.003533] zio pool=nvme_volume vdev=/dev/disk/by-partuuid/9746adf5-dc66-4c17-888a-4c7bf8d64c03 error=5 type=1 offset=3198483701760 size=8192 flags=b08c1
[ +0.000003] zio pool=nvme_volume vdev=/dev/disk/by-partuuid/9746adf5-dc66-4c17-888a-4c7bf8d64c03 error=5 type=1 offset=3198483963904 size=8192 flags=b08c1
[ +0.003746] zio pool=nvme_volume vdev=/dev/disk/by-partuuid/9746adf5-dc66-4c17-888a-4c7bf8d64c03 error=5 type=2 offset=3195008069632 size=131072 flags=180880
[ +0.018552] WARNING: Pool 'nvme_volume' has encountered an uncorrectable I/O failure and has been suspended.

[ +0.003245] zio pool=nvme_volume vdev=/dev/disk/by-partuuid/9746adf5-dc66-4c17-888a-4c7bf8d64c03 error=5 type=2 offset=3195008331776 size=131072 flags=180880
[ +0.040862] zio pool=nvme_volume vdev=/dev/disk/by-partuuid/9746adf5-dc66-4c17-888a-4c7bf8d64c03 error=5 type=2 offset=3179394580480 size=36864 flags=180880
[ +0.006675] blk_update_request: I/O error, dev nvme0c0n1, sector 6244445592 op 0x1:(WRITE) flags 0x2000000 phys_seg 32 prio class 0
[ +0.006859] zio pool=nvme_volume vdev=/dev/disk/by-partuuid/9746adf5-dc66-4c17-888a-4c7bf8d64c03 error=5 type=2 offset=3195008593920 size=131072 flags=180880
[Mar21 20:59] INFO: task asyncio_loop:18030 blocked for more than 120 seconds.
[ +0.003564] Tainted: P OE 5.10.93+truenas #1
[ +0.003619] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ +0.003666] task:asyncio_loop state:D stack: 0 pid:18030 ppid: 1 flags:0x00000000
[ +0.003725] Call Trace:
[ +0.003698] __schedule+0x282/0x870
[ +0.003652] schedule+0x46/0xb0
[ +0.003614] io_schedule+0x42/0x70
[ +0.003580] cv_wait_common+0xac/0x130 [spl]
[ +0.003571] ? add_wait_queue_exclusive+0x70/0x70
[ +0.003570] txg_wait_synced_impl+0xc9/0x110 [zfs]
[ +0.003486] txg_wait_synced+0xc/0x40 [zfs]
[ +0.003413] dmu_tx_wait+0x380/0x390 [zfs]
[ +0.003351] dmu_tx_assign+0x16d/0x480 [zfs]
[ +0.003301] zfs_write+0x408/0xcd0 [zfs]
[ +0.003198] ? smp_call_function_many_cond+0x264/0x2d0
[ +0.003197] zpl_iter_write+0x103/0x170 [zfs]
[ +0.003084] new_sync_write+0x11c/0x1b0
[ +0.003035] vfs_write+0x1c2/0x260
[ +0.002957] ksys_write+0x5f/0xe0
[ +0.002889] do_syscall_64+0x33/0x80
[ +0.002819] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ +0.002792] RIP: 0033:0x7f82f03c7fef
[ +0.002717] RSP: 002b:00007f827e679800 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
[ +0.002700] RAX: ffffffffffffffda RBX: 00000000029f16f0 RCX: 00007f82f03c7fef
[ +0.002673] RDX: 000000000000007d RSI: 00007f82d412a7a0 RDI: 0000000000000035
[ +0.002613] RBP: 00007f827e67e680 R08: 0000000000000000 R09: 0000000000000000
[ +0.002550] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
[ +0.002494] R13: 000000000000007d R14: 0000000000000035 R15: 00007f82d412a7a0
 
Top