NVMe Crashing System

nexusone

Dabbler
Joined
Nov 3, 2014
Messages
15
Supermicro X12STL-IF-O
WD_BLACK SN850X 4000GB
TrueNAS-SCALE-23.10.0.1


I am experiencing intermittent system crashes. The console produces an error like this:

[58013.996053] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[58013.996053] nvme nvme0: Does your device have a faulty power saving mode enabled?
[58013.996243] nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug



I tried adding the sysctl param via UI/Advanced/sysctl, but received an error.
"Sysctl 'nvme_core.default_ps_max_latency_us' does not exist in kernel."

Am I doing this in the wrong place?




EDIT:

This appears to have successfully added the params

root@truenas[~]# midclt call system.advanced.update '{"kernel_extra_options": "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"}'

This seems to survive reboots. Time will tell if it survives upgrades.

admin@truenas[~]$ cat /proc/cmdline
BOOT_IMAGE=/ROOT/23.10.0.1@/boot/vmlinuz-6.1.55-production+truenas root=ZFS=boot-pool/ROOT/23.10.0.1 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
admin@truenas[~]$
 
Last edited:

fredbean

Cadet
Joined
Jan 3, 2024
Messages
3
I'm running into the same issue as above. Could you share if the above is the fix for this?

Jan 2 15:58:56 TrueNAS kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Jan 2 15:58:56 TrueNAS kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Jan 2 15:58:56 TrueNAS kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Jan 2 15:58:56 TrueNAS kernel: nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Jan 2 15:58:56 TrueNAS kernel: nvme nvme1: Does your device have a faulty power saving mode enabled?
Jan 2 15:58:56 TrueNAS kernel: nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Jan 2 15:58:56 TrueNAS kernel: nvme nvme2: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Jan 2 15:58:56 TrueNAS kernel: nvme nvme2: Does your device have a faulty power saving mode enabled?
Jan 2 15:58:56 TrueNAS kernel: nvme nvme2: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Jan 2 15:58:56 TrueNAS kernel: nvme nvme3: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Jan 2 15:58:56 TrueNAS kernel: nvme nvme3: Does your device have a faulty power saving mode enabled?
Jan 2 15:58:57 TrueNAS kernel: nvme nvme3: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Jan 2 15:59:01 TrueNAS kernel: nvme nvme0: Removing after probe failure status: -19
Jan 2 15:59:01 TrueNAS kernel: nvme nvme1: Removing after probe failure status: -19
Jan 2 15:59:01 TrueNAS kernel: nvme0n1: detected capacity change from 8001573552 to 0
Jan 2 15:59:01 TrueNAS kernel: nvme1n1: detected capacity change from 8001573552 to 0
 

nexusone

Dabbler
Joined
Nov 3, 2014
Messages
15
I have not experienced any crashes since making this change. So far so good.
 

fredbean

Cadet
Joined
Jan 3, 2024
Messages
3
Apparently it did not work for me even after several reboots. However, I decided to boot from an older version of TrueNAS-SCALE-23.10.0.1 and once booted, I was able to detect the drives and successfully created a pool and datasets.

Since this is still fresh, I wanted to test out my suspicion the data protection tasks to make sure I don't run into issues which I ran into last time.

So I ran data protection scrub task and sure enough I received errors with trace logs below which seems like a bug. I'm a bit skeptical on the reliability now and lose all of my data by running TrueNAS Scale at this time.

Jan 4 01:21:33 truenas kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Jan 4 01:21:33 truenas kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Jan 4 01:21:33 truenas kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Jan 4 01:21:33 truenas kernel: nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Jan 4 01:21:33 truenas kernel: nvme nvme1: Does your device have a faulty power saving mode enabled?
Jan 4 01:21:33 truenas kernel: nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Jan 4 01:21:33 truenas kernel: nvme nvme2: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Jan 4 01:21:33 truenas kernel: nvme nvme2: Does your device have a faulty power saving mode enabled?
Jan 4 01:21:33 truenas kernel: nvme nvme2: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Jan 4 01:21:34 truenas kernel: nvme nvme3: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Jan 4 01:21:34 truenas kernel: nvme nvme3: Does your device have a faulty power saving mode enabled?
Jan 4 01:21:34 truenas kernel: nvme nvme3: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Jan 4 01:21:38 truenas kernel: nvme nvme1: Removing after probe failure status: -19
Jan 4 01:21:38 truenas kernel: nvme nvme0: Removing after probe failure status: -19
Jan 4 01:21:39 truenas kernel: nvme1n1: detected capacity change from 8001573552 to 0
Jan 4 01:21:39 truenas kernel: zio pool=Flash vdev=/dev/disk/by-partuuid/774b9f80-bfcf-44ce-b338-374cd6dafbd6 error=5 type=1 offset=25774534656 size=20480 flags=1074267312
Jan 4 01:21:39 truenas kernel: zio pool=Flash vdev=/dev/disk/by-partuuid/774b9f80-bfcf-44ce-b338-374cd6dafbd6 error=5 type=1 offset=25774608384 size=106496 flags=1074267312
Jan 4 01:21:39 truenas kernel: zio pool=Flash vdev=/dev/disk/by-partuuid/774b9f80-bfcf-44ce-b338-374cd6dafbd6 error=5 type=1 offset=25774333952 size=81920 flags=1074267312
Jan 4 01:21:39 truenas kernel: nvme0n1: detected capacity change from 8001573552 to 0
Jan 4 01:21:39 truenas kernel: zio pool=Flash vdev=/dev/disk/by-partuuid/bbe7e73a-c110-4941-87e8-c04cc9254d30 error=5 type=1 offset=25774272512 size=4096 flags=1573040
Jan 4 01:21:39 truenas kernel: zio pool=Flash vdev=/dev/disk/by-partuuid/bbe7e73a-c110-4941-87e8-c04cc9254d30 error=5 type=1 offset=25774129152 size=118784 flags=1074267312
Jan 4 01:21:39 truenas kernel: zio pool=Flash vdev=/dev/disk/by-partuuid/bbe7e73a-c110-4941-87e8-c04cc9254d30 error=5 type=1 offset=25774333952 size=81920 flags=1074267312
Jan 4 01:21:39 truenas kernel: zio pool=Flash vdev=/dev/disk/by-partuuid/bbe7e73a-c110-4941-87e8-c04cc9254d30 error=5 type=1 offset=25773998080 size=126976 flags=1074267312
Jan 4 01:21:39 truenas kernel: zio pool=Flash vdev=/dev/disk/by-partuuid/774b9f80-bfcf-44ce-b338-374cd6dafbd6 error=5 type=1 offset=25774465024 size=8192 flags=1074267312
Jan 4 01:21:40 truenas kernel: nvme nvme2: Removing after probe failure status: -19
Jan 4 01:21:40 truenas kernel: nvme nvme3: Removing after probe failure status: -19
Jan 4 01:21:40 truenas kernel: nvme2n1: detected capacity change from 8001573552 to 0
Jan 4 01:21:40 truenas kernel: zio pool=Flash vdev=/dev/disk/by-partuuid/2c66db5f-c066-4442-870b-bbed32acfc79 error=5 type=1 offset=25773998080 size=131072 flags=1074267312
Jan 4 01:21:40 truenas kernel: zio pool=Flash vdev=/dev/disk/by-partuuid/2c66db5f-c066-4442-870b-bbed32acfc79 error=5 type=1 offset=21480095744 size=90112 flags=1074267312
Jan 4 01:21:40 truenas kernel: zio pool=Flash vdev=/dev/disk/by-partuuid/2c66db5f-c066-4442-870b-bbed32acfc79 error=5 type=1 offset=25774272512 size=4096 flags=1573040
Jan 4 01:21:40 truenas kernel: zio pool=Flash vdev=/dev/disk/by-partuuid/2c66db5f-c066-4442-870b-bbed32acfc79 error=5 type=1 offset=25774129152 size=118784 flags=1074267312
Jan 4 01:21:40 truenas kernel: nvme3n1: detected capacity change from 8001573552 to 0
Jan 4 01:21:40 truenas kernel: zio pool=Flash vdev=/dev/disk/by-partuuid/65bc81f6-3416-45ac-bb88-6de08341c002 error=5 type=1 offset=25774272512 size=4096 flags=1573040
Jan 4 01:21:40 truenas kernel: zio pool=Flash vdev=/dev/disk/by-partuuid/65bc81f6-3416-45ac-bb88-6de08341c002 error=5 type=1 offset=25774329856 size=118784 flags=1074267312
Jan 4 01:21:40 truenas kernel: zio pool=Flash vdev=/dev/disk/by-partuuid/65bc81f6-3416-45ac-bb88-6de08341c002 error=5 type=1 offset=25774129152 size=118784 flags=1074267312
Jan 4 01:21:40 truenas kernel: zio pool=Flash vdev=/dev/disk/by-partuuid/65bc81f6-3416-45ac-bb88-6de08341c002 error=5 type=1 offset=25773998080 size=131072 flags=1074267312
Jan 4 01:21:40 truenas kernel: WARNING: Pool 'Flash' has encountered an uncorrectable I/O failure and has been suspended.

Jan 4 01:25:12 truenas kernel: task:agents state:D stack:0 pid:2100 ppid:1 flags:0x00004002
Jan 4 01:25:12 truenas kernel: Call Trace:
Jan 4 01:25:12 truenas kernel: <TASK>
Jan 4 01:25:12 truenas kernel: __schedule+0x2ed/0x860
Jan 4 01:25:12 truenas kernel: schedule+0x5a/0xb0
Jan 4 01:25:12 truenas kernel: io_schedule+0x42/0x70
Jan 4 01:25:12 truenas kernel: cv_wait_common+0xaa/0x130 [spl]
Jan 4 01:25:12 truenas kernel: ? cpuusage_read+0x10/0x10
Jan 4 01:25:12 truenas kernel: txg_wait_synced_impl+0xc0/0x110 [zfs]
Jan 4 01:25:12 truenas kernel: txg_wait_synced+0xc/0x40 [zfs]
Jan 4 01:25:12 truenas kernel: spa_vdev_state_exit+0x91/0x150 [zfs]
Jan 4 01:25:12 truenas kernel: zfs_ioc_vdev_set_state+0xe6/0x1c0 [zfs]
Jan 4 01:25:12 truenas kernel: zfsdev_ioctl_common+0x67c/0x770 [zfs]
Jan 4 01:25:12 truenas kernel: ? __kmalloc_node+0xbf/0x150
Jan 4 01:25:12 truenas kernel: zfsdev_ioctl+0x4f/0xd0 [zfs]
Jan 4 01:25:12 truenas kernel: __x64_sys_ioctl+0x8d/0xd0
Jan 4 01:25:12 truenas kernel: do_syscall_64+0x58/0xc0
Jan 4 01:25:12 truenas kernel: ? exc_page_fault+0x70/0x170
Jan 4 01:25:12 truenas kernel: entry_SYSCALL_64_after_hwframe+0x64/0xce
Jan 4 01:25:12 truenas kernel: RIP: 0033:0x7fc5638f8afb
Jan 4 01:25:12 truenas kernel: RSP: 002b:00007fc56284d240 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jan 4 01:25:12 truenas kernel: RAX: ffffffffffffffda RBX: 00007fc55c007910 RCX: 00007fc5638f8afb
Jan 4 01:25:12 truenas kernel: RDX: 00007fc56284d2b0 RSI: 0000000000005a0d RDI: 000000000000000a
Jan 4 01:25:12 truenas kernel: RBP: 00007fc562850ca0 R08: 0000000000000001 R09: 0000000000000000
Jan 4 01:25:12 truenas kernel: R10: 13e10cb64d0425b0 R11: 0000000000000246 R12: 00007fc562850860
Jan 4 01:25:12 truenas kernel: R13: 000055cb4dc80590 R14: 00007fc55c009f50 R15: 00007fc56284d2b0
Jan 4 01:25:12 truenas kernel: </TASK>
Jan 4 01:25:12 truenas kernel: task:txg_sync state:D stack:0 pid:9165 ppid:2 flags:0x00004000
Jan 4 01:25:12 truenas kernel: Call Trace:
Jan 4 01:25:12 truenas kernel: <TASK>
Jan 4 01:25:12 truenas kernel: __schedule+0x2ed/0x860
Jan 4 01:25:12 truenas kernel: schedule+0x5a/0xb0
Jan 4 01:25:12 truenas kernel: schedule_timeout+0x94/0x150
Jan 4 01:25:12 truenas kernel: ? __bpf_trace_tick_stop+0x10/0x10
Jan 4 01:25:12 truenas kernel: io_schedule_timeout+0x4c/0x80
Jan 4 01:25:12 truenas kernel: __cv_timedwait_common+0x12a/0x160 [spl]
Jan 4 01:25:12 truenas kernel: ? cpuusage_read+0x10/0x10
Jan 4 01:25:12 truenas kernel: __cv_timedwait_io+0x15/0x20 [spl]
Jan 4 01:25:12 truenas kernel: zio_wait+0x10b/0x220 [zfs]
Jan 4 01:25:12 truenas kernel: dsl_pool_sync_mos+0x37/0xa0 [zfs]
Jan 4 01:25:12 truenas kernel: dsl_pool_sync+0x3b7/0x410 [zfs]
Jan 4 01:25:12 truenas kernel: spa_sync_iterate_to_convergence+0xdc/0x200 [zfs]
Jan 4 01:25:12 truenas kernel: spa_sync+0x306/0x5d0 [zfs]
Jan 4 01:25:12 truenas kernel: txg_sync_thread+0x1e4/0x250 [zfs]
Jan 4 01:25:12 truenas kernel: ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
Jan 4 01:25:12 truenas kernel: ? sigorsets+0x10/0x10 [spl]
Jan 4 01:25:12 truenas kernel: thread_generic_wrapper+0x57/0x70 [spl]
Jan 4 01:25:12 truenas kernel: kthread+0xe6/0x110
Jan 4 01:25:12 truenas kernel: ? kthread_complete_and_exit+0x20/0x20
Jan 4 01:25:12 truenas kernel: ret_from_fork+0x1f/0x30
 
Top