SOLVED NVMe " Device /dev/nvd1p2 is causing slow I/O on pool boot-pool", controller resets due to timeouts

Benni.blanko

Dabbler
Joined
Dec 18, 2021
Messages
31
Hi there,
I'm managing a server at a remote location (3hour drive from here) which is popping up every couple of weeks with this:

*Device /dev/nvd1p2 is causing slow I/O on pool boot-pool.
* Boot pool status is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state..

All other data disks and pools are SATA based and showing no problems at all.
A server reboot is no solving this issues; i have to power cycle the machine, then the mirror is up and running for a couple of weeks until the problem occurs again.

Most of the time it's device nvd0, but it's on nvd1 as well. So I'm not really sure it's a defective SSD. Also no SMART anomalies.
Googling for this error I found some hits, that FreeBSD might have some trouble handling NVMs devices.
Like this one: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211713

Speed is not relevant to this boot configuration, so what options do I have to use boot parameters to get rid of the problem?
Any other ideas of recommendations?
Thanks for your support.

-----------
installed: Truenas 12-U8
HW: HP DL380 Gen.9, single Xeon E5-2620 v3 @ 2,40 Ghz; 32GB RAM, NVMe SSDs in question: "ADATA XPG SX6000 PRO, 256GB"

root@server2[/var/log]# nvmecontrol devlist
nvme0: ADATA SX6000PNP
nvme0ns1 (244198MB)
nvme1: ADATA SX6000PNP
nvme1ns1 (244198MB)


Some more stuff from the logs:
May 31 06:04:07 server2 nvme1: resetting controller
May 31 06:04:07 server2 nvme1: aborting outstanding i/o
May 31 06:04:38 server2 nvme1[1387]: Last message 'aborting outstanding' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:04:38 server2 nvme1: Resetting controller due to a timeout.
May 31 06:04:38 server2 nvme1[1387]: Last message 'Resetting controller' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:04:38 server2 nvme1: resetting controller
May 31 06:04:38 server2 nvme1: aborting outstanding i/o
May 31 06:04:38 server2 nvme1: READ sqid:4 cid:127 nsid:1 lba:500112424 len:256
May 31 06:04:38 server2 nvme1: ABORTED - BY REQUEST (00/07) sqid:4 cid:127 cdw0:0
May 31 06:04:38 server2 nvme1: aborting outstanding i/o
May 31 06:04:38 server2 nvme1: READ sqid:4 cid:124 nsid:1 lba:500112680 len:256
May 31 06:04:38 server2 nvme1: ABORTED - BY REQUEST (00/07) sqid:4 cid:124 cdw0:0
May 31 06:05:08 server2 nvme1: Resetting controller due to a timeout.
May 31 06:05:08 server2 nvme1[1387]: Last message 'Resetting controller' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:05:08 server2 nvme1: resetting controller
May 31 06:05:08 server2 nvme1: aborting outstanding i/o
May 31 06:05:38 server2 nvme1[1387]: Last message 'aborting outstanding' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:05:38 server2 nvme1: Resetting controller due to a timeout.
May 31 06:05:38 server2 nvme1[1387]: Last message 'Resetting controller' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:05:38 server2 nvme1: resetting controller
May 31 06:05:38 server2 nvme1: aborting outstanding i/o
May 31 06:06:08 server2 nvme1[1387]: Last message 'aborting outstanding' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:06:08 server2 nvme1: Resetting controller due to a timeout.
May 31 06:06:08 server2 nvme1[1387]: Last message 'Resetting controller' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:06:08 server2 nvme1: resetting controller
May 31 06:06:09 server2 nvme1: aborting outstanding i/o
May 31 06:06:39 server2 nvme1[1387]: Last message 'aborting outstanding' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:06:39 server2 nvme1: Resetting controller due to a timeout.
May 31 06:06:39 server2 nvme1[1387]: Last message 'Resetting controller' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:06:39 server2 nvme1: resetting controller
May 31 06:06:39 server2 nvme1: aborting outstanding i/o
May 31 06:07:09 server2 nvme1[1387]: Last message 'aborting outstanding' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:07:09 server2 nvme1: Resetting controller due to a timeout.
May 31 06:07:09 server2 nvme1[1387]: Last message 'Resetting controller' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:07:09 server2 nvme1: resetting controller
May 31 06:07:09 server2 nvme1: aborting outstanding i/o
May 31 06:07:09 server2 nvme1: READ sqid:4 cid:124 nsid:1 lba:500112936 len:256
May 31 06:07:09 server2 nvme1: ABORTED - BY REQUEST (00/07) sqid:4 cid:124 cdw0:0
May 31 06:07:09 server2 nvme1: aborting outstanding i/o
May 31 06:07:09 server2 nvme1: READ sqid:4 cid:127 nsid:1 lba:500113192 len:256
May 31 06:07:09 server2 nvme1: ABORTED - BY REQUEST (00/07) sqid:4 cid:127 cdw0:0

root@server2[/var/log]# smartctl -a /dev/nvme1
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p12 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: ADATA SX6000PNP
Serial Number: ***************
Firmware Version: VC0S032E
PCI Vendor/Subsystem ID: 0x10ec
IEEE OUI Identifier: 0x00e04c
Controller ID: 1
NVMe Version: 1.4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 256,060,514,304 [256 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 00e04c 14c3b36201
Local Time is: Tue May 31 10:20:57 2022 CEST
Firmware Updates (0x02): 1 Slot
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001c): DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size: 32 Pages
Warning Comp. Temp. Threshold: 115 Celsius
Critical Comp. Temp. Threshold: 120 Celsius

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 8.00W - - 0 0 0 0 0 0
1 + 4.00W - - 1 1 1 1 0 0
2 + 3.00W - - 2 2 2 2 0 0
3 - 0.0300W - - 3 3 3 3 5000 10000
4 - 0.0050W - - 4 4 4 4 60000 45000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 37 Celsius
Available Spare: 100%
Available Spare Threshold: 32%
Percentage Used: 0%
Data Units Read: 199,736 [102 GB]
Data Units Written: 148,396 [75.9 GB]
Host Read Commands: 1,872,516
Host Write Commands: 8,561,421
Controller Busy Time: 0
Power Cycles: 41
Power On Hours: 315
Unsafe Shutdowns: 12
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0

Error Information (NVMe Log 0x01, 8 of 8 entries)
No Errors Logged


root@server2[/var/log]# nvmecontrol logpage -p 1 nvme1
Error Information Log
=====================
No error entries found
root@server2[/var/log]# nvmecontrol logpage -p 1 nvme0
Error Information Log
=====================
No error entries found
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
* Check if there is a firmware update for those SSDs.
* Try to upgrade to TN 13 - you can always roll back as long as you do not upgrade any of your zpools.
 

Benni.blanko

Dabbler
Joined
Dec 18, 2021
Messages
31
*Firmware is up to date
*I'm planning the TN13 upgrade for end of June (once it's ready for production with U1).
 

Benni.blanko

Dabbler
Joined
Dec 18, 2021
Messages
31
So, it took me some days to get the upgrade organized:
The system is now on Truenas 13U1, but the problem still persists. Appox. 8 days after upgrade and restart the same happens again:
"Device /dev/nvd0p2 is causing slow I/O on pool boot-pool."

What can I try now?
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Hey @Benni.blanko,

I completely forgot about the obvious shotgun: have you tried disabling TRIM?

Kind regards,
Patrick
 

Benni.blanko

Dabbler
Joined
Dec 18, 2021
Messages
31
I haven't changed anything from the default setting according to TRIM. I can't even find where I have to change the settings for the boot-pool.
Nothing under system/boot, Storage/disk or cron jobs.
Where do I see the actual TRIM setting for the boot-pool?
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Something something sysctl ... sorry. You will have to search. I'd start with the forum. Although with OpenZFS now, it might also be a pool property.
 

Benni.blanko

Dabbler
Joined
Dec 18, 2021
Messages
31
Good hint; after some google-fu I found this tunable:
vfs.zfs.trim.enabled=0

But shell shows:
# sysctl vfs.zfs.trim.enabled=0
sysctl: unknown oid 'vfs.zfs.trim.enabled'

According to https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=261212
this parameter is gone from FreeBSD 13 and up.

Also tried tunefs, but failed
# tunefs -t disable /dev/nvme0
tunefs: /dev/nvme0: Input/output error

But this command showed at least: no autotrim activated
# zpool get autotrim
NAME PROPERTY VALUE SOURCE
TIER1 autotrim off default
TIER2 autotrim off default
boot-pool autotrim off default

As long as there are not other methods, which initiate a trim: this seems to be disabled.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
tunefs is for UFS only, not ZFS. And the disk device is nvd0, not nvme0 ... but you found the property it seems. Hmmm ...
 

GodZilla

Cadet
Joined
Feb 15, 2022
Messages
4
Similar story but on Scale. Tried to turn it off:
#systemctl disable fstrim.timer
until not freezing, let's see what happens next

add: did not help...
 
Last edited:

Benni.blanko

Dabbler
Joined
Dec 18, 2021
Messages
31
I'm also still loosing ... I had to reboot the system in the meantime to upgrade to 13U1.1. But now after 4,5weeks the problem occured again.
I'm now trying the tunable hw.nvme.force_intx=1 .. let's see if does fix it.
My next try then will be: hw.nvme.use_nvd=0
 

GodZilla

Cadet
Joined
Feb 15, 2022
Messages
4
I'm also still loosing ... I had to reboot the system in the meantime to upgrade to 13U1.1. But now after 4,5weeks the problem occured again.
I'm now trying the tunable hw.nvme.force_intx=1 .. let's see if does fix it.
My next try then will be: hw.nvme.use_nvd=0
i have a guess which i am testing now. Look at system logs (journalctl), I have an "ntpd" error every 1-5 minutes and in a few day buffer overflow. Then I get an error "causing slow I/O on pool boot-pool" and the system becomes completely inaccessible. Changed NTPD_OPTS='-4 -g' in /etc/default/ntp (disabled ipv6 in ntpd). no more errors...wait :smile:
 

Benni.blanko

Dabbler
Joined
Dec 18, 2021
Messages
31
The problem still occured again.
next tunable I'm trying is hw.nvme.use_nvd=0
@GodZilla: Where do you find that NTPD_OPTS setting? there is not /etc/defaults/ntp file in my installation.
 

Benni.blanko

Dabbler
Joined
Dec 18, 2021
Messages
31
The problem occured again after a few days. I'm out of tunables to test so far.
There seems to be a new firmware relased for these NVMe SSD (ADATA XPG SX6000 PRO),
but there is only a Windows updater available from Adata. I'm thinking about possible ways to get this done (maybe remote booting a winpe and try the updater).
 

Benni.blanko

Dabbler
Joined
Dec 18, 2021
Messages
31
38days uptime so far and the problem did not appear yet. So the firmware update seems to be the solution. The tunables were all not successful. I had some luck that approx. 6-7 months after I setup the system, ADATA finally published the new firmware.

Remote update war kind of tricky. I had to remote boot a WinPE based windows to run the firmware updater. Took approx. one hour for both boot devices.
 
Top