Benni.blanko
Dabbler
- Joined
- Dec 18, 2021
- Messages
- 31
Hi there,
I'm managing a server at a remote location (3hour drive from here) which is popping up every couple of weeks with this:
*Device /dev/nvd1p2 is causing slow I/O on pool boot-pool.
* Boot pool status is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state..
All other data disks and pools are SATA based and showing no problems at all.
A server reboot is no solving this issues; i have to power cycle the machine, then the mirror is up and running for a couple of weeks until the problem occurs again.
Most of the time it's device nvd0, but it's on nvd1 as well. So I'm not really sure it's a defective SSD. Also no SMART anomalies.
Googling for this error I found some hits, that FreeBSD might have some trouble handling NVMs devices.
Like this one: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211713
Speed is not relevant to this boot configuration, so what options do I have to use boot parameters to get rid of the problem?
Any other ideas of recommendations?
Thanks for your support.
-----------
installed: Truenas 12-U8
HW: HP DL380 Gen.9, single Xeon E5-2620 v3 @ 2,40 Ghz; 32GB RAM, NVMe SSDs in question: "ADATA XPG SX6000 PRO, 256GB"
root@server2[/var/log]# nvmecontrol devlist
nvme0: ADATA SX6000PNP
nvme0ns1 (244198MB)
nvme1: ADATA SX6000PNP
nvme1ns1 (244198MB)
Some more stuff from the logs:
May 31 06:04:07 server2 nvme1: resetting controller
May 31 06:04:07 server2 nvme1: aborting outstanding i/o
May 31 06:04:38 server2 nvme1[1387]: Last message 'aborting outstanding' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:04:38 server2 nvme1: Resetting controller due to a timeout.
May 31 06:04:38 server2 nvme1[1387]: Last message 'Resetting controller' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:04:38 server2 nvme1: resetting controller
May 31 06:04:38 server2 nvme1: aborting outstanding i/o
May 31 06:04:38 server2 nvme1: READ sqid:4 cid:127 nsid:1 lba:500112424 len:256
May 31 06:04:38 server2 nvme1: ABORTED - BY REQUEST (00/07) sqid:4 cid:127 cdw0:0
May 31 06:04:38 server2 nvme1: aborting outstanding i/o
May 31 06:04:38 server2 nvme1: READ sqid:4 cid:124 nsid:1 lba:500112680 len:256
May 31 06:04:38 server2 nvme1: ABORTED - BY REQUEST (00/07) sqid:4 cid:124 cdw0:0
May 31 06:05:08 server2 nvme1: Resetting controller due to a timeout.
May 31 06:05:08 server2 nvme1[1387]: Last message 'Resetting controller' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:05:08 server2 nvme1: resetting controller
May 31 06:05:08 server2 nvme1: aborting outstanding i/o
May 31 06:05:38 server2 nvme1[1387]: Last message 'aborting outstanding' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:05:38 server2 nvme1: Resetting controller due to a timeout.
May 31 06:05:38 server2 nvme1[1387]: Last message 'Resetting controller' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:05:38 server2 nvme1: resetting controller
May 31 06:05:38 server2 nvme1: aborting outstanding i/o
May 31 06:06:08 server2 nvme1[1387]: Last message 'aborting outstanding' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:06:08 server2 nvme1: Resetting controller due to a timeout.
May 31 06:06:08 server2 nvme1[1387]: Last message 'Resetting controller' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:06:08 server2 nvme1: resetting controller
May 31 06:06:09 server2 nvme1: aborting outstanding i/o
May 31 06:06:39 server2 nvme1[1387]: Last message 'aborting outstanding' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:06:39 server2 nvme1: Resetting controller due to a timeout.
May 31 06:06:39 server2 nvme1[1387]: Last message 'Resetting controller' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:06:39 server2 nvme1: resetting controller
May 31 06:06:39 server2 nvme1: aborting outstanding i/o
May 31 06:07:09 server2 nvme1[1387]: Last message 'aborting outstanding' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:07:09 server2 nvme1: Resetting controller due to a timeout.
May 31 06:07:09 server2 nvme1[1387]: Last message 'Resetting controller' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:07:09 server2 nvme1: resetting controller
May 31 06:07:09 server2 nvme1: aborting outstanding i/o
May 31 06:07:09 server2 nvme1: READ sqid:4 cid:124 nsid:1 lba:500112936 len:256
May 31 06:07:09 server2 nvme1: ABORTED - BY REQUEST (00/07) sqid:4 cid:124 cdw0:0
May 31 06:07:09 server2 nvme1: aborting outstanding i/o
May 31 06:07:09 server2 nvme1: READ sqid:4 cid:127 nsid:1 lba:500113192 len:256
May 31 06:07:09 server2 nvme1: ABORTED - BY REQUEST (00/07) sqid:4 cid:127 cdw0:0
root@server2[/var/log]# smartctl -a /dev/nvme1
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p12 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: ADATA SX6000PNP
Serial Number: ***************
Firmware Version: VC0S032E
PCI Vendor/Subsystem ID: 0x10ec
IEEE OUI Identifier: 0x00e04c
Controller ID: 1
NVMe Version: 1.4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 256,060,514,304 [256 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 00e04c 14c3b36201
Local Time is: Tue May 31 10:20:57 2022 CEST
Firmware Updates (0x02): 1 Slot
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001c): DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size: 32 Pages
Warning Comp. Temp. Threshold: 115 Celsius
Critical Comp. Temp. Threshold: 120 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 8.00W - - 0 0 0 0 0 0
1 + 4.00W - - 1 1 1 1 0 0
2 + 3.00W - - 2 2 2 2 0 0
3 - 0.0300W - - 3 3 3 3 5000 10000
4 - 0.0050W - - 4 4 4 4 60000 45000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 37 Celsius
Available Spare: 100%
Available Spare Threshold: 32%
Percentage Used: 0%
Data Units Read: 199,736 [102 GB]
Data Units Written: 148,396 [75.9 GB]
Host Read Commands: 1,872,516
Host Write Commands: 8,561,421
Controller Busy Time: 0
Power Cycles: 41
Power On Hours: 315
Unsafe Shutdowns: 12
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Error Information (NVMe Log 0x01, 8 of 8 entries)
No Errors Logged
root@server2[/var/log]# nvmecontrol logpage -p 1 nvme1
Error Information Log
=====================
No error entries found
root@server2[/var/log]# nvmecontrol logpage -p 1 nvme0
Error Information Log
=====================
No error entries found
I'm managing a server at a remote location (3hour drive from here) which is popping up every couple of weeks with this:
*Device /dev/nvd1p2 is causing slow I/O on pool boot-pool.
* Boot pool status is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state..
All other data disks and pools are SATA based and showing no problems at all.
A server reboot is no solving this issues; i have to power cycle the machine, then the mirror is up and running for a couple of weeks until the problem occurs again.
Most of the time it's device nvd0, but it's on nvd1 as well. So I'm not really sure it's a defective SSD. Also no SMART anomalies.
Googling for this error I found some hits, that FreeBSD might have some trouble handling NVMs devices.
Like this one: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211713
Speed is not relevant to this boot configuration, so what options do I have to use boot parameters to get rid of the problem?
Any other ideas of recommendations?
Thanks for your support.
-----------
installed: Truenas 12-U8
HW: HP DL380 Gen.9, single Xeon E5-2620 v3 @ 2,40 Ghz; 32GB RAM, NVMe SSDs in question: "ADATA XPG SX6000 PRO, 256GB"
root@server2[/var/log]# nvmecontrol devlist
nvme0: ADATA SX6000PNP
nvme0ns1 (244198MB)
nvme1: ADATA SX6000PNP
nvme1ns1 (244198MB)
Some more stuff from the logs:
May 31 06:04:07 server2 nvme1: resetting controller
May 31 06:04:07 server2 nvme1: aborting outstanding i/o
May 31 06:04:38 server2 nvme1[1387]: Last message 'aborting outstanding' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:04:38 server2 nvme1: Resetting controller due to a timeout.
May 31 06:04:38 server2 nvme1[1387]: Last message 'Resetting controller' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:04:38 server2 nvme1: resetting controller
May 31 06:04:38 server2 nvme1: aborting outstanding i/o
May 31 06:04:38 server2 nvme1: READ sqid:4 cid:127 nsid:1 lba:500112424 len:256
May 31 06:04:38 server2 nvme1: ABORTED - BY REQUEST (00/07) sqid:4 cid:127 cdw0:0
May 31 06:04:38 server2 nvme1: aborting outstanding i/o
May 31 06:04:38 server2 nvme1: READ sqid:4 cid:124 nsid:1 lba:500112680 len:256
May 31 06:04:38 server2 nvme1: ABORTED - BY REQUEST (00/07) sqid:4 cid:124 cdw0:0
May 31 06:05:08 server2 nvme1: Resetting controller due to a timeout.
May 31 06:05:08 server2 nvme1[1387]: Last message 'Resetting controller' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:05:08 server2 nvme1: resetting controller
May 31 06:05:08 server2 nvme1: aborting outstanding i/o
May 31 06:05:38 server2 nvme1[1387]: Last message 'aborting outstanding' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:05:38 server2 nvme1: Resetting controller due to a timeout.
May 31 06:05:38 server2 nvme1[1387]: Last message 'Resetting controller' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:05:38 server2 nvme1: resetting controller
May 31 06:05:38 server2 nvme1: aborting outstanding i/o
May 31 06:06:08 server2 nvme1[1387]: Last message 'aborting outstanding' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:06:08 server2 nvme1: Resetting controller due to a timeout.
May 31 06:06:08 server2 nvme1[1387]: Last message 'Resetting controller' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:06:08 server2 nvme1: resetting controller
May 31 06:06:09 server2 nvme1: aborting outstanding i/o
May 31 06:06:39 server2 nvme1[1387]: Last message 'aborting outstanding' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:06:39 server2 nvme1: Resetting controller due to a timeout.
May 31 06:06:39 server2 nvme1[1387]: Last message 'Resetting controller' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:06:39 server2 nvme1: resetting controller
May 31 06:06:39 server2 nvme1: aborting outstanding i/o
May 31 06:07:09 server2 nvme1[1387]: Last message 'aborting outstanding' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:07:09 server2 nvme1: Resetting controller due to a timeout.
May 31 06:07:09 server2 nvme1[1387]: Last message 'Resetting controller' repeated 1 times, suppressed by syslog-ng on server2.local
May 31 06:07:09 server2 nvme1: resetting controller
May 31 06:07:09 server2 nvme1: aborting outstanding i/o
May 31 06:07:09 server2 nvme1: READ sqid:4 cid:124 nsid:1 lba:500112936 len:256
May 31 06:07:09 server2 nvme1: ABORTED - BY REQUEST (00/07) sqid:4 cid:124 cdw0:0
May 31 06:07:09 server2 nvme1: aborting outstanding i/o
May 31 06:07:09 server2 nvme1: READ sqid:4 cid:127 nsid:1 lba:500113192 len:256
May 31 06:07:09 server2 nvme1: ABORTED - BY REQUEST (00/07) sqid:4 cid:127 cdw0:0
root@server2[/var/log]# smartctl -a /dev/nvme1
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p12 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: ADATA SX6000PNP
Serial Number: ***************
Firmware Version: VC0S032E
PCI Vendor/Subsystem ID: 0x10ec
IEEE OUI Identifier: 0x00e04c
Controller ID: 1
NVMe Version: 1.4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 256,060,514,304 [256 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 00e04c 14c3b36201
Local Time is: Tue May 31 10:20:57 2022 CEST
Firmware Updates (0x02): 1 Slot
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001c): DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size: 32 Pages
Warning Comp. Temp. Threshold: 115 Celsius
Critical Comp. Temp. Threshold: 120 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 8.00W - - 0 0 0 0 0 0
1 + 4.00W - - 1 1 1 1 0 0
2 + 3.00W - - 2 2 2 2 0 0
3 - 0.0300W - - 3 3 3 3 5000 10000
4 - 0.0050W - - 4 4 4 4 60000 45000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 37 Celsius
Available Spare: 100%
Available Spare Threshold: 32%
Percentage Used: 0%
Data Units Read: 199,736 [102 GB]
Data Units Written: 148,396 [75.9 GB]
Host Read Commands: 1,872,516
Host Write Commands: 8,561,421
Controller Busy Time: 0
Power Cycles: 41
Power On Hours: 315
Unsafe Shutdowns: 12
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Error Information (NVMe Log 0x01, 8 of 8 entries)
No Errors Logged
root@server2[/var/log]# nvmecontrol logpage -p 1 nvme1
Error Information Log
=====================
No error entries found
root@server2[/var/log]# nvmecontrol logpage -p 1 nvme0
Error Information Log
=====================
No error entries found