smartmontool 7.2 is causing error's to accumulate on HPE NVMe drives

dvaezazizi

Cadet
Joined
Jul 11, 2018
Messages
3
  • HPE DL325 Gen10 Plus V2
  • Epyc 7513
  • 256GB RAM
  • 16x NVMe direct attach
  • QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller
  • Linux truenas 5.10.142+truenas #1 SMP Mon Sep 26 18:20:46 UTC 2022 x86_64 GNU/Linux
smartmontool is causing a ton of errors to accumulate on the error-logs of 6x HPE/kioxia nvme drives (see below). I'm not sure if this issue is fixed in 7.3, but if it is, can we please get a patch? I also cross posted this to smartmontools as it seems related to: https://www.smartmontools.org/ticket/1134 but that was fixed ages ago. (Kioxia is Toshiba).

2022 Oct 30 16:38:18 truenas Device: /dev/nvme4n1, number of Error Log entries increased from 230917 to 231033
2022 Oct 30 16:38:18 truenas Device: /dev/nvme1n1, number of Error Log entries increased from 214056 to 214172
2022 Oct 30 16:38:18 truenas Device: /dev/nvme0n1, number of Error Log entries increased from 213937 to 214053
2022 Oct 30 16:38:18 truenas Device: /dev/nvme5n2, number of Error Log entries increased from 36162 to 36278
2022 Oct 30 16:38:18 truenas Device: /dev/nvme7n1, number of Error Log entries increased from 27371 to 27487


Model Number: TCM615T4P5xnFTRI
Serial Number: <removed>
Firmware Version: 3P01
PCI Vendor ID: 0x1e0f
PCI Vendor Subsystem ID: 0x1590
IEEE OUI Identifier: 0x8ce38e
Total NVM Capacity: 15,360,950,534,144 [15.3 TB]
Unallocated NVM Capacity: 312,458,870,784 [312 GB]
Controller ID: 1
NVMe Version: 1.4
Number of Namespaces: 1
Namespace 1 Size/Capacity?: 15,048,491,663,360 [15.0 TB]
Namespace 1 Utilization: 2,165,793,648,640 [2.16 TB]
Namespace 1 Formatted LBA Size: 4096
Namespace 1 IEEE EUI-64: 8ce38e e209a49501
Local Time is: Sun Oct 30 16:37:26 2022 PDT
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x025f): Security Format Frmw_DL NS_Mngmt Self_Test MI_Snd/Rec Get_LBA_Sts
Optional NVM Commands (0x00ff): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Resv Timestmp Verify
Log Page Attributes (0x1e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size: 8192 Pages
Warning Comp. Temp. Threshold: 76 Celsius
Critical Comp. Temp. Threshold: 82 Celsius
Namespace 1 Features (0x10): NP_Fields

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat

0 + 27.50W 25.00W - 0 0 0 0 500000 500000
1 + 19.80W 18.00W - 0 0 1 1 500000 500000
2 + 17.60W 16.00W - 0 0 2 2 500000 500000
3 + 15.40W 14.00W - 1 1 3 3 500000 500000
4 + 12.10W 11.00W - 2 2 4 4 500000 500000
5 + 9.90W 9.00W - 3 3 5 5 500000 500000
6 - 5.00W - - 6 6 6 6 500000 500000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf

0 - 512 0 0
1 - 512 8 0
2 - 1 0 0
3 + 4096 0 0
4 - 4096 8 0
5 - 4096 64 0

START OF SMART DATA SECTION​

SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 47 Celsius
Available Spare: 100%
Available Spare Threshold: 26%
Percentage Used: 0%
Data Units Read: 76,623,421 [39.2 TB]
Data Units Written: 59,365,282 [30.3 TB]
Host Read Commands: 614,079,901
Host Write Commands: 549,353,795
Controller Busy Time: 544
Power Cycles: 59
Power On Hours: 3,526
Unsafe Shutdowns: 24
Media and Data Integrity Errors: 25
Error Information Log Entries: 214,053
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 109 Celsius
Temperature Sensor 2: 101 Celsius

Error Information (NVMe Log 0x01, 16 of 256 entries)
Num ErrCount? SQId CmdId? Status PELoc LBA NSID VS

0 214053 0 0xc00a 0xc004 0x02e - 0 -
1 214052 0 0x7019 0xc004 0x02e - 0 -
2 214051 0 0xc008 0xc004 0x02e - 0 -
3 214050 0 0xb00b 0xc004 0x02e - 0 -
4 214049 0 0x601b 0xc004 0x02e - 0 -
5 214048 0 0x601a 0xc004 0x02e - 0 -
6 214047 0 0x6018 0xc004 0x02e - 0 -
7 214046 0 0x501b 0xc004 0x02e - 0 -
8 214045 0 0x700c 0xc004 0x02e - 0 -
9 214044 0 0x600f 0xc004 0x02e - 0 -
10 214043 0 0x600d 0xc004 0x02e - 0 -
11 214042 0 0x600c 0xc004 0x02e - 0 -
12 214041 0 0x3013 0xc004 0x02e - 0 -
13 214040 0 0x3012 0xc004 0x02e - 0 -
14 214039 0 0x3010 0xc004 0x02e - 0 -
15 214038 0 0x2013 0xc004 0x02e - 0 -

---

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.142+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

smartctl comes with ABSOLUTELY NO WARRANTY. This is free
software, and you are welcome to redistribute it under
the terms of the GNU General Public License; either
version 2, or (at your option) any later version.
See http://www.gnu.org for further details.

smartmontools release 7.2 dated 2020-12-30 at 16:48:30 UTC
smartmontools SVN rev 5155 dated 2020-12-30 at 16:49:18
smartmontools build host: x86_64-pc-linux-gnu
smartmontools build with: C++14, GCC 10.2.1 20210110
smartmontools configure arguments: '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--localstatedir=/var' '--disable-option-checking' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--runstatedir=/run' '--disable-maintainer-mode' '--disable-dependency-tracking' '--build=x86_64-linux-gnu' '--host=x86_64-linux-gnu' '--prefix=/usr' '--sysconfdir=/etc' '--mandir=/usr/share/man' '--with-initscriptdir=no' '--docdir=/usr/share/doc/smartmontools' '--with-attributelog=/var/lib/smartmontools/attrlog.' '--with-drivedbdir=/var/lib/smartmontools/drivedb' '--with-exampledir=/usr/share/doc/smartmontools/examples/' '--with-savestates=/var/lib/smartmontools/smartd.' '--with-smartdplugindir=/etc/smartmontools/smartd_warning.d' '--with-smartdscriptdir=/usr/share/smartmontools' '--with-systemdenvfile=/etc/default/smartmontools' '--with-systemdsystemunitdir=/lib/systemd/system' '--with-libsystemd=auto' '--with-selinux' 'build_alias=x86_64-linux-gnu' 'host_alias=x86_64-linux-gnu' 'CXXFLAGS=-g -O2 -ffile-prefix-map=/dpkg-src=. -fstack-protector-strong -Wformat -Werror=format-security -fsigned-char -Wall' 'LDFLAGS=-Wl,-z,relro -Wl,-z,now' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CFLAGS=-g -O2 -ffile-prefix-map=/dpkg-src=. -fstack-protector-strong -Wformat -Werror=format-security -fsigned-char -Wall'

---

.................
error_count : 214030
sqid : 0
cmdid : 0x7003
status_field : 0xc004(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc : 0x2e
lba : 0xffffffffffffffff
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0xffffffffffffffff
trtype_spec_info: 0
.................
 
Top