Cache NVMe error count increase

zkvvoob

Cadet
Joined
Dec 18, 2023
Messages
4
Hello all,

I have a pool of 4 16 TB HDDs and 2 1TB NVMes used as a cache VDEV. I week or so ago I noticed that whenever I'm logged in the server via SSH, the console would randomly show a message that the error count of nvme01 has increased. Here's the output of smartctl -a /dev/nvme0:

Code:
=== START OF INFORMATION SECTION ===
Model Number:                       KINGSTON SNV2S1000G
Serial Number:                      50026B77855C33BE
Firmware Version:                   ELFK0S.6
PCI Vendor/Subsystem ID:            0x2646
IEEE OUI Identifier:                0x0026b7
Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      0
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            0026b7 7855c33be5
Local Time is:                      Sun Jan 14 10:12:14 2024 EET
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0057):     Comp Wr_Unc DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x1e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size:         64 Pages
Warning  Comp. Temp. Threshold:     77 Celsius
Critical Comp. Temp. Threshold:     79 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     5.00W       -        -    0  0  0  0        0       0
 1 +     2.40W       -        -    1  1  1  1        0       0
 2 +     1.90W       -        -    2  2  2  2        0       0
 3 -   0.0500W       -        -    3  3  3  3     3000    2000
 4 -   0.0035W       -        -    4  4  4  4    10000   40000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         1
 1 -    4096       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        36 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    3,156,105 [1.61 TB]
Data Units Written:                 25,576,803 [13.0 TB]
Host Read Commands:                 14,883,940
Host Write Commands:                288,586,691
Controller Busy Time:               247
Power Cycles:                       47
Power On Hours:                     442
Unsafe Shutdowns:                   12
Media and Data Integrity Errors:    0
Error Information Log Entries:      1,280,895
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 2:               64 Celsius

Error Information (NVMe Log 0x01, 16 of 63 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0    1280895     0  0x401d  0xc004  0x000            0     0     -
  1    1280894     0  0x501c  0xc004  0x000            0     0     -
  2    1280893     0  0xd011  0xc004  0x000            0     0     -
  3    1280892     0  0xd010  0xc004  0x000            0     0     -
  4    1280891     0  0x201c  0xc005  0x000            0     0     -
  5    1280890     0  0x7008  0xc005  0x000            0     0     -
  6    1280889     0  0x001d  0xc005  0x000            0     0     -
  7    1280888     0  0x101c  0xc005  0x000            0     0     -
  8    1280887     0  0x7019  0xc004  0x000            0     0     -
  9    1280886     0  0x7018  0xc004  0x000            0     0     -
 10    1280885     0  0xf01c  0xc004  0x000            0     0     -
 11    1280884     0  0xd01d  0xc004  0x000            0     0     -
 12    1280883     0  0xf00a  0xc005  0x000            0     0     -
 13    1280882     0  0xf009  0xc005  0x000            0     0     -
 14    1280881     0  0x501a  0xc005  0x000            0     0     -
 15    1280880     0  0x5019  0xc005  0x000            0     0     -
... (47 entries not read)


Could someone please help me understand what these errors mean, what could have caused them and how to avoid them in the future, if possible?

Thank you!
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Well, NVMe drives are not as transparent about these things as SATA drives, but clearly errors are bad. You may be able to get more info by updating the smartctl database (update-smart-drivedb) and using smartctl -x to get all data.
 

zkvvoob

Cadet
Joined
Dec 18, 2023
Messages
4
Well, NVMe drives are not as transparent about these things as SATA drives, but clearly errors are bad. You may be able to get more info by updating the smartctl database (update-smart-drivedb) and using smartctl -x to get all data.
Thanks for the suggestion! Unfortunately, my smart-drivedb turned out to be up to date and smartctl -x didn't show any additional information. :frown:
 

MrGuvernment

Patron
Joined
Jun 15, 2017
Messages
268
Not necessarily related, but how much system ram do you have? Is the cache even being hit on the NVMe's?
 

zkvvoob

Cadet
Joined
Dec 18, 2023
Messages
4
Not necessarily related, but how much system ram do you have? Is the cache even being hit on the NVMe's?
My system has 64 GB RAM, around half of it is used for ZFS cache. I don't know whether the NVMes are actually in use...
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I don't know whether the NVMes are actually in use...
go to the shell and type arc_summary

The generated report will tell you in the L2ARC section what's used.
 

zkvvoob

Cadet
Joined
Dec 18, 2023
Messages
4
go to the shell and type arc_summary

The generated report will tell you in the L2ARC section what's used.
Here are the results. Can you help me interpret them, please?
Code:
L2ARC status:                                                    HEALTHY
        Low memory aborts:                                             0
        Free on write:                                                 0
        R/W clashes:                                                   0
        Bad checksums:                                                 0
        Read errors:                                                   0
        Write errors:                                                  0

L2ARC size (adaptive):                                           1.8 TiB
        Compressed:                                    86.6 %    1.6 TiB
        Header size:                                    0.1 %    1.4 GiB
        MFU allocated size:                             3.9 %   62.2 GiB
        MRU allocated size:                            95.3 %    1.5 TiB
        Prefetch allocated size:                        0.9 %   14.1 GiB
        Data (buffer content) allocated size:          99.7 %    1.6 TiB
        Metadata (buffer content) allocated size:       0.3 %    5.3 GiB

L2ARC breakdown:                                                    1.0M
        Hit ratio:                                     34.9 %     350.8k
        Miss ratio:                                    65.1 %     655.3k

L2ARC I/O:
        Reads:                                       36.8 GiB     350.8k
        Writes:                                      11.2 GiB      16.5k

L2ARC evicts:
        L1 cached:                                                     0
        While reading:                                                 0
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
So your L2ARC is holding 1.8TB of data and of the 1 million times it was asked for some content, about 35% of the time, it was able to give something back, preventing the need for the pool disks to serve that content.
 
Top