Hello all,
I have a pool of 4 16 TB HDDs and 2 1TB NVMes used as a cache VDEV. I week or so ago I noticed that whenever I'm logged in the server via SSH, the console would randomly show a message that the error count of nvme01 has increased. Here's the output of
Could someone please help me understand what these errors mean, what could have caused them and how to avoid them in the future, if possible?
Thank you!
I have a pool of 4 16 TB HDDs and 2 1TB NVMes used as a cache VDEV. I week or so ago I noticed that whenever I'm logged in the server via SSH, the console would randomly show a message that the error count of nvme01 has increased. Here's the output of
smartctl -a /dev/nvme0
:Code:
=== START OF INFORMATION SECTION === Model Number: KINGSTON SNV2S1000G Serial Number: 50026B77855C33BE Firmware Version: ELFK0S.6 PCI Vendor/Subsystem ID: 0x2646 IEEE OUI Identifier: 0x0026b7 Total NVM Capacity: 1,000,204,886,016 [1.00 TB] Unallocated NVM Capacity: 0 Controller ID: 0 NVMe Version: 1.4 Number of Namespaces: 1 Namespace 1 Size/Capacity: 1,000,204,886,016 [1.00 TB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 0026b7 7855c33be5 Local Time is: Sun Jan 14 10:12:14 2024 EET Firmware Updates (0x12): 1 Slot, no Reset required Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test Optional NVM Commands (0x0057): Comp Wr_Unc DS_Mngmt Sav/Sel_Feat Timestmp Log Page Attributes (0x1e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg Maximum Data Transfer Size: 64 Pages Warning Comp. Temp. Threshold: 77 Celsius Critical Comp. Temp. Threshold: 79 Celsius Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 5.00W - - 0 0 0 0 0 0 1 + 2.40W - - 1 1 1 1 0 0 2 + 1.90W - - 2 2 2 2 0 0 3 - 0.0500W - - 3 3 3 3 3000 2000 4 - 0.0035W - - 4 4 4 4 10000 40000 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 1 1 - 4096 0 0 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 36 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 0% Data Units Read: 3,156,105 [1.61 TB] Data Units Written: 25,576,803 [13.0 TB] Host Read Commands: 14,883,940 Host Write Commands: 288,586,691 Controller Busy Time: 247 Power Cycles: 47 Power On Hours: 442 Unsafe Shutdowns: 12 Media and Data Integrity Errors: 0 Error Information Log Entries: 1,280,895 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 2: 64 Celsius Error Information (NVMe Log 0x01, 16 of 63 entries) Num ErrCount SQId CmdId Status PELoc LBA NSID VS 0 1280895 0 0x401d 0xc004 0x000 0 0 - 1 1280894 0 0x501c 0xc004 0x000 0 0 - 2 1280893 0 0xd011 0xc004 0x000 0 0 - 3 1280892 0 0xd010 0xc004 0x000 0 0 - 4 1280891 0 0x201c 0xc005 0x000 0 0 - 5 1280890 0 0x7008 0xc005 0x000 0 0 - 6 1280889 0 0x001d 0xc005 0x000 0 0 - 7 1280888 0 0x101c 0xc005 0x000 0 0 - 8 1280887 0 0x7019 0xc004 0x000 0 0 - 9 1280886 0 0x7018 0xc004 0x000 0 0 - 10 1280885 0 0xf01c 0xc004 0x000 0 0 - 11 1280884 0 0xd01d 0xc004 0x000 0 0 - 12 1280883 0 0xf00a 0xc005 0x000 0 0 - 13 1280882 0 0xf009 0xc005 0x000 0 0 - 14 1280881 0 0x501a 0xc005 0x000 0 0 - 15 1280880 0 0x5019 0xc005 0x000 0 0 - ... (47 entries not read)
Could someone please help me understand what these errors mean, what could have caused them and how to avoid them in the future, if possible?
Thank you!