Trying to read/understand smartctl -a information.

DenisInternet

Dabbler
Joined
Jun 14, 2022
Messages
28
Hey folks, have a failed NVME drive, waiting for the replacement drive to arrive, currently pool is running degraded (slow but seems stable). Everything is backed up on a secondary raid box.

While I am waiting for my replacement NVME to arrive (lesson learned about keeping a spare in the future). I was trying to figure out what went wrong with the drive, while running smartctl -a /dev/nvme3n1 ; could someone please help me understand the information below (if any is insightful to what is the failure)?

Thanks!

Code:
admin@truenas[~]$ sudo smartctl -a /dev/nvme3n1
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.1.74-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       WUS3BA176C7P3E3
Serial Number:                      A068DE2E
Firmware Version:                   R0112100
PCI Vendor/Subsystem ID:            0x1b96
IEEE OUI Identifier:                0x0014ee
Total NVM Capacity:                 7,681,501,126,656 [7.68 TB]
Unallocated NVM Capacity:           0
Controller ID:                      0
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          7,681,501,126,656 [7.68 TB]
Namespace 1 Formatted LBA Size:     4096
Namespace 1 IEEE EUI-64:            0014ee 81000aee24
Local Time is:                      Thu Mar  7 02:31:05 2024 UTC
Firmware Updates (0x19):            4 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x001f):   Security Format Frmw_DL NS_Mngmt Self_Test
Optional NVM Commands (0x005a):     Wr_Unc Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Warning  Comp. Temp. Threshold:     70 Celsius
Critical Comp. Temp. Threshold:     80 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    12.00W       -        -    0  0  0  0        0       0
 1 +    10.00W       -        -    0  0  0  0        0       0
 2 +     8.00W       -        -    0  0  0  0        0       0

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +    4096       0         0
 1 -     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- media has been placed in read only mode
- volatile memory backup device has failed

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x18
Temperature:                        36 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    136,159,627 [69.7 TB]
Data Units Written:                 78,984,419 [40.4 TB]
Host Read Commands:                 738,510,235
Host Write Commands:                549,669,094
Controller Busy Time:               105,931
Power Cycles:                       3,813
Power On Hours:                     8,793
Unsafe Shutdowns:                   221
Media and Data Integrity Errors:    0
Error Information Log Entries:      28
Warning  Comp. Temperature Time:    39399
Critical Comp. Temperature Time:    57
Temperature Sensor 1:               39 Celsius

Error Information (NVMe Log 0x01, 16 of 256 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0         28     0  0x201c  0xc004  0x029            0     0     -  Invalid Field in Command

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
Num  Test_Description  Status                       Power_on_Hours  Failing_LBA  NSID Seg SCT Code
 0   Extended          Completed without error                6572            -     -   -   -    -

admin@truenas[~]$ print

admin@truenas[~]$ 

 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
First, you would need to run a test in order to get actual data since the last one is from over 2k hours before.
Second, it looks like it reached its endurance limit and put itself in safe mode.

This model doesn't seem to give you much data to work with however. Maybe tring the manufacturers' propietary interface (if existing) could tell you more.
 
Last edited:
Top