Please rate: Smart-Error on new SSD critical?

saveZFS · Apr 11, 2022

Hello,
I get the following error message with one of my SSDs (not even a new week old).
"Device: /dev/nvme0, number of Error Log entries increased from 34 to 36."

The following error is entered in the smart values.

Is this error critical and should I have the SSD replaced or is that unimportant and I can leave the SSD in the system?

sretalla · Apr 11, 2022

Check the full SMART output from it first, but it's not looking great to have 36 bad things to say in the first week of its life.

saveZFS · Apr 11, 2022

sretalla said:
Check the full SMART output from it first, but it's not looking great to have 36 bad things to say in the first week of its life.

Thank you for your first assessment.
Is this the smart output you mean?

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 35 Celsius
Available Spare: 100%
Available Spare Threshold: 5%
Percentage Used: 0%
Data Units Read: 495,158 [253 GB]
Data Units Written: 2,521,738 [1.29 TB]
Host Read Commands: 2,461,910
Host Write Commands: 21,383,802
Controller Busy Time: 182
Power Cycles: 3
Power On Hours: 120
Unsafe Shutdowns: 1
Media and Data Integrity Errors: 0
Media and Data Integrity Errors: 0
Error Information Log Entries: 36
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 49 Celsius

sretalla · Apr 11, 2022

I can confirm that I have devices with 50TB or more written that have no errors logged, so I would consider it to not be normal to have any errors. Making 36 sound like something to make you want to send it back.

I was hoping to be able to see the 36 errors via smartctl, but it seems not...

If we're on CORE here, you'll need to run nvmecontrol logpage -p 1 -x nvme0 (assuming it's nvme0 we're talking about).

What we then do with the Hex table dump, I'm not as sure, but all I can see on mine is zeros, so can't see what might be the format without some actual errors to go on.

NugentS · Apr 11, 2022

Please post the entire result from the smart output - in code tags. It would also help if you read the forum rules about posting your hardware setup - so we know what we are dealing with

saveZFS · Apr 11, 2022

NugentS said:
Please post the entire result from the smart output - in code tags.

What do you mean exactly, if I may ask?
So that I can do better in the future!
I will also list my system in the signature in the future. Thanks for the hint! :)

For the two SSDs I have already decided that they will go back to the dealer and I will order others. I wasn't satisfied with the sync-write performance anyway.

winnielinnie · Apr 11, 2022

saveZFS said:
"Device: /dev/nvme0, number of Error Log entries increased from 34 to 36."

Likely to increase with every reboot or suspend. It has something to do with the NVMe's firmware and/or the OS not sending appropriate signals to the device. It's harmless.

It's perfectly fine, unless there's reason to suspect otherwise.

SMART doesn't play well (nor understand well) NVMe drives.

One of my NVMe has 64 "errors".

The error log is spammed with these:

Code:

 Entry[63]
.................
error_count     : 0
sqid            : 0
cmdid           : 0
status_field    : 0(SUCCESS: The command completed successfully)
phase_tag       : 0
parm_err_loc    : 0
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................

---

Why does the Number of Error Information Log Entries in the SMART values
increase along with the number of times the computer is turned on and off?

Due to the fact that some platforms send non-NVMe command signals to the SSD,
the Number of Error Information Log Entries in the SMART values of the SSD
continues to increase. Error notification messages even appear on the screen
when booting.

The counting of this value will not cause any impact on the use of the SSD.
Customers may use it with peace of mind.

Some further reading: https://forum.proxmox.com/threads/pve-7-1-new-smart-error-on-every-hard-disk-at-every-reboot.101063/

mira said:
It seems this is an incompatibility between the kernel nvme interface and the NVMe. A future firmware update might fix this issue.

Feni said:
Same here with an older Samsung 950 Pro. No negative impact detected, other than an increasing counter.

So theoretically, these increasing "errors" may pause after a kernel update and/or NVMe firmware update. Or you can just ignore them.

saveZFS · Apr 11, 2022

winnielinnie said:
So theoretically, these increasing "errors" may pause after a kernel update and/or NVMe firmware update. Or you can just ignore them.

Thank you very much, then I will keep the SSDs and not send them back to the dealer!
Although I will still need other SSDs in the long run because the sync-write performance is just extremely bad! :(

NugentS · Apr 11, 2022

saveZFS said:
What do you mean exactly, if I may ask?
So that I can do better in the future!
I will also list my system in the signature in the future. Thanks for the hint! :)

For the two SSDs I have already decided that they will go back to the dealer and I will order others. I wasn't satisfied with the sync-write performance anyway.

You only appeared to post the second half of the smart output. TRhe first part would have told us what they were (for example)
For example


root@newnas[~]# smartctl -a /dev/nvme1
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p12 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO 500GB
Serial Number:                      S5H7NS0NA72055A
Firmware Version:                   2B2QEXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 500,107,862,016 [500 GB]
Unallocated NVM Capacity:           0
Controller ID:                      4
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          500,107,862,016 [500 GB]
Namespace 1 Utilization:            373,901,086,720 [373 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5a0143d2fd
Local Time is:                      Mon Apr 11 23:16:00 2022 BST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.20W       -        -    0  0  0  0        0       0
 1 +     4.30W       -        -    1  1  1  1        0       0
 2 +     2.10W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        30 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    1,059,596 [542 GB]
Data Units Written:                 3,560,615 [1.82 TB]
Host Read Commands:                 16,171,177
Host Write Commands:                69,974,540
Controller Busy Time:               103
Power Cycles:                       66
Power On Hours:                     3,832
Unsafe Shutdowns:                   24
Media and Data Integrity Errors:    0
Error Information Log Entries:      161
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               30 Celsius
Temperature Sensor 2:               36 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

SirNomad49 · Jun 6, 2022

Same topic.

NVME disk - Error "number of Error Log entries increased"

Hello. I have a Dell server on which I have a Truenas installed: FreeBSD 12.2-RELEASE-p9 2ee62d665f0(HEAD) TRUENAS We have a pool of NVME disks with 3 units. Device: /dev/nvme0, number of Error Log entries increased from 1 to 2 Device: /dev/nvme2, number of Error Log entries increased from...

www.truenas.com

Important Announcement for the TrueNAS Community.

Please rate: Smart-Error on new SSD critical?

saveZFS

Explorer

sretalla

Powered by Neutrality

saveZFS

Explorer

sretalla

Powered by Neutrality

NugentS

MVP

saveZFS

Explorer

winnielinnie

MVP

saveZFS

Explorer

NugentS

MVP

SirNomad49

Cadet

NVME disk - Error "number of Error Log entries increased"

Similar threads