One or more devices has experienced an unrecoverable error

ChrisChros

Patron
Joined
Nov 24, 2018
Messages
218
Hello guys,

today I saw the following message in the Alerts Tab:

Pool VirtualMachine state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.​


Code:
root@TrueNAS:~ # zpool status -v
  pool: TankDrive
 state: ONLINE
  scan: scrub repaired 0B in 03:00:44 with 0 errors on Sat Oct 15 07:00:46 2022
config:

    NAME                                            STATE     READ WRITE CKSUM
    TankDrive                                       ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/def3c606-f4d2-11e8-8329-d05099c12845  ONLINE       0     0     0
        gptid/dfa6cded-f4d2-11e8-8329-d05099c12845  ONLINE       0     0     0
        gptid/e05b5506-f4d2-11e8-8329-d05099c12845  ONLINE       0     0     0
        gptid/e11854fd-f4d2-11e8-8329-d05099c12845  ONLINE       0     0     0
        gptid/e1d3c80e-f4d2-11e8-8329-d05099c12845  ONLINE       0     0     0

errors: No known data errors

  pool: VirtualMachine
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 00:01:10 with 0 errors on Sun Oct 16 17:40:29 2022
config:

    NAME            STATE     READ WRITE CKSUM
    VirtualMachine  ONLINE       0     0     0
      nvd0          ONLINE       1     0     0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
    The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:00:40 with 0 errors on Fri Oct 14 03:45:40 2022
config:

    NAME          STATE     READ WRITE CKSUM
    freenas-boot  ONLINE       0     0     0
      ada0p2      ONLINE       0     0     0

errors: No known data errors


It looks like that the pool VirtualMachine has 1 Read Error.
How can I check what exactly happened and get ride of this message?

And on top, how can I upgrade the boot drive to enable all features?

Thanks for your help and suggestions.
Chris
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
How can I check what exactly happened and get ride of this message?
smartctl -a /dev/nvd0 and zpool clear VirtualMachine respectively.
And on top, how can I upgrade the boot drive to enable all features?
zpool upgrade freenas-boot.
Once this is done, the pool may no longer be accessible by software that does not support the features.
Suggestion: do read what the system tells you, even if looks like a wall of text.
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
And on top, how can I upgrade the boot drive to enable all features?

This is probably meaningless and pointless, as there aren't any significant "features" that you need on the boot pool aside from the ability to store files. Additionally, if you *do* upgrade the boot pool, and then need to revert to an older version of TrueNAS, you could find yourself unable to.
 

ChrisChros

Patron
Joined
Nov 24, 2018
Messages
218
Hi, again I received the same message for the same drive. Is this anything I need to worry about replacing the drive?
Code:
pool: VirtualMachine
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 00:01:37 with 0 errors on Fri Dec 15 05:01:37 2023
config:

    NAME            STATE     READ WRITE CKSUM
    VirtualMachine  ONLINE       0     0     0
      nvd0          ONLINE       7     0     2

errors: No known data errors

SMART shows me this result:
Code:
root@TrueNAS:~ # smartctl -a /dev/nvme0
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO 1TB
Serial Number:                      S5H9NC0N202224Z
Firmware Version:                   2B2QEXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      4
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Utilization:            288,979,996,672 [288 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 52015008b0
Local Time is:                      Tue Dec 19 19:05:02 2023 CET
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.20W       -        -    0  0  0  0        0       0
 1 +     4.30W       -        -    1  1  1  1        0       0
 2 +     2.10W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        36 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    1%
Data Units Read:                    17,651,281 [9.03 TB]
Data Units Written:                 94,528,920 [48.3 TB]
Host Read Commands:                 194,512,797
Host Write Commands:                2,703,696,616
Controller Busy Time:               10,864
Power Cycles:                       26
Power On Hours:                     29,131
Unsafe Shutdowns:                   8
Media and Data Integrity Errors:    2
Error Information Log Entries:      19
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               36 Celsius
Temperature Sensor 2:               43 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0         19     1  0x007f  0xc502  0x000    737101018     1     -


Any suggestions how to proceed?

Regards Chris
 

ChrisChros

Patron
Joined
Nov 24, 2018
Messages
218
latest available firmware for this SSD is already installed, 2B2QEXE7.
 

ChrisChros

Patron
Joined
Nov 24, 2018
Messages
218

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222

ChrisChros

Patron
Joined
Nov 24, 2018
Messages
218
If it's under warranty, file a RMA request.
Its a Samsung SSD 970 EVO with 29131 h => 1213 days of operation, slightly out of warranty.

UPDATE:
found this, looks like the ssd is still under warranty.
 

Attachments

  • Bildschirmfoto 2023-12-20 um 14.00.38.png
    Bildschirmfoto 2023-12-20 um 14.00.38.png
    143.8 KB · Views: 29
Joined
Oct 22, 2019
Messages
3,641
Interestingly, I have the same NVMe drives as you, in a 2-way mirror vdev.

Mine have comparable power-on hours (27,000) as yours (29,000).

What stands out is that mine have significantly higher reads (44 TiB) compared to yours (9 TiB), yet, inversely, significantly lower writes (15 TiB) compared to yours (48 TiB).

This probably lends credence to the asymmetrical resilience of SATA and NVMe SSDs. (Writes are not only more intensive, but require resetting cells for previously "deleted" data before they can be written to.)

While I've never had any ZFS read/write/checksum errors on my NVMe pool, the "error count" does occasionally climb for both drives. (They're both around 30 logged errors so far.)

For what it's worth, I disabled "Auto TRIM", but I do have a weekly cron job that issues the "zpool trim" command on this pool. Are you using any form of TRIM (auto, weekly, something else)?
 
Joined
Oct 22, 2019
Messages
3,641
What is this pool ("VirtualMachine") being used for? To hold zvols for your VM storage devices?
 

ChrisChros

Patron
Joined
Nov 24, 2018
Messages
218
What is this pool ("VirtualMachine") being used for? To hold zvols for your VM storage devices?
the pool is mainly used for VM and plugin / Jail storage. Furthermore is there also a zvol for ISO images.
 

ChrisChros

Patron
Joined
Nov 24, 2018
Messages
218
For what it's worth, I disabled "Auto TRIM", but I do have a weekly cron job that issues the "zpool trim" command on this pool. Are you using any form of TRIM (auto, weekly, something else)?
Auto-TRIM is enabled for this drive.
 

Attachments

  • Bildschirmfoto 2023-12-20 um 17.59.23.png
    Bildschirmfoto 2023-12-20 um 17.59.23.png
    224.6 KB · Views: 22
Top