SOLVED Help! One or more devices has experienced an unrecoverable error

kamhoe

Dabbler
Joined
Mar 30, 2016
Messages
26
I received an email alert on Friday that:

Current alerts:
* Pool Pool1 state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

and followed some commands from:
One or more devices has experienced an unrecoverable error

and getting:
Code:
root@TrueNAS[~]# smartctl -a /dev/ada1                         
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p6 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Phison Driven SSDs
Device Model:     KINGSTON SA400S37240G
Serial Number:    50026B77823D15D2
LU WWN Device Id: 5 0026b7 7823d15d2
Firmware Version: R0105A
User Capacity:    240,057,409,536 bytes [240 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
TRIM Command:     Available
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Sep 17 10:58:15 2021 MDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  120) seconds.
Offline data collection
capabilities:              (0x11) SMART execute Offline immediate.
                    No Auto Offline data collection support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    No Selective Self-test supported.
SMART capabilities:            (0x0002)    Does not save SMART data before
                    entering power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  10) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   100   100   000    Old_age   Always       -       100
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       24998
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       98
148 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
149 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
167 Write_Protect_Mode      0x0000   100   100   000    Old_age   Offline      -       0
168 SATA_Phy_Error_Count    0x0012   100   100   000    Old_age   Always       -       0
169 Bad_Block_Rate          0x0000   100   100   000    Old_age   Offline      -       0
170 Bad_Blk_Ct_Erl/Lat      0x0000   100   100   010    Old_age   Offline      -       0/0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 MaxAvgErase_Ct          0x0000   100   100   000    Old_age   Offline      -       0
181 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count        0x0000   100   100   000    Old_age   Offline      -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
192 Unsafe_Shutdown_Count   0x0012   100   100   000    Old_age   Always       -       90
194 Temperature_Celsius     0x0022   040   055   000    Old_age   Always       -       40 (Min/Max 30/55)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
199 SATA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
218 CRC_Error_Count         0x0032   100   100   000    Old_age   Always       -       0
231 SSD_Life_Left           0x0000   071   071   000    Old_age   Offline      -       71
233 Flash_Writes_GiB        0x0032   100   100   000    Old_age   Always       -       40549
241 Lifetime_Writes_GiB     0x0032   100   100   000    Old_age   Always       -       48933
242 Lifetime_Reads_GiB      0x0032   100   100   000    Old_age   Always       -       4904
244 Average_Erase_Count     0x0000   100   100   000    Old_age   Offline      -       297
245 Max_Erase_Count         0x0000   100   100   000    Old_age   Offline      -       328
246 Total_Erase_Count       0x0000   100   100   000    Old_age   Offline      -       125836

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Selective Self-tests/Logging not supported


This Pool1 (mirrored with 2 disks) is mainly for my jails & I am not an expert to understand these technical values.

Can someone tell me what error it is & if this disk still safety to run or I should replace it immediately?

Thanks a lot!
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
How about zpool status ?

Your pool is now at higher risk because it survives only thanks to the second and last drive in the mirror. There is no time to waste but it is never the time to panic.

If you do not have backups, now is the time to make one.

You do not need to backup the software used by the jail. As for the config, up to you to see if they are complex enough to justify backups. What is important is to extract and backup all the data used inside each of these jails.
 

kamhoe

Dabbler
Joined
Mar 30, 2016
Messages
26
Code:
root@TrueNAS[~]# zpool status -v
  pool: Pool1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 00:22:27 with 0 errors on Sun Sep  5 00:22:32 2021
config:

    NAME                                                STATE     READ WRITE CKSUM
    Pool1                                               ONLINE       0     0     0
      mirror-0                                          ONLINE       0     0     0
        gptid/5424aac3-dbbe-11e8-bf34-f0921cf5831c.eli  ONLINE       0     0     0
        gptid/54bab0a8-dbbe-11e8-bf34-f0921cf5831c.eli  ONLINE       0     0     3

errors: No known data errors


My ada1 (gptid/54bab0a8-dbbe-11e8-bf34-f0921cf5831c.eli) still showing online & I wonder what error it is?

If not, I will just replace a new disk & resilvering?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
Personally I would replace the drive and then put the old drive through a battery of tests to see if there is a problem with it or not. If not then keep it as a spare (I like to have a spare and its a good excuse to buy one)

Depending on your level of paranoia - it is only 3 checksum errors and you could just clear the errors as the drive is still working (at the moment). But if the checksums come back. Also check the cables for a good fit, not tight etc

I also note that they are SSD's which are less subject to issues than HDD's

More importantly - no smart tests. You need to run a long smart test on the drive immediately and schedule both short and long tests on a regular basis (I think I do short test weekly and long tests monthly)
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
So the drive suffered 3 checksum errors. They have been fixed and now everything is back in order. As such, your pool is as safe as it was before that.

Now the need is to understand how / why the disk ended up with such error. The problem may be cabling, port, bus, disk, .... There are many reasons that can explain this.

So first thing I suggest is for you to plan, implement and test your backup procedures. You always need backups in all cases, even when your server is at its best. Second, I would keep monitoring the situation to see if anything changes.

SSDs and magnetic drives are different and are not tested / evaluated / managed the same way. But the need to ultimately replace one exists for both. If you do not already have a spare at hand, it would be another thing to get ready before facing a potential emergency.

But again and even more with the result you posted about zpool status, there is no need to panic.
 

kamhoe

Dabbler
Joined
Mar 30, 2016
Messages
26
I just checked my Cron Jobs & I do have plexdata-plexpass backup regularly.

I don't have many jails installed but most concern one are Plex & Nextcloud.
I keep Nextcloud data in Pool2 & Plex media in Pool3.

As suggested by NugentS, I now added S.M.A.R.T. Tests weekly for short & monthly for long.

Also, just ran a long test for ada1
Code:
root@TrueNAS[~]# smartctl -t long /dev/ada1
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p6 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 10 minutes for test to complete.
Test will complete after Mon Sep 20 11:20:13 2021 MDT
Use smartctl -X to abort test.

root@TrueNAS[~]# smartctl -a /dev/ada1
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p6 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Phison Driven SSDs
Device Model:     KINGSTON SA400S37240G
Serial Number:    50026B77823D15D2
LU WWN Device Id: 5 0026b7 7823d15d2
Firmware Version: R0105A
User Capacity:    240,057,409,536 bytes [240 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
TRIM Command:     Available
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Sep 20 11:23:23 2021 MDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  120) seconds.
Offline data collection
capabilities:              (0x11) SMART execute Offline immediate.
                    No Auto Offline data collection support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    No Selective Self-test supported.
SMART capabilities:            (0x0002)    Does not save SMART data before
                    entering power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  10) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   100   100   000    Old_age   Always       -       100
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       25070
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       98
148 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
149 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
167 Write_Protect_Mode      0x0000   100   100   000    Old_age   Offline      -       0
168 SATA_Phy_Error_Count    0x0012   100   100   000    Old_age   Always       -       0
169 Bad_Block_Rate          0x0000   100   100   000    Old_age   Offline      -       0
170 Bad_Blk_Ct_Erl/Lat      0x0000   100   100   010    Old_age   Offline      -       0/0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 MaxAvgErase_Ct          0x0000   100   100   000    Old_age   Offline      -       0
181 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count        0x0000   100   100   000    Old_age   Offline      -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
192 Unsafe_Shutdown_Count   0x0012   100   100   000    Old_age   Always       -       90
194 Temperature_Celsius     0x0022   040   055   000    Old_age   Always       -       40 (Min/Max 30/55)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
199 SATA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
218 CRC_Error_Count         0x0032   100   100   000    Old_age   Always       -       0
231 SSD_Life_Left           0x0000   071   071   000    Old_age   Offline      -       71
233 Flash_Writes_GiB        0x0032   100   100   000    Old_age   Always       -       40802
241 Lifetime_Writes_GiB     0x0032   100   100   000    Old_age   Always       -       49074
242 Lifetime_Reads_GiB      0x0032   100   100   000    Old_age   Always       -       4945
244 Average_Erase_Count     0x0000   100   100   000    Old_age   Offline      -       298
245 Max_Erase_Count         0x0000   100   100   000    Old_age   Offline      -       329
246 Total_Erase_Count       0x0000   100   100   000    Old_age   Offline      -       126286

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     25070         -

Selective Self-tests/Logging not supported


I am thinking to add 3rd mirror to Pool1. Or am I only allowed to create a new vdev? Can I use 1 SSD that I currently have on hand or do I need 2 SSD for a new vdev?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
You need a new vdev to increase the capacity. I have never tried a 3 way mirror. The 2nd vdev ought to be identical to the first

Actually - you have made me think of something - I normally intensely dislike hot spares - it just puts wear and tear on the disk for limited purpose. However I can't think why I would object to an SSD Hot Spare. Time powered up doesn't bother me - its just writes that do and a spare doesn't get any.

I might even add a 3rd disk as a hotspare to the vdev if I didn't want to buy 2 more. You can (I think) always remove a hotspare later on

@kamhoe I would simply reseat the drives (turning off the NAS), rerun cable as necessary to ensure that the cable isn't stressed. Then boot and if everything comes up properly clear the error counters and wait to see if it happens again. If it does first thing to do is replace the cable
 

kamhoe

Dabbler
Joined
Mar 30, 2016
Messages
26
I did a shutdown & turned back on & received an email that:

The following alert has been cleared:
* Pool Pool1 state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

@NugentS I guess I will wait this weekend to clean up my system little bit & install the 3rd SSD as hotspare or wait until I can see a new deal for more SSD. Thanks!
 

kamhoe

Dabbler
Joined
Mar 30, 2016
Messages
26
I would conclude that it was the SATA cable that caused the error!
 
Top