SOLVED Boot pool: Unrecoverable error

Stromkompressor

Dabbler
Joined
Mar 13, 2023
Messages
18
Hi!

After 1 week downtime of my TrueNAS Scale system (I just had it shut off), I turned it on today. I manually started a long SMART test for all disks (4 HDDs and 1 SSD as the boot drive). After some time I was alerted with this error:

Code:
TrueNAS @ truenas
New alerts:
    Boot pool status is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected..


I checked via SSH what's going on:

Code:
admin@truenas[~]$ sudo zpool status -v
[sudo] password for admin:
  pool: boot-pool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 00:00:21 with 0 errors on Sat Jun 10 03:45:22 2023
config:

    NAME        STATE     READ WRITE CKSUM
    boot-pool   ONLINE       0     0     0
      sdd3      ONLINE       0    12     0

errors: No known data errors

  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 00:41:34 with 0 errors on Sun Jun  4 00:41:35 2023
config:

    NAME                                      STATE     READ WRITE CKSUM
    tank                                      ONLINE       0     0     0
      raidz2-0                                ONLINE       0     0     0
        e73ded72-01b5-488c-9b07-5db6305c3d1f  ONLINE       0     0     0
        2ad2fa7f-6b38-4d95-acd6-df02afdafc0e  ONLINE       0     0     0
        e70e9be2-8007-4bbb-a822-5cf81c50ebfd  ONLINE       0     0     0
        e0886fa9-0284-4117-8cc0-8e682467a884  ONLINE       0     0     0

errors: No known data errors



admin@truenas[~]$ sudo smartctl -a /dev/sdd
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.79+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     INTENSO SSD
Serial Number:    AA000000000000014880
Firmware Version: V0718B0
User Capacity:    240,057,409,536 bytes [240 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sun Jun 18 18:04:58 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x03)    Offline data collection activity
                    is in progress.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (  41)    The self-test routine was interrupted
                    by the host with a hard or soft reset.
Total time to complete Offline
data collection:         (  120) seconds.
Offline data collection
capabilities:              (0x11) SMART execute Offline immediate.
                    No Auto Offline data collection support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    No Selective Self-test supported.
SMART capabilities:            (0x0002)    Does not save SMART data before
                    entering power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  10) minutes.
SCT capabilities:            (0x0001)    SCT Status supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   100   100   050    Old_age   Always       -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   050    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   050    Old_age   Always       -       995
 12 Power_Cycle_Count       0x0032   100   100   050    Old_age   Always       -       42
160 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       0
161 Unknown_Attribute       0x0033   100   100   050    Pre-fail  Always       -       100
163 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       14
164 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       163
165 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       3
166 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       1
167 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       2
168 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       5050
169 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       100
175 Program_Fail_Count_Chip 0x0032   100   100   050    Old_age   Always       -       0
176 Erase_Fail_Count_Chip   0x0032   100   100   050    Old_age   Always       -       0
177 Wear_Leveling_Count     0x0032   100   100   050    Old_age   Always       -       0
178 Used_Rsvd_Blk_Cnt_Chip  0x0032   100   100   050    Old_age   Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   050    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   050    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   050    Old_age   Always       -       27
194 Temperature_Celsius     0x0022   100   100   050    Old_age   Always       -       40
195 Hardware_ECC_Recovered  0x0032   100   100   050    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   050    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   050    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0032   100   100   050    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   050    Old_age   Always       -       0
232 Available_Reservd_Space 0x0032   100   100   050    Old_age   Always       -       100
241 Total_LBAs_Written      0x0030   100   100   050    Old_age   Offline      -       3858
242 Total_LBAs_Read         0x0030   100   100   050    Old_age   Offline      -       6394
245 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       2640

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Interrupted (host reset)      90%       992         -
# 2  Extended offline    Interrupted (host reset)      80%       991         -
# 3  Short offline       Completed without error       00%       975         -
# 4  Short offline       Completed without error       00%       951         -
# 5  Short offline       Completed without error       00%       927         -
# 6  Short offline       Completed without error       00%       903         -
# 7  Short offline       Completed without error       00%       879         -
# 8  Short offline       Completed without error       00%       855         -
# 9  Short offline       Completed without error       00%       832         -
#10  Short offline       Completed without error       00%       808         -
#11  Short offline       Completed without error       00%       784         -
#12  Extended offline    Completed without error       00%       768         -
#13  Short offline       Completed without error       00%       760         -
#14  Short offline       Completed without error       00%       736         -
#15  Short offline       Completed without error       00%       712         -
#16  Extended offline    Completed without error       00%       706         -
#17  Short offline       Completed without error       00%       688         -
#18  Short offline       Completed without error       00%       677         -
#19  Short offline       Completed without error       00%       675         -
#20  Short offline       Completed without error       00%       655         -
#21  Short offline       Completed without error       00%       631         -

Selective Self-tests/Logging not supported


I am not aware that I interrupted the test. smartctl says

Code:
The self-test routine was interrupted by the host with a hard or soft reset.


I am confused why zpool status shows errors for the SSD drive.
 

Stromkompressor

Dabbler
Joined
Mar 13, 2023
Messages
18
I will just rerun the SMART test for the SSD and then do a scrub and see what zpool and smartctl say after that.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Run a SMART Long test on the SSD, wait 10 minutes for it to complete. Check the SMART Status, I suspect it will look fine. Run a Scrub on the boot pool. You may still have the WRITE Errors. If you have no other failures (missing or corrupt files, run the command
Code:
zpool clear boot-pool
which should clear the error values. If the problem occurs again then maybe the SSD is suspect, or maybe you dropped power to the NAS before it was completely shutdown.
 

Stromkompressor

Dabbler
Joined
Mar 13, 2023
Messages
18
Yup, worked, thanks. Since this is "only" the boot drive and I immediately get alerts when something begins to go wrong, I will just monitor it.

Code:
Self-test execution status:      (   0)    The previous self-test routine completed without error or no self-test has ever been run.



Code:
  scan: scrub repaired 0B in 00:00:22 with 0 errors on Sun Jun 18 20:01:48 2023
config:

    NAME        STATE     READ WRITE CKSUM
    boot-pool   ONLINE       0     0     0
      sdd3      ONLINE       0     0     0
 
Top