SOLVED Pool says it is online but unhealthy

jordanthompson · Oct 29, 2022

I ran zpool status:

Code:

 zpool status -v pool

Code:

  pool: pool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 07:11:55 with 0 errors on Sun Oct 16 07:12:03 2022
config:

        NAME                                            STATE     READ WRITE CKSUM
        pool                                            ONLINE       0     0     0
          raidz2-0                                      ONLINE       3     0     0
            gptid/b84ae8a3-0e68-11ed-959c-14dae9124c74  ONLINE      10     0     0
            gptid/5a57cde6-2316-11ed-9eea-14dae9124c74  ONLINE       4     0     0
            gptid/fd3b142d-b916-11ec-85a6-14dae9124c74  ONLINE       2     0     0
            gptid/76e34b28-ac3c-11ec-8473-14dae9124c74  ONLINE       6     0     0
            gptid/2fd48a3c-23c2-11ed-9eea-14dae9124c74  ONLINE       2     0     0
            gptid/3e962c04-0461-11ed-9bd2-14dae9124c74  ONLINE       4     0     0
            gptid/291369e9-0df9-11ed-959c-14dae9124c74  ONLINE       6     0     0
            gptid/24a59418-0616-11ed-af63-14dae9124c74  ONLINE       8     0     0

errors: No known data errors

Does this mean my drive is beginning to fail already (they are all new drives)

ChrisRJ · Oct 29, 2022

A new drive is orders of magnitude more likely to fail than one that has been running 24x7 for 2 years. On the other hand, it is not very likely that 8 new drives fail pretty much at same time. Therefore the read errors you see are obviously a symptom, but not necessarily caused by all drives going bad. It is more likely that a controller, or PSU have gone bad. Or perhaps all the cables were moved when closing the case, or ...

jordanthompson · Oct 29, 2022

ChrisRJ said:
A new drive is orders of magnitude more likely to fail than one that has been running 24x7 for 2 years. On the other hand, it is not very likely that 8 new drives fail pretty much at same time. Therefore the read errors you see are obviously a symptom, but not necessarily caused by all drives going bad. It is more likely that a controller, or PSU have gone bad. Or perhaps all the cables were moved when closing the case, or ...

What tests can I run on the other hardware to help eliminate everything else?

ChrisRJ · Oct 29, 2022

The very first thing is to make a backup and verify that restore works.

To eliminate things you can carefully re-seat all cables. If that does not help, replacing components one by one (never more than one at a time!) is the easiest.

Also, you can run zpool clear to reset the error counters. This will remove the warning/error only, but not correct anything. So you have to watch out.

jordanthompson · Oct 29, 2022

ChrisRJ said:
The very first thing is to make a backup and verify that restore works.

To eliminate things you can carefully re-seat all cables. If that does not help, replacing components one by one (never more than one at a time!) is the easiest.

Also, you can run zpool clear to reset the error counters. This will remove the warning/error only, but not correct anything. So you have to watch out.

So, that brings up another (seemingly unrelated) question: what backup/restore method should I be using? I am currently running a rsync task (as defined in Tasks->Rsync Tasks) to an old Netgear ReadyNAS over the network. I am also running an rsync script that copies the files to a hard disk attached to the TrueNAS and then verifies the files are intact.

If I try to restore any/all of the files, should I be worried about corrupting any existing files? Which set of files should I be using for my restore?

jordanthompson · Nov 4, 2022

I powered down the NAS and reset all of the drive cables. Powered it back on and the pool is reporting healthy.

I ran the long test over the drive in question and get the following:

===============================================
Device: da5
Device Model: WDC WD20EFZX-68AWUN0
Serial Number: Serial Number
had the following errors:
===============================================
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p14 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 3
CR = Command Register
FEATR = Features Register
COUNT = Count (was: Sector Count) Register
LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8
LH = LBA High (was: Cylinder High) Register ] LBA
LM = LBA Mid (was: Cylinder Low) Register ] Register
LL = LBA Low (was: Sector Number) Register ]
DV = Device (was: Device/Head) Register
DC = Device Control Register
ER = Error register
ST = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3 [2] occurred at disk power-on lifetime: 2233 hours (93 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
10 -- 51 00 00 00 00 c6 93 d3 f0 40 00 Error: IDNF at LBA = 0xc693d3f0 = 3331576816

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
61 00 30 00 00 00 00 c6 93 d3 f0 40 00 29d+22:53:10.909 WRITE FPDMA QUEUED
60 00 08 00 00 00 00 b4 09 8d 68 40 00 29d+22:53:09.830 READ FPDMA QUEUED
60 00 08 00 38 00 00 bf 3b 0e b0 40 00 29d+22:53:09.024 READ FPDMA QUEUED
60 00 08 00 30 00 00 c6 91 fc 58 40 00 29d+22:53:09.024 READ FPDMA QUEUED
60 00 08 00 28 00 00 c6 84 45 48 40 00 29d+22:53:09.024 READ FPDMA QUEUED

Error 2 [1] occurred at disk power-on lifetime: 2233 hours (93 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 c6 82 e9 40 40 00 Error: UNC at LBA = 0xc682e940 = 3330468160

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 08 00 00 00 00 c6 82 e9 40 40 00 29d+22:50:10.570 READ FPDMA QUEUED
60 00 08 00 00 00 00 b7 82 58 d8 40 00 29d+22:50:04.406 READ FPDMA QUEUED
60 00 08 00 00 00 00 c6 8a 4c 40 40 00 29d+22:50:04.400 READ FPDMA QUEUED
60 00 20 00 00 00 00 c6 89 d4 30 40 00 29d+22:50:03.395 READ FPDMA QUEUED
61 00 08 00 00 00 00 c6 93 97 58 40 00 29d+22:49:58.026 WRITE FPDMA QUEUED

Error 1 [0] occurred at disk power-on lifetime: 2233 hours (93 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 b5 8f e6 b8 40 00 Error: UNC at LBA = 0xb58fe6b8 = 3046106808

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 08 00 00 00 00 b5 8f e6 b8 40 00 29d+22:49:15.815 READ FPDMA QUEUED
60 00 10 00 08 00 00 c6 82 e9 60 40 00 29d+22:49:15.792 READ FPDMA QUEUED
60 00 08 00 00 00 00 c6 8c 7a a8 40 00 29d+22:49:15.785 READ FPDMA QUEUED
60 00 20 00 08 00 00 ba 77 5c c0 40 00 29d+22:49:13.474 READ FPDMA QUEUED
60 00 20 00 00 00 00 bd b4 08 d8 40 00 29d+22:49:13.466 READ FPDMA QUEUED

Funny thing is that the pool is still registering as healthy!

It looks like the only errors are occurring at power-on?
Is the drive failing?

Alecmascot · Nov 4, 2022

Is the SMART test unable report the drive serial no or did you hide it ?

jordanthompson · Nov 4, 2022

LOL... I have a cron that runs and checks all of my drives and reports the results: apparently there was a bug in displayingthe serial number.
When I run from the command line I get this (same error as I initially reported for this drive). It's curious that the time (2233 hours) of the error hasn't changed (even since I originally created this post). Is this an old error that has been cleared but is still being reported?

phong% sudo smartctl -l xerror /dev/da5
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 3
CR = Command Register
FEATR = Features Register
COUNT = Count (was: Sector Count) Register
LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8
LH = LBA High (was: Cylinder High) Register ] LBA
LM = LBA Mid (was: Cylinder Low) Register ] Register
LL = LBA Low (was: Sector Number) Register ]
DV = Device (was: Device/Head) Register
DC = Device Control Register
ER = Error register
ST = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3 [2] occurred at disk power-on lifetime: 2233 hours (93 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
10 -- 51 00 00 00 00 c6 93 d3 f0 40 00 Error: IDNF at LBA = 0xc693d3f0 = 3331576816

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
61 00 30 00 00 00 00 c6 93 d3 f0 40 00 29d+22:53:10.909 WRITE FPDMA QUEUED
60 00 08 00 00 00 00 b4 09 8d 68 40 00 29d+22:53:09.830 READ FPDMA QUEUED
60 00 08 00 38 00 00 bf 3b 0e b0 40 00 29d+22:53:09.024 READ FPDMA QUEUED
60 00 08 00 30 00 00 c6 91 fc 58 40 00 29d+22:53:09.024 READ FPDMA QUEUED
60 00 08 00 28 00 00 c6 84 45 48 40 00 29d+22:53:09.024 READ FPDMA QUEUED

Error 2 [1] occurred at disk power-on lifetime: 2233 hours (93 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 c6 82 e9 40 40 00 Error: UNC at LBA = 0xc682e940 = 3330468160

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 08 00 00 00 00 c6 82 e9 40 40 00 29d+22:50:10.570 READ FPDMA QUEUED
60 00 08 00 00 00 00 b7 82 58 d8 40 00 29d+22:50:04.406 READ FPDMA QUEUED
60 00 08 00 00 00 00 c6 8a 4c 40 40 00 29d+22:50:04.400 READ FPDMA QUEUED
60 00 20 00 00 00 00 c6 89 d4 30 40 00 29d+22:50:03.395 READ FPDMA QUEUED
61 00 08 00 00 00 00 c6 93 97 58 40 00 29d+22:49:58.026 WRITE FPDMA QUEUED

Error 1 [0] occurred at disk power-on lifetime: 2233 hours (93 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 b5 8f e6 b8 40 00 Error: UNC at LBA = 0xb58fe6b8 = 3046106808

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 08 00 00 00 00 b5 8f e6 b8 40 00 29d+22:49:15.815 READ FPDMA QUEUED
60 00 10 00 08 00 00 c6 82 e9 60 40 00 29d+22:49:15.792 READ FPDMA QUEUED
60 00 08 00 00 00 00 c6 8c 7a a8 40 00 29d+22:49:15.785 READ FPDMA QUEUED
60 00 20 00 08 00 00 ba 77 5c c0 40 00 29d+22:49:13.474 READ FPDMA QUEUED

Alecmascot · Nov 4, 2022

That's historical log data.
smartctl -a /dev/d5 will get everything

jordanthompson · Nov 4, 2022

I get the following error when I use the -a option:

phong% sudo smartctl -a /dev/d5
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

/dev/d5: Unable to detect device type
Please specify device type with the -d option.

Use smartctl -h to get a usage summary

Its a SATA device, so I tried this:

phong% sudo smartctl -a /dev/d5 -d ata
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

Smartctl open device: /dev/d5 failed: No such file or directory

FWIW, here is the output of sudo smartctl --scan:

/dev/da0 -d scsi # /dev/da0, SCSI device
/dev/da1 -d scsi # /dev/da1, SCSI device
/dev/da2 -d scsi # /dev/da2, SCSI device
/dev/da3 -d scsi # /dev/da3, SCSI device
/dev/da4 -d scsi # /dev/da4, SCSI device
/dev/da5 -d scsi # /dev/da5, SCSI device
/dev/da6 -d scsi # /dev/da6, SCSI device
/dev/da7 -d scsi # /dev/da7, SCSI device
/dev/ada0 -d atacam # /dev/ada0, ATA device
/dev/ada1 -d atacam # /dev/ada1, ATA device
/dev/ada2 -d atacam # /dev/ada2, ATA device

Alecmascot · Nov 5, 2022

smartctl Command in Linux

The smartctl command is a powerful tool that comes with the smartmontools package, designed to interact with and manage the Self-Monitoring, Analysis, and Reporting Technology (SMART) system of your hard drives and solid-state drives.

www.tutorialspoint.com

jordanthompson · Nov 5, 2022

Turns out that I was fat-fingering the command:

smartctl -a /dev/d5

Should have been:

smartctl -a /dev/ad5

The results from THIS command are:

smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: WDC WD20EFZX-68AWUN0
Serial Number: WD-WX12D41361E7
LU WWN Device Id: 5 0014ee 2bee455bd
Firmware Version: 81.00B81
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Nov 5 09:52:59 2022 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (22560) seconds.
Offline data collection
capabilities: (0x11) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 241) minutes.
SCT capabilities: (0x303d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 208 205 021 Pre-fail Always - 2558
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 14
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2321
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 13
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 11
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 4
194 Temperature_Celsius 0x0022 117 107 000 Old_age Always - 30
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 2312 -
# 2 Extended offline Completed without error 00% 2303 -
# 3 Short offline Completed without error 00% 2298 -
# 4 Extended offline Completed without error 00% 2290 -
# 5 Short offline Completed without error 00% 2264 -
# 6 Short offline Completed without error 00% 2240 -
# 7 Short offline Completed without error 00% 2234 -
# 8 Short offline Completed without error 00% 2216 -
# 9 Short offline Completed without error 00% 2192 -
#10 Extended offline Interrupted (host reset) 70% 2169 -
#11 Short offline Completed without error 00% 2145 -
#12 Short offline Completed without error 00% 2121 -
#13 Short offline Completed without error 00% 2097 -
#14 Short offline Completed without error 00% 2073 -
#15 Short offline Completed without error 00% 2049 -
#16 Short offline Completed without error 00% 2025 -
#17 Extended offline Interrupted (host reset) 20% 2004 -
#18 Short offline Completed without error 00% 1977 -
#19 Short offline Completed without error 00% 1953 -
#20 Short offline Completed without error 00% 1929 -
#21 Short offline Completed without error 00% 1905 -

Selective Self-tests/Logging not supported

So I assume that the drive is healthy?

Alecmascot · Nov 5, 2022

Looks ok.
If you post data in the future use code tags to preserve formatting.

jordanthompson · Nov 8, 2022

@ChrisRJ Thank you! I had zero confidence that your sage wisdom was going to work, but I gave it the hail Mary. I powered down, reset all of the cables and rebooted as you suggested. The NAS has been running since with ZERO issues! It has been up for nearly 5 days straight and I know that I would have had errors by now before I reset the cables.

ChrisRJ · Nov 9, 2022

Glad that things look ok now. I have a similar situation here and therefore monitor things carefully (if it is the drive, after all)

Important Announcement for the TrueNAS Community.

SOLVED Pool says it is online but unhealthy

jordanthompson

Patron

ChrisRJ

Wizard

jordanthompson

Patron

ChrisRJ

Wizard

jordanthompson

Patron

jordanthompson

Patron

Alecmascot

Guru

jordanthompson

Patron

Alecmascot

Guru

jordanthompson

Patron

Alecmascot

Guru

smartctl Command in Linux

jordanthompson

Patron

Alecmascot

Guru

jordanthompson

Patron

ChrisRJ

Wizard

Similar threads

Important Announcement for the TrueNAS Community.

SOLVED Pool says it is online but unhealthy

Patron

Wizard

Patron

Wizard

Patron

Patron

Guru

Patron

Guru

Patron

Guru

Patron

Guru

Patron

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Pool says it is online but unhealthy"

Similar threads