SOLVED Pool says it is online but unhealthy

Joined
Mar 5, 2022
Messages
224
I ran zpool status:

Code:
 zpool status -v pool
Code:
  pool: pool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 07:11:55 with 0 errors on Sun Oct 16 07:12:03 2022
config:

        NAME                                            STATE     READ WRITE CKSUM
        pool                                            ONLINE       0     0     0
          raidz2-0                                      ONLINE       3     0     0
            gptid/b84ae8a3-0e68-11ed-959c-14dae9124c74  ONLINE      10     0     0
            gptid/5a57cde6-2316-11ed-9eea-14dae9124c74  ONLINE       4     0     0
            gptid/fd3b142d-b916-11ec-85a6-14dae9124c74  ONLINE       2     0     0
            gptid/76e34b28-ac3c-11ec-8473-14dae9124c74  ONLINE       6     0     0
            gptid/2fd48a3c-23c2-11ed-9eea-14dae9124c74  ONLINE       2     0     0
            gptid/3e962c04-0461-11ed-9bd2-14dae9124c74  ONLINE       4     0     0
            gptid/291369e9-0df9-11ed-959c-14dae9124c74  ONLINE       6     0     0
            gptid/24a59418-0616-11ed-af63-14dae9124c74  ONLINE       8     0     0

errors: No known data errors


Does this mean my drive is beginning to fail already (they are all new drives)
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
A new drive is orders of magnitude more likely to fail than one that has been running 24x7 for 2 years. On the other hand, it is not very likely that 8 new drives fail pretty much at same time. Therefore the read errors you see are obviously a symptom, but not necessarily caused by all drives going bad. It is more likely that a controller, or PSU have gone bad. Or perhaps all the cables were moved when closing the case, or ...
 
Joined
Mar 5, 2022
Messages
224
A new drive is orders of magnitude more likely to fail than one that has been running 24x7 for 2 years. On the other hand, it is not very likely that 8 new drives fail pretty much at same time. Therefore the read errors you see are obviously a symptom, but not necessarily caused by all drives going bad. It is more likely that a controller, or PSU have gone bad. Or perhaps all the cables were moved when closing the case, or ...
What tests can I run on the other hardware to help eliminate everything else?
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
The very first thing is to make a backup and verify that restore works.

To eliminate things you can carefully re-seat all cables. If that does not help, replacing components one by one (never more than one at a time!) is the easiest.

Also, you can run zpool clear to reset the error counters. This will remove the warning/error only, but not correct anything. So you have to watch out.
 
Joined
Mar 5, 2022
Messages
224
The very first thing is to make a backup and verify that restore works.

To eliminate things you can carefully re-seat all cables. If that does not help, replacing components one by one (never more than one at a time!) is the easiest.

Also, you can run zpool clear to reset the error counters. This will remove the warning/error only, but not correct anything. So you have to watch out.
So, that brings up another (seemingly unrelated) question: what backup/restore method should I be using? I am currently running a rsync task (as defined in Tasks->Rsync Tasks) to an old Netgear ReadyNAS over the network. I am also running an rsync script that copies the files to a hard disk attached to the TrueNAS and then verifies the files are intact.

If I try to restore any/all of the files, should I be worried about corrupting any existing files? Which set of files should I be using for my restore?
 
Joined
Mar 5, 2022
Messages
224
I powered down the NAS and reset all of the drive cables. Powered it back on and the pool is reporting healthy.

I ran the long test over the drive in question and get the following:
===============================================
Device: da5
Device Model: WDC WD20EFZX-68AWUN0
Serial Number: Serial Number
had the following errors:
===============================================
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p14 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 3
CR = Command Register
FEATR = Features Register
COUNT = Count (was: Sector Count) Register
LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8
LH = LBA High (was: Cylinder High) Register ] LBA
LM = LBA Mid (was: Cylinder Low) Register ] Register
LL = LBA Low (was: Sector Number) Register ]
DV = Device (was: Device/Head) Register
DC = Device Control Register
ER = Error register
ST = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3 [2] occurred at disk power-on lifetime: 2233 hours (93 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
10 -- 51 00 00 00 00 c6 93 d3 f0 40 00 Error: IDNF at LBA = 0xc693d3f0 = 3331576816

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
61 00 30 00 00 00 00 c6 93 d3 f0 40 00 29d+22:53:10.909 WRITE FPDMA QUEUED
60 00 08 00 00 00 00 b4 09 8d 68 40 00 29d+22:53:09.830 READ FPDMA QUEUED
60 00 08 00 38 00 00 bf 3b 0e b0 40 00 29d+22:53:09.024 READ FPDMA QUEUED
60 00 08 00 30 00 00 c6 91 fc 58 40 00 29d+22:53:09.024 READ FPDMA QUEUED
60 00 08 00 28 00 00 c6 84 45 48 40 00 29d+22:53:09.024 READ FPDMA QUEUED

Error 2 [1] occurred at disk power-on lifetime: 2233 hours (93 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 c6 82 e9 40 40 00 Error: UNC at LBA = 0xc682e940 = 3330468160

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 08 00 00 00 00 c6 82 e9 40 40 00 29d+22:50:10.570 READ FPDMA QUEUED
60 00 08 00 00 00 00 b7 82 58 d8 40 00 29d+22:50:04.406 READ FPDMA QUEUED
60 00 08 00 00 00 00 c6 8a 4c 40 40 00 29d+22:50:04.400 READ FPDMA QUEUED
60 00 20 00 00 00 00 c6 89 d4 30 40 00 29d+22:50:03.395 READ FPDMA QUEUED
61 00 08 00 00 00 00 c6 93 97 58 40 00 29d+22:49:58.026 WRITE FPDMA QUEUED

Error 1 [0] occurred at disk power-on lifetime: 2233 hours (93 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 b5 8f e6 b8 40 00 Error: UNC at LBA = 0xb58fe6b8 = 3046106808

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 08 00 00 00 00 b5 8f e6 b8 40 00 29d+22:49:15.815 READ FPDMA QUEUED
60 00 10 00 08 00 00 c6 82 e9 60 40 00 29d+22:49:15.792 READ FPDMA QUEUED
60 00 08 00 00 00 00 c6 8c 7a a8 40 00 29d+22:49:15.785 READ FPDMA QUEUED
60 00 20 00 08 00 00 ba 77 5c c0 40 00 29d+22:49:13.474 READ FPDMA QUEUED
60 00 20 00 00 00 00 bd b4 08 d8 40 00 29d+22:49:13.466 READ FPDMA QUEUED

Funny thing is that the pool is still registering as healthy!

It looks like the only errors are occurring at power-on?
Is the drive failing?
 

Alecmascot

Guru
Joined
Mar 18, 2014
Messages
1,177
Is the SMART test unable report the drive serial no or did you hide it ?
 
Joined
Mar 5, 2022
Messages
224
LOL... I have a cron that runs and checks all of my drives and reports the results: apparently there was a bug in displayingthe serial number.
When I run from the command line I get this (same error as I initially reported for this drive). It's curious that the time (2233 hours) of the error hasn't changed (even since I originally created this post). Is this an old error that has been cleared but is still being reported?

phong% sudo smartctl -l xerror /dev/da5
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 3
CR = Command Register
FEATR = Features Register
COUNT = Count (was: Sector Count) Register
LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8
LH = LBA High (was: Cylinder High) Register ] LBA
LM = LBA Mid (was: Cylinder Low) Register ] Register
LL = LBA Low (was: Sector Number) Register ]
DV = Device (was: Device/Head) Register
DC = Device Control Register
ER = Error register
ST = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3 [2] occurred at disk power-on lifetime: 2233 hours (93 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
10 -- 51 00 00 00 00 c6 93 d3 f0 40 00 Error: IDNF at LBA = 0xc693d3f0 = 3331576816

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
61 00 30 00 00 00 00 c6 93 d3 f0 40 00 29d+22:53:10.909 WRITE FPDMA QUEUED
60 00 08 00 00 00 00 b4 09 8d 68 40 00 29d+22:53:09.830 READ FPDMA QUEUED
60 00 08 00 38 00 00 bf 3b 0e b0 40 00 29d+22:53:09.024 READ FPDMA QUEUED
60 00 08 00 30 00 00 c6 91 fc 58 40 00 29d+22:53:09.024 READ FPDMA QUEUED
60 00 08 00 28 00 00 c6 84 45 48 40 00 29d+22:53:09.024 READ FPDMA QUEUED

Error 2 [1] occurred at disk power-on lifetime: 2233 hours (93 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 c6 82 e9 40 40 00 Error: UNC at LBA = 0xc682e940 = 3330468160

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 08 00 00 00 00 c6 82 e9 40 40 00 29d+22:50:10.570 READ FPDMA QUEUED
60 00 08 00 00 00 00 b7 82 58 d8 40 00 29d+22:50:04.406 READ FPDMA QUEUED
60 00 08 00 00 00 00 c6 8a 4c 40 40 00 29d+22:50:04.400 READ FPDMA QUEUED
60 00 20 00 00 00 00 c6 89 d4 30 40 00 29d+22:50:03.395 READ FPDMA QUEUED
61 00 08 00 00 00 00 c6 93 97 58 40 00 29d+22:49:58.026 WRITE FPDMA QUEUED

Error 1 [0] occurred at disk power-on lifetime: 2233 hours (93 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 b5 8f e6 b8 40 00 Error: UNC at LBA = 0xb58fe6b8 = 3046106808

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 08 00 00 00 00 b5 8f e6 b8 40 00 29d+22:49:15.815 READ FPDMA QUEUED
60 00 10 00 08 00 00 c6 82 e9 60 40 00 29d+22:49:15.792 READ FPDMA QUEUED
60 00 08 00 00 00 00 c6 8c 7a a8 40 00 29d+22:49:15.785 READ FPDMA QUEUED
60 00 20 00 08 00 00 ba 77 5c c0 40 00 29d+22:49:13.474 READ FPDMA QUEUED
 

Alecmascot

Guru
Joined
Mar 18, 2014
Messages
1,177
That's historical log data.
smartctl -a /dev/d5 will get everything
 
Joined
Mar 5, 2022
Messages
224
I get the following error when I use the -a option:
phong% sudo smartctl -a /dev/d5
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

/dev/d5: Unable to detect device type
Please specify device type with the -d option.

Use smartctl -h to get a usage summary

Its a SATA device, so I tried this:
phong% sudo smartctl -a /dev/d5 -d ata
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

Smartctl open device: /dev/d5 failed: No such file or directory

FWIW, here is the output of sudo smartctl --scan:
/dev/da0 -d scsi # /dev/da0, SCSI device
/dev/da1 -d scsi # /dev/da1, SCSI device
/dev/da2 -d scsi # /dev/da2, SCSI device
/dev/da3 -d scsi # /dev/da3, SCSI device
/dev/da4 -d scsi # /dev/da4, SCSI device
/dev/da5 -d scsi # /dev/da5, SCSI device
/dev/da6 -d scsi # /dev/da6, SCSI device
/dev/da7 -d scsi # /dev/da7, SCSI device
/dev/ada0 -d atacam # /dev/ada0, ATA device
/dev/ada1 -d atacam # /dev/ada1, ATA device
/dev/ada2 -d atacam # /dev/ada2, ATA device
 
Last edited:

Alecmascot

Guru
Joined
Mar 18, 2014
Messages
1,177
Joined
Mar 5, 2022
Messages
224
Turns out that I was fat-fingering the command:

smartctl -a /dev/d5
Should have been:
smartctl -a /dev/ad5

The results from THIS command are:
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: WDC WD20EFZX-68AWUN0
Serial Number: WD-WX12D41361E7
LU WWN Device Id: 5 0014ee 2bee455bd
Firmware Version: 81.00B81
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Nov 5 09:52:59 2022 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (22560) seconds.
Offline data collection
capabilities: (0x11) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 241) minutes.
SCT capabilities: (0x303d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 208 205 021 Pre-fail Always - 2558
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 14
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2321
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 13
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 11
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 4
194 Temperature_Celsius 0x0022 117 107 000 Old_age Always - 30
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 2312 -
# 2 Extended offline Completed without error 00% 2303 -
# 3 Short offline Completed without error 00% 2298 -
# 4 Extended offline Completed without error 00% 2290 -
# 5 Short offline Completed without error 00% 2264 -
# 6 Short offline Completed without error 00% 2240 -
# 7 Short offline Completed without error 00% 2234 -
# 8 Short offline Completed without error 00% 2216 -
# 9 Short offline Completed without error 00% 2192 -
#10 Extended offline Interrupted (host reset) 70% 2169 -
#11 Short offline Completed without error 00% 2145 -
#12 Short offline Completed without error 00% 2121 -
#13 Short offline Completed without error 00% 2097 -
#14 Short offline Completed without error 00% 2073 -
#15 Short offline Completed without error 00% 2049 -
#16 Short offline Completed without error 00% 2025 -
#17 Extended offline Interrupted (host reset) 20% 2004 -
#18 Short offline Completed without error 00% 1977 -
#19 Short offline Completed without error 00% 1953 -
#20 Short offline Completed without error 00% 1929 -
#21 Short offline Completed without error 00% 1905 -

Selective Self-tests/Logging not supported
So I assume that the drive is healthy?
 
Last edited:

Alecmascot

Guru
Joined
Mar 18, 2014
Messages
1,177
Looks ok.
If you post data in the future use code tags to preserve formatting.
 
Joined
Mar 5, 2022
Messages
224
@ChrisRJ Thank you! I had zero confidence that your sage wisdom was going to work, but I gave it the hail Mary. I powered down, reset all of the cables and rebooted as you suggested. The NAS has been running since with ZERO issues! It has been up for nearly 5 days straight and I know that I would have had errors by now before I reset the cables.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
Glad that things look ok now. I have a similar situation here and therefore monitor things carefully (if it is the drive, after all)
 
Top