sudden unhealthy pool

AJinNJ · Sep 1, 2022

Been running freenas -> truenas for a long time, but still a bit of a novice (it's been very good to me and I haven't had to mess-around too much with it)...

Just received an alert of "Pool pool1 state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected." Sure enough, the pool (creatively named "pool1" ;) ) shows unhealthy...

Code:

 zpool status -v pool1
  pool: pool1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 08:30:27 with 0 errors on Thu Sep  1 08:30:28 2022
config:

        NAME                                            STATE     READ WRITE CKSUM
        pool1                                           ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/a577c0fe-ab6f-11e6-a104-10bf48b6ddea  ONLINE       0     0     0
            gptid/a67b47c3-ab6f-11e6-a104-10bf48b6ddea  ONLINE       0     0     0
            gptid/a77d0f33-ab6f-11e6-a104-10bf48b6ddea  ONLINE       1     0     0

errors: No known data errors

Just about 1hr later I get the next alert: "Device: /dev/ada2, ATA error count increased from 0 to 5."

Code:

smartctl -a /dev/ada2
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p11 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WCC4N4AJUXKZ
LU WWN Device Id: 5 0014ee 2630f6d04
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Sep  1 14:41:37 2022 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (39540) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 397) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       6
  3 Spin_Up_Time            0x0027   183   180   021    Pre-fail  Always       -       5850
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       61
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   038   038   000    Old_age   Always       -       45932
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       61
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       43
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       589
194 Temperature_Celsius     0x0022   118   112   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 5
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 5 occurred at disk power-on lifetime: 45931 hours (1913 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 80 58 f5 18 e6  Error: UNC 128 sectors at LBA = 0x0618f558 = 102298968

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 80 50 f5 18 e6 00   4d+05:34:28.543  READ DMA
  c8 00 80 50 f5 18 e6 00   4d+05:34:25.147  READ DMA
  c8 00 80 50 f5 18 e6 00   4d+05:34:21.750  READ DMA
  c8 00 80 50 f5 18 e6 00   4d+05:34:18.343  READ DMA
  c8 00 80 50 f5 18 e6 00   4d+05:34:14.973  READ DMA

Error 4 occurred at disk power-on lifetime: 45931 hours (1913 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 80 58 f5 18 e6  Error: UNC 128 sectors at LBA = 0x0618f558 = 102298968

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 80 50 f5 18 e6 00   4d+05:34:25.147  READ DMA
  c8 00 80 50 f5 18 e6 00   4d+05:34:21.750  READ DMA
  c8 00 80 50 f5 18 e6 00   4d+05:34:18.343  READ DMA
  c8 00 80 50 f5 18 e6 00   4d+05:34:14.973  READ DMA

Error 3 occurred at disk power-on lifetime: 45931 hours (1913 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 80 58 f5 18 e6  Error: UNC 128 sectors at LBA = 0x0618f558 = 102298968

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 80 50 f5 18 e6 00   4d+05:34:21.750  READ DMA
  c8 00 80 50 f5 18 e6 00   4d+05:34:18.343  READ DMA
  c8 00 80 50 f5 18 e6 00   4d+05:34:14.973  READ DMA
  c8 00 80 d0 f2 18 e6 00   4d+05:34:14.961  READ DMA

Error 2 occurred at disk power-on lifetime: 45931 hours (1913 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 80 58 f5 18 e6  Error: UNC 128 sectors at LBA = 0x0618f558 = 102298968

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 80 50 f5 18 e6 00   4d+05:34:18.343  READ DMA
  c8 00 80 50 f5 18 e6 00   4d+05:34:14.973  READ DMA
  c8 00 80 d0 f2 18 e6 00   4d+05:34:14.961  READ DMA
  c8 00 80 50 f2 18 e6 00   4d+05:34:14.961  READ DMA

Error 1 occurred at disk power-on lifetime: 45931 hours (1913 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 80 58 f5 18 e6  Error: UNC 128 sectors at LBA = 0x0618f558 = 102298968

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 80 50 f5 18 e6 00   4d+05:34:14.973  READ DMA
  c8 00 80 d0 f2 18 e6 00   4d+05:34:14.961  READ DMA
  c8 00 80 50 f2 18 e6 00   4d+05:34:14.961  READ DMA
  c8 00 80 d0 f1 18 e6 00   4d+05:34:14.961  READ DMA

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     45823         -
# 2  Short offline       Completed without error       00%     45655         -
# 3  Short offline       Completed without error       00%     45487         -
# 4  Short offline       Completed without error       00%     45320         -
# 5  Short offline       Completed without error       00%     45152         -
# 6  Short offline       Completed without error       00%     44984         -
# 7  Short offline       Completed without error       00%     44816         -
# 8  Short offline       Completed without error       00%     44648         -
# 9  Short offline       Completed without error       00%     44480         -
#10  Short offline       Completed without error       00%     44313         -
#11  Short offline       Completed without error       00%     44145         -
#12  Short offline       Completed without error       00%     43977         -
#13  Short offline       Completed without error       00%     43809         -
#14  Short offline       Completed without error       00%     43641         -
#15  Short offline       Completed without error       00%     43474         -
#16  Short offline       Completed without error       00%     43314         -
#17  Short offline       Completed without error       00%     43138         -
#18  Short offline       Completed without error       00%     42974         -
#19  Short offline       Completed without error       00%     42802         -
#20  Short offline       Completed without error       00%     42634         -
#21  Short offline       Completed without error       00%     42467         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

So that's the details I have...can anyone help me understand whether I should replace ata2 ASAP, or try some other troubleshooting?

Thanks very much,
AJ

Heracles · Sep 1, 2022

AJinNJ said:
Error 1 occurred at disk power-on lifetime: 45931 hours (1913 days + 19 hours)

So the drive has been on for 5 years and 3 months. From that, I suspect that it is not under warranty anymore...

So that not-so-young drive just had a first bad block.

Because your pool is Raid-Z1, your redundancy is very minimalistic. How are your backups ? Ever tried to do a restore to be sure they are indeed working properly ? Now would be a good time to test them and ensure they are working.

From now, everything can happen. Things can remain the same with no more error in any drive and the pool will remain good. But it can also turn to catastrophic if all other drives are about the same age and they start suffering failures that would happen to affect the same data.

Should you have been Raid-Z2, to wait would have been less dangerous. Because you are Raid-Z1, I would not waste a minute and start getting a new drive. If you can re-silver without removing the actual drive, that would be ideal. If not, ensure your backups are good before dropping your pool to no redundancy.

AJinNJ · Sep 1, 2022

@Heracles you are correct, that's a WD Red - the old model, now relabeled the WD Red Plus - and those have 3 year warranties. (I think even the WD Red Pro "only" gets you 5 years...not sure there's any that go higher...)

I just ordered a replacement, which will take a day to get here.

I also understand I would be much better off with RAID-Z2. Unfortunately, I'm at where I'm at.

Are you saying there is a procedure to add the new drive without removing the failing one? (I guess then subsequently remove the failed one.)

If you would be so kind, can you please provide a link (or if it's easy/preferable, paste it in this thread)? I understand what you're saying about replacing the drive placing additional stress on the others - which, you are correct, I verified to be similar ages. So the best path forward (safest path, given my current situation) would be to add the new drive while all 3 of the current pool are online, right? But somehow designate the new disk to be a replacement for ada2?

Thanks again,
AJ

PS: The data on that pool is currently backed-up to the cloud. Thus, the data is recoverable - but it would probably be painfully slow.

Heracles · Sep 1, 2022

AJinNJ said:
Are you saying there is a procedure to add the new drive without removing the failing one? (I guess then subsequently remove the failed one.)

Yep... All you need is some free space in your server to plug that extra drive at the same time as all your others.

You first need to plug that new drive in your server. If your server is not hot-plug capable, you need to power it down for that. Know that to power down a server with disks that have been hot for a very long time is a risky action. They may not come back. As such, before powering down, be sure your backups are good.

Once your server is up with all the 4 drives, you can go in the menu about disks and click to replace the failing drive with the new one. That way, the new drive will be re-silvered without removing the problematic drive until the re-silvering process is completed.

AJinNJ · Sep 1, 2022

You first need to plug that new drive in your server. If your server is not hot-plug capable, you need to power it down for that. Know that to power down a server with disks that have been hot for a very long time is a risky action. They may not come back. As such, before powering down, be sure your backups are good.

Yeah, I've seen that movie before ;)

Since it uses GUID in many places, and I don't think it's hot-pluggable (but I'll double check), can I use a USB-to-SATA cable temporarily, just to perform the initial sync? Like: 1) Add new drive via USB-adapter. 2) Perform "Replace Disk" via Pool Management menu. 3) Once resilver is complete, remove failing disk: Power down and physically replace disk in server. 5) Power up.

Or would that procedure be more trouble than it's worth?

souporman · Sep 1, 2022

AJinNJ said:
Yeah, I've seen that movie before ;)

Since it uses GUID in many places, and I don't think it's hot-pluggable (but I'll double check), can I use a USB-to-SATA cable temporarily, just to perform the initial sync? Like: 1) Add new drive via USB-adapter. 2) Perform "Replace Disk" via Pool Management menu. 3) Once resilver is complete, remove failing disk: Power down and physically replace disk in server. 5) Power up.

Or would that procedure be more trouble than it's worth?

You would probably increase your chances at losing everything if you did that. Don't use USB-to-SATA for any reason in TrueNAS. Think of it as a significantly worse way of connecting your drives. Because it is. Would you want to connect a significantly worse drive to your pool in an attempt to resilver? No, if/when that significantly worse drive fails for any reason during resilver, your pool is 100% dead. Just do like the guy above said:
1. Backup your data if you can (this is a critical first step)
2. Replace the dead drive with another SATA drive using a SATA cable only.

What you are proposing would be significantly more risky than powering down and replacing the disk. Really, though... you gotta get that data backed up before you do anything. I'd sooner buy a XTB external USB and copy everything to that than just shut it down and cross my fingers everything works out alright... but you have to decide if you care enough about this data to spend the money.

AJinNJ · Sep 1, 2022

@souporman Understood. My idea is a no-go.

As I said, that pool/mount is backed-up to the cloud (and I have check the consistency of that). It would be a painful restore, but that cloud-backup is meant to be for Disaster Recovery, so I knew that would be slow when I chose to do that.

Making a temporary copy to a local USB drive via rsync something is a great idea. I might have that available already.

Thx

Important Announcement for the TrueNAS Community.

sudden unhealthy pool

AJinNJ

Dabbler

Heracles

Wizard

AJinNJ

Dabbler

Heracles

Wizard

AJinNJ

Dabbler

souporman

Explorer

AJinNJ

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

sudden unhealthy pool

AJinNJ

Dabbler

Heracles

Wizard

AJinNJ

Dabbler

Heracles

Wizard

AJinNJ

Dabbler

souporman

Explorer

AJinNJ

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "sudden unhealthy pool"

Similar threads