SOLVED After replacing a disk and resilvering pool is still in degraded status and drive says it's still replacing

Yorick · May 8, 2020

psssttt said:
do I have to somehow initialise or create a partition layout on the new drive before it can be used to replace a failing drive

No, the middleware should do that for you. It creates a small partition (default I think 2GB) to allow for slight drive size differences, and then a main partition for the rest.

People have manually created those when using the CLI to resilver a drive in. But that's what the UI / middleware is for, that you don't have to do that.

Again: I think, from reading. Have not actually ever replaced a drive, myself.

psssttt · May 8, 2020

Thanks.
In that case I'll take ada5 offline, take it physically out, wipe it, and replace it with itself.

psssttt · May 8, 2020

All righty, after wiping ada5 and replacing it with itself, resilvering kicked in.
I might be lucky this time because now the new drive has a gptid assigned to it:

Code:

# zpool status -v nasferatuvolume
  pool: nasferatuvolume
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri May  8 17:13:47 2020
    20.3G scanned at 417M/s, 452M issued at 9.04M/s, 25.2T total
    0 resilvered, 0.00% done, no estimated completion time
config:

    NAME                                              STATE     READ WRITE CKSUM
    nasferatuvolume                                   DEGRADED     0     0     0
      raidz2-0                                        DEGRADED     0     0     0
        gptid/271e756e-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0
        gptid/2c32e27b-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0
        gptid/31362dba-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0
        gptid/36387c1c-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0
        gptid/3b7b207e-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0
        replacing-5                                   OFFLINE      0     0     0
          12067865134856464115                        OFFLINE      0     0     0  was /dev/ada5
          gptid/e460ca0b-9146-11ea-9468-001b21275bb9  ONLINE       0     0     0
        gptid/45e1321a-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0
        gptid/4b337f7c-9605-11e5-b66a-382c4abd4614    ONLINE       0     0     0

errors: No known data errors

Code:

glabel status
                                      Name  Status  Components
gptid/271e756e-9605-11e5-b66a-382c4abd4614     N/A  ada0p2
gptid/2c32e27b-9605-11e5-b66a-382c4abd4614     N/A  ada1p2
gptid/31362dba-9605-11e5-b66a-382c4abd4614     N/A  ada2p2
gptid/36387c1c-9605-11e5-b66a-382c4abd4614     N/A  ada3p2
gptid/3b7b207e-9605-11e5-b66a-382c4abd4614     N/A  ada4p2
gptid/45e1321a-9605-11e5-b66a-382c4abd4614     N/A  ada6p2
gptid/4b337f7c-9605-11e5-b66a-382c4abd4614     N/A  ada7p2
gptid/2621435c-8701-11ea-9bf7-001b21275bb9     N/A  da0p1
gptid/e460ca0b-9146-11ea-9468-001b21275bb9     N/A  ada5p2
gptid/e44bd287-9146-11ea-9468-001b21275bb9     N/A  ada5p1
gptid/270982c1-9605-11e5-b66a-382c4abd4614     N/A  ada0p1

and gpart finally shows partitions on ada5:

Code:

gpart show ada5
=>         40  15628053088  ada5  GPT  (7.3T)
           40           88        - free -  (44K)
          128      4194304     1  freebsd-swap  (2.0G)
      4194432  15623858696     2  freebsd-zfs  (7.3T)

PS. Before I wiped the ada5 I checked its partition layout, and it had only one large partition, so something definitely must've gone wrong during the previous resilvering.

psssttt · May 9, 2020

Oh Yiiiiiiiiiiiiiiisssssssssssssss!!!
Finally things started to look good!

Code:

zpool status -v nasferatuvolume
  pool: nasferatuvolume
 state: ONLINE
  scan: resilvered 3.08T in 0 days 13:49:27 with 0 errors on Sat May  9 07:03:14 2020
config:

    NAME                                            STATE     READ WRITE CKSUM
    nasferatuvolume                                 ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/271e756e-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/2c32e27b-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/31362dba-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/36387c1c-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/3b7b207e-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/e460ca0b-9146-11ea-9468-001b21275bb9  ONLINE       0     0     0
        gptid/45e1321a-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/4b337f7c-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0

errors: No known data errors

I'm going to replace ada3 today.
Hope everything goes well :)

psssttt · May 9, 2020

All righty. I've replaced the second drive and so far everything looks just fine.
Tomorrow we'll know if everything went fine or not.

Thank you @Yorick & @sretalla for priceless tips.
I wish that the official documentation https://www.ixsystems.com/documentation/freenas/11.3-U2/storage.html#replacing-a-failed-disk mentioned clearly that before you replace a disk, you should ensure that your pool is in an error free state.
It would've saved me and ppl in similar situation a lot of time and trouble.

Yorick · May 9, 2020

psssttt said:
mentioned clearly that before you replace a disk, you should ensure that your pool is in an error free state

I agree, that'd be helpful.

I am wondering how you ended up with three permanently damaged files. I can see two failing drives leading to a situation where a file can't be repaired, but how did the file get damaged in the first place? Power outage? Another drive with errors?

psssttt · May 9, 2020

I'm not entirely sure.
I think we do have 1 max 2 power outages a year. In all cases my UPS did the job and powered the NAS off just in time.

re. errors:
I've been seeing these errors for quite a while and they were one of the two reasons behind disk replacement.
The second one was slowly increasing number of bad sectors reported by SMART test for both ada3 & ada5

Here's a rough timeline:

Feb 12th:

Code:

Current alerts:
* Device: /dev/ada5, 8 Currently unreadable (pending) sectors
* Device: /dev/ada5, 8 Offline uncorrectable sectors

Apr 6th:

Code:

Current alerts:
* Device: /dev/ada0, 8 Currently unreadable (pending) sectors.
* Device: /dev/ada0, 8 Offline uncorrectable sectors.
* Device: /dev/ada3, 8 Currently unreadable (pending) sectors.
* Device: /dev/ada3, 8 Offline uncorrectable sectors.
* Device: /dev/ada5, ATA error count increased from 55 to 60.
* Device: /dev/ada5, 8 Currently unreadable (pending) sectors.
* Device: /dev/ada5, 8 Offline uncorrectable sectors.
* Device: /dev/ada6, 8 Currently unreadable (pending) sectors.
* Device: /dev/ada6, 8 Offline uncorrectable sectors.

Apr 26th:

Code:

Current alerts:
* Device: /dev/ada5, ATA error count increased from 55 to 60.
* Device: /dev/ada0, 8 Currently unreadable (pending) sectors.
* Device: /dev/ada0, 8 Offline uncorrectable sectors.
* Device: /dev/ada3, 8 Currently unreadable (pending) sectors.
* Device: /dev/ada3, 8 Offline uncorrectable sectors.
* Device: /dev/ada5, 8 Currently unreadable (pending) sectors.
* Device: /dev/ada5, 8 Offline uncorrectable sectors.
* Device: /dev/ada5, ATA error count increased from 65 to 70.
* Device: /dev/ada5, Self-Test Log error count increased from 8 to 9.
* Device: /dev/ada6, 8 Currently unreadable (pending) sectors.
* Device: /dev/ada6, 8 Offline uncorrectable sectors.

May 3rd:

Code:

New alerts:
* Device: /dev/ada5, Self-Test Log error count increased from 9 to 10.

Current alerts:
* Device: /dev/ada5, ATA error count increased from 65 to 70.
* Device: /dev/ada5, Self-Test Log error count increased from 8 to 9.
* Device: /dev/ada5, ATA error count increased from 70 to 75.
* Device: /dev/ada5, ATA error count increased from 75 to 80.
* Device: /dev/ada3, ATA error count increased from 19 to 20.
* Device: /dev/ada5, ATA error count increased from 80 to 85.
* Device: /dev/ada5, 8 Currently unreadable (pending) sectors.
* Device: /dev/ada5, 8 Offline uncorrectable sectors.
* Device: /dev/ada5, ATA error count increased from 85 to 90.

May 8th (after replacing ada5 and before replacing ada3)

Code:

Current alerts:
* Device: /dev/ada3, ATA error count increased from 20 to 22.

My plan is to replace remaining 6x4TB drives in a span of 3 to 6 months (as I can't easily afford more than 2 new drives a month :) )
When do you think I should be getting worried about a bad sectors?

Yorick · May 9, 2020

psssttt said:
When do you think I should be getting worried about a bad sectors

Okay that makes sense, you have errors on ada0 and ada6 as well. So four bad drives, and hence a few files were corrupted beyond recovery. That actually shows the strength of ZFS: You have dual parity, four drives are bad, and ZFS can pin-point the few files that suffered and get you back up and running. Impressive.

Mechanical devices don't self-heal, they just get worse. Personally I'd start to worry as soon as I have any uncorrectable or unreadable sectors. Hard drives have scratch space for a "grown defect list" or "reallocated sector count", and that's handled by firmware and automated, exposed through SMART. I don't know much about how hard drives operate, but I'd assume that once the file system finds unreadable sectors, either the grown defect list of the drive is full, or the drive is encountering physical damage that can't be handled by reallocating sectors. At that point, it's time to count pennies and plan a hard drive replacement.

Good news is you are now back to "two bad drives", so unless another one fails, you shouldn't lose any more files. And as you replace ada0 and then ada6, you get your full pool redundancy back.

psssttt · May 10, 2020

Thank you @Yorick for explaining it in layman terms.

Resilvering of ada3 has finished without problems and the pool is healthy again :)
As I wrote earlier I'm going to replace remaining 4TB drives as soon as I have enough funds to do so.

kappclark · May 10, 2020

I got some excellent guidance on this very topic on this forum a short time ago - I was trying to replace a hard drive which had previously been unavailable --- in fact, I just reviewed the thread this AM to add a new seagate IronWolf ... the solution requires some CLI, but not too bad

Here is the thread

Haathi · Feb 23, 2021

psssttt said:

Finally I made a lil progress:
I've detached the new (OFFLINE) drive via CLI:

Code:

zpool detach  nasferatuvolume 9511904525519547236

And pool is finally in a normal state :)

Code:

# zpool status -v nasferatuvolume
  pool: nasferatuvolume
state: ONLINE
  scan: resilvered 37.7M in 0 days 02:22:52 with 0 errors on Fri May  1 20:43:36 2020
config:

    NAME                                            STATE     READ WRITE CKSUM
    nasferatuvolume                                 ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/271e756e-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/2c32e27b-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/31362dba-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/36387c1c-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/3b7b207e-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/40a65355-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/45e1321a-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0
        gptid/4b337f7c-9605-11e5-b66a-382c4abd4614  ONLINE       0     0     0

errors: No known data errors

Thank you for posting this thread and keeping it updated with your debugging. The detach is what I was needing to fix my resilver process.

Important Announcement for the TrueNAS Community.

SOLVED After replacing a disk and resilvering pool is still in degraded status and drive says it's still replacing

Yorick

Wizard

psssttt

Dabbler

psssttt

Dabbler

psssttt

Dabbler

psssttt

Dabbler

Yorick

Wizard

psssttt

Dabbler

Yorick

Wizard

psssttt

Dabbler

kappclark

Explorer

Haathi

Explorer

Similar threads

Important Announcement for the TrueNAS Community.

SOLVED After replacing a disk and resilvering pool is still in degraded status and drive says it's still replacing

Wizard

Dabbler

Dabbler

Dabbler

Dabbler

Wizard

Dabbler

Wizard

Dabbler

Explorer

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "After replacing a disk and resilvering pool is still in degraded status and drive says it's still replacing"

Similar threads