SOLVED Cancel a resilver for data integrity?

Unis_Torvalds · Mar 11, 2024

This is probably an edge case, but here's my situation:
One Pool/vdev consisting of 3 HDDs in Raid-Z. One of the drives is lower capacity, and it refuses to die (the other two have been replaced multiple times). So after many years I decided to upgrade that one old drive to expand the total capacity of my vdev, even though it was not faulty.

So I replaced the drive and it starts to resilver.

But now the resilver is taking forever. ETA is five days. Just last week I replaced one of the other drives in the same vdev and the resilver took six hours. So already I'm worried. Suddenly the other two drives start showing read/checksum errors in the Pool Status page, and drive statuses have gone from Online to Degraded (except the new drive being resilvered). Now I am concerned about my data integrity.

Should I:

Remove the new drive, replace the old (still good) one, and scrub the dataset (and then update my offline backups)? Will this work as I expect (i.e. recognize old drive and cancel resilver)?
Let the insanely long resilver run its course (at risk of data loss if the other two drive are indeed failing)?

Anyone see anything like this before? Thanks!

chuck32 · Mar 11, 2024

FelixLeclerc said:
Now I am concerned about my data integrity.

Afaik zfs will tell you which files are corrupted and you seem to have backups, so in the end you'll probably be okay.

Seems like odd timing, did you by any chance losen the data cables of the drives during physical replacement of the drive?

The correct course of action in your situation needs to be determined by someone else, I have no experience here. You should however list your hardware in detail to attract helpful responses.

Unis_Torvalds · Mar 11, 2024

Hi Chuck thanks for your response. I didn't know zfs could tell me exactly which files are corrupted (wouldn't it normally be a sector? But maybe the fs is smart enough to know which file maps to that sector?). If this is true, then that is definitely a relief. I should be able to replace those specific files from backup. Cheers.

chuck32 · Mar 11, 2024

You're still not in the clear though, both of your other drives throwing errors during resilvering need to be investigated.

When my boot pool degraded I didn't get such a list, but @joeschmuck suggested zpool status -xv.

From zpool-status.8

Displays verbose data error information, printing out a complete list of all data errors since the last complete pool scrub. If the head_errlog feature is enabled and files containing errors have been removed then the respective filenames will not be reported in subsequent runs of this command

Arwen · Mar 11, 2024

@FelixLeclerc - Please supply the make and model of all 3 hard drives.

Their is a known, serious problem with some Western Digital Red drives that can explain your slow down. Other vendors also have this SMR, (Shingled Magnetic Recording), problem, which is generally not suitable for use with ZFS.

Unis_Torvalds · Mar 11, 2024

@Arwen don't worry the drives are all CMR. I made sure of that when buying them. Furthermore the new drive taking forever to resilver is a Seagate.

@chuck32 here's the status:

Code:

root@freenas:~ # zpool status -xv
  pool: Entrepot_2
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Mar  9 14:43:38 2024
        3.13T scanned at 16.6M/s, 2.48T issued at 13.2M/s, 7.30T total
        848G resilvered, 33.99% done, 4 days 10:37:00 to go
config:

        NAME                                              STATE     READ WRITE CKSUM
        Entrepot_2                                        DEGRADED     0     0   0
          raidz1-0                                        DEGRADED     0     0   0
            gptid/347f1dc3-d773-11ee-920c-003018a24e12    DEGRADED 27.5K     0   0  too many errors
            replacing-1                                   DEGRADED     0     0 19.6K
              gptid/d22edd31-781c-11e4-b7b7-003018a24e12  OFFLINE      0     0   0
              gptid/52ee7882-de4d-11ee-920c-003018a24e12  ONLINE       0     0   0  (resilvering)
            gptid/1ca3c8a8-ffd8-11ed-95e2-003018a24e12    DEGRADED     0     0 19.6K  too many errors

errors: Permanent errors have been detected in the following files:

It then lists only one file (whew!).

chuck32 · Mar 12, 2024

As stated here the degraded state may cause the slow performance.

I'll say it again, the number of errors is worrysome. This needs to be addressed. With Raidz1 there is no parity during resilvering.

Unis_Torvalds · Mar 12, 2024

@chuck32 so what are my options? Can I cancel and restore the old drive for a scrub? (And maybe address any other failed drives if that be the case?)
Googling around forums before I posted here, everything I saw suggested it is impossible to cancel a resilver once started.

chuck32 · Mar 13, 2024

chuck32 said:
The correct course of action in your situation needs to be determined by someone else, I have no experience here. You should however list your hardware in detail to attract helpful responses.

I'm sorry, I can just guess here. See below for what I would do in absence of better knowledge.

Both other drives beeing degraded doesn't sound too good to me. When did you run the last smart tests? Can you post their output?

Realize you may have to nuke the pool and restore completely from backup.
Shutdown the server and check all cabling, make sure everything is properly seated.
Power on the server.
Let the resilver finish.
Run zpool status -xv again and replace all damaged files.
Run long smart tests on all drives.

Now you could either verify the integrity of the data against your backups and also run a scrub (not sure if that's needed after resilvering, as I said I'm lacking experience here).
Or you destroy the pool and restore from backup completely.
Hope that in the meantime someone else with more experience chimes in.

You didn't mention your hardware, are you using a HBA?
What was your scrub and smart test schedule prior to this?

My best guess is that you accidentally losened the connections on the other drives during your physical replacement.

Unis_Torvalds · Mar 14, 2024

Hi @chuck32 and thanks again for your advice.
I double checked all the connections and everything seems sound on a hardware level. No bus adapter, the drives are all plugged directly to the mobo. Scrub and smart tests were daily crons. No alerts before I started this foolhardy operation.

But now operations have slowed to a crawl (weirdly, my share mounts on client systems still read at full speed) and I get all sorts of strange errors when I try to swap in new disks. I've been operating my Free/TrueNAS for over ten years without issue. This isn't normal.

Looks like I'm going to have to start over from a clean slate.

Thanks anyways for your help!

UPDATE: Upon fiddling around in the case I've located a loose power connection from the PSU. Hopefully this was supplying the SATA bus and can explain the issues. Good instincts @chuck32! If this still doesn't solve the issue, I'll just start over and restore from backups.

Unis_Torvalds · Mar 15, 2024

Final update: I think it was in fact the loose power connector. I tested the 'failed' HDDs in another system and they seem to check out. Upon leaving the system to resilver (which went faster with the cables were properly re-seated), the pool still showed as DEGRADED. But zpool status -xv now shows no checksum errors nor any corrupted files. So I ran zpool clear to clear the degraded state and now I'm waiting for a scrub to finish. If it's truly faulty, the scrub will turn up any errors but I have a hunch that it won't find any.

chuck32 · Mar 15, 2024

Glad I could help and thanks for the update.

Unis_Torvalds said:
Scrub and smart tests were daily crons.

Daily scrubs seems excessive. I scrub 2 times per month, daily shorts and weekly long tests.

Unis_Torvalds · Mar 16, 2024

My mistake yes you're right. The scrub cron job checks daily but the threshold days is set to 20.

Unis_Torvalds · Mar 16, 2024

FINAL TAKEAWAY for anyone else who stumbles on this thread:

You cannot stop a resilver once in progress. It persists through system reboots.
Even if you replace the original pool member disk, it will not reintegrate as before but will require resilvering.

Important Announcement for the TrueNAS Community.

SOLVED Cancel a resilver for data integrity?

Unis_Torvalds

Dabbler

chuck32

Guru

Unis_Torvalds

Dabbler

chuck32

Guru

Arwen

MVP

Unis_Torvalds

Dabbler

chuck32

Guru

Unis_Torvalds

Dabbler

chuck32

Guru

Unis_Torvalds

Dabbler

Unis_Torvalds

Dabbler

chuck32

Guru

Unis_Torvalds

Dabbler

Unis_Torvalds

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

SOLVED Cancel a resilver for data integrity?

Dabbler

Guru

Dabbler

Guru

MVP

Dabbler

Guru

Dabbler

Guru

Dabbler

Dabbler

Guru

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Cancel a resilver for data integrity?"

Similar threads