Resilvering on a SMART failed drive

gtzx

Cadet
Joined
Apr 4, 2022
Messages
1
Hello,

We have an issue on our TrueNAS-12.0-U2.1.
We have a 22To RAIDZ2 pool with 7 7200K drives of 7.8To (6 data, 1 hot spare). LZ4 compression is enabled.
The pool is used with ISCSI.

We have an ongoing resilvering process after a disk replacement. The process is extremely slow, the time remaining is growing from 2,5 days and is still growing. (490Go read/ 6.7To total)

We have tried to replace the hotspare drive, with SMART ERRORS. Instead, a healthy drive has been swapped, so the faulty drive was taken to production and is now on resilvering. So we place the new healthy disk as hot spare.

Code:
  pool: DATASTORE-TRUENAS
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Apr  4 11:29:21 2022
        492G scanned at 35.9M/s, 361G issued at 26.3M/s, 6.17T total
        48.8G resilvered, 5.71% done, 2 days 16:32:01 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        DATASTORE-TRUENAS                               ONLINE       0     0 0
          raidz2-0                                      ONLINE       0     0 0
            gptid/0d05cf36-ae74-11eb-8c12-f4e9d4a018f0  ONLINE       0     0 0  (resilvering)
            gptid/044257b0-ae52-11eb-8c12-f4e9d4a018f0  ONLINE       0     0 0
            gptid/eff4a480-8e21-11eb-8c12-f4e9d4a018f0  ONLINE       0     0 0
            gptid/efed61d3-8e21-11eb-8c12-f4e9d4a018f0  ONLINE       0     0 0
            gptid/effe0944-8e21-11eb-8c12-f4e9d4a018f0  ONLINE       0     0 0


We have some logs in /var/log/messages:
Code:
Apr  4 15:25:56 truenas ctl_datamove: tag 0x9ebab39 on (11:3:0) aborted
Apr  4 15:25:56 truenas ctl_datamove: tag 0x9ebab3b on (11:3:0) aborted
Apr  4 15:25:56 truenas ctl_datamove: tag 0x9ebab3d on (11:3:0) aborted
Apr  4 15:25:56 truenas ctl_datamove: tag 0x9ebab3f on (11:3:0) aborted


The resilvering process is extremely slow, is it possible to cancel it ? Or replace the faulty but currently resilvering disk with the hot spare one ?

Thanks for your help.
 

Nick2253

Wizard
Joined
Apr 21, 2014
Messages
1,633
When using a single parity ZFS vdev, a hotspare never makes sense. In your case, you could have easily gone to RAIDZ3. The only time a hot spare makes sense is when you are striping data across multiple vdevs, and even then, it usually only makes sense when you're striping across mirrored vdevs.

You can interrupt a resilver by detaching the disk. For example: zpool detach mypool gptid/5fe33556-3ff2-11e2-9437-f46d049aaeca (make sure to replace "mypool" and the drive ID with the correct values for your use case).
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
While we know it's a bit late, I agree with @Nick2253 that a hot spare is not of much use when you could have just made it a RAIDZ3. When you get an opportunity to rebuild your pool, you should make a RAIDZ3.

So I see that you have one drive resilvering, and are you saying that it is the bad drive? I'm not following your english very well. If that is a new drive and it is resilvering as it should be, maybe just leave it alone until it finishes.

You also didn't specify if the machine was in-use while it's resilvering, as you know that will slow things down, like performing a scrub or a SMART Long/Extended Test will add lots of time.

I would think you could also interrupt a resilver by rebooting the machine, but I have never tried it. To be honest, I would not stop the resilvering unless the bad drive is being resilvered back into the machine.

I'm curious how this turns out.

Best of luck to you.
 
Top