Resilvering percent complete progression not very linear

danjb

Dabbler
Joined
Aug 2, 2014
Messages
26
I had an 8TB drive fail in a storage pool consisting of 2 RAIDZ2 Vdev's each comprised of 6 8TB Western Digital WD80EFZX drives. I replaced the failed drive with a Toshiba HDWG180 because it was the soonest available replacement option I had available.

The resilvering proceeded very quckly at first, reaching something like 53% complete in the first 12 hours. However, it then slowed to a crawl and took about 4 days to reach 58%. It then speeded up a little and took another 3 days to reach 67%, and is currently projecting completion in a little over 3 more days from now. The pool is something like 72% full.

This is on TrueNAS-12.0-U3.1 running on a Xeon E5-2620 with 64GB of RAM. The system runs 24x7 and is not super loaded down. I am not really concerned about the total amount of time the resilvering process takes, I had expected it to run very slowly. However, I am curious about the jumpy percentage complete progress. It progresses along at about 4-5% per hour for the first 12 hours, then slows down by a factor of 100X for the next 4 days, then speeds up about 2X for the next 3 days.

What is the major driving factor to resilvering speed that could cause this orders of magnitude variability? Some things I thought of:
  1. Numbers of files or sizes of files. I have some metadata directories with huge amounts of tiny files. I have other media directories with relatively small numbers of extremely large files. Does one or the other of these cause speedups or slowdowns in resilvering?
  2. System usage. As I say this is not a heavily loaded system, but it is performing variable amounts of I/O all the time. Could high levels of I/O cause 100X resilvering slowdowns? Could CPU utilization cause that large an amount of variability?
  3. I would have liked to use an identical drive, but like I mentioned I didn't. The old drives are 5400RPM slower drives, the new drive is a 7200RPM faster drive. Other than it being bad to have a mismatch, could this cause performance issues? It doesn't seem like it would contribute to the variability though.
  4. Is it possible this variability indicates an issue with any drives? No errors are currently being reported.
Current zpool status reads:
Code:
  pool: storage
state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Aug 31 17:50:30 2021
        47.4T scanned at 84.2M/s, 46.8T issued at 83.2M/s, 69.6T total
        1.22T resilvered, 67.29% done, 3 days 07:42:05 to go
 

QonoS

Explorer
Joined
Apr 1, 2021
Messages
87
Resilver does check hash of every piece of data in your pool. So when ZFS detects an error it rebuilds the bad data from redundant information.
47.4T have been scanned at an average speed of 84.2M/s and 1.22T of that data have been found with errors that had to be corrected (resilvered).
67.29% done means : 47.4T/69.6T have been scanned so far. At 84.2M/s the remaining TB to be scanned will take presumably "
3 days 07:42:05".

2 things to consider here:
1) data errors are not distributed evenly ;) so when ZFS rebuilds bad data from redundant copies it will be slower than just aknowledging good data. processing small files will take more time for HDDs because of small i/o. huge amounts of small files could make a huge speed impact.
2) 84.2M/s is not very fast for 2 RAIDZ2 Vdev's of 6 8TB . This could be because of the huge amonut of small files but also because of other HDDs that are at a point just short before an error. Have you checked SMART data of the other HDDs? you should just to be sure.
 

danjb

Dabbler
Joined
Aug 2, 2014
Messages
26
Thank you for your reply, and sorry for the delay in my response. Your reply prompted some investigating I should have done before. I found out my scheduled SMART tests had not been running since my upgrade to TrueNAS earlier this year, so I had no SMART results to look at. I reestablished those, but did not see anything unusual, so I let the resilver run to completion, which took another 10+ days.

At that point I realized something much worse than a failed disk was happening, I had actually had 2 failed disks in the Vdev. I gather the second disk had failed during the resilver of the first, and I assume was interfering with or otherwise slowing down the resilvering process. This second failed drive showed in the pool status display as offline, but I had interpreted that as being the first failed disk (I've had disk failures in the past but not often enough to remember exactly how they're displayed during resilvering).

Anyway this all became apparent when the resilvering on the first disk finished and the zpool stayed degraded and the drive continued to show offline. A careful inventory of the drives allowed me to locate the second failed drive, replace it, and its full resilvering process took about 3 days, exactly as you say it should.

This was definitely a bullet dodged by pure luck. Everything shows healthy now.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Numbers of files or sizes of files. I have some metadata directories with huge amounts of tiny files. I have other media directories with relatively small numbers of extremely large files. Does one or the other of these cause speedups or slowdowns in resilvering?
Not files per se, strictly speaking. Small blocks, definitely, as they mean more metadata per byte stored. Although tiny blocks small enough to fit in the block pointer once compressed can end up being faster...
System usage. As I say this is not a heavily loaded system, but it is performing variable amounts of I/O all the time. Could high levels of I/O cause 100X resilvering slowdowns? Could CPU utilization cause that large an amount of variability?
I wouldn't say 100x, but some slowdown is expected.
I would have liked to use an identical drive, but like I mentioned I didn't. The old drives are 5400RPM slower drives, the new drive is a 7200RPM faster drive. Other than it being bad to have a mismatch, could this cause performance issues? It doesn't seem like it would contribute to the variability though.
Not a concern.
Is it possible this variability indicates an issue with any drives? No errors are currently being reported.
As you've seen, the answer is "yeah, definitely look into it!"
 
Top