Slow Resilver raid-z

Status
Not open for further replies.

Cstdenis

Cadet
Joined
May 30, 2014
Messages
6
I had a drive with smart errors so I powered down the system and swapped it out for a new one.

The system is now resilvering, but it is very slow (going to take over a week to complete). It was going at 10M/s , but has now slowed to 8.7M/s (the drive should be able to write at around 100MBps, and read even faster).

CPU is basically idle, nothing else is using the file system. Why so slow?


Oddly, looking at gstat, ada0 and ada1 are looking like the bottleneck, showing as much more busy than the other drives (ada2 is the replaced one).

All drives are identical except the new ada2 which is a newer hardware revision of the same. I don't understand why so much more data is getting read off ada0 and ada1 (shouldn't the data be spread fairly evenly across the drives?)

Code:
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name   
    2    123     71    967   18.6     50    376    1.0   88.1| ada0            
    2    124     74    875   19.1     48    380    1.1   92.9| ada1            
    0    182      0      0    0.0    180   1095    1.6   50.4| ada2            
    2    191    144    871    4.5     45    376    0.2   56.5| ada3            
    2    211    166    887    3.8     43    336    0.2   42.3| ada4  
             



I've tried setting vfs.zfs.resilver_delay=0, but this may be making it even slower...
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
The data doesn't have to be evenly distributed. That's why we say RAIDZ1 is like RAID5 (and likewise RAIDZ2 is like RAID6). It is, but not exactly the same. At the deeper levels ZFS makes it work when stuff is unbalanced and all you have to do is make sure you keep the pool healthy. As a non-developer this behavior doesn't significantly affect your ability to use ZFS (and it's not well understood by non-programmers and math wizards anyway) so it's best to just not talk about it and let ZFS do its magic behind the proverbial curtain.

Yes, that resilvering is pretty low. Without listing all that stuff we expect when creating a thread I can provide no further advice. BUT, that 300KB/sec that's being written to your pool shows that your pool is somewhat busy. Actual zpool media use will suspend (for a short time) the resilvering process. This is by design to minimize resilvers from negatively affecting pool performance in the same way that replacing a disk in a hardware RAID controller does.

Also keep in mind that pool throughput will fluctuate throughout the resilver. It might say it has a week right now, but in an hour it might be cruising along at 300MB/sec. The locations of the blocks and their sizes play a part into how fast (or slow) a scrub/resilver goes.

So there's lots of stuff that can affect throughput, and unless you can find a problem it's best to leave well enough alone and let it be. Unless something is actually wrong it should finish and be fine.
 

Cstdenis

Cadet
Joined
May 30, 2014
Messages
6
Hardware:
* Supermicro X7SBL-LN2
* 8GB RAM
* CPU Q6700 @ 2.66GHz
* 850w PSU
* 5 x WD Green 3TB drives (plugged into mobo sata ports)


It did speed up after a while, finally finishing after 17.5 hours. Much faster than the initial estimate, but still many times longer than it should have taken given the speed of the drives.

The IO load across the drives evened out for a while, but when I checked in the morning once again it was the first 2 drives that were the bottleneck. Seems very inefficient to distribute that unevenly, or something is going on causing 2 of the identical drives to be much slower than the others.

BUT, that 300KB/sec that's being written to your pool shows that your pool is somewhat busy
I can only assume that zfs is writing something (zlog maybe?) to the disks causing this. There should be no activity otherwise.


"zpool status" now shows gptid instead of device name. How do I fix this?

Is there any point to doing a scrub now, or does the resilver basically include an implicit scrub?
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Was this pool created using freenas? FreeNAS uses gptids when labeling disks and anything else could possibly break freenas.

There isn't any reason to do a scrub now. Just let your automated scrubs do there thing.
 

Cstdenis

Cadet
Joined
May 30, 2014
Messages
6
It was created using an older version of FreeNas and imported into a clean install of 9.3
 
Status
Not open for further replies.
Top