Resilver slower than scrub and gstat report unexpected statistics

Tommaso

Dabbler
Joined
Oct 19, 2016
Messages
12
Unfortunately I did not collect any data during the specific resilver but the thing is this.

I have a pool of 4 different mirrored vdev made by two drives each. One is 2TB the ohter are 3TB drives. I replaced da7 that was in mirror with da4.

In the last week I had to perform more than one resilver of the same drive because I had to correct some lost data. This has been due to probably faulty power supply, I did not loose so much so no problem.

What I noticed is that the resilver took 36 hours almost, while the full scrub of the pool just around 10 hours. And the other strange statistic from gstat was that da7 was reading and da4 writing, shouldn't it be exactly the opposite?
 

Tommaso

Dabbler
Joined
Oct 19, 2016
Messages
12
No one has an insight for me?

It seems fairly strange that the resilver takes so much longer than the scrub.
 
Joined
May 10, 2017
Messages
838
What model disks? SMR disks will take longer to resilver, but scrubs are normal since read speeds are not an issue with SMR drives.
 

Tommaso

Dabbler
Joined
Oct 19, 2016
Messages
12
The two specific drives are ST3000LM024-2AN17R. I have other 2 of them, then two MQ01ABB200 and two MQ03ABB300. Right now I had many failures in the MQ03ABB300 family with ATA errors and reallocated sectors, the MQ01ABB200 instead lasted some years without problems, then I changed them when I had the first reallocated sector (I should never have done it and live with one reallocated instead). Instead the ST3000LM024-2AN17R gave me problem with Current_Pending_Sector (but I suppose was a metter of how I connected the power supply to them, now I am trying with a total reconfiguration).
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
What I noticed is that the resilver took 36 hours almost, while the full scrub of the pool just around 10 hours. And the other strange statistic from gstat was that da7 was reading and da4 writing, shouldn't it be exactly the opposite?
Resilver should take longer since it's doing a lot of the same work as a scrub, but also writing the data to one of the members.

I guess you must be reading the charts incorrectly (or the BIOS has reallocated the disk IDs during a reboot after you replaced the disk).
 

Tommaso

Dabbler
Joined
Oct 19, 2016
Messages
12
I don't think so: in gstat the columns on the "r" part were reading for the da7 and the "w" part was on da4. I checked multiple times the status, both from UI and command line to undertand if the disk has been reassigned but it does not seems so, because while doing all these resilvers the "replacing" disk was always da7. And the reporting in the UI for operations was confirming this strange "swap".

Fortunately, it worked at the end and all the errors were gone so I had no more to do a resilver, but if it will ever happen I will keep an eye on this again.

Do you all guys think the problem is just related to SMR? Can they give a discrepancy of 36 vs 10 hours.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Do you all guys think the problem is just related to SMR? Can they give a discrepancy of 36 vs 10 hours.

Yes, this is well documented over multiple systems and pool topologies. On my personal system, swapping in a SMR WD40EFAX to a RAIDZ2 pool resulted in a resilver time of of 42 hours. Changing that out for a CMR WD40EFRX resulted in a resilver time of 5 hours.
 

Tommaso

Dabbler
Joined
Oct 19, 2016
Messages
12
@Samuel Tai thak you for the advice. This is incredible, I would not expect a so big difference. I could expect losses in the low order of two figures percentages like 20% or 30%, but you are speaking about losses of about 700% O_O
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Yes, a resilver is pretty much the worst-case workload for an SMR disk, which is optimized for write-once/read-many archival workloads. An SMR disk has a small CMR area to handle intermittent writes, and will relocate data written to the CMR area to the larger SMR area when idle. However, during a resilver, this CMR area fills up quickly, and then the resilver has to wait for the drive to empty out the CMR area before continuing.
 

Tommaso

Dabbler
Joined
Oct 19, 2016
Messages
12
Thank you for the explanation. This is making things clearer. Despite knowing how the SMR mechanism is working I would never guessed a so huge loss.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Thank you for the explanation. This is making things clearer. Despite knowing how the SMR mechanism is working I would never guessed a so huge loss.
A nice approach to debugging or collecting metrics during resilvering is to install and run Netdata.
To isolate issues related to SMR drive is to have a look at the following graphs per disks:

"Disk I/O Bandwidth" and "Disk Utilization Time"

If "Disk Utilization Time" is stuck at 100% but "Disk I/O Bandwidth" is a fraction of the maximum disk bandwidth (a few MB/s), then this is an indication the disk is trying to cleanup the SMR section by constantly jumping back and forth between SMR and CMR.
 
Top