Start of scrub kills performance

Meyers

Patron
Joined
Nov 16, 2016
Messages
211
Recently we've noticed that during the monthly scrub, for the first hour or so it slams the disks to 100% increasing latency and taking the site down. Specs:

Motherboard: Supermicro X10DRH
CPU: Xeon E5-2620 v3 @ 2.40GHz (12 cores)
Ram: 64GB ECC
OS disk: 2 x Kingston DataTraveler 3.0 64GB mirrored
Data disk: 12 x HGST Ultrastar 7K6000 4 TB SAS 12Gb/s (6 x 2 disk mirrored vdevs)
FreeNAS 11.2-U5

This was with vfs.zfs.scrub_delay: 40 as well. I've since set it back to the default of 4 because I noticed that after the initial spike, the additional load is manageable (and scrubs take over a week otherwise).

Pool is at 76% utilization and was 33% fragmented last I checked.

Scrubs and SMART long tests seem to be taking longer each month to where I can only do one of each otherwise they'll overlap. I wonder if the utilization + fragmentation is contributing to this?

I just ordered a new server with 12 x 14TB drives and SSDs for OS. This will put us at 44% utilization.

Until we get this server installed, I'm wondering if there is any way to throttle down whatever the scrub is doing in that first hour. It doesn't seem to respect scrub_delay at all. Any help here would be much appreciated.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Recently we've noticed that during the monthly scrub, for the first hour or so it slams the disks to 100% increasing latency and taking the site down.
Site down? What kind of site? Scrubs should be scheduled to run at a time when they will have minimal impact.
Pool is at 76% utilization and was 33% fragmented last I checked.
What is the storage used for?
I wonder if the utilization + fragmentation is contributing to this?
Yes. You need to plan for an expansion of your pool, especially if performance is important to you. As your disks approach being full, the slower the storage will be. Where I work, we have a department that manages storage for a supercomputer and it needs to be very fast storage, so they test the drives to see what portion of the drive is fastest and partition it to only use that part of the platter. It often limits the capacity to less than 50% of the drive. The 'edges' tend to be the slow part and it only gets worse the closer to full the drive is.
I just ordered a new server with 12 x 14TB drives and SSDs for OS. This will put us at 44% utilization.
You should consider going with more drives. IOPS generally increase with vdev count. So, if performance matters, more drves is generally going to mean more performance. I have a system where I work that we built with ten vdevs to get more IO. Limiting yourself to 12 drives in mirror vdevs causes an artificial limit of 6 vdevs. There is no reason for that unless you just don't have room for a larger server or an external SAS attached drive shelf.
Until we get this server installed, I'm wondering if there is any way to throttle down whatever the scrub is doing in that first hour.
It is first analyzing all the metadata / checksum data in the pool. This is a massive number of small data reads. Again, this would not impact your performance as much if you had more drives.
I am still curious about how this storage is being used, it makes a difference.
 

Meyers

Patron
Joined
Nov 16, 2016
Messages
211
Site down? What kind of site? Scrubs should be scheduled to run at a time when they will have minimal impact.

High traffic website. Scrubs are scheduled when traffic dips of course but it's always relatively busy.

What is the storage used for?

Videos, mp3s, images, HTML, Javascript, etc.

You should consider going with more drives.

Would if I could. We're at a hosted facility that only has so many options. 12 drives are the max.

We have six datasets that could be spread out to additional servers but I don't know how much IO is going to each dataset and it doesn't look like there's an easy way to figure that out. I wouldn't know how to balance it and we can't just order and remove servers all that easily.

We have cache servers taking the brunt of the traffic. The file server is just serving a very small subset of traffic and over 97% of it is coming out of the ARC (which isn't even being 100% used). Performance isn't a problem at all until scrubs run.

Surely there has to be a way to throttle the initial part of the process. That's the only thing causing problems.
 

Meyers

Patron
Joined
Nov 16, 2016
Messages
211
For anyone looking to see what each dataset/filesystem is doing in terms of ops and throughput, this dtrace script appears to work on 11.2-U5:

Code:
#!/usr/sbin/dtrace -s

/*
 * vfsstat emulation
 */

#pragma D option quiet
#pragma D option defaultargs
#pragma D option switchrate=10hz
#pragma D option bufsize=8m
#pragma D option dynvarsize=32m

/*
 * Refer to man 9 VFS and the files <sys/vnode.h>,
 * </usr/src/sys/kern/vfs_default.c> and various
 * information on vnode in each fs.
 */

/*
 * This script only records successful operations. However, it
 * is trivial to modify so that failures are also record.
 * This script is intent to be used with your favourite scripting
 * language to process the output into something like the
 * Solaris fsstat. Also, note that read and write bytes will
 * be significantly different with disk IOs due to IO inflation
 * or deflation.
 */

vfs::vop_read:entry, vfs::vop_write:entry
{
        self->bytes[stackdepth] = args[1]->a_uio->uio_resid;
}

vfs::vop_read:return, vfs::vop_write:return
/this->delta = self->bytes[stackdepth] - args[1]->a_uio->uio_resid/
{
        this->fi_mount = args[0]->v_mount ?
                stringof(args[0]->v_mount->mnt_stat.f_mntonname) :
                        "<none>";
        @bytes[this->fi_mount, probefunc] = sum(this->delta);
        @ops[this->fi_mount, probefunc] = count();
}

vfs::vop_read:return, vfs::vop_write:return
{
        self->bytes[stackdepth] = 0;
}

/* You may add or remove operations of interest here. */

vfs::vop_rename:return, vfs::vop_create:return, vfs::vop_remove:return,
vfs::vop_getattr:return, vfs::vop_access:return, vfs::vop_open:return,
vfs::vop_close:return, vfs::vop_setattr:return, vfs::vop_rename:return,
vfs::vop_mkdir:return, vfs::vop_rmdir:return, vfs::vop_readdir:return,
vfs::vop_lookup:return, vfs::vop_cachedlookup:return
/errno == 0/
{
        this->fi_mount = args[0]->v_mount ?
                stringof(args[0]->v_mount->mnt_stat.f_mntonname) :
                        "<none>";
        @ops[this->fi_mount, probefunc] = count();
}

tick-1s
{
        printf("Number of operations\n");
        printf("%-40s %-18s %20s\n", "FILESYSTEM", "OPERATIONS", "COUNTS");
        printa("%-40s %-18s %20@d\n", @ops);
        printf("\nBytes read or write\n");
        printf("%-40s %-18s %20s\n", "FILESYSTEM", "OPERATIONS", "BYTES");
        printa("%-40s %-18s %20@d\n", @bytes);
        printf("\n--------------------------------------------------\n\n");
}


I'm totally new to dtrace so I have no idea if any of the output is correct (it looks correct). This has all the raw output that could be used for graphing individual filesystem performance.
 

Meyers

Patron
Joined
Nov 16, 2016
Messages
211
Bumping this because scrubs started killing performance again. Setting scrub_delay doesn't seem to have any effect. There will be periods where disk IO is clearly higher (with scrub_delay at 4) but not totally slamming the disks. Then for some reason the scrub will consume 100% of available IO which takes the site down (the site serves files over NFS which is the ONLY thing this server is doing). If I pause the scrub everything goes back to normal (and oddly, when I resume it, it seems to be fine again). Also, if I just leave it be, things go back to normal after an hour or two (but the site is down).

Scrub performance is so erratic. Last month it seemed to work fine, this month it has been going through periods of working OK and slamming the disks to 100%. I don't get it.

Does anyone know what I can do to fix this? Setting scrub_delay to a much higher setting doesn't seem to do anything. Something appears to be going on outside of normal, expected behavior and I can't seem to figure out what.
 

Meyers

Patron
Joined
Nov 16, 2016
Messages
211
OK so having scrub_delay set to 4 was fine for the most part but we were seeing unexpected periods of extremely high latency on the web servers (Apache/PHP serving files from the FreeNAS server) so I increased it to 8. This is the impact that change had. You can see how the load had been low aside from a few spikes, but then for some reason it steadily increased to over 100 until I increased scrub_delay to 8 (where the load drops off):

1570037551138.png


The spikes appear to be directly related to the scrub. If I pause it, the load immediately goes down. Nothing else is running at this time and web server traffic is what it normally is.

Here you can see when the scrub starts and the impact this had. The beginning of the scrub is where scrub_delay appears to be getting completely ignored:

1570037709144.png


Tuesday at 4am EST is when the scrub started and caused load to climb to > 100 which took the site down. Here's a corresponding IO graph for one of the data disks:

1570038180858.png


And this is when I set scrub_delay to 8:

1570037100859.png


Nothing too surprising here I guess, but what I don't get is why latency is spiking when IO is always around 300ish with scrub_delay set to 4. I would expect IO to be lower when latency is low and topping out at 300ish when it's high, but that doesn't appear to be the case.

It's as if the scrub starves the disks of IO over time. Setting it to 8 seems to be working well for now though so I'll leave it at that (but I wouldn't be surprised if latency spikes again).

I had scrub_delay set to 40 for a long time there and scrubs were still causing problems. While troubleshooting this, I removed all the autotune and other custom settings that were put in back in 2017 when I first installed these servers and tried to get everything back to a baseline.

I believe there is something that happens at the beginning of the scrub (metadata gets read or something?) and this process appears to be the main problem. It doesn't seem to respect scrub_delay. I can't find any good documentation on this though but I'll keep searching.

If anyone has any input on this I'd sure appreciate it. For months now I've been pulling my hair out trying to figure out what's going on with scrubs and how to fix it.
 

Meyers

Patron
Joined
Nov 16, 2016
Messages
211

Meyers

Patron
Joined
Nov 16, 2016
Messages
211
If you've really sunk that many man hours into it, perhaps you could contract iX : https://www.ixsystems.com/support/

Would if I could. You need to have a server built by them. Even if that were a possibility, it would be way outside of our budget (we've gotten quotes in the past).

Either way, I'm setting up a new server now and moving everything over to FreeBSD (for a variety of reasons). The current system is at nearly 80% usage and 37% fragmentation so perhaps that has something to do with it, though I'm skeptical about this being the case because we never really had any issues with scrubs until 11.2. We'll see what happens when I get everything moved over (assuming the scrub code is the same in FreeBSD).
 
Top