J
jpaetzel
Guest
ZFS lives and dies by it's caching. The performance of ZFS varies widely based on weather a system is running off cache or of the disks. Sometimes you are interested in how your disk subsystem can perform, other times you are interested in how ZFS is performing. The following paragraphs will detail how to get both types of telemetry.
The key to getting disk subsystem numbers is blowing out the read cache. You need to exhaust system resources so that it's forced to hit the disks. This is easier to do on low resource systems, blowing out the read cache on a system with 192 GB of RAM and 600GB of L2ARC is non-trivial.
zpool iostat only tells you about disk I/O, not if anything is being satisfied from cache, so it can't be trusted in this case. gstat and iostat are in the same boat.
In /usr/local/www/freenasUI/tools there is a utility called arcstat.py This tool can tell you all about the ARC and L2ARC.
If you don't have an L2ARC, leave off the l2arc columns in arcstat.py
Here's a sample from a box that is hitting cache quite well.
Corresponding zpool iostat output, indeed the disks aren't working at all. It did satisfy nearly 3k iops, which the underlying disk subsystem can't handle. (In this case the system is 4 7200 RPM SATA drives in RAIDZ)
Notice the disks aren't doing anything, even though the system is under read load.
If we want to test the disk performance we need to blow out the ARC
Ok, so the largest the ARC can get on this system is ~11GB, and the pool has no L2ARC at all, so I can use a 24GB dataset to blow out the cache and get the system on disk.
I'll let that crunch away for a while, while watching zpool iostat and arcstat in other windows.
The first test is a write test, the system is doing about 130MB/sec, or 8GB/min, so I can expect this phase of the test to last 3 minutes or so.
The second test is rewrite, which takes another three minutes or so...
And then the read test starts.
By looking at the read columns you can see that I didn't do a good enough job blowing the read cache out. The average miss percentage was 40, so over half of the reads came from cache. Well, that's sequential read, maybe random read will be better.
It's on to the random tests now:
As you can see, the miss percentages are higher, but still a sizeable chunk of read performance is coming from cache.
What this tells us is that we need to try with a larger dataset size (the -s 24g) parameter to iozone, or we need to remove RAM from the system, as increasing the test size will increase the time to run the tests, because you have to do the write tests first...then you have to ask yourself the question, since ZFS is so effective at caching, do you really care what the performance is off the cache?
arcstat shows no arc misses during that test, so you can see it satisfied the reads from cache. Look at random read, nearly 4.5GB/sec, that's a level of performance that would take hundreds of disks to satisfy if caching weren't involved.
Another tool included in FreeNAS is xdd. xdd can give you iops numbers, as well as latency numbers.
I'll post about that in a separate article.
Since arc size is so critical, there is a tool to help determine sizing.
If the target size is banging into the max size limit it would likely grow larger if there was more headroom.
The key to getting disk subsystem numbers is blowing out the read cache. You need to exhaust system resources so that it's forced to hit the disks. This is easier to do on low resource systems, blowing out the read cache on a system with 192 GB of RAM and 600GB of L2ARC is non-trivial.
zpool iostat only tells you about disk I/O, not if anything is being satisfied from cache, so it can't be trusted in this case. gstat and iostat are in the same boat.
In /usr/local/www/freenasUI/tools there is a utility called arcstat.py This tool can tell you all about the ARC and L2ARC.
Code:
[root@freenas] /usr/local/www/freenasUI/tools# arcstat.py -f read,hits,miss,hit%,arcsz,l2read,l2hits,l2miss,l2hit%,l2size,l2bytes
If you don't have an L2ARC, leave off the l2arc columns in arcstat.py
Here's a sample from a box that is hitting cache quite well.
Code:
[root@freenas] /usr/local/www/freenasUI/tools# ./arcstat.py -f read,hits,miss,hit%,arcsz 1 read hits miss hit% arcsz 0 0 0 0 9.0G 2.9K 2.9K 0 100 9.0G 179 179 0 100 9.0G 5 5 0 100 9.0G 471 471 0 100 9.0G 182 182 0 100 9.0G 14 14 0 100 9.0G 505 505 0 100 9.0G 157 157 0 100 9.0G
Code:
root@freenas] /usr/local/www/freenasUI/tools# zpool iostat 1 capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- tank 1.81T 1.81T 0 0 0 0 tank 1.81T 1.81T 0 0 0 0 tank 1.81T 1.81T 0 0 0 0 tank 1.81T 1.81T 0 142 0 727K tank 1.81T 1.81T 0 0 0 0 tank 1.81T 1.81T 0 0 0 0 tank 1.81T 1.81T 0 0 0 0
Corresponding zpool iostat output, indeed the disks aren't working at all. It did satisfy nearly 3k iops, which the underlying disk subsystem can't handle. (In this case the system is 4 7200 RPM SATA drives in RAIDZ)
Notice the disks aren't doing anything, even though the system is under read load.
If we want to test the disk performance we need to blow out the ARC
Code:
[root@freenas] /usr/local/www/freenasUI/tools# sysctl vfs.zfs.arc_max vfs.zfs.arc_max: 11377840128
Ok, so the largest the ARC can get on this system is ~11GB, and the pool has no L2ARC at all, so I can use a 24GB dataset to blow out the cache and get the system on disk.
Code:
[root@freenas] /usr/local/www/freenasUI/tools# iozone -a -s 24g -r 4096
I'll let that crunch away for a while, while watching zpool iostat and arcstat in other windows.
Code:
[root@freenas] ~# zpool iostat 1 capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- tank 1.82T 1.81T 6 46 153K 613K tank 1.82T 1.81T 3 1.47K 511K 149M tank 1.82T 1.81T 0 1.63K 0 159M tank 1.82T 1.81T 18 1.25K 164K 125M tank 1.82T 1.81T 13 1.27K 162K 126M tank 1.82T 1.81T 4 1.46K 136K 175M tank 1.82T 1.81T 7 1.39K 12.5K 165M tank 1.82T 1.81T 14 1.38K 145K 142M tank 1.82T 1.81T 13 1.23K 19.0K 119M tank 1.82T 1.81T 20 1.36K 38.9K 132M
The first test is a write test, the system is doing about 130MB/sec, or 8GB/min, so I can expect this phase of the test to last 3 minutes or so.
The second test is rewrite, which takes another three minutes or so...
And then the read test starts.
Code:
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c 09:48:48 1.4K 551 39 242 22 309 99 41 6 8.5G 8.5G 09:48:49 1.9K 864 45 164 16 700 79 41 7 8.5G 8.5G 09:48:50 1.0K 289 27 29 6 260 45 7 2 8.5G 8.5G 09:48:51 1.2K 472 39 97 16 375 62 16 7 8.5G 8.5G 09:48:52 1.7K 768 46 135 14 633 83 24 5 8.5G 8.5G 09:48:53 1.6K 499 31 161 15 338 62 25 3 8.5G 8.5G 09:48:54 2.2K 1.0K 46 175 13 834 99 20 2 8.5G 8.5G 09:48:55 1.3K 441 32 139 13 302 86 20 3 8.5G 8.5G 09:48:56 2.2K 494 22 210 11 284 100 33 2 8.5G 8.5G 09:48:57 2.2K 1.2K 53 338 24 843 99 86 8 8.5G 8.5G 09:48:58 1.5K 726 49 232 27 494 82 60 34 8.5G 8.5G 09:48:59 1.3K 684 53 233 30 451 85 38 9 8.5G 8.5G 09:49:00 2.6K 974 37 96 5 878 89 14 1 8.5G 8.5G 09:49:01 1.3K 656 50 178 22 478 92 26 6 8.5G 8.5G
By looking at the read columns you can see that I didn't do a good enough job blowing the read cache out. The average miss percentage was 40, so over half of the reads came from cache. Well, that's sequential read, maybe random read will be better.
Code:
KB reclen | write | rewrite | read | reread | random read random write 25165824 4096 | 141809 | 139618 | 285847 | 297634 | 52813 | 181362
It's on to the random tests now:
Code:
[root@freenas] /usr/local/www/freenasUI/tools# ./arcstat.py 1 time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c 09:55:13 16 16 100 16 100 0 0 0 0 8.5G 8.5G 09:55:14 1.6K 1.3K 80 461 63 806 93 0 0 8.5G 8.5G 09:55:15 1.3K 844 62 546 59 298 70 0 0 8.5G 8.5G 09:55:16 1.2K 745 60 330 43 415 88 0 0 8.5G 8.5G 09:55:17 1.8K 1.4K 75 467 52 889 97 0 0 8.5G 8.5G 09:55:18 1.9K 1.2K 61 686 49 466 99 0 0 8.5G 8.5G 09:55:19 1.5K 1.1K 69 564 57 501 89 0 0 8.5G 8.5G 09:55:20 1.4K 1.1K 83 206 50 941 97 0 0 8.5G 8.5G 09:55:21 1.3K 805 59 418 59 387 60 0 0 8.5G 8.5G 09:55:22 1.4K 903 66 698 70 205 56 0 0 8.5G 8.5G 09:55:23 1.6K 877 56 209 24 668 94 0 0 8.5G 8.5G 09:55:24 1.6K 1.1K 70 389 52 737 85 0 0 8.5G 8.5G 09:55:25 1.2K 738 64 210 36 528 90 0 0 8.5G 8.5G 09:55:26 1.0K 889 85 492 76 397 100 0 0 8.5G 8.5G
As you can see, the miss percentages are higher, but still a sizeable chunk of read performance is coming from cache.
What this tells us is that we need to try with a larger dataset size (the -s 24g) parameter to iozone, or we need to remove RAM from the system, as increasing the test size will increase the time to run the tests, because you have to do the write tests first...then you have to ask yourself the question, since ZFS is so effective at caching, do you really care what the performance is off the cache?
Code:
[root@freenas] /usr/local/www/freenasUI/tools# iozone -a -s 1g -r 4096 KB reclen | write | rewrite | read | reread | random read | random write 1048576 4096 | 459033 | 2928320 | 4709995 | 4615880 | 4606312 | 3561581
arcstat shows no arc misses during that test, so you can see it satisfied the reads from cache. Look at random read, nearly 4.5GB/sec, that's a level of performance that would take hundreds of disks to satisfy if caching weren't involved.
Another tool included in FreeNAS is xdd. xdd can give you iops numbers, as well as latency numbers.
I'll post about that in a separate article.
Since arc size is so critical, there is a tool to help determine sizing.
Code:
/usr/local/www/freenasUI/tool/arc_summary.py ARC Size: 57.82% 6.13 GiB Target Size: (Adaptive) 82.18% 8.71 GiB Min Size (Hard Limit): 12.50% 1.32 GiB Max Size (High Water): 8:1 10.60 GiB
If the target size is banging into the max size limit it would likely grow larger if there was more headroom.