Benchmarking ZFS

jpaetzel · Jul 23, 2012

ZFS lives and dies by it's caching. The performance of ZFS varies widely based on weather a system is running off cache or of the disks. Sometimes you are interested in how your disk subsystem can perform, other times you are interested in how ZFS is performing. The following paragraphs will detail how to get both types of telemetry.

The key to getting disk subsystem numbers is blowing out the read cache. You need to exhaust system resources so that it's forced to hit the disks. This is easier to do on low resource systems, blowing out the read cache on a system with 192 GB of RAM and 600GB of L2ARC is non-trivial.

zpool iostat only tells you about disk I/O, not if anything is being satisfied from cache, so it can't be trusted in this case. gstat and iostat are in the same boat.

In /usr/local/www/freenasUI/tools there is a utility called arcstat.py This tool can tell you all about the ARC and L2ARC.

Code:

[root@freenas] /usr/local/www/freenasUI/tools# arcstat.py -f read,hits,miss,hit%,arcsz,l2read,l2hits,l2miss,l2hit%,l2size,l2bytes

If you don't have an L2ARC, leave off the l2arc columns in arcstat.py

Here's a sample from a box that is hitting cache quite well.

Code:

[root@freenas] /usr/local/www/freenasUI/tools# ./arcstat.py -f read,hits,miss,hit%,arcsz 1
read  hits  miss  hit%  arcsz  
   0     0     0     0   9.0G  
2.9K  2.9K     0   100   9.0G  
 179   179     0   100   9.0G  
   5     5     0   100   9.0G  
 471   471     0   100   9.0G  
 182   182     0   100   9.0G  
  14    14     0   100   9.0G  
 505   505     0   100   9.0G  
 157   157     0   100   9.0G

Code:

root@freenas] /usr/local/www/freenasUI/tools# zpool iostat 1
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        1.81T  1.81T      0      0      0      0
tank        1.81T  1.81T      0      0      0      0
tank        1.81T  1.81T      0      0      0      0
tank        1.81T  1.81T      0    142      0   727K
tank        1.81T  1.81T      0      0      0      0
tank        1.81T  1.81T      0      0      0      0
tank        1.81T  1.81T      0      0      0      0

Corresponding zpool iostat output, indeed the disks aren't working at all. It did satisfy nearly 3k iops, which the underlying disk subsystem can't handle. (In this case the system is 4 7200 RPM SATA drives in RAIDZ)

Notice the disks aren't doing anything, even though the system is under read load.

If we want to test the disk performance we need to blow out the ARC

Code:

[root@freenas] /usr/local/www/freenasUI/tools# sysctl vfs.zfs.arc_max
vfs.zfs.arc_max: 11377840128

Ok, so the largest the ARC can get on this system is ~11GB, and the pool has no L2ARC at all, so I can use a 24GB dataset to blow out the cache and get the system on disk.

Code:

[root@freenas] /usr/local/www/freenasUI/tools# iozone -a -s 24g -r 4096

I'll let that crunch away for a while, while watching zpool iostat and arcstat in other windows.

Code:

[root@freenas] ~# zpool iostat 1
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        1.82T  1.81T      6     46   153K   613K
tank        1.82T  1.81T      3  1.47K   511K   149M
tank        1.82T  1.81T      0  1.63K      0   159M
tank        1.82T  1.81T     18  1.25K   164K   125M
tank        1.82T  1.81T     13  1.27K   162K   126M
tank        1.82T  1.81T      4  1.46K   136K   175M
tank        1.82T  1.81T      7  1.39K  12.5K   165M
tank        1.82T  1.81T     14  1.38K   145K   142M
tank        1.82T  1.81T     13  1.23K  19.0K   119M
tank        1.82T  1.81T     20  1.36K  38.9K   132M

The first test is a write test, the system is doing about 130MB/sec, or 8GB/min, so I can expect this phase of the test to last 3 minutes or so.
The second test is rewrite, which takes another three minutes or so...

And then the read test starts.

Code:

    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
09:48:48  1.4K   551     39   242   22   309   99    41    6   8.5G  8.5G  
09:48:49  1.9K   864     45   164   16   700   79    41    7   8.5G  8.5G  
09:48:50  1.0K   289     27    29    6   260   45     7    2   8.5G  8.5G  
09:48:51  1.2K   472     39    97   16   375   62    16    7   8.5G  8.5G  
09:48:52  1.7K   768     46   135   14   633   83    24    5   8.5G  8.5G  
09:48:53  1.6K   499     31   161   15   338   62    25    3   8.5G  8.5G  
09:48:54  2.2K  1.0K     46   175   13   834   99    20    2   8.5G  8.5G  
09:48:55  1.3K   441     32   139   13   302   86    20    3   8.5G  8.5G  
09:48:56  2.2K   494     22   210   11   284  100    33    2   8.5G  8.5G  
09:48:57  2.2K  1.2K     53   338   24   843   99    86    8   8.5G  8.5G  
09:48:58  1.5K   726     49   232   27   494   82    60   34   8.5G  8.5G  
09:48:59  1.3K   684     53   233   30   451   85    38    9   8.5G  8.5G  
09:49:00  2.6K   974     37    96    5   878   89    14    1   8.5G  8.5G  
09:49:01  1.3K   656     50   178   22   478   92    26    6   8.5G  8.5G

By looking at the read columns you can see that I didn't do a good enough job blowing the read cache out. The average miss percentage was 40, so over half of the reads came from cache. Well, that's sequential read, maybe random read will be better.

Code:

                                                                                           
        KB             reclen |  write    | rewrite  |  read     |  reread  |  random read   random write   
        25165824    4096   | 141809  | 139618  |  285847  | 297634 |   52813          | 181362

It's on to the random tests now:

Code:

[root@freenas] /usr/local/www/freenasUI/tools# ./arcstat.py 1
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
09:55:13    16    16    100    16  100     0    0     0    0   8.5G  8.5G  
09:55:14  1.6K  1.3K     80   461   63   806   93     0    0   8.5G  8.5G  
09:55:15  1.3K   844     62   546   59   298   70     0    0   8.5G  8.5G  
09:55:16  1.2K   745     60   330   43   415   88     0    0   8.5G  8.5G  
09:55:17  1.8K  1.4K     75   467   52   889   97     0    0   8.5G  8.5G  
09:55:18  1.9K  1.2K     61   686   49   466   99     0    0   8.5G  8.5G  
09:55:19  1.5K  1.1K     69   564   57   501   89     0    0   8.5G  8.5G  
09:55:20  1.4K  1.1K     83   206   50   941   97     0    0   8.5G  8.5G  
09:55:21  1.3K   805     59   418   59   387   60     0    0   8.5G  8.5G  
09:55:22  1.4K   903     66   698   70   205   56     0    0   8.5G  8.5G  
09:55:23  1.6K   877     56   209   24   668   94     0    0   8.5G  8.5G  
09:55:24  1.6K  1.1K     70   389   52   737   85     0    0   8.5G  8.5G  
09:55:25  1.2K   738     64   210   36   528   90     0    0   8.5G  8.5G  
09:55:26  1.0K   889     85   492   76   397  100     0    0   8.5G  8.5G

As you can see, the miss percentages are higher, but still a sizeable chunk of read performance is coming from cache.

What this tells us is that we need to try with a larger dataset size (the -s 24g) parameter to iozone, or we need to remove RAM from the system, as increasing the test size will increase the time to run the tests, because you have to do the write tests first...then you have to ask yourself the question, since ZFS is so effective at caching, do you really care what the performance is off the cache?

Code:

[root@freenas] /usr/local/www/freenasUI/tools# iozone -a -s 1g -r 4096
         KB             reclen |  write    | rewrite   |  read     |  reread  |  random read  | random write
         1048576     4096    | 459033  | 2928320 | 4709995 | 4615880 | 4606312        | 3561581

arcstat shows no arc misses during that test, so you can see it satisfied the reads from cache. Look at random read, nearly 4.5GB/sec, that's a level of performance that would take hundreds of disks to satisfy if caching weren't involved.

Another tool included in FreeNAS is xdd. xdd can give you iops numbers, as well as latency numbers.

I'll post about that in a separate article.

Since arc size is so critical, there is a tool to help determine sizing.

Code:

/usr/local/www/freenasUI/tool/arc_summary.py

ARC Size:                               57.82%  6.13    GiB
        Target Size: (Adaptive)         82.18%  8.71    GiB
        Min Size (Hard Limit):          12.50%  1.32    GiB
        Max Size (High Water):          8:1     10.60   GiB

If the target size is banging into the max size limit it would likely grow larger if there was more headroom.

paleoN · Jul 24, 2012

Great article. Thanks for taking the time to post it. I think I can even manage to use arcstat.py now. :)

jpaetzel said:
Another tool included in FreeNAS is xdd. xdd can give you iops numbers, as well as latency numbers.

I'll post about that in a separate article.

I'm looking forward to it. One request though. Can you put the terminal commands & output inside some [code][/code] tags? IMO, it will improve the readability.

jpaetzel · Jul 24, 2012

How's that? I'm not much of a web guy. :-/

paleoN · Jul 24, 2012

Looks great, thanks. :D

yaneurabeya · Sep 19, 2012

The misgivings I have about these directions is that:

a) This is a pure filesystem benchmark, not a protocol benchmark; most FreeNAS end-users care about protocol performance more than pure filesystem performance.
b) iozone doesn't produce usable throughput numbers when run from Windows.
c) The directions are a start, but you're only testing the ARC (best case scenario) and not some of the other interesting scenarios as iozone is not bypassing filesystem/disk subsystem buffering; thus, the testing seems a bit artificial.

aufalien · Sep 5, 2013

Great post, following the dirs and setting my arc min and max to 2G rather then dumbing it down via hardware as I've 8x16GB mem modules.

Running iozone with 15g.

No SLOG-ZIL/L2ARC and getting interesting values via;

arctstat with a %miss max;
94%

zpool iostat max;
900M to 1.01G reads
700M to 900M writes

gstat;
r kBps maxes at about ~30000
wkBps maxes at about ~30000

This all depends on what phase the test is on.

I have 7 vdevs with 5 disks each @RaidZ.

Thats 35 SATA with a usable 75TB. Using the 3TB Seags E.2 but the new ones are out with double the cache!

Are these numbers expected?

I also noticed autotune is off in 9.1.1?

Will keep it off for further tweaking and enable it while undoing my tweaks for a before and after.

The server itself is gnarly bro. I ran 70 rsyncs via parallel backing up a system over NFS with ~100 million objects being a combo of dirs and files. The thing broke a load avg of ~11 and was still very responsive.

However my point here is that I'm weary of ZILs/L2ARC as they may in fact slow me down.

I'll test of course and have a real test via some heavy 2D and 3D scenes on a render queue with ~650 cores. This is prolly were the SSD based ZILs and L2ARC will come in to play as being required as all this will be over NFS.

I've a pair of the Intel 370 on standby for the ZIL and 2-3 Intel 520s at 240GB for L2ARC. I know my working data set is ~350GB/day and I don't want to burden my ARC with too much L2ARC if its not required.

Anyways, I'll post further experiences on this thread as I really like what Mr P posted. Very helpful.

Any comments or criticisms are welcome of course. After all, learning is more painful then it is pleasurable.

PS, Will you be doing an xdd article as well?

hugovsky · Mar 25, 2014

jpaetzel said:
Code:
/usr/local/www/freenasUI/tool/arc_summary.py ARC Size: 57.82% 6.13 GiB Target Size: (Adaptive) 82.18% 8.71 GiB Min Size (Hard Limit): 12.50% 1.32 GiB Max Size (High Water): 8:1 10.60 GiB

If the target size is banging into the max size limit it would likely grow larger if there was more headroom.

Is this an example of what you say? My percentage values are very similar to yours.

trekuhl · Jan 20, 2015

Setting up a new test box with 9.3 and there is now a built in performance test option under SYSTEM/ADVANCED. how does this compare to manually testing? I was trying to follow this guide and testing the command "iozone -a -s 24g -r 4096" against 15 300GB sas 15K drives the zpool iostat command was showing write tests around 6-9MB/s. i also have another pool of 8x 146GB SAS 10K drives and they were writing around 2-3MB/s on that command. I stopped them and tried the built in performance test and can see that iostat command showing much better specs already.

Im guessing the 9.3 performance test is using varied workloads.

I could get into more details but may be better off not muddling this thread up as I learn how to test this stuff.

Important Announcement for the TrueNAS Community.

Benchmarking ZFS

jpaetzel

Guest

paleoN

Wizard

jpaetzel

Guest

paleoN

Wizard

yaneurabeya

Dabbler

aufalien

Patron

hugovsky

Guru

trekuhl

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Benchmarking ZFS

jpaetzel

Guest

paleoN

Wizard

jpaetzel

Guest

paleoN

Wizard

yaneurabeya

Dabbler

aufalien

Patron

hugovsky

Guru

trekuhl

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Benchmarking ZFS"

Similar threads