J
jpaetzel
Guest
ZFS has a variable blocksize with the default being 128k. This means that you should test with 128k blocks. If you test with a smaller blocksize using a tool that does not create discreet files (a la iozone), ZFS will be tricked into creating the large files used for the test with it's biggest blocksize, 128k.
This causes issues when testing as ZFS is forced to do large read -
modify - write operations to satisfy the test's request. For instance
doing a random write to a 4K block when the recordsize is set to 128K
means that ZFS has to read in a 128K block, modify it, then write out
the 128K block.
Here's an example of an iozone test using 128K blocks, with the default
ZFs recordsize of 128K and an NFS mount using 128K:
Children see throughput for 8 initial writers = 379307.77 KB/sec
Children see throughput for 8 readers = 954874.97 KB/sec
Children see throughput for 8 random readers = 656026.40 KB/sec
Children see throughput for 8 random writers = 301620.48 KB/sec
Writes are bottlenecked by the SSD ZIL. Sequential reads with readahead
saturate the 10Gbe link. The high random read and write numbers tell
you this is an SSD pool.
Now I'll request 4K blocks from iozone, but leave the ZFS recordsize and NFS mount size at 128K:
Children see throughput for 8 initial writers = 59323.97 KB/sec
Children see throughput for 8 readers = 966890.23 KB/sec
Children see throughput for 8 random readers = 90986.52 KB/sec
Children see throughput for 8 random writers = 15234.34 KB/sec
Write speed tanks, however not for lack of trying. The pipe is nearly
saturated with TCP traffic, yet only a fraction of that makes it to the
disk. Readahead still works and the sequential read tests saturates the pipe. The random tests plummet, however if you monitor the link during these tests it's saturated in both directions , and the pool is doing over 1GB/sec reads and writes to satisfy the test. The comparison here would be trying to drive 60 MPH with your car in 2nd gear. It's doing a lot more work than it would if you had it in overdrive.
Finally, I set the ZFS recordsize to 4K, do a 4K NFS mount, and request
4K blocks from iozone:
Children see throughput for 8 initial writers = 88655.63 KB/sec
Children see throughput for 8 readers = 306055.79 KB/sec
Children see throughput for 8 random readers = 164795.67 KB/sec
Children see throughput for 8 random writers = 71582.13 KB/sec
Sequential write increases, readahead suffers on the sequential read
test, and random read and random write see substantial gains, as the
overhead on the storage drops way down.
Trying to use industry standard benchmarks on ZFS to test 4K blocks will end up with poor results unless you take steps to ensure you don't end up in a pathological case. You'll see a LOT of stuff on the web about how bad ZFS is at small block random for this exact reason. Artificially testing 4K access to a 10GB file is a lot different than actually having 4K files, which will let ZFS vary the blocksize correctly.
Another place where this can bite you is with iSCSI and file extents. Because the default record size is 128K and the iSCSI file extent is a giant multi GB or TB file, ZFS leaves it's recordsize at 128K. Then you stick a filesystem inside it that invariably uses 4 or 8K blocks and random performance craters. Zvols avoid this by setting the volblocksize to 8K, however setting the recordsize of a dataset that will hold an iSCSI file extent to 8K is a good idea. (Before you create the extent)
This causes issues when testing as ZFS is forced to do large read -
modify - write operations to satisfy the test's request. For instance
doing a random write to a 4K block when the recordsize is set to 128K
means that ZFS has to read in a 128K block, modify it, then write out
the 128K block.
Here's an example of an iozone test using 128K blocks, with the default
ZFs recordsize of 128K and an NFS mount using 128K:
Children see throughput for 8 initial writers = 379307.77 KB/sec
Children see throughput for 8 readers = 954874.97 KB/sec
Children see throughput for 8 random readers = 656026.40 KB/sec
Children see throughput for 8 random writers = 301620.48 KB/sec
Writes are bottlenecked by the SSD ZIL. Sequential reads with readahead
saturate the 10Gbe link. The high random read and write numbers tell
you this is an SSD pool.
Now I'll request 4K blocks from iozone, but leave the ZFS recordsize and NFS mount size at 128K:
Children see throughput for 8 initial writers = 59323.97 KB/sec
Children see throughput for 8 readers = 966890.23 KB/sec
Children see throughput for 8 random readers = 90986.52 KB/sec
Children see throughput for 8 random writers = 15234.34 KB/sec
Write speed tanks, however not for lack of trying. The pipe is nearly
saturated with TCP traffic, yet only a fraction of that makes it to the
disk. Readahead still works and the sequential read tests saturates the pipe. The random tests plummet, however if you monitor the link during these tests it's saturated in both directions , and the pool is doing over 1GB/sec reads and writes to satisfy the test. The comparison here would be trying to drive 60 MPH with your car in 2nd gear. It's doing a lot more work than it would if you had it in overdrive.
Finally, I set the ZFS recordsize to 4K, do a 4K NFS mount, and request
4K blocks from iozone:
Children see throughput for 8 initial writers = 88655.63 KB/sec
Children see throughput for 8 readers = 306055.79 KB/sec
Children see throughput for 8 random readers = 164795.67 KB/sec
Children see throughput for 8 random writers = 71582.13 KB/sec
Sequential write increases, readahead suffers on the sequential read
test, and random read and random write see substantial gains, as the
overhead on the storage drops way down.
Trying to use industry standard benchmarks on ZFS to test 4K blocks will end up with poor results unless you take steps to ensure you don't end up in a pathological case. You'll see a LOT of stuff on the web about how bad ZFS is at small block random for this exact reason. Artificially testing 4K access to a 10GB file is a lot different than actually having 4K files, which will let ZFS vary the blocksize correctly.
Another place where this can bite you is with iSCSI and file extents. Because the default record size is 128K and the iSCSI file extent is a giant multi GB or TB file, ZFS leaves it's recordsize at 128K. Then you stick a filesystem inside it that invariably uses 4 or 8K blocks and random performance craters. Zvols avoid this by setting the volblocksize to 8K, however setting the recordsize of a dataset that will hold an iSCSI file extent to 8K is a good idea. (Before you create the extent)