How you can trick ZFS unintentionally to get the block size wrong

jpaetzel · May 9, 2014

ZFS has a variable blocksize with the default being 128k. This means that you should test with 128k blocks. If you test with a smaller blocksize using a tool that does not create discreet files (a la iozone), ZFS will be tricked into creating the large files used for the test with it's biggest blocksize, 128k.

This causes issues when testing as ZFS is forced to do large read -
modify - write operations to satisfy the test's request. For instance
doing a random write to a 4K block when the recordsize is set to 128K
means that ZFS has to read in a 128K block, modify it, then write out
the 128K block.

Here's an example of an iozone test using 128K blocks, with the default
ZFs recordsize of 128K and an NFS mount using 128K:

Children see throughput for 8 initial writers = 379307.77 KB/sec
Children see throughput for 8 readers = 954874.97 KB/sec
Children see throughput for 8 random readers = 656026.40 KB/sec
Children see throughput for 8 random writers = 301620.48 KB/sec

Writes are bottlenecked by the SSD ZIL. Sequential reads with readahead
saturate the 10Gbe link. The high random read and write numbers tell
you this is an SSD pool.

Now I'll request 4K blocks from iozone, but leave the ZFS recordsize and NFS mount size at 128K:

Children see throughput for 8 initial writers = 59323.97 KB/sec
Children see throughput for 8 readers = 966890.23 KB/sec
Children see throughput for 8 random readers = 90986.52 KB/sec
Children see throughput for 8 random writers = 15234.34 KB/sec

Write speed tanks, however not for lack of trying. The pipe is nearly
saturated with TCP traffic, yet only a fraction of that makes it to the
disk. Readahead still works and the sequential read tests saturates the pipe. The random tests plummet, however if you monitor the link during these tests it's saturated in both directions , and the pool is doing over 1GB/sec reads and writes to satisfy the test. The comparison here would be trying to drive 60 MPH with your car in 2nd gear. It's doing a lot more work than it would if you had it in overdrive.

Finally, I set the ZFS recordsize to 4K, do a 4K NFS mount, and request
4K blocks from iozone:

Children see throughput for 8 initial writers = 88655.63 KB/sec
Children see throughput for 8 readers = 306055.79 KB/sec
Children see throughput for 8 random readers = 164795.67 KB/sec
Children see throughput for 8 random writers = 71582.13 KB/sec

Sequential write increases, readahead suffers on the sequential read
test, and random read and random write see substantial gains, as the
overhead on the storage drops way down.

Trying to use industry standard benchmarks on ZFS to test 4K blocks will end up with poor results unless you take steps to ensure you don't end up in a pathological case. You'll see a LOT of stuff on the web about how bad ZFS is at small block random for this exact reason. Artificially testing 4K access to a 10GB file is a lot different than actually having 4K files, which will let ZFS vary the blocksize correctly.

Another place where this can bite you is with iSCSI and file extents. Because the default record size is 128K and the iSCSI file extent is a giant multi GB or TB file, ZFS leaves it's recordsize at 128K. Then you stick a filesystem inside it that invariably uses 4 or 8K blocks and random performance craters. Zvols avoid this by setting the volblocksize to 8K, however setting the recordsize of a dataset that will hold an iSCSI file extent to 8K is a good idea. (Before you create the extent)

diehard · May 9, 2014

Thanks for the info, ive been bugging everyone and their mother about recordsize on iSCSI file extents on IRC. Hoping to convert my VM's to extents to an 8kb recordsize..

Cyberjock brought up some great points about why it might not be the best idea to force a recordsize , hopefully he chimes in.

eraser · May 9, 2014

diehard said:
Cyberjock brought up some great points about why it might not be the best idea to force a recordsize , hopefully he chimes in.

Shhh, don't say that name three times -- why would you risk that? Cue a two-page rant about how you don't know how ZFS works because you are not running at least 64 GB of ECC RAM, how it takes a programmer to fully understand all the intricacies of ZFS and you are a fool to even try to figure it out yourself, and don't bother even trying to run benchmarks because they are totally useless. Oh, and he'll offer to tell you the real answer in private if you PM him and hire him for an hour or two of consulting time.

cyberjock · May 10, 2014

Yeah.. I've already discussed this with diehard, and he and I have both found several people that have tried forcing a smaller blocksize with no actual performance benefit. I haven't find a situation yet where people doing iscsi extents have found an advantage with smaller blocks. My understanding is that the reason is that iscsi, if it's actually going at 8kb/block, then ZFS will begin writing at 8kb blocks. Nothing has changed if you leave it at 128kB block sizes. But, if tomorrow iscsi went to 32kb blocks, you'd potentially be hurting performance if you left it at 8kb(of course, you could be helping performance in some situations).

As for the 128kb block size, ZFS will start parting out the blocks individually and they will slowly shrink to some smaller block size as time goes on and if applicable.

5 comments I have with regard to forcing a smaller block:

1. You might be artificially limiting yourself for potentially no gain(but definitely losses).
2. If you are in such bad shape with performance that you are trying to eek out what has already been documented by multiple people as a minor, if even noticeable, change in performance you have probably done other things wrong that should be corrected before trying to tweak block sizes.
3. If iscsi is going to do 8kb blocks, ZFS is going to allocate itself in 8kb blocks(assuming you don't set the value smaller than 8kB). So even if you have it set to 64kB you gain nothing but lose something.. the ability to potentially have bigger blocks.
4. If this actually mattered a whole lot, I'd have expected *tons* of posts on this topic. The fact that there are so incredibly few even with several hours of Googling, the discussions on this topic with actual numbers tells me there's no advantage in the best case and at the worst case is a disadvantage.
5. Even with the initial performance "penalty" with the large blocks being broken down, that penalty is minor in the big picture. It's only after you've filled up the iscsi extent that you're going to be more and more worried about performance. Just the act of filling the iscsi extent with your VMs/data will break the block sizes down as applicable.

I just don't see how this actually provides a real-world benefit. Others who appear to have done this in practice have reported the same result.

Important Announcement for the TrueNAS Community.

How you can trick ZFS unintentionally to get the block size wrong

jpaetzel

Guest

diehard

Contributor

eraser

Contributor

cyberjock

Inactive Account

Similar threads