So I have been troubleshooting the awful raw disk write (~15MB/s) in the past few days, and it turns out to the combination of disabled disk cache (the 128M cache on each HDD), FreeBSD's hard limit of 128K MAX per IO, single queue depth of the dd command: https://forums.freenas.org/index.php?threads/troubleshooting-low-disk-write-speed.70217/
Now with this sorted out, I decided to do some simple testing to see its impact on ZFS performance which I would like to share.
I created 2 single disk pool, cacheon (on-disk cache enabled) and cacheoff (on-disk disabled), within each pool, 2 datasets syncon (sync=always) and syncoff (sync=disabled). Compression is disabled in all datasets.
The first thing I noticed was ZFS was smart enough to mitigate the awful raw speed, which is brilliant. With disk cache disabled and sync=always, it can still do ~2.5x raw speed by queuing up commands; when sync=disable, everything get write into the RAM very quickly, and in the backend data is piping into disk at even faster (>4x) speed.
Now with disk cache enabled, things gets more interesting. When sync=disable, again data gets dumped into RAM at lightspeed, and in the backend disk write at its full potential, thanks to the cache. The could be a benefit if ZFS is subject to heavy sustained seq writes, like video surveillance? When sync=always, surprisingly it is ~20% slower comparing to disk cache disabled. I can only assume that because ZFS have to constantly tell the disk to flush its cache, which takes time.
This is more pronounced if we move to a smaller block size. I didn't bother to include sync=disable because (I believe) if you are facing lots of small IOs, more likely than not the use case calls for sync=always (like VM, database). I also didn't include gstat results as you can see from above it agrees with dd pretty well when sync=always.
We are seeing a >300% performance increase by disabling the on-disk cache! This may be counter-intuitive but makes sense if you think about the headroom of flushing the cache after every small-ish IOs.
Now some thoughts:
1. Is it safe to enable the on disk cache? Obviously if data integrity is at risk performance is meaningless. And AFAIK cache on HDD are volatile, which means data store in them are gone in event of power failure. Now that ZFS is clearly flushing the cache for sync writes, I can only imagine it will also flush the cache before considering a transaction group is committed to safe storage. However I don't have any reference to this.
2. Should one disable the on-disk cache to get better performance with smaller IOs? The oversimplified test above may suggest so. But one in this situation (sync=always, small IOs) might be better off getting a proper SLOG. My understanding is that this will effectively put the HDDs into the same situation as sync=disable, where data get write onto them in transaction groups, and on-disk cache certainly helps with this.
3. About increasing the max IO size, setting it larger than 128K this should benefit in many cases. I believe this is set with MAXPHYS, unfortunately it cannot be changed short of re-compiling the kernel:http://freebsd.1045724.x6.nabble.com/Time-to-increase-MAXPHYS-td6189400.html. I am not going to do it in fear of breaking something, but should be doable.
I hope you find this interesting. Any input is appreciated. Even a simple confirmation on my thoughts would help me establish confidence as a n00b in FreeNAS/BSD/ZFS.
Now with this sorted out, I decided to do some simple testing to see its impact on ZFS performance which I would like to share.
I created 2 single disk pool, cacheon (on-disk cache enabled) and cacheoff (on-disk disabled), within each pool, 2 datasets syncon (sync=always) and syncoff (sync=disabled). Compression is disabled in all datasets.
Code:
root@freenas:/mnt/cacheon/syncon # dd if=/dev/zero of=ddtest bs=1M count=1K 1024+0 records in 1024+0 records out 1073741824 bytes transferred in 33.871098 secs (31700827 bytes/sec) L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name 6 306 0 0 0.0 276 31577 2.9 95.9| da2 root@freenas:/mnt/cacheon/syncoff # dd if=/dev/zero of=ddtest bs=1M count=1K 1024+0 records in 1024+0 records out 1073741824 bytes transferred in 0.594406 secs (1806411267 bytes/sec) L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name 0 1543 0 0 0.0 1543 197442 2.8 91.8| da2 root@freenas:/mnt/cacheoff/syncon # dd if=/dev/zero of=ddtest bs=1M count=1K 1024+0 records in 1024+0 records out 1073741824 bytes transferred in 27.064734 secs (39673097 bytes/sec) L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name 5 458 0 0 0.0 416 40820 14.0 96.7| da3 root@freenas:/mnt/cacheoff/syncoff # dd if=/dev/zero of=ddtest bs=1M count=1K 1024+0 records in 1024+0 records out 1073741824 bytes transferred in 0.723659 secs (1483767229 bytes/sec) L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name 8 550 0 0 0.0 550 70345 11.9 97.1| da3
The first thing I noticed was ZFS was smart enough to mitigate the awful raw speed, which is brilliant. With disk cache disabled and sync=always, it can still do ~2.5x raw speed by queuing up commands; when sync=disable, everything get write into the RAM very quickly, and in the backend data is piping into disk at even faster (>4x) speed.
Now with disk cache enabled, things gets more interesting. When sync=disable, again data gets dumped into RAM at lightspeed, and in the backend disk write at its full potential, thanks to the cache. The could be a benefit if ZFS is subject to heavy sustained seq writes, like video surveillance? When sync=always, surprisingly it is ~20% slower comparing to disk cache disabled. I can only assume that because ZFS have to constantly tell the disk to flush its cache, which takes time.
This is more pronounced if we move to a smaller block size. I didn't bother to include sync=disable because (I believe) if you are facing lots of small IOs, more likely than not the use case calls for sync=always (like VM, database). I also didn't include gstat results as you can see from above it agrees with dd pretty well when sync=always.
Code:
root@freenas:/mnt/cacheon/syncon # dd if=/dev/zero of=ddtest bs=16K count=8K 8192+0 records in 8192+0 records out 134217728 bytes transferred in 89.989479 secs (1491482 bytes/sec) root@freenas:/mnt/cacheon/syncon # dd if=/dev/zero of=ddtest bs=4K count=8K 8192+0 records in 8192+0 records out 33554432 bytes transferred in 82.307550 secs (407671 bytes/sec) root@freenas:/mnt/cacheoff/syncon # dd if=/dev/zero of=ddtest bs=16K count=8K 8192+0 records in 8192+0 records out 134217728 bytes transferred in 27.925497 secs (4806279 bytes/sec) root@freenas:/mnt/cacheoff/syncon # dd if=/dev/zero of=ddtest bs=4K count=8K 8192+0 records in 8192+0 records out 33554432 bytes transferred in 23.053595 secs (1455497 bytes/sec)
We are seeing a >300% performance increase by disabling the on-disk cache! This may be counter-intuitive but makes sense if you think about the headroom of flushing the cache after every small-ish IOs.
Now some thoughts:
1. Is it safe to enable the on disk cache? Obviously if data integrity is at risk performance is meaningless. And AFAIK cache on HDD are volatile, which means data store in them are gone in event of power failure. Now that ZFS is clearly flushing the cache for sync writes, I can only imagine it will also flush the cache before considering a transaction group is committed to safe storage. However I don't have any reference to this.
2. Should one disable the on-disk cache to get better performance with smaller IOs? The oversimplified test above may suggest so. But one in this situation (sync=always, small IOs) might be better off getting a proper SLOG. My understanding is that this will effectively put the HDDs into the same situation as sync=disable, where data get write onto them in transaction groups, and on-disk cache certainly helps with this.
3. About increasing the max IO size, setting it larger than 128K this should benefit in many cases. I believe this is set with MAXPHYS, unfortunately it cannot be changed short of re-compiling the kernel:http://freebsd.1045724.x6.nabble.com/Time-to-increase-MAXPHYS-td6189400.html. I am not going to do it in fear of breaking something, but should be doable.
I hope you find this interesting. Any input is appreciated. Even a simple confirmation on my thoughts would help me establish confidence as a n00b in FreeNAS/BSD/ZFS.
Last edited: