js_level2
Dabbler
- Joined
- May 4, 2018
- Messages
- 10
I'd like to take a few minutes to talk about SLOG devices and what makes good ones versus bad ones. I have no doubt that this will be a controversial topic since this is not well understood by many people. In short, there's 3 things that you need for a "great" SLOG:
1. Fast throughput
2. Long lifespan with regards to 100% write workloads
3. Fast cache flushes
SLOG workloads are a bit unique. So what goes into an SLOG?
Only sync writes will actually go to an SLOG. A sync write is a write that is being performed that cannot be acknowledged as complete until the write is performed to non-volatile storage. ESXi is particularly rough on this because its NFS implementation defaults to all writes being sync writes as it cannot differentiate from important file system writes inside a VM and some file update that may not be important. This conservatism is what will protect your VMs from corruption if something goes wrong. For this reason, it is typically not a good idea to try to make writes asynchronous.
Samba itself has no sync write function, so you would have to handle that on the ZFS layer. iSCSI supports it, but I have not seen any iSCSI clients that use the sync flag yet. I'm not sure about AFP as it is slowly dying and is less often used with each passing day.
I'll be focusing mostly on NFS sync writes because that's where many people's problems lie.
NFS prior to 11.1 let you use FHA (file handle association). This would have 2 effects:
1. Speed up reads from NFS shares.
2. Slow down writes to NFS shares.
The vice versa happens if you disable FHA.
As you could only turn on FHA for the entire NFS service, you had to pick the one you cared about, and the other was a situation where you "got what you got". The sysctl was vfs.nfsd.fha.enable.
Starting with 11.1, there are new sysctls (thanks to @mav@ at iX for the changes!):
vfs.nfsd.fha.write
vfs.nfsd.fha.read
These allow you to choose the best of both worlds if you desire and if your workloads use FHA properly.
Anyway, to get back on topic, SLOG writes to the SSD involve unique situations for the writes, so it requires very specific benchmarking. Here's how to do it:
1. Make sure the disk is not partitioned.
2. Make sure no data is on the disk. This test is DESTRUCTIVE.
3. Make sure the disk is not in use by anything else (
This should only be used on SSDs, and this test is only useful for SLOG workloads. If you are not planning to use the disk for an slog, then this test is not going to give you results that are truly useful. It doesn't tell you how good or bad a disk is that is a spindle disk, nor how good of an l2arc it would be... only for SLOG.
The command to test:
#
Here's a test on my system with one of the old (10+ year old) Sun Microsystem ZFS PCIe accelerator cards:
The important thing for sync writes is the microseconds of latency, lower is better. (Hint: Those numbers are relatively terrible!) The throughput does, however, matter as that is going to potentially be your bottleneck if you start getting large amounts of sync writes.
I'd like to compile a list of devices out there and the outputs. If people can test their devices along with providing the model number, disk size, and firmware version, that would be beneficial for people shopping for a disk to see what works and can look at current prices. That info can be obtained by looking at the output of
If we get enough data points, I'll try to put a table or something on here for easy access.
Thanks and happy NASing!
1. Fast throughput
2. Long lifespan with regards to 100% write workloads
3. Fast cache flushes
SLOG workloads are a bit unique. So what goes into an SLOG?
Only sync writes will actually go to an SLOG. A sync write is a write that is being performed that cannot be acknowledged as complete until the write is performed to non-volatile storage. ESXi is particularly rough on this because its NFS implementation defaults to all writes being sync writes as it cannot differentiate from important file system writes inside a VM and some file update that may not be important. This conservatism is what will protect your VMs from corruption if something goes wrong. For this reason, it is typically not a good idea to try to make writes asynchronous.
Samba itself has no sync write function, so you would have to handle that on the ZFS layer. iSCSI supports it, but I have not seen any iSCSI clients that use the sync flag yet. I'm not sure about AFP as it is slowly dying and is less often used with each passing day.
I'll be focusing mostly on NFS sync writes because that's where many people's problems lie.
NFS prior to 11.1 let you use FHA (file handle association). This would have 2 effects:
1. Speed up reads from NFS shares.
2. Slow down writes to NFS shares.
The vice versa happens if you disable FHA.
As you could only turn on FHA for the entire NFS service, you had to pick the one you cared about, and the other was a situation where you "got what you got". The sysctl was vfs.nfsd.fha.enable.
Starting with 11.1, there are new sysctls (thanks to @mav@ at iX for the changes!):
vfs.nfsd.fha.write
vfs.nfsd.fha.read
These allow you to choose the best of both worlds if you desire and if your workloads use FHA properly.
Anyway, to get back on topic, SLOG writes to the SSD involve unique situations for the writes, so it requires very specific benchmarking. Here's how to do it:
1. Make sure the disk is not partitioned.
2. Make sure no data is on the disk. This test is DESTRUCTIVE.
3. Make sure the disk is not in use by anything else (
gstat
can be your friend here).This should only be used on SSDs, and this test is only useful for SLOG workloads. If you are not planning to use the disk for an slog, then this test is not going to give you results that are truly useful. It doesn't tell you how good or bad a disk is that is a spindle disk, nor how good of an l2arc it would be... only for SLOG.
The command to test:
#
diskinfo -wS /dev/XXX
Here's a test on my system with one of the old (10+ year old) Sun Microsystem ZFS PCIe accelerator cards:
Code:
root@freenas:~ # diskinfo -wS /dev/da24 /dev/da24 512 # sectorsize 24575868928 # mediasize in bytes (23G) 47999744 # mediasize in sectors 4096 # stripesize 0 # stripeoffset 2987 # Cylinders according to firmware. 255 # Heads according to firmware. 63 # Sectors according to firmware. ATA MARVELL SD88SA02 # Disk descr. 0946M01NAP # Disk ident. Not_Zoned # Zone Mode Synchronous random writes: 0.5 kbytes: 9990.9 usec/IO = 0.0 Mbytes/s 1 kbytes: 9977.0 usec/IO = 0.1 Mbytes/s 2 kbytes: 9986.2 usec/IO = 0.2 Mbytes/s 4 kbytes: 9977.1 usec/IO = 0.4 Mbytes/s 8 kbytes: 9991.1 usec/IO = 0.8 Mbytes/s 16 kbytes: 10002.9 usec/IO = 1.6 Mbytes/s 32 kbytes: 9854.5 usec/IO = 3.2 Mbytes/s 64 kbytes: 9982.3 usec/IO = 6.3 Mbytes/s 128 kbytes: 7868.0 usec/IO = 15.9 Mbytes/s 256 kbytes: 9977.6 usec/IO = 25.1 Mbytes/s 512 kbytes: 9982.8 usec/IO = 50.1 Mbytes/s 1024 kbytes: 10029.8 usec/IO = 99.7 Mbytes/s 2048 kbytes: 19998.0 usec/IO = 100.0 Mbytes/s 4096 kbytes: 39990.5 usec/IO = 100.0 Mbytes/s 8192 kbytes: 70000.2 usec/IO = 114.3 Mbytes/s
The important thing for sync writes is the microseconds of latency, lower is better. (Hint: Those numbers are relatively terrible!) The throughput does, however, matter as that is going to potentially be your bottleneck if you start getting large amounts of sync writes.
dd
is really just not a suitable alternative as the disk has to be given the flush commands and the time for the flush command to execute is important.I'd like to compile a list of devices out there and the outputs. If people can test their devices along with providing the model number, disk size, and firmware version, that would be beneficial for people shopping for a disk to see what works and can look at current prices. That info can be obtained by looking at the output of
smartctl -a /dev/XXX
.If we get enough data points, I'll try to put a table or something on here for easy access.
Thanks and happy NASing!
Last edited by a moderator: