SLOG bottleneck on sync writes with smaller block sizes

xyzzy · May 20, 2017

Quick Summary: ESXi NFS sequential writes seem to be capped at 250 MB/s when talking to a beefy NFS pool over 10GbE.

NOTE: This isn't the usual "why are sync writes slow" question. At least I don't think so.

Box #1
VMware ESXi 6.0 U2
Supermicro X9SRE-F, E5-1650V2, 16GB ECC memory
Intel X710-DA4 10GbE
1 local SSD with 1 test VM VMDK

Box #2
FreeNAS 9.10.2 U2
Supermicro X9SRE-F, E5-1650V2, 64GB ECC memory
Intel X710-DA4 10GbE
1 pool:
- 3 x Intel DC S3710 200GB striped SLOG
- 3 x Samsung SSD 850 Pro 1TB striped
- atime=off, sync=always
- recordsize: I've tried 16K and 128K
- Local tests show this pool can handle sync writes @ 700 MB/s

Both boxes are connected via DAC. iperf testing shows that the boxes can talk at nearly 10 Gb/s in either direction. I've tried with and without jumbo frames.

On the test VM, I added a 40 GB thin-provisioned disk that's served from the FreeNAS box via NFS. When I run various disk tests (Iometer, CrystalDiskMark), the sequential write tests on this disk seem to be capped at around 250 MB/s even though the FreeNAS filesystem can easily exceed that.

When I check the FreeNAS graphs, I don't see any CPU, memory or disk bottlenecks. I've run zpool iostat -v 1 on the FreeNAS box and I see the writes evenly spread across the striped SLOG and striped data disks.

The most curious thing is that the network graphs on ESXi and FreeNAS during the write tests show a nearly flat ceiling at 2 Gb/s.

Is there something in ESXi or FreeNAS that limits/caps the NFS write performance? Maybe a setting that defaults to a single stream/thread/connection that needs to be increased?

xyzzy · May 20, 2017

One additional detail --

Sequential reads on the same VM from the same NFS disk are ~930 MB/s. These operations appear to fully use the 10GbE bandwidth. Not sure why sequential writes would be so much slower than the pool can support.

Ericloewe · May 20, 2017

Try turning off sync and see what happens.

xyzzy · May 20, 2017

Setting to sync=disabled returned results in excess of 1000 MB/s, basically saturating the 10GbE link. So, I think that clearly rules out the VMware side.

That leads me back to the "SLOG is the bottleneck" theory. But oddly enough, if I add more drives to the SLOG (in a stripe), it doesn't help. Check these numbers out:

3 pool drives, 0 SLOG drive = 17 MB/s
3 pool drives, 1 SLOG drive = 274 MB/s
3 pool drives, 2 SLOG drive = 276 MB/s
3 pool drives, 3 SLOG drive = 267 MB/s

I would think that the SLOG activity would be distributed across the SLOG drives, and for the most part it is. While running my tests, I ran "zpool iostat -v 1 | grep gpt". Please see below for representative samples from each scenario (with headers added back in for clarity).

What am I doing wrong on the FreeNAS side?

3 pool drives, 0 SLOG drives

 

												 capacity	 operations	bandwidth

pool										  alloc   free   read  write   read  write

--------------------------------------------  -----  -----  -----  -----  -----  -----

vol1

  gptid/0c749a10-3d1c-11e7-9503-3cfdfe9eff20  2.29G   942G	  0	117	  0  5.97M

  gptid/0ca71f8f-3d1c-11e7-9503-3cfdfe9eff20  2.33G   942G	  0	114	  0  5.72M

  gptid/0cd63e98-3d1c-11e7-9503-3cfdfe9eff20  2.33G   942G	  0	118	  0  5.84M

3 pool drives, 1 SLOG drive

 

												 capacity	 operations	bandwidth

pool										  alloc   free   read  write   read  write

--------------------------------------------  -----  -----  -----  -----  -----  -----

vol1

  gptid/0c749a10-3d1c-11e7-9503-3cfdfe9eff20  6.70G   937G	  0	846	  0  84.8M

  gptid/0ca71f8f-3d1c-11e7-9503-3cfdfe9eff20  6.74G   937G	  0	839	  0  84.6M

  gptid/0cd63e98-3d1c-11e7-9503-3cfdfe9eff20  6.72G   937G	  0	843	  0  84.6M

logs

  gptid/7999da57-3da8-11e7-8eec-3cfdfe9eff20   126M  15.8G	  0  4.17K	  0   275M

3 pool drives, 2 SLOG drives

 

												 capacity	 operations	bandwidth

pool										  alloc   free   read  write   read  write

--------------------------------------------  -----  -----  -----  -----  -----  -----

vol1

  gptid/0c749a10-3d1c-11e7-9503-3cfdfe9eff20  3.33G   941G	  0  1.03K	  0   107M

  gptid/0ca71f8f-3d1c-11e7-9503-3cfdfe9eff20  3.37G   941G	  0  1.02K	  0   107M

  gptid/0cd63e98-3d1c-11e7-9503-3cfdfe9eff20  3.36G   941G	  0  1.02K	  0   106M

logs

  gptid/7999da57-3da8-11e7-8eec-3cfdfe9eff20  63.0M  15.8G	  0  2.08K	  0  8.31M

  gptid/1e60783d-3da9-11e7-8eec-3cfdfe9eff20  63.1M  15.8G	  0  2.08K	  0   266M

3 pool drives, 3 SLOG drives

 

												 capacity	 operations	bandwidth

pool										  alloc   free   read  write   read  write

--------------------------------------------  -----  -----  -----  -----  -----  -----

vol1

  gptid/0c749a10-3d1c-11e7-9503-3cfdfe9eff20  2.87G   941G	  0	847	  0  85.7M

  gptid/0ca71f8f-3d1c-11e7-9503-3cfdfe9eff20  2.91G   941G	  0	847	  0  85.1M

  gptid/0cd63e98-3d1c-11e7-9503-3cfdfe9eff20  2.90G   941G	  0	849	  0  85.7M

logs

  gptid/7999da57-3da8-11e7-8eec-3cfdfe9eff20  42.1M  15.8G	  0  1.33K	  0  87.7M

  gptid/1e60783d-3da9-11e7-8eec-3cfdfe9eff20	42M  15.8G	  0  1.33K	  0  87.7M

  gptid/e8ee0a62-3da9-11e7-8eec-3cfdfe9eff20  42.0M  15.8G	  0  1.33K	  0  87.7M

Many thanks in advance!

Stux · May 20, 2017

Funnily enough your SLOG is running st SATA2 speeds. How are the drives connected to the mobo?

Maybe you should consider a PCIe NVMe SLOG?

Disabling sync and your problem going away shows it's the SLOG. I'm concerned that your SLOG is bottlenecked somehow.

Ericloewe · May 20, 2017

Are the SLOG drives attached to the PCH? If so, that sounds like the culprit. Cramming all sorts of stuff through a measly 4x PCIe 2.0 lanes is not fun.

Stux said:
Maybe you should consider a PCIe NVMe SLOG?

+1.

xyzzy · May 20, 2017

Stux said:
Funnily enough your SLOG is running st SATA2 speeds. How are the drives connected to the mobo?

2 of the SLOG drives are connected to the onboard 6 Gbps ports. The other one plus the 3 data drives are connected to an LSI 9305-16i. The OS indicates all 6 drives are talking at 6 Gbps.

Stux said:
Maybe you should consider a PCIe NVMe SLOG?

Unfortunately, my mobo has limited PCIe slots. Plus, I have a strong preference to hotswap bays so I can replace/rearrange drives without needing to pull the box out of the closet and crack it open. :)

Stux said:
Disabling sync and your problem going away shows it's the SLOG. I'm concerned that your SLOG is bottlenecked somehow.

Me too. But I can't understand why going from 1 SLOG to 2 SLOGs would effectively make no difference.

Ericloewe · May 20, 2017

My guess is that it's a latency issue more than it is a bandwidth issue. Can you try moving the drives to the SAS controller?

xyzzy · May 20, 2017

Ericloewe said:
My guess is that it's a latency issue more than it is a bandwidth issue. Can you try moving the drives to the SAS controller?

I has assumed the onboard ports would be slightly faster but maybe the LSI controller is more solid?

Unfortunately, I only have 1 free SFF-8643 to 4x SATA cable at the moment so I can only hook up 4 drives to the SAS controller.

I tried 3 scenarios with all drives connected to the 9305-16i:

1 pool drive, 1 SLOG drive = 248 MB/s
2 pool drives, 2 SLOG drives = 244 MB/s
1 pool drive, 3 SLOG drives = 245 MB/s

Here is some representative output from "pool iostat -v 1 | grep gpt":

Code:

1 pool drive, 1 SLOG drive		248.45 MB/s
gptid/954bc975-3dd8-11e7-b81e-3cfdfe9eff20  3.39G   941G	  0  2.23K	  0   258M
gptid/958184da-3dd8-11e7-b81e-3cfdfe9eff20   126M  15.8G	  0  3.69K	  0   244M

2 pool drives, 2 SLOG drives	243.89 MB/s
gptid/7f8f3b9c-3dd9-11e7-b81e-3cfdfe9eff20  1.35G   943G	  0  1.22K	  0   130M
gptid/7fc1bbc0-3dd9-11e7-b81e-3cfdfe9eff20  1.35G   943G	  0  1.22K	  0   130M
gptid/7ffa5a56-3dd9-11e7-b81e-3cfdfe9eff20  63.4M  15.8G	  0  1.83K	  0   234M
gptid/923e89e0-3dd9-11e7-b81e-3cfdfe9eff20  63.1M  15.8G	  0  1.83K	  0  7.30M

1 pool drive, 3 SLOG drives		245.74 MB/s
gptid/295123a4-3ddb-11e7-b81e-3cfdfe9eff20  2.03G   942G	  0  2.32K	  0   256M
gptid/2989fa42-3ddb-11e7-b81e-3cfdfe9eff20  42.1M  15.8G	  0  1.22K	  0  80.2M
gptid/3c889490-3ddb-11e7-b81e-3cfdfe9eff20  42.1M  15.8G	  0  1.22K	  0  80.4M
gptid/49a7e7af-3ddb-11e7-b81e-3cfdfe9eff20  42.1M  15.8G	  0  1.22K	  0  80.2M

mav@ · May 20, 2017

Sorry if I missed it, but how exactly are you testing this sequential write? If write has no sufficient queue depth, sync=always may limit throughput due to latency, and make multiple SLOG devices useless. Multiple SLOG devices make sense only if you have more data simultaneously then single SLOG device may handle in one request (>128KB).

xyzzy · May 20, 2017

mav@ said:
Sorry if I missed it, but how exactly are you testing this sequential write? If write has no sufficient queue depth, sync=always may limit throughput due to latency, and make multiple SLOG devices useless. Multiple SLOG devices make sense only if you have more data simultaneously then single SLOG device may handle in one request (>128KB).

I have a single VM with its main drive locally on the ESXi host and this test drive over NFS. On the VM, I run Iometer with a 2 MB seq write QD1 test for 15 seconds.

What's strange is that with 1 pool drive, 1 SLOG, I see both pushing roughly 250 MB/sec. So I know each drive can do that amount of data transfer. So when I add one more of each, I'm surprised to see the drives slow down with the pairs effectively splitting the 250 MB/sec instead of going faster.

With regards to queue depth, I tested that in previous tests with Iometer ranging from QD1 to QD32 (increasing in factors of 2). In each case, the results were in the 250-270 MB/s range.

xyzzy · May 21, 2017

I don't know if this helps but I did some additional testing on the FreeNAS box with NFS out of the equation.

I temporarily created 1.5GB RAM disk and placed a 1GB random file there using the following commands on the FreeNAS box:

Code:

mdconfig -a -t swap -s 1500m -u 1
newfs -U md1
mkdir /mnt/md1
mount /dev/md1 /mnt/md1
dd if=/dev/random of=/mnt/md1/rand1G bs=1024k count=1024

I did this to create a "source" that would be faster than my destination ZFS pool and because pulling from /dev/random on the fly is too slow.

Using this source, I used dd on the FreeNAS box to copy this file from the RAM disk to 1 data drive / 3 SLOG pool with varying output block sizes as follows:

Code:

dd if=/mnt/md1/rand1G of=/mnt/vol1/rand1G ibs=1024k obs={blocksize}

Here are the results:

obs=1024k: 694 MB/s
obs=512k: 537 MB/s
obs=256k: 409 MB/s
obs=128k: 247 MB/s
obs=64k: 189 MB/s
obs=32k: 127 MB/s
obs=16k: 77 MB/s

That 247 MB/s looks awfully familiar. However, my understanding is that ESXi's NFS uses 64KB transfer sizes and its not changeable. I'm not sure how to verify this.

I kept an eye on the FreeNAS graphs during this test and the CPU, memory and disks weren't being pushed anywhere near their limits. I've also toyed around with different ZFS recordsize values but it didn't make much of a difference.

This doesn't explain why the using 2 or 3 SLOGs isn't better than 1. But it does seem to show that the pool performs better with larger size blocks (not completely surprising). From this, I'm assuming that that whatever is happening under the hood during my Iometer tests is serving the pool blocks on the smaller end of the spectrum even though lots of data is coming through. Is there a way to tweak this?

xyzzy · May 26, 2017

I've renamed this thread since this is clearly not NFS related.

Instead, it seems that when using sync=always with smaller writes, the SLOG mechanism is bottlenecking and upgrading the SLOG to a multi-drive striped SLOG doesn't help.

Does anyone have any ideas? I really need to finish the system build in the next few days and am totally stuck on this issue.

mav@ · May 26, 2017

xyzzy said:
ESXi's NFS uses 64KB transfer sizes

That is not true. ESXi negotiates that value with NFS server. Old FreeNAS versions had this limited at 64KB, while starting from about some 9.10 it is 128KB, so you observation about 247MB/s has some ground.

xyzzy said:
This doesn't explain why the using 2 or 3 SLOGs isn't better than 1.

For the same reason why 9 women can't make a baby in a month -- latency v/s throughput. ZFS can benefit from multiple SLOG devices only when single transaction size is above single log block (128KB), and something, probably somewhere in NFS stack does not allow it get one. I've noticed this problem recently too, but haven't investigated yet, was just going to create a ticket to track it.

Stux · May 26, 2017

Stux said:
Maybe you should consider a PCIe NVMe SLOG?

What he said.

xyzzy · May 26, 2017

Thanks for your response....I really appreciate it, especially since I realize you're quite the guru. :)

mav@ said:
That is not true. ESXi negotiates that value with NFS server. Old FreeNAS versions had this limited at 64KB, while starting from about some 9.10 it is 128KB, so you observation about 247MB/s has some ground.

I'm a bit confused about this one. The (admittedly old) VMware white paper accessible here says:

"The rsize and wsize parameters, defined at mount time, define the I/O chunk transfer sizes between the host and the target. These sizes are not tunable and are hard set to 64KB."

That said, when I've used ESXTOP to watch the 10GbE link that's dedicated for NFS, I've generally seen the PSZTX (Average Packet Size Tx) variable in the neighborhood of 32K during my tests. It's possible that's not a fair test but I didn't know of any other way to check.

Is there a way to check on the FreeNAS side what the packet size really is?

mav@ said:
For the same reason why 9 women can't make a baby in a month -- latency v/s throughput. ZFS can benefit from multiple SLOG devices only when single transaction size is above single log block (128KB), and something, probably somewhere in NFS stack does not allow it get one. I've noticed this problem recently too, but haven't investigated yet, was just going to create a ticket to track it.

I'm not sure I'm completely following. Are you saying that NFS isn't able to feed the data to the pool fast enough to take advantage of the extra SLOG?

Also, what about my tests that removed NFS from the equation and simply piped a bunch of data from a RAM disk to the same pool?

xyzzy · May 26, 2017

One more thought...

mav@ said:
ZFS can benefit from multiple SLOG devices only when single transaction size is above single log block (128KB)

By "log block" are you referring to the ZFS recordsize?

If so, I tried changing that from 128K to 16K and re-ran the "dd" tests with varying output block sizes. The results were slightly worse for 128K and smaller block sizes and significantly worse for the larger block sizes.

obs=1024k: 551 MB/s
obs=512k: 489 MB/s
obs=256k: 384 MB/s
obs=128k: 240 MB/s
obs=64k: 181 MB/s
obs=32k: 124 MB/s
obs=16k: 76 MB/s

PS: All of the drives are now connected to the LSI card.

Many thanks in advance!

xyzzy · May 26, 2017

Stux said:
What he said.

LOL. I know. I wish I could. :)

mav@ · May 26, 2017

xyzzy said:
Is there a way to check on the FreeNAS side what the packet size really is?

You may use `nfsstat -s 1` to see number of requests per second and divide throughput by that number.

xyzzy said:
I'm not sure I'm completely following. Are you saying that NFS isn't able to feed the data to the pool fast enough to take advantage of the extra SLOG?

The question is not about speed, but about ability to pipeline/queue multiple requests, and their synchronicity policy. Theoretically NFS is able to queue multiple sync requests same time so that ZFS log could aggregate them, but seems not everything there is as it should.

xyzzy said:
Also, what about my tests that removed NFS from the equation and simply piped a bunch of data from a RAM disk to the same pool?

Your test shown exactly what it should: the bigger write size is, the better is throughput, gradually reaching peak SLOG device throughput, since bigger write sizes compensate the per-operation cache flush latency, implied by sync=always. Or there you had more then one SLOG device?

xyzzy said:
By "log block" are you referring to the ZFS recordsize?

No. ZFS ZIL blocks are not related to specific dataset blocks. It has own hardcoded block sizes with maximum of 128KB.

xyzzy · May 26, 2017

Thank you very much mav@.

I ran the test with nffsstat and came up with numbers right around 133KB so it looks like ESXi and NFS are in fact using 128KB.

So is there anything else I can try to fix this?

Is there a chance the "NFS not queuing multiple requests" is fixed or at least better in 9.10.2 U4 or 11 RC3?

Is iSCSI able to take advantage of multiple SLOG devices?

Would a PCIe NVMe SLOG really help or would it also be limited due to this issue?

Important Announcement for the TrueNAS Community.

SLOG bottleneck on sync writes with smaller block sizes

Explorer

Explorer

Server Wrangler

Explorer

MVP

Server Wrangler

Explorer

Server Wrangler

Explorer

iXsystems

Explorer

Explorer

Explorer

iXsystems

MVP

Explorer

Explorer

Explorer

iXsystems

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "SLOG bottleneck on sync writes with smaller block sizes"

Similar threads