Performance of sync writes

Ender117 · Oct 9, 2018

Hi
I am benchmarking my system in prepare for ESXi, I have very little understanding in the internal mechanism in ZFS so would like to hear your thoughts.
Here is the detail of the test pool:

Code:

root@freenas:/mnt/m8 # zpool list -v m8
NAME									 SIZE  ALLOC   FREE  EXPANDSZ   FRAG	CAP  DEDUP  HEALTH  ALTROOT
m8									  14.5T   447G  14.1T		 -	 0%	 3%  1.00x  ONLINE  /mnt
  mirror								3.62T   109G  3.52T		 -	 0%	 2%
	gptid/1956925c-c9e5-11e8-b164-ecf4bbcb7e90	  -	  -	  -		 -	  -	  -
	gptid/1a38d15c-c9e5-11e8-b164-ecf4bbcb7e90	  -	  -	  -		 -	  -	  -
  mirror								3.62T   113G  3.51T		 -	 0%	 3%
	gptid/1b4b8875-c9e5-11e8-b164-ecf4bbcb7e90	  -	  -	  -		 -	  -	  -
	gptid/18050e27-c9f1-11e8-b164-ecf4bbcb7e90	  -	  -	  -		 -	  -	  -
  mirror								3.62T   113G  3.51T		 -	 0%	 3%
	gptid/1d5c5df5-c9e5-11e8-b164-ecf4bbcb7e90	  -	  -	  -		 -	  -	  -
	gptid/1e5ee446-c9e5-11e8-b164-ecf4bbcb7e90	  -	  -	  -		 -	  -	  -
  mirror								3.62T   111G  3.52T		 -	 0%	 2%
	gptid/3cdb7a3e-c9f1-11e8-b164-ecf4bbcb7e90	  -	  -	  -		 -	  -	  -
	gptid/6cbeef4b-c9e5-11e8-b164-ecf4bbcb7e90	  -	  -	  -		 -	  -	  -
log										 -	  -	  -		 -	  -	  -
  nvd0									40G   128K  40.0G		 -	 0%	 0%
root@freenas:/mnt/m8 # zfs get all m8
NAME  PROPERTY			  VALUE				  SOURCE
m8	type				  filesystem			 -
m8	creation			  Sat Oct  6 20:57 2018  -
m8	used				  447G				   -
m8	available			 13.6T				  -
m8	referenced			447G				   -
m8	compressratio		 1.00x				  -
m8	mounted			   yes					-
m8	quota				 none				   local
m8	reservation		   none				   local
m8	recordsize			16K					local
m8	mountpoint			/mnt/m8				default
m8	sharenfs			  off					default
m8	checksum			  on					 default
m8	compression		   off					local
m8	atime				 on					 default
m8	devices			   on					 default
m8	exec				  on					 default
m8	setuid				on					 default
m8	readonly			  off					default
m8	jailed				off					default
m8	snapdir			   hidden				 default
m8	aclmode			   passthrough			local
m8	aclinherit			passthrough			local
m8	canmount			  on					 default
m8	xattr				 off					temporary
m8	copies				1					  default
m8	version			   5					  -
m8	utf8only			  off					-
m8	normalization		 none				   -
m8	casesensitivity	   sensitive			  -
m8	vscan				 off					default
m8	nbmand				off					default
m8	refquota			  none				   local
m8	refreservation		none				   local
m8	primarycache		  all					default
m8	secondarycache		all					default
m8	usedbysnapshots	   0					  -
m8	usedbydataset		 447G				   -
m8	usedbychildren		37.5M				  -
m8	usedbyrefreservation  0					  -
m8	logbias			   latency				default
m8	dedup				 off					default
m8	mlslabel									 -
m8	sync				  always				 local
m8	refcompressratio	  1.00x				  -
m8	written			   447G				   -
m8	logicalused		   447G				   -
m8	logicalreferenced	 447G				   -
m8	volmode			   default				default
m8	filesystem_limit	  none				   default
m8	snapshot_limit		none				   default
m8	filesystem_count	  none				   default
m8	snapshot_count		none				   default
m8	redundant_metadata	all					default

Key words: 4 mirrored vdevs, low utilization and fragmentation, recordsize=16K, sync=always, compression off

Code:

root@freenas:/mnt/m8 # iozone -a -s 512M -O
		Iozone: Performance Test of File I/O
				Version $Revision: 3.457 $
				Compiled for 64 bit mode.
				Build: freebsd

		Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
					 Al Slater, Scott Rhine, Mike Wisner, Ken Goss
					 Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
					 Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
					 Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
					 Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
					 Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer,
					 Vangel Bojaxhi, Ben England, Vikentsi Lapa,
					 Alexey Skidanov.

		Run began: Tue Oct  9 09:50:10 2018

		Auto Mode
		File size set to 524288 kB
		OPS Mode. Output is in operations per second.
		Command line used: iozone -a -s 512M -O
		Time Resolution = 0.000001 seconds.
		Processor cache size set to 1024 kBytes.
		Processor cache line size set to 32 bytes.
		File stride size set to 17 * record size.
															  random	random	 bkwd	record	stride
			  kB  reclen	write  rewrite	read	reread	read	 write	 read   rewrite	  read   fwrite frewrite	fread  freread
		  524288	   4	 9281	 9510   162554   340426   250513	 8166   135176	 10920	160345	10034	 9485   147176   315245
		  524288	   8	 8563	 8531   110181   260793   214246	 7899	98585	 10013	 97223	 8552	 8609   103698   155445
		  524288	  16	 6614	 6526	83307	87849   152698	 7025	76399	  8758	 76104	 6624	 6651	71240	77383
		  524288	  32	 4552	 4337	52874	62135   107487	 4575	52359	  5898	 51267	 4313	 4279	45693	48314
		  524288	  64	 2717	 2703	29693	30646	34548	 2801	29705	  3497	 29035	 2784	 2717	22968	25298
		  524288	 128	 1666	 1529	16272	16223	16586	 1698	16168	  1869	 15943	 1539	 1565	11979	12701
		  524288	 256	 1013	 1006	 8422	 8435	16472	 1186	 8359	  1284	  8376	 1032	 1019	 6034	 6038
		  524288	 512	  587	  593	 4388	 4411	 7132	  724	 4493	  1424	  4473	  611	  609	 3048	 3095
		  524288	1024	  318	  315	 2253	 2245	 2348	  466	 2237	   427	  2136	  326	  327	 1538	 1571
		  524288	2048	  173	  168	 1136	 1128	 1741	  188	 1174	   241	  1150	  170	  167	  753	  801
		  524288	4096	  114	   85	  559	  583	  637	   90	  581	   128	   642	   88	   85	  398	  452
		  524288	8192	   43	   42	  248	  261	  282	   42	  285		57	   274	   43	   43	  148	  168
		  524288   16384	   21	   21	  109	  128	  136	   23	  109		26	   108	   26	   21	   58	   71

Now with the slog in play, I was expecting the write IOPS to approximate that of the slog, especially with a small (512M) test file, but only get ~10K IOPS. While it is a vast improvement over w/o slog (~150 IOPS) but I was hoping for something like 30-40K IOPS. Is there anything I can do short of throw in more hardware?

HoneyBadger · Oct 9, 2018

Since we had lots of fun with NUMA nodes in the other thread - which CPU is managing the PCIe slot that your P3700 is in? ;)

Ender117 · Oct 9, 2018

HoneyBadger said:
Since we had lots of fun with NUMA nodes in the other thread - which CPU is managing the PCIe slot that your P3700 is in? ;)

Good question, but the answer is I don't know. Short of digging up the manual for this info. However I did cpuset -l 0-8 and cpuset -l 30-38 for the iozone and they all yielded about 10K IOPS for 4K writes. While ZFS itself may be running in any node, I don't think there is anything I can do about

HoneyBadger · Oct 9, 2018

Thinking a bit more about it, although the iozone benchmark can be pinned to specific thread(s) the internal ZFS threads will go wherever they see fit, and might not run where you want/expect them to be (although I assume they'd trend lower)

Looking at the Tech Guide for your R620, page 30:
https://i.dell.com/sites/content/sh...ments/dell-poweredge-r620-technical-guide.pdf

for your config (3 slots, 2 procs) slots 1 and 2 appear to be pegged to processor 2, slot 3 is for processor 1 (and I'm assuming the H310 mini as well?)

Ender117 · Oct 9, 2018

HoneyBadger said:
Thinking a bit more about it, although the iozone benchmark can be pinned to specific thread(s) the internal ZFS threads will go wherever they see fit, and might not run where you want/expect them to be (although I assume they'd trend lower)

Looking at the Tech Guide for your R620, page 30:
https://i.dell.com/sites/content/sh...ments/dell-poweredge-r620-technical-guide.pdf

for your config (3 slots, 2 procs) slots 1 and 2 appear to be pegged to processor 2, slot 3 is for processor 1 (and I'm assuming the H310 mini as well?)

Yeah, I guess the best thing I can do here is to put the HBA and SLOG on the same processor? Will try that when I got home.

slightly off topic, why is latency rather than IOPS most important for slog? Maybe because command queuing does not work well here? If so, why?

Ender117 · Oct 9, 2018

HoneyBadger said:
Thinking a bit more about it, although the iozone benchmark can be pinned to specific thread(s) the internal ZFS threads will go wherever they see fit, and might not run where you want/expect them to be (although I assume they'd trend lower)

Looking at the Tech Guide for your R620, page 30:
https://i.dell.com/sites/content/sh...ments/dell-poweredge-r620-technical-guide.pdf

for your config (3 slots, 2 procs) slots 1 and 2 appear to be pegged to processor 2, slot 3 is for processor 1 (and I'm assuming the H310 mini as well?)

Tried to move the HBA to the middle riser, which should be wired to the same CPU of P3700 (right slot).

Code:

root@freenas:/mnt/m8 # iozone -a -s 512M -O -r 4K
		Iozone: Performance Test of File I/O
				Version $Revision: 3.457 $
				Compiled for 64 bit mode.
				Build: freebsd

		Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
					 Al Slater, Scott Rhine, Mike Wisner, Ken Goss
					 Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
					 Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
					 Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
					 Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
					 Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer,
					 Vangel Bojaxhi, Ben England, Vikentsi Lapa,
					 Alexey Skidanov.

		Run began: Tue Oct  9 18:42:01 2018

		Auto Mode
		File size set to 524288 kB
		OPS Mode. Output is in operations per second.
		Record Size 4 kB
		Command line used: iozone -a -s 512M -O -r 4K
		Time Resolution = 0.000001 seconds.
		Processor cache size set to 1024 kBytes.
		Processor cache line size set to 32 bytes.
		File stride size set to 17 * record size.
															  random	random	 bkwd	record	stride
			  kB  reclen	write  rewrite	read	reread	read	 write	 read   rewrite	  read   fwrite frewrite	fread  freread
		  524288	   4	10472	 9483	78355   112765   112381	 5589	78246	 12121	 76743	 9863	 9545	86083	80625

iozone test complete.

Write performance does not change much but read dropped quite a bit, probably due to some changes of ARC on reboot

However, bumping up thread counts does increase IOPS, though not linearly

Code:

root@freenas:/mnt/m8 # iozone -i 0 -s 512M -O -r 4K -t 1
		Iozone: Performance Test of File I/O
				Version $Revision: 3.457 $
				Compiled for 64 bit mode.
				Build: freebsd

		Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
					 Al Slater, Scott Rhine, Mike Wisner, Ken Goss
					 Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
					 Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
					 Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
					 Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
					 Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer,
					 Vangel Bojaxhi, Ben England, Vikentsi Lapa,
					 Alexey Skidanov.

		Run began: Tue Oct  9 18:46:38 2018

		File size set to 524288 kB
		OPS Mode. Output is in operations per second.
		Record Size 4 kB
		Command line used: iozone -i 0 -s 512M -O -r 4K -t 1
		Time Resolution = 0.000001 seconds.
		Processor cache size set to 1024 kBytes.
		Processor cache line size set to 32 bytes.
		File stride size set to 17 * record size.
		Throughput test with 1 process
		Each process writes a 524288 kByte file in 4 kByte records

		Children see throughput for  1 initial writers  =   10504.46 ops/sec
		Parent sees throughput for  1 initial writers   =   10502.78 ops/sec
		Min throughput per process					  =   10504.46 ops/sec
		Max throughput per process					  =   10504.46 ops/sec
		Avg throughput per process					  =   10504.46 ops/sec
		Min xfer										=  131072.00 ops

		Children see throughput for  1 rewriters		=	9518.84 ops/sec
		Parent sees throughput for  1 rewriters		 =	9517.45 ops/sec
		Min throughput per process					  =	9518.84 ops/sec
		Max throughput per process					  =	9518.84 ops/sec
		Avg throughput per process					  =	9518.84 ops/sec
		Min xfer										=  131072.00 ops



iozone test complete.
root@freenas:/mnt/m8 # iozone -i 0 -s 512M -O -r 4K -t 4
		Iozone: Performance Test of File I/O
				Version $Revision: 3.457 $
				Compiled for 64 bit mode.
				Build: freebsd

		Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
					 Al Slater, Scott Rhine, Mike Wisner, Ken Goss
					 Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
					 Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
					 Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
					 Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
					 Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer,
					 Vangel Bojaxhi, Ben England, Vikentsi Lapa,
					 Alexey Skidanov.

		Run began: Tue Oct  9 18:47:19 2018

		File size set to 524288 kB
		OPS Mode. Output is in operations per second.
		Record Size 4 kB
		Command line used: iozone -i 0 -s 512M -O -r 4K -t 4
		Time Resolution = 0.000001 seconds.
		Processor cache size set to 1024 kBytes.
		Processor cache line size set to 32 bytes.
		File stride size set to 17 * record size.
		Throughput test with 4 processes
		Each process writes a 524288 kByte file in 4 kByte records

		Children see throughput for  4 initial writers  =   20337.84 ops/sec
		Parent sees throughput for  4 initial writers   =   20332.72 ops/sec
		Min throughput per process					  =	5083.84 ops/sec
		Max throughput per process					  =	5085.06 ops/sec
		Avg throughput per process					  =	5084.46 ops/sec
		Min xfer										=  131041.00 ops

		Children see throughput for  4 rewriters		=   19845.50 ops/sec
		Parent sees throughput for  4 rewriters		 =   19843.74 ops/sec
		Min throughput per process					  =	4959.08 ops/sec
		Max throughput per process					  =	4962.51 ops/sec
		Avg throughput per process					  =	4961.38 ops/sec
		Min xfer										=  130982.00 ops



iozone test complete.
root@freenas:/mnt/m8 # iozone -i 0 -s 512M -O -r 4K -t 16
		Iozone: Performance Test of File I/O
				Version $Revision: 3.457 $
				Compiled for 64 bit mode.
				Build: freebsd

		Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
					 Al Slater, Scott Rhine, Mike Wisner, Ken Goss
					 Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
					 Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
					 Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
					 Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
					 Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer,
					 Vangel Bojaxhi, Ben England, Vikentsi Lapa,
					 Alexey Skidanov.

		Run began: Tue Oct  9 18:50:59 2018

		File size set to 524288 kB
		OPS Mode. Output is in operations per second.
		Record Size 4 kB
		Command line used: iozone -i 0 -s 512M -O -r 4K -t 16
		Time Resolution = 0.000001 seconds.
		Processor cache size set to 1024 kBytes.
		Processor cache line size set to 32 bytes.
		File stride size set to 17 * record size.
		Throughput test with 16 processes
		Each process writes a 524288 kByte file in 4 kByte records

		Children see throughput for 16 initial writers  =   35383.53 ops/sec
		Parent sees throughput for 16 initial writers   =   35378.10 ops/sec
		Min throughput per process					  =	2211.03 ops/sec
		Max throughput per process					  =	2211.61 ops/sec
		Avg throughput per process					  =	2211.47 ops/sec
		Min xfer										=  131038.00 ops

		Children see throughput for 16 rewriters		=   35048.79 ops/sec
		Parent sees throughput for 16 rewriters		 =   35046.80 ops/sec
		Min throughput per process					  =	2190.31 ops/sec
		Max throughput per process					  =	2190.78 ops/sec
		Avg throughput per process					  =	2190.55 ops/sec
		Min xfer										=  131045.00 ops



iozone test complete.

Guess the apparent performance will be dependent on how NFS/iSCSI stack up IOs.

HoneyBadger · Oct 10, 2018

Ender117 said:
slightly off topic, why is latency rather than IOPS most important for slog? Maybe because command queuing does not work well here? If so, why?

Latency and IOPS are two sides of the same coin - the higher your latency, the longer it takes for an IO to complete, and the fewer you get per second.

Queueing doesn't work with SLOG devices because the nature of a sync write is "take this, put it on stable storage, and I'm going to stand right here and wait until you do."

Ender117 · Oct 10, 2018

HoneyBadger said:
Latency and IOPS are two sides of the same coin - the higher your latency, the longer it takes for an IO to complete, and the fewer you get per second.

Queueing doesn't work with SLOG devices because the nature of a sync write is "take this, put it on stable storage, and I'm going to stand right here and wait until you do."

That is very true, if only a single thread is doing IO. I would be hard to imagine, for example, ESXi is only updating one part of a iSCSI target at a time. I might be wrong, but this seems very inefficient.

Back to topic, am I seeing the performance I should be expecting? Or my benchmarking method is wrong?

HoneyBadger · Oct 11, 2018

Ender117 said:
That is very true, if only a single thread is doing IO. I would be hard to imagine, for example, ESXi is only updating one part of a iSCSI target at a time. I might be wrong, but this seems very inefficient.

Multiple threads are sending writes to ZFS, and multiple threads will commit a transaction group to a pool; but as far as writing into the transaction groups, the timing needs to be incredibly tightly controlled for integrity reasons; basically, when you get a "write" command, there's no way (quantum computing aside) to know whether or not the very next command you receive is "read back that address I just write to."

Back to topic, am I seeing the performance I should be expecting? Or my benchmarking method is wrong?

You're not seeing the performance I'd expect from a P3700 at all; it's barely twice my S3700. But I can't say that your method is wrong.

Ender117 · Oct 11, 2018

HoneyBadger said:
Multiple threads are sending writes to ZFS, and multiple threads will commit a transaction group to a pool; but as far as writing into the transaction groups, the timing needs to be incredibly tightly controlled for integrity reasons; basically, when you get a "write" command, there's no way (quantum computing aside) to know whether or not the very next command you receive is "read back that address I just write to."

You're not seeing the performance I'd expect from a P3700 at all; it's barely twice my S3700. But I can't say that your method is wrong.

You are confusing me. I thought for sync writes, ZIL will be write to SLOG, and as soon as ZIL is committed ZFS will report back to the application write is completed. Meanwhile the writes are grouped into TxG which is in RAM, waited to be flushed into the pool.
For a purely illustration purpose, say ZFS received a sync write request from a threads at time 0, ZFS prepares the ZIL and send it to SLOG. At time 0+1/∞, another thread sends another sync write request, ZFS then prepares another ZIL for SLOG to write. Our SLOG has HIGH latency that after 1 second (t=1s) it completed the first ZIL and second at t=1s+1/∞, ZFS now can report that these writes are complete. Now if the pathetic SLOG also have a QD of 1, then rather than t=1s+1/∞ it will be t=2s+1/∞ for the 2nd write to complete (in the view point of the application) and the IOPS would drop from 2 to 1. I guess I am missing something, just don't know what it is

Academic discussion aside, any way to rule out the QPI latency? I heard ESXi can let you pin certain cores to a VM?
The strange thing is that diskinfo -wS number are not way off (although a little bit lower than other's P3700), but during test the P3700 was about 10% busy. It almost looks like ZFS doesn't want to tax the SLOG.
I will play with pcie slots a few more times. Meanwhile if you can also share the result of iozone -a -s 512M -O on your pool that would help us rule out iozone being the culprit

toadman · Oct 11, 2018

Ender117 said:
You are confusing me. I thought for sync writes, ZIL will be write to SLOG, and as soon as ZIL is committed ZFS will report back to the application write is completed. Meanwhile the writes are grouped into TxG which is in RAM, waited to be flushed into the pool.
For a purely illustration purpose, say ZFS received a sync write request from a threads at time 0, ZFS prepares the ZIL and send it to SLOG. At time 0+1/∞, another thread sends another sync write request, ZFS then prepares another ZIL for SLOG to write. Our SLOG has HIGH latency that after 1 second (t=1s) it completed the first ZIL and second at t=1s+1/∞, ZFS now can report that these writes are complete. Now if the pathetic SLOG also have a QD of 1, then rather than t=1s+1/∞ it will be t=2s+1/∞ for the 2nd write to complete (in the view point of the application) and the IOPS would drop from 2 to 1. I guess I am missing something, just don't know what it is

Basically, yes. The issue is the latency through the Network stack (incoming at the server) + the storage stack (why NVMe is superior to ACHI for example) + the I/O hardware latency + the device latency + the network stack again (outgoing). (Simple model.) That is the min time for the completion of a sync write, i.e. your 1s in the first case. And we know there is a maximum parallelism that can be achieved sending data to the device. Beyond that a new transaction has to be generated which at minimum has to deal with the device latency. So yes, the biggest issue involved is the 1s part you mention above, which is the minimum latency for every I/O that can be batched in the first SLOG I/O. Make that 0.5s and obviously you go 2x as fast now.

If you toss out the network part, on average the device is going to account for at least 75% of the access time for a write (I think) for NVMe. Obviously the device write access time for a given i/o will in part depend on the state of the on device dram buffers at the exact time of the write.

Academic discussion aside, any way to rule out the QPI latency? I heard ESXi can let you pin certain cores to a VM?

Yes, it can do so.

HoneyBadger · Oct 11, 2018

Ender117 said:
Meanwhile if you can also share the result of iozone -a -s 512M -O on your pool that would help us rule out iozone being the culprit

Assuming you meant the S3700 pool I mentioned earlier, since that's the one I can lean on pretty freely.

Code:

[root@badger] /mnt/mushroom/iozone# iozone -a -s 512M -O
		Iozone: Performance Test of File I/O
				Version $Revision: 3.457 $
				Compiled for 64 bit mode.
				Build: freebsd

		Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
					 Al Slater, Scott Rhine, Mike Wisner, Ken Goss
					 Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
					 Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
					 Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
					 Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
					 Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer,
					 Vangel Bojaxhi, Ben England, Vikentsi Lapa,
					 Alexey Skidanov.

		Run began: Thu Oct 11 12:57:52 2018

		Auto Mode
		File size set to 524288 kB
		OPS Mode. Output is in operations per second.
		Command line used: iozone -a -s 512M -O
		Time Resolution = 0.000001 seconds.
		Processor cache size set to 1024 kBytes.
		Processor cache line size set to 32 bytes.
		File stride size set to 17 * record size.
															  random	random	 bkwd	record	stride
			  kB  reclen	write  rewrite	read	reread	read	 write	 read   rewrite	  read   fwrite frewrite	fread  freread
		  524288	   4	 5235	 5129   103443   108903   106571	 4802   111319	  5256	104961	 5119	 5128	97164   105415
		  524288	   8	 4576	 4576	99762   100960   102068	 4291   108605	  4649	102648	 4461	 4453	92961	94352
		  524288	  16	 3939	 3892	94943	95091	98498	 3868   101782	  4006	 95826	 3906	 3899	83577	82163
		  524288	  32	 2950	 2871	56781	56022	59768	 2916	60539	  3003	 58616	 2931	 2930	48976	49538
		  524288	  64	 1998	 1954	29899	31171	33854	 1954	34261	  2018	 33769	 1966	 1958	26224	26551
		  524288	 128	 1167	 1138	15279	15927	17538	 1132	17773	  1201	 17278	 1137	 1140	13230	12498
		  524288	 256	  681	  668	 8876	 8805	 9567	  659	 8876	   715	  8393	  665	  663	 6504	 6558
		  524288	 512	  362	  361	 4293	 3934	 4360	  357	 4296	   368	  4235	  359	  357	 3641	 3603
		  524288	1024	  171	  168	 1924	 2034	 2238	  170	 2376	   173	  2199	  169	  175	 1940	 1648
		  524288	2048	   92	   89	 1073	 1065	 1116	   90	 1074		91	  1075	   90	   90	  818	  874
		  524288	4096	   45	   44	  548	  495	  563	   44	  584		45	   530	   44	   44	  452	  413
		  524288	8192	   22	   21	  271	  272	  272	   21	  293		22	   253	   21	   21	  190	  187
		  524288   16384	   10	   10	  116	  123	  134	   10	  136		10	   123	   10	   10	   87	   91

iozone test complete.

toadman · Oct 11, 2018

Unrelated to the SLOG discussion, but consider turning off atime on the pool or dataset:

m8 atime on default

HoneyBadger · Oct 11, 2018

Ender117 said:
Academic discussion aside, any way to rule out the QPI latency? I heard ESXi can let you pin certain cores to a VM?

Is this FreeNAS machine a VM?

Ender117 · Oct 11, 2018

HoneyBadger said:

Assuming you meant the S3700 pool I mentioned earlier, since that's the one I can lean on pretty freely.

Code:

[root@badger] /mnt/mushroom/iozone# iozone -a -s 512M -O
		Iozone: Performance Test of File I/O
				Version $Revision: 3.457 $
				Compiled for 64 bit mode.
				Build: freebsd

		Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
					 Al Slater, Scott Rhine, Mike Wisner, Ken Goss
					 Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
					 Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
					 Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
					 Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
					 Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer,
					 Vangel Bojaxhi, Ben England, Vikentsi Lapa,
					 Alexey Skidanov.

		Run began: Thu Oct 11 12:57:52 2018

		Auto Mode
		File size set to 524288 kB
		OPS Mode. Output is in operations per second.
		Command line used: iozone -a -s 512M -O
		Time Resolution = 0.000001 seconds.
		Processor cache size set to 1024 kBytes.
		Processor cache line size set to 32 bytes.
		File stride size set to 17 * record size.
															  random	random	 bkwd	record	stride
			  kB  reclen	write  rewrite	read	reread	read	 write	 read   rewrite	  read   fwrite frewrite	fread  freread
		  524288	   4	 5235	 5129   103443   108903   106571	 4802   111319	  5256	104961	 5119	 5128	97164   105415
		  524288	   8	 4576	 4576	99762   100960   102068	 4291   108605	  4649	102648	 4461	 4453	92961	94352
		  524288	  16	 3939	 3892	94943	95091	98498	 3868   101782	  4006	 95826	 3906	 3899	83577	82163
		  524288	  32	 2950	 2871	56781	56022	59768	 2916	60539	  3003	 58616	 2931	 2930	48976	49538
		  524288	  64	 1998	 1954	29899	31171	33854	 1954	34261	  2018	 33769	 1966	 1958	26224	26551
		  524288	 128	 1167	 1138	15279	15927	17538	 1132	17773	  1201	 17278	 1137	 1140	13230	12498
		  524288	 256	  681	  668	 8876	 8805	 9567	  659	 8876	   715	  8393	  665	  663	 6504	 6558
		  524288	 512	  362	  361	 4293	 3934	 4360	  357	 4296	   368	  4235	  359	  357	 3641	 3603
		  524288	1024	  171	  168	 1924	 2034	 2238	  170	 2376	   173	  2199	  169	  175	 1940	 1648
		  524288	2048	   92	   89	 1073	 1065	 1116	   90	 1074		91	  1075	   90	   90	  818	  874
		  524288	4096	   45	   44	  548	  495	  563	   44	  584		45	   530	   44	   44	  452	  413
		  524288	8192	   22	   21	  271	  272	  272	   21	  293		22	   253	   21	   21	  190	  187
		  524288   16384	   10	   10	  116	  123	  134	   10	  136		10	   123	   10	   10	   87	   91

iozone test complete.

Thanks, the number you have looks to be in line with diskinfo -wS benchmark on s3700. Now I got to figure out why mine isn't. Would you mind the share system spec here?

HoneyBadger said:
Is this FreeNAS machine a VM?

No, it's bare metal. However I was thinking about boot into ESXi, create a FreeNAS VM and pin it to the CPU where P3700 is connected and run the test. That should help rule out QPI latency. However I still got hard time believing QPI have so big an impact on performance.

HoneyBadger · Oct 11, 2018

Ender117 said:
Would you mind the share system spec here?

Dell C2100, single Xeon X5650, 72GB RAM, H200 with LSI P20 FW, 12x Constellation ES SAS drives in mirrored pairs, Intel S3700 100GB as SLOG.

Ender117 said:
However I still got hard time believing QPI have so big an impact on performance.

QPI isn't a big issue for raw bandwidth situations, but the added latency could significantly impact the sync write throughput since it "waits" for completion every time.

Ender117 · Oct 11, 2018

toadman said:
Basically, yes. The issue is the latency through the Network stack (incoming at the server) + the storage stack (why NVMe is superior to ACHI for example) + the I/O hardware latency + the device latency + the network stack again (outgoing). (Simple model.) That is the min time for the completion of a sync write, i.e. your 1s in the first case. And we know there is a maximum parallelism that can be achieved sending data to the device. Beyond that a new transaction has to be generated which at minimum has to deal with the device latency. So yes, the biggest issue involved is the 1s part you mention above, which is the minimum latency for every I/O that can be batched in the first SLOG I/O. Make that 0.5s and obviously you go 2x as fast now.

If you toss out the network part, on average the device is going to account for at least 75% of the access time for a write (I think) for NVMe. Obviously the device write access time for a given i/o will in part depend on the state of the on device dram buffers at the exact time of the write.

Yes, it can do so.

Thanks for your detailed explanation.

toadman said:
Unrelated to the SLOG discussion, but consider turning off atime on the pool or dataset:

m8 atime on default

I tried this, but little to no impact to performance.

Ender117 · Oct 11, 2018

HoneyBadger said:
Dell C2100, single Xeon X5650, 72GB RAM, H200 with LSI P20 FW, 12x Constellation ES SAS drives in mirrored pairs, Intel S3700 100GB as SLOG.

QPI isn't a big issue for raw bandwidth situations, but the added latency could significantly impact the sync write throughput since it "waits" for completion every time.

Yeah guess I will have to find out. Any resource on how to pin a VM to the correct core and RAM stick?

HoneyBadger · Oct 11, 2018

Ender117 said:
Yeah guess I will have to find out. Any resource on how to pin a VM to the correct core and RAM stick?

Edit your VM settings, look under CPU Options/Advanced (not sure what they're calling it these days) for the "CPU Scheduling Affinity" field.

Quick Googling turned up this document for 5.1; I believe it's pretty similar in newer versions

https://pubs.vmware.com/vsphere-51/...UID-F40F901D-C1A7-43E2-90AF-E6F98C960E4B.html

Ender117 · Oct 12, 2018

HoneyBadger said:
Edit your VM settings, look under CPU Options/Advanced (not sure what they're calling it these days) for the "CPU Scheduling Affinity" field.

Quick Googling turned up this document for 5.1; I believe it's pretty similar in newer versions

https://pubs.vmware.com/vsphere-51/index.jsp?topic=/com.vmware.vsphere.resmgmt.doc/GUID-F40F901D-C1A7-43E2-90AF-E6F98C960E4B.html

Thanks I will take a look.

I followed a guide to tune my BIOS setting to be latency orientated, and it doubled the diskinfo -wS and iozone result. I will open another post to document my finding. But there is still disparity between the 2 test I need to figure out why.

Important Announcement for the TrueNAS Community.

Performance of sync writes

Patron

actually does care

Patron

actually does care

Patron

Patron

actually does care

Patron

actually does care

Patron

Guru

actually does care

Guru

actually does care

Patron

actually does care

Patron

Patron

actually does care

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Performance of sync writes"

Similar threads