Questions on vdev construction after pool re-creation, and performance

Status
Not open for further replies.

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Indeed, check out the differences in the dtrace results.
Code:
dtrace: script 'dirty.d' matched 2 probes
CPU	 ID					FUNCTION:NAME
 15  63083				 none:txg-syncing  101MB of 4096MB used


Code:
dtrace: script 'duration.d' matched 2 probes
CPU	 ID					FUNCTION:NAME
  6  63084				  none:txg-synced sync took 0.27 seconds


I'll have to see what the network throughput is with lz4 as compared to the throttled results with gzip9.

Check out the difference in CPU load. It is more than I would have thought. The data I had copied to that point was only getting 1.10 compression. Even if lz4 is less, it doesn't look like it is worth the hit with the data I have in this pool.

"LZ4 is love, LZ4 is life."

Anything else you would suggest since I have a green field at the moment? I did set the VMWare dataset to 16K record size.

At this point, your SLOG device is the bottleneck (at least until your disks start to get fragmented.) Since your disks are currently able to drain the dirty_data pool faster than your 900p SLOG can ingest it, increasing dirty_data_max won't have any impact.

If you want to start really messing with things, you can increase the max values of the per-vdev queues, but pushing them too far can impact latency. I don't know that I'd meddle with that on a production system.
 
Joined
Dec 29, 2014
Messages
1,135
"LZ4 is love, LZ4 is life."
Duly noted!
At this point, your SLOG device is the bottleneck
I don't think it is using my SLOG at the moment. I have mounted the secondary FreeNAS via NFS, so the cpio is writing locally from the perspective of the primary FreeNAS. Also, zpool doesn't look like it is allocating anything in the SLOG.
Code:
zpool list -v RAIDZ2-I
NAME									 SIZE  ALLOC   FREE  EXPANDSZ   FRAG	CAP  DEDUP  HEALTH  ALTROOT
RAIDZ2-I								14.5T   395G  14.1T		 -	 0%	 2%  1.00x  ONLINE  /mnt
  raidz2								7.25T   197G  7.06T		 -	 0%	 2%
	gptid/67a9a148-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/68893123-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/696903c2-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/6a501044-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/6b4526cb-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/6c34b281-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/6d271bd9-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/6e33d52c-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
  raidz2								7.25T   198G  7.06T		 -	 0%	 2%
	gptid/a1436a28-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/a24a517e-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/a3404858-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/a43c8614-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/a53a0b93-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/a657fa7a-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/a761f10f-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/a8b3b2da-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
log										 -	  -	  -		 -	  -	  -
  nvd0p1								15.9G	  0  15.9G		 -	 0%	 0%
cache									   -	  -	  -		 -	  -	  -
  nvd0p4								 213G  77.8G   135G		 -	 0%	36%
spare									   -	  -	  -		 -	  -	  -
  gptid/c01c4d23-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -

It is allocating space in the L2ARC which I find puzzling.
 

Ender117

Patron
Joined
Aug 20, 2018
Messages
219
Duly noted!

I don't think it is using my SLOG at the moment. I have mounted the secondary FreeNAS via NFS, so the cpio is writing locally from the perspective of the primary FreeNAS. Also, zpool doesn't look like it is allocating anything in the SLOG.
Code:
zpool list -v RAIDZ2-I
NAME									 SIZE  ALLOC   FREE  EXPANDSZ   FRAG	CAP  DEDUP  HEALTH  ALTROOT
RAIDZ2-I								14.5T   395G  14.1T		 -	 0%	 2%  1.00x  ONLINE  /mnt
  raidz2								7.25T   197G  7.06T		 -	 0%	 2%
	gptid/67a9a148-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/68893123-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/696903c2-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/6a501044-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/6b4526cb-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/6c34b281-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/6d271bd9-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/6e33d52c-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
  raidz2								7.25T   198G  7.06T		 -	 0%	 2%
	gptid/a1436a28-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/a24a517e-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/a3404858-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/a43c8614-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/a53a0b93-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/a657fa7a-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/a761f10f-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
	gptid/a8b3b2da-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -
log										 -	  -	  -		 -	  -	  -
  nvd0p1								15.9G	  0  15.9G		 -	 0%	 0%
cache									   -	  -	  -		 -	  -	  -
  nvd0p4								 213G  77.8G   135G		 -	 0%	36%
spare									   -	  -	  -		 -	  -	  -
  gptid/c01c4d23-de13-11e8-adca-e4c722848f30	  -	  -	  -		 -	  -	  -

It is allocating space in the L2ARC which I find puzzling.
zpool iostat -v should give you real time io of each vdev if you are interested. If the dataset you are writing to is set to sync=always SLOG should be used regardless of the source
 
Joined
Dec 29, 2014
Messages
1,135
zpool iostat -v should give you real time io of each vdev if you are interested. If the dataset you are writing to is set to sync=always SLOG should be used regardless of the source
Here is what that shows:
Code:
zpool iostat -v RAIDZ2-I
										   capacity	 operations	bandwidth
pool									alloc   free   read  write   read  write
--------------------------------------  -----  -----  -----  -----  -----  -----
RAIDZ2-I								 539G  14.0T	  0  1.56K	176   130M
  raidz2								 269G  6.99T	  0	797	  5  65.0M
	gptid/67a9a148-de13-11e8-adca-e4c722848f30	  -	  -	  0	 87	 83  11.7M
	gptid/68893123-de13-11e8-adca-e4c722848f30	  -	  -	  0	 87	 83  11.7M
	gptid/696903c2-de13-11e8-adca-e4c722848f30	  -	  -	  0	 86	 84  11.7M
	gptid/6a501044-de13-11e8-adca-e4c722848f30	  -	  -	  0	 86	 83  11.7M
	gptid/6b4526cb-de13-11e8-adca-e4c722848f30	  -	  -	  0	 87	 83  11.7M
	gptid/6c34b281-de13-11e8-adca-e4c722848f30	  -	  -	  0	 87	 84  11.7M
	gptid/6d271bd9-de13-11e8-adca-e4c722848f30	  -	  -	  0	 87	 83  11.7M
	gptid/6e33d52c-de13-11e8-adca-e4c722848f30	  -	  -	  0	 87	 86  11.7M
  raidz2								 269G  6.99T	  0	824	 12  67.1M
	gptid/a1436a28-de13-11e8-adca-e4c722848f30	  -	  -	  0	 89	 90  12.1M
	gptid/a24a517e-de13-11e8-adca-e4c722848f30	  -	  -	  0	 88	 86  12.1M
	gptid/a3404858-de13-11e8-adca-e4c722848f30	  -	  -	  0	 89	 86  12.1M
	gptid/a43c8614-de13-11e8-adca-e4c722848f30	  -	  -	  0	 88	 87  12.1M
	gptid/a53a0b93-de13-11e8-adca-e4c722848f30	  -	  -	  0	 89	 87  12.1M
	gptid/a657fa7a-de13-11e8-adca-e4c722848f30	  -	  -	  0	 89	 86  12.1M
	gptid/a761f10f-de13-11e8-adca-e4c722848f30	  -	  -	  0	 89	 86  12.1M
	gptid/a8b3b2da-de13-11e8-adca-e4c722848f30	  -	  -	  0	 89	 91  12.1M
logs										-	  -	  -	  -	  -	  -
  nvd0p1									0  15.9G	  0	  0	167   1010
cache									   -	  -	  -	  -	  -	  -
  nvd0p4								96.8G   116G	  0	 41	441  34.6M
--------------------------------------  -----  -----  -----  -----  -----  -----

It sure doesn't appear to be doing anything of substance with the SLOG. Synch is set to default at the moment which I would then assume is false. iozone results to come later.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Duly noted!

I don't think it is using my SLOG at the moment. I have mounted the secondary FreeNAS via NFS, so the cpio is writing locally from the perspective of the primary FreeNAS. Also, zpool doesn't look like it is allocating anything in the SLOG.

I'm guessing you're right then. In that case, it's either the network or the sending system that's the bottleneck now. Do zilstat from a command line to confirm that you're not hitting SLOG.

Edit: I refreshed. Nope, you're not. default will only do sync if requested, and I'm guessing cpio doesn't.

It is allocating space in the L2ARC which I find puzzling.

Hmm. Does cpio confirm or verify the written data somehow, perhaps? To oversimplify, L2ARC fills from "potential ARC evictions" so maybe you've managed to cycle a bunch of data through already.
 
Joined
Dec 29, 2014
Messages
1,135
Hmm. Does cpio confirm or verify the written data somehow, perhaps?
I don't actually know. As a crusty old Unix guy (System III/System 5 first), I tend to use cpio for copies. You can use tar for that too, but it can't do special type devices where cpio can.
 
Joined
Dec 29, 2014
Messages
1,135
Here is the test with record size 16K, compression off, synch on.
Code:
		Auto Mode
		File size set to 524288 kB
		OPS Mode. Output is in operations per second.
		Command line used: iozone -a -s 512M -O
		Time Resolution = 0.000001 seconds.
		Processor cache size set to 1024 kBytes.
		Processor cache line size set to 32 bytes.
		File stride size set to 17 * record size.
															  random	random	 bkwd	record	stride
			  kB  reclen	write  rewrite	read	reread	read	 write	 read   rewrite	  read   fwrite frewrite	fread  freread
		  524288	   4	19241	19632   363252   365654   283936	15257   298374	 23661	276926	19977	19932   344145   339479
		  524288	   8	16384	16705   286439   287607   239296	14588   259848	 22381	239416	16798	17012   253231   252344
		  524288	  16	13179	13038   155426   160361   190874	13279   162836	 19380	171332	13266	13061   143856   131535
		  524288	  32	 8869	 8392	80073	83743	94841	 8706	88668	 12271	 96139	 8962	 8390	75013	72520
		  524288	  64	 5756	 5560	55685	55921	62456	 5519	26132	  6554	 36417	 5714	 5355	47024	53644
		  524288	 128	 3281	 3193	33060	15340	25378	 3024	15266	  3545	 18602	 3383	 3209	28994	12773
		  524288	 256	 1989	 1809	17208	17168	17727	 2079	13287	  2465	 15412	 1990	 2115	10841	10483
		  524288	 512	 1270	 1190	 9117	 8814	11654	 1219	 7508	  1533	  7492	 1246	 1294	 6724	 7452
		  524288	1024	  695	  662	 5171	 3226	 2347	  659	 4768	   882	  3495	  614	  716	 3488	 4022
		  524288	2048	  353	  318	 2264	 2266	 2299	  325	 1782	   422	  1891	  377	  295	 1589	 1641
		  524288	4096	  170	  161	  795	  822	  972	  174	 1173	   210	   494	  166	  184	  336	  451
		  524288	8192	   88	   75	  386	  396	  407	   93	  276		91	   208	   87	   79	  220	  220
		  524288   16384	   41	   47	  132	  127	  143	   41	  180		45	   115	   41	   43	   80	   78

iozone test complete.

Here is the test with record size 16K, compression off, synch off.
Code:
		Auto Mode
		File size set to 524288 kB
		OPS Mode. Output is in operations per second.
		Command line used: iozone -a -s 512M -O
		Time Resolution = 0.000001 seconds.
		Processor cache size set to 1024 kBytes.
		Processor cache line size set to 32 bytes.
		File stride size set to 17 * record size.
															  random	random	 bkwd	record	stride
			  kB  reclen	write  rewrite	read	reread	read	 write	 read   rewrite	  read   fwrite frewrite	fread  freread
		  524288	   4   116380   127652   344527   352060   247833	99823   273595	165786	194703   128752   124925   276657   252708
		  524288	   8	88820	99770   254412   238929   228230   103992   182341	154352	160968   102290	99110   251027   267520
		  524288	  16	80644	68162   192377   188295   178194	85584   186921	152189	215805	63790	89941   161375   162325
		  524288	  32	43019	40148   117015   116020   117996	48504	59501	 82045	 86831	42988	48474   101147   113030
		  524288	  64	21802	23580	35398	48929	53824	21532	63386	 42679	 63391	28173	17739	50084	50628
		  524288	 128	10840	14728	41835	38325	15828	10722	35025	 21378	 33595	13076	13832	27586	13783
		  524288	 256	 5480	 6574	18468	22720	22004	 4588	17907	 10438	 17722	 6549	 6496	13983	16353
		  524288	 512	 2924	 2713	 6476	 7406	 7851	 2926	 9309	  5159	  9279	 3486	 2317	 7084	 7393
		  524288	1024	 1396	 1755	 5602	 5952	 5405	 1146	 4619	  2594	  4699	 1667	 1639	 3696	 4418
		  524288	2048	  715	  658	 1651	 1691	 2012	  762	 2385	  1255	  2312	  853	  569	 1691	 1764
		  524288	4096	  340	  448	 1238	  400	  715	  345	 1161	   561	  1133	  443	  318	  524	  573
		  524288	8192	  164	  225	  297	  240	  307	  169	  427	   207	   496	  167	  161	  218	  218
		  524288   16384	   85	   85	  186	  219	  121	   76	  174		96	   175	  104	   67	   97	   98

iozone test complete.
 

Ender117

Patron
Joined
Aug 20, 2018
Messages
219
Here is the test with record size 16K, compression off, synch on.
Code:
		Auto Mode
		File size set to 524288 kB
		OPS Mode. Output is in operations per second.
		Command line used: iozone -a -s 512M -O
		Time Resolution = 0.000001 seconds.
		Processor cache size set to 1024 kBytes.
		Processor cache line size set to 32 bytes.
		File stride size set to 17 * record size.
															  random	random	 bkwd	record	stride
			  kB  reclen	write  rewrite	read	reread	read	 write	 read   rewrite	  read   fwrite frewrite	fread  freread
		  524288	   4	19241	19632   363252   365654   283936	15257   298374	 23661	276926	19977	19932   344145   339479
		  524288	   8	16384	16705   286439   287607   239296	14588   259848	 22381	239416	16798	17012   253231   252344
		  524288	  16	13179	13038   155426   160361   190874	13279   162836	 19380	171332	13266	13061   143856   131535
		  524288	  32	 8869	 8392	80073	83743	94841	 8706	88668	 12271	 96139	 8962	 8390	75013	72520
		  524288	  64	 5756	 5560	55685	55921	62456	 5519	26132	  6554	 36417	 5714	 5355	47024	53644
		  524288	 128	 3281	 3193	33060	15340	25378	 3024	15266	  3545	 18602	 3383	 3209	28994	12773
		  524288	 256	 1989	 1809	17208	17168	17727	 2079	13287	  2465	 15412	 1990	 2115	10841	10483
		  524288	 512	 1270	 1190	 9117	 8814	11654	 1219	 7508	  1533	  7492	 1246	 1294	 6724	 7452
		  524288	1024	  695	  662	 5171	 3226	 2347	  659	 4768	   882	  3495	  614	  716	 3488	 4022
		  524288	2048	  353	  318	 2264	 2266	 2299	  325	 1782	   422	  1891	  377	  295	 1589	 1641
		  524288	4096	  170	  161	  795	  822	  972	  174	 1173	   210	   494	  166	  184	  336	  451
		  524288	8192	   88	   75	  386	  396	  407	   93	  276		91	   208	   87	   79	  220	  220
		  524288   16384	   41	   47	  132	  127	  143	   41	  180		45	   115	   41	   43	   80	   78

iozone test complete.

Here is the test with record size 16K, compression off, synch off.
Code:
		Auto Mode
		File size set to 524288 kB
		OPS Mode. Output is in operations per second.
		Command line used: iozone -a -s 512M -O
		Time Resolution = 0.000001 seconds.
		Processor cache size set to 1024 kBytes.
		Processor cache line size set to 32 bytes.
		File stride size set to 17 * record size.
															  random	random	 bkwd	record	stride
			  kB  reclen	write  rewrite	read	reread	read	 write	 read   rewrite	  read   fwrite frewrite	fread  freread
		  524288	   4   116380   127652   344527   352060   247833	99823   273595	165786	194703   128752   124925   276657   252708
		  524288	   8	88820	99770   254412   238929   228230   103992   182341	154352	160968   102290	99110   251027   267520
		  524288	  16	80644	68162   192377   188295   178194	85584   186921	152189	215805	63790	89941   161375   162325
		  524288	  32	43019	40148   117015   116020   117996	48504	59501	 82045	 86831	42988	48474   101147   113030
		  524288	  64	21802	23580	35398	48929	53824	21532	63386	 42679	 63391	28173	17739	50084	50628
		  524288	 128	10840	14728	41835	38325	15828	10722	35025	 21378	 33595	13076	13832	27586	13783
		  524288	 256	 5480	 6574	18468	22720	22004	 4588	17907	 10438	 17722	 6549	 6496	13983	16353
		  524288	 512	 2924	 2713	 6476	 7406	 7851	 2926	 9309	  5159	  9279	 3486	 2317	 7084	 7393
		  524288	1024	 1396	 1755	 5602	 5952	 5405	 1146	 4619	  2594	  4699	 1667	 1639	 3696	 4418
		  524288	2048	  715	  658	 1651	 1691	 2012	  762	 2385	  1255	  2312	  853	  569	 1691	 1764
		  524288	4096	  340	  448	 1238	  400	  715	  345	 1161	   561	  1133	  443	  318	  524	  573
		  524288	8192	  164	  225	  297	  240	  307	  169	  427	   207	   496	  167	  161	  218	  218
		  524288   16384	   85	   85	  186	  219	  121	   76	  174		96	   175	  104	   67	   97	   98

iozone test complete.

Thanks, very interesting results. Here are your diskinfo -wS result taken from the thread
Code:
root@freenas2:/nonexistent # diskinfo -wS /dev/nvd0
/dev/nvd0
	   512			 # sectorsize
	   280065171456	# mediasize in bytes (261G)
	   547002288	   # mediasize in sectors
	   0			   # stripesize
	   0			   # stripeoffset
	   INTEL SSDPED1D280GA	 # Disk descr.
	   PHMB742401A6280CGN	  # Disk ident.

Synchronous random writes:
		0.5 kbytes:	 22.8 usec/IO =	 21.4 Mbytes/s
		 1 kbytes:	 20.6 usec/IO =	 47.3 Mbytes/s
		 2 kbytes:	 21.4 usec/IO =	 91.5 Mbytes/s
		 4 kbytes:	 15.8 usec/IO =	247.0 Mbytes/s
		 8 kbytes:	 19.6 usec/IO =	397.9 Mbytes/s
		 16 kbytes:	 22.8 usec/IO =	684.8 Mbytes/s
		 32 kbytes:	 31.8 usec/IO =	981.9 Mbytes/s
		 64 kbytes:	 51.2 usec/IO =   1221.4 Mbytes/s
		128 kbytes:	102.8 usec/IO =   1216.5 Mbytes/s
		256 kbytes:	166.0 usec/IO =   1506.2 Mbytes/s
		512 kbytes:	291.7 usec/IO =   1714.0 Mbytes/s
	   1024 kbytes:	555.4 usec/IO =   1800.6 Mbytes/s
	   2048 kbytes:   1060.2 usec/IO =   1886.4 Mbytes/s
	   4096 kbytes:   2099.1 usec/IO =   1905.5 Mbytes/s
	   8192 kbytes:   4179.4 usec/IO =   1914.2 Mbytes/s


At lower recordsize, iozone is about 1/3 of diskinfo. Mine was ~1/4, could be the higher clock of your CPUs.
Yet overhead at higher recordsizes(>512K) is still high (~60%), mine was close to ~30 at this range. Could be sth NAND vs 3D Xpoint related

PS: I find out that tuning BIOS, especially disable HT and board controlled CPU power, can significantly increase sync write performance. You may want to have a try.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I was in the middle of writing a post when that one showed up.

Ran those benchmarks against a testbed here and I was getting about 80% of diskinfo performance at 4/8/16K, beyond that it's 90-95%.

Check the HT and power-savings settings as suggested by @Ender117
 
Joined
Dec 29, 2014
Messages
1,135
Yet overhead at higher recordsizes(>512K) is still high (~60%), mine was close to ~30 at this range. Could be sth NAND vs 3D Xpoint related

I tuned the record size down to 16K to improve VM performance. I haven't looked at testing the interactive performance of the VM's, but I can say for sure that the lower record size is having a significantly negative impact on the speed of storage Vmotion operations!

PS: I find out that tuning BIOS, especially disable HT and board controlled CPU power, can significantly increase sync write performance. You may want to have a try.

I am sure all that HT is enabled. Are you suggesting that OS controlled CPU speed is the way to go? I can certainly do that as my FreeNAS units are both bare metal and don't do anything but storage.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I am sure all that HT is enabled.

You've got enough cores that disabling HT won't hurt, so I would try that.

Are you suggesting that OS controlled CPU speed is the way to go?

I'd start by disabling it entirely, and then go to OS-controlled. The CPU going into any kind of "sleep" state will greatly increase the latency involved with waking it up.
 
Joined
Dec 29, 2014
Messages
1,135
You've got enough cores that disabling HT won't hurt, so I would try that.

Here is how I have the BIOS settings now.
upload_2018-11-2_11-11-22.png

upload_2018-11-2_11-12-3.png


Anything else you see that you suggest changing?
 
Joined
Dec 29, 2014
Messages
1,135
Here is an interesting thing. I made the above BIOS changes, and I am moving some VM's to it via Vmotion. Take a look at what top says about nfsd waiting on CPU.
Code:
last pid:  8647;  load averages:  1.42,  2.82,  2.30																	  up 0+00:20:00  11:34:39
68 processes:  1 running, 67 sleeping
CPU:  0.2% user,  0.0% nice, 22.0% system,  2.8% interrupt, 75.0% idle
Mem: 29M Active, 220M Inact, 584M Laundry, 114G Wired, 9770M Free
ARC: 102G Total, 11M MFU, 95G MRU, 4913M Anon, 2098M Header, 108M Other
	 93G Compressed, 118G Uncompressed, 1.27:1 Ratio
Swap: 10G Total, 10G Free

  PID USERNAME	THR PRI NICE   SIZE	RES STATE   C   TIME	WCPU COMMAND
 3061 root		 20  21	0  6256K  2388K rpcsvc  2   8:24  49.55% nfsd
 6042 root		 15  20	0   213M   141M umtxn   6   0:31   1.92% uwsgi
 4850 root		 21  20	0 36492K 17404K uwait   5   0:02   0.23% consul
 3241 uucp		  1  20	0 14984K  8488K select  0   0:00   0.16% snmp-ups
 8626 root		  1  20	0  7948K  3464K CPU5	5   0:00   0.09% top
  264 root		 13  20	0   199M   149M kqread  5   0:21   0.04% python3.6
 4747 www		   1  20	0 31352K 10076K kqread  6   0:01   0.04% nginx
 2668 root		  3  20	0 20724K  6508K kqread  0   0:01   0.03% syslog-ng
 3137 root		  1  20	0 12512K 12620K select  0   0:01   0.01% ntpd
 8247 elliot		1  20	0 13216K  7300K select  3   0:00   0.00% sshd
 3243 uucp		  1  20	0   102M  2648K select  6   0:00   0.00% upsd
 4824 root		  1  20	0   147M   115M kqread  7   0:07   0.00% uwsgi
 3347 root		  1  20	0 85220K 55560K select  0   0:00   0.00% winbindd
 
Joined
Dec 29, 2014
Messages
1,135
Here is a synch on, compression off, record size 16K test after BIOS changes.
Code:
		Auto Mode
		File size set to 524288 kB
		OPS Mode. Output is in operations per second.
		Command line used: iozone -a -s 512M -O
		Time Resolution = 0.000001 seconds.
		Processor cache size set to 1024 kBytes.
		Processor cache line size set to 32 bytes.
		File stride size set to 17 * record size.
															  random	random	 bkwd	record	stride
			  kB  reclen	write  rewrite	read	reread	read	 write	 read   rewrite	  read   fwrite frewrite	fread  freread
		  524288	   4	22088	22649   345977   342938   277856	18634   298145	 25845	247717	23279	22625   320338   316925
		  524288	   8	18699	19555   281264   272460   242009	17761   206286	 23760	213098	19425	19356   233624   238386
		  524288	  16	15348	15311   142649   132718   152798	15498   141504	 20491	152915	15551	15385   118293   109392
		  524288	  32	10823	10406	90298	90522   114350	10276	89425	 12824	110196	10967	10813	53241	66468
		  524288	  64	 6201	 6645	32870	44692	51955	 6456	33437	  7493	 32527	 6689	 6839	22839	36163
		  524288	 128	 3670	 3278	31965	31649	33708	 4043	18890	  4399	 20291	 3902	 3468	22593	27468
		  524288	 256	 2405	 2294	18718	18144	12339	 2141	17493	  2905	  7460	 2240	 2587	 5758	 8761
		  524288	 512	 1279	 1477	 4045	 6601	 7280	 1286	 8687	  1735	  4162	 1275	 1459	 3182	 4639
		  524288	1024	  665	  780	 1850	 2972	 3821	  679	 4494	   988	  5915	  596	  732	 2969	 3518
		  524288	2048	  360	  323	 2113	 2107	 2116	  332	 1545	   421	  1765	  374	  315	 1375	 1422
		  524288	4096	  183	  146	  985	  917	 1005	  191	 1472	   202	   441	  168	  160	  582	  609
		  524288	8192	   85	   79	  416	  404	  396	   92	  485		92	   238	   82	   95	  149	  159
		  524288   16384	   43	   38	  178	  169	  181	   47	  234		45		92	   43	   38	   98	   97

iozone test complete.

Here is a synch off, compression off, record size 16K test after BIOS changes.
Code:
		Auto Mode
		File size set to 524288 kB
		OPS Mode. Output is in operations per second.
		Command line used: iozone -a -s 512M -O
		Time Resolution = 0.000001 seconds.
		Processor cache size set to 1024 kBytes.
		Processor cache line size set to 32 bytes.
		File stride size set to 17 * record size.
															  random	random	 bkwd	record	stride
			  kB  reclen	write  rewrite	read	reread	read	 write	 read   rewrite	  read   fwrite frewrite	fread  freread
		  524288	   4   111667   119149   336434   347803   194594	89076   315069	148233	208375   122403   115359   317707   232941
		  524288	   8	78859   112411   182790   215608   177820	88411   164612	141886	148604	98523	95586   271420   190667
		  524288	  16	73804	84856   191229   178865   177511	74686   124422	128961	140550	76393	94896	96863   143167
		  524288	  32	38349	46196   121239   130260   117580	32701   115528	 73278	109846	47612	48851   103936	52042
		  524288	  64	20680	23983	62043	63506	61935	25529	46832	 39234	 46402	20836	24014	49599	53224
		  524288	 128	11023	13405	36126	15906	26491	10563	34033	 18418	 33484	12899	 9482	23033	22362
		  524288	 256	 5363	 6196	18120	17969	18810	 6360	 9371	  9585	 14186	 5563	 6279	12577	14723
		  524288	 512	 2937	 2996	10242	11794	11139	 2227	 8641	  4761	  9018	 3200	 3163	 7446	 5888
		  524288	1024	 1363	 1577	 4483	 4060	 4658	 1561	 4471	  2846	  3524	 1172	 1572	 3127	 3236
		  524288	2048	  658	  917	 3092	 2089	  963	  682	 2236	  1140	  2291	  804	  857	 1784	  762
		  524288	4096	  332	  376	 1022	 1045	 1022	  384	  550	   497	   654	  340	  390	  717	  913
		  524288	8192	  167	  136	  395	  397	  410	  182	  413	   211	   221	  156	  182	  217	  208
		  524288   16384	   82	   72	  176	  178	  188	  108	  103		90	   120	   79	   88	   70	   74

iozone test complete.
 

Ender117

Patron
Joined
Aug 20, 2018
Messages
219
I was in the middle of writing a post when that one showed up.

Ran those benchmarks against a testbed here and I was getting about 80% of diskinfo performance at 4/8/16K, beyond that it's 90-95%.

Check the HT and power-savings settings as suggested by @Ender117
For your testbed, was the one with a pair of S3700 and a single CPU (X5690?). Now while NUMA latency still seems a compiling explanation, I start wondering if it's because the ZIL/SLOG implementation was optimized towards SATA/SAS SSD, which was the fastest when ZFS was written.

I tuned the record size down to 16K to improve VM performance. I haven't looked at testing the interactive performance of the VM's, but I can say for sure that the lower record size is having a significantly negative impact on the speed of storage Vmotion operations!
I agree with you on setting to 16K. I should also have been more clear that my ~75% overhead in lower (write) blocksize range and ~30% in higher range was in a recordsize=16K dataset. If the dataset was set to recordsize=128K, there were essentially no overhead in higher write blocksize range.
 

Ender117

Patron
Joined
Aug 20, 2018
Messages
219
Here is how I have the BIOS settings now.
View attachment 26374
View attachment 26375

Anything else you see that you suggest changing?
I don't know about your system. But if "workload configuration" have something like latency optimized I would select that.
For energy performance policy I would pick OS controlled or alike. This can save a lot of watts in idle and doesn't seem to hurt performance.
Here is a synch on, compression off, record size 16K test after BIOS changes.
Code:
		Auto Mode
		File size set to 524288 kB
		OPS Mode. Output is in operations per second.
		Command line used: iozone -a -s 512M -O
		Time Resolution = 0.000001 seconds.
		Processor cache size set to 1024 kBytes.
		Processor cache line size set to 32 bytes.
		File stride size set to 17 * record size.
															  random	random	 bkwd	record	stride
			  kB  reclen	write  rewrite	read	reread	read	 write	 read   rewrite	  read   fwrite frewrite	fread  freread
		  524288	   4	22088	22649   345977   342938   277856	18634   298145	 25845	247717	23279	22625   320338   316925
		  524288	   8	18699	19555   281264   272460   242009	17761   206286	 23760	213098	19425	19356   233624   238386
		  524288	  16	15348	15311   142649   132718   152798	15498   141504	 20491	152915	15551	15385   118293   109392
		  524288	  32	10823	10406	90298	90522   114350	10276	89425	 12824	110196	10967	10813	53241	66468
		  524288	  64	 6201	 6645	32870	44692	51955	 6456	33437	  7493	 32527	 6689	 6839	22839	36163
		  524288	 128	 3670	 3278	31965	31649	33708	 4043	18890	  4399	 20291	 3902	 3468	22593	27468
		  524288	 256	 2405	 2294	18718	18144	12339	 2141	17493	  2905	  7460	 2240	 2587	 5758	 8761
		  524288	 512	 1279	 1477	 4045	 6601	 7280	 1286	 8687	  1735	  4162	 1275	 1459	 3182	 4639
		  524288	1024	  665	  780	 1850	 2972	 3821	  679	 4494	   988	  5915	  596	  732	 2969	 3518
		  524288	2048	  360	  323	 2113	 2107	 2116	  332	 1545	   421	  1765	  374	  315	 1375	 1422
		  524288	4096	  183	  146	  985	  917	 1005	  191	 1472	   202	   441	  168	  160	  582	  609
		  524288	8192	   85	   79	  416	  404	  396	   92	  485		92	   238	   82	   95	  149	  159
		  524288   16384	   43	   38	  178	  169	  181	   47	  234		45		92	   43	   38	   98	   97

iozone test complete.

Here is a synch off, compression off, record size 16K test after BIOS changes.
Code:
		Auto Mode
		File size set to 524288 kB
		OPS Mode. Output is in operations per second.
		Command line used: iozone -a -s 512M -O
		Time Resolution = 0.000001 seconds.
		Processor cache size set to 1024 kBytes.
		Processor cache line size set to 32 bytes.
		File stride size set to 17 * record size.
															  random	random	 bkwd	record	stride
			  kB  reclen	write  rewrite	read	reread	read	 write	 read   rewrite	  read   fwrite frewrite	fread  freread
		  524288	   4   111667   119149   336434   347803   194594	89076   315069	148233	208375   122403   115359   317707   232941
		  524288	   8	78859   112411   182790   215608   177820	88411   164612	141886	148604	98523	95586   271420   190667
		  524288	  16	73804	84856   191229   178865   177511	74686   124422	128961	140550	76393	94896	96863   143167
		  524288	  32	38349	46196   121239   130260   117580	32701   115528	 73278	109846	47612	48851   103936	52042
		  524288	  64	20680	23983	62043	63506	61935	25529	46832	 39234	 46402	20836	24014	49599	53224
		  524288	 128	11023	13405	36126	15906	26491	10563	34033	 18418	 33484	12899	 9482	23033	22362
		  524288	 256	 5363	 6196	18120	17969	18810	 6360	 9371	  9585	 14186	 5563	 6279	12577	14723
		  524288	 512	 2937	 2996	10242	11794	11139	 2227	 8641	  4761	  9018	 3200	 3163	 7446	 5888
		  524288	1024	 1363	 1577	 4483	 4060	 4658	 1561	 4471	  2846	  3524	 1172	 1572	 3127	 3236
		  524288	2048	  658	  917	 3092	 2089	  963	  682	 2236	  1140	  2291	  804	  857	 1784	  762
		  524288	4096	  332	  376	 1022	 1045	 1022	  384	  550	   497	   654	  340	  390	  717	  913
		  524288	8192	  167	  136	  395	  397	  410	  182	  413	   211	   221	  156	  182	  217	  208
		  524288   16384	   82	   72	  176	  178	  188	  108	  103		90	   120	   79	   88	   70	   74

iozone test complete.
That was what I was seeing. Changing the BIOS yielded ~2x sync write IOPS. Now if you did diskinfo -wS (I believe it can be done on a partition, like diskinfo -wS nvd0p1nvd0p1) again, you should see the same number as well.
 
Joined
Dec 29, 2014
Messages
1,135
I don't know about your system. But if "workload configuration" have something like latency optimized I would select that.
I will see what options are available there.
That was what I was seeing. Changing the BIOS yielded ~2x sync write IOPS. Now if you did diskinfo -wS (I believe it can be done on a partition, like diskinfo -wS nvd0p1nvd0p1) again, you should see the same number as well.
Yes, that can be done on a partition. Here are those results. I removed the SLOG from the pool first. I guess I could have done it on one of the other SLOG partitions too.
Code:
root@freenas2:/mnt/RAIDZ2-I/ftp # diskinfo -wS /dev/nvd0p1
/dev/nvd0p1
		512			 # sectorsize
		17179869184	 # mediasize in bytes (16G)
		33554432		# mediasize in sectors
		0			   # stripesize
		1048576		 # stripeoffset
		2088			# Cylinders according to firmware.
		255			 # Heads according to firmware.
		63			  # Sectors according to firmware.
		INTEL SSDPED1D280GA	 # Disk descr.
		PHMB742401A6280CGN	  # Disk ident.

Synchronous random writes:
		 0.5 kbytes:	 13.7 usec/IO =	 35.7 Mbytes/s
		   1 kbytes:	 13.7 usec/IO =	 71.2 Mbytes/s
		   2 kbytes:	 14.2 usec/IO =	137.4 Mbytes/s
		   4 kbytes:	 11.5 usec/IO =	340.3 Mbytes/s
		   8 kbytes:	 13.4 usec/IO =	581.7 Mbytes/s
		  16 kbytes:	 18.1 usec/IO =	865.1 Mbytes/s
		  32 kbytes:	 25.8 usec/IO =   1210.0 Mbytes/s
		  64 kbytes:	 41.3 usec/IO =   1513.1 Mbytes/s
		 128 kbytes:	 73.9 usec/IO =   1691.3 Mbytes/s
		 256 kbytes:	134.0 usec/IO =   1865.1 Mbytes/s
		 512 kbytes:	249.4 usec/IO =   2004.6 Mbytes/s
		1024 kbytes:	484.6 usec/IO =   2063.5 Mbytes/s
		2048 kbytes:	950.4 usec/IO =   2104.4 Mbytes/s
		4096 kbytes:   1889.2 usec/IO =   2117.3 Mbytes/s
		8192 kbytes:   3757.1 usec/IO =   2129.3 Mbytes/s

This system does have a faster CPU than it did when I ran the first test, so that could have had an impact as well.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Here is an interesting thing. I made the above BIOS changes, and I am moving some VM's to it via Vmotion. Take a look at what top says about nfsd waiting on CPU.
Code:
  PID USERNAME	THR PRI NICE   SIZE	RES STATE   C   TIME	WCPU COMMAND
 3061 root		 20  21	0  6256K  2388K rpcsvc  2   8:24  49.55% nfsd

I've seen some reports that the default number of threads for the L2ARC feed is insufficient, and causes CPU blocking. Can you try, as a test, removing your L2ARC (cache) device and see if the nfsd usage drops?

Also, it's worth asking - did you ever check this metric before the changes? It might have always been an issue ...

Here is a synch on, compression off, record size 16K test after BIOS changes.

Decent improvements across the board but still way under what the diskinfo says. Try pulling the L2ARC as suggested above, if nfsd is blocking on CPU that could be a problem.
 
Last edited:

Ender117

Patron
Joined
Aug 20, 2018
Messages
219
Here are my results
Code:
root@freenas:/mnt/Mirrors/test # diskinfo -wS /dev/nvd0p2
/dev/nvd0p2
		4096			# sectorsize
		16106127360	 # mediasize in bytes (15G)
		3932160		 # mediasize in sectors
		131072		  # stripesize
		0			   # stripeoffset
		244			 # Cylinders according to firmware.
		255			 # Heads according to firmware.
		63			  # Sectors according to firmware.
		INTEL SSDPEDMD400G4	 # Disk descr.
		CVFT7112001A400LGN	  # Disk ident.

Synchronous random writes:
		   4 kbytes:	 11.6 usec/IO =	336.5 Mbytes/s
		   8 kbytes:	 14.0 usec/IO =	558.3 Mbytes/s
		  16 kbytes:	 20.8 usec/IO =	750.2 Mbytes/s
		  32 kbytes:	 35.5 usec/IO =	879.8 Mbytes/s
		  64 kbytes:	 66.3 usec/IO =	942.6 Mbytes/s
		 128 kbytes:	135.3 usec/IO =	923.7 Mbytes/s
		 256 kbytes:	249.7 usec/IO =   1001.2 Mbytes/s
		 512 kbytes:	488.2 usec/IO =   1024.1 Mbytes/s
		1024 kbytes:	959.4 usec/IO =   1042.3 Mbytes/s
		2048 kbytes:   1919.0 usec/IO =   1042.2 Mbytes/s
		4096 kbytes:   3806.9 usec/IO =   1050.7 Mbytes/s
		8192 kbytes:   7638.7 usec/IO =   1047.3 Mbytes/s

root@freenas:~ # cd /mnt/Mirrors/test
root@freenas:/mnt/Mirrors/test # iozone -a -s 512M -O
		Iozone: Performance Test of File I/O
				Version $Revision: 3.457 $
				Compiled for 64 bit mode.
				Build: freebsd

		Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
					 Al Slater, Scott Rhine, Mike Wisner, Ken Goss
					 Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
					 Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
					 Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
					 Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
					 Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer,
					 Vangel Bojaxhi, Ben England, Vikentsi Lapa,
					 Alexey Skidanov.

		Run began: Fri Nov  2 09:43:53 2018

		Auto Mode
		File size set to 524288 kB
		OPS Mode. Output is in operations per second.
		Command line used: iozone -a -s 512M -O
		Time Resolution = 0.000001 seconds.
		Processor cache size set to 1024 kBytes.
		Processor cache line size set to 32 bytes.
		File stride size set to 17 * record size.
															  random	random	 bkwd	record	stride
			  kB  reclen	write  rewrite	read	reread	read	 write	 read   rewrite	  read   fwrite frewrite	fread  freread
		  524288	   4	20557	21914   335121   349229   262885	18949   296270	 19680	267664	21718	21378   314948   323410
		  524288	   8	18297	19261   270069   278251   224346	17610   240095	 22014	233534	19014	18613   227363   239429
		  524288	  16	14989	15017   191080   197659   186458	14986   160568	 18576	169139	15112	15209   150189   167535
		  524288	  32	10317	10225   109307   117916   118349	10154	98531	 13341	103951	10432	10198	97890	89721
		  524288	  64	 6047	 5750	59981	45895	47483	 5877	59481	  7857	 54287	 6023	 5780	44383	49981
		  524288	 128	 3434	 3428	28742	31934	35218	 3421	35249	  3933	 32924	 3421	 3473	22226	23294
		  524288	 256	 2245	 2268	 8163	10684	14528	 2211	 8938	  2777	  9567	 2262	 2280	14529	13632
		  524288	 512	 1291	 1268	 8519	 8599	 9074	 1413	 4685	  1665	  5174	 1346	 1291	 6547	 6640
		  524288	1024	  713	  737	 3319	 2634	 3434	  694	 4930	   819	  2272	  719	  674	 3250	 3223
		  524288	2048	  368	  349	 1823	 1774	 2150	  365	 1328	   404	  1406	  369	  353	 1119	 1192
		  524288	4096	  162	  135	 1052	 1078	 1281	  153	  854	   189	   618	  157	  143	  750	  807
		  524288	8192	   73	   71	  293	  304	  420	   72	  311		81	   322	   74	   70	  330	  211
		  524288   16384	   35	   33	  240	  234	  230	   33	  190		38	   165	   36	   34	  123	   76

iozone test complete.



dataset was set to sync=off compression=off recordsize=16K. It's interesting to see that while P3700 is slower than 900P especially in higer IO sizes, iozone is faster in IO size of 512K-2048K writes.
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
For your testbed, was the one with a pair of S3700 and a single CPU (X5690?). Now while NUMA latency still seems a compiling explanation, I start wondering if it's because the ZIL/SLOG implementation was optimized towards SATA/SAS SSD, which was the fastest when ZFS was written.

Yes, this was the single-X5650 with S3700. It might be that the PCIe devices are trying and able to enter a low-power "sleep state" more readily due to their closer hardware ties; the SATA/SAS devices might always have some activity or the HBA might be keeping itself in full power mode all the time and therefore there's no wake-up latency.

I agree with you on setting to 16K. I should also have been more clear that my ~75% overhead in lower (write) blocksize range and ~30% in higher range was in a recordsize=16K dataset. If the dataset was set to recordsize=128K, there were essentially no overhead in higher write blocksize range.

I don't have that much overhead on a recordsize=16K dataset. My small-block (4-8-16K) is about 80% of the diskinfo, and from there on up it's 90%+.
 
Last edited:
Status
Not open for further replies.
Top