AutoTune, or not to AutoTune

Status
Not open for further replies.

pkernstock

Cadet
Joined
Nov 10, 2017
Messages
8
Hi,

after a short conversation in the FreeNAS Facebook group I read that it is not the best idea turning AutoTune on. So to get a better idea of the reasons, pro and contra, I was starting to research... So what I know as of now:
However: Apparently AutoTune seems to limit (on my system) ARC to about 31 GB out of 32 GB. So the limitation is actually not that critical? Furthermore it's also tweaking some more network-related changes to probably squeeze some more performance out of it (at least I think so).

Some details about my setup:
  • Intel Xeon E3-1220 v3 @ 3.10GHz (dedicated for FreeNAS)
  • 32 GB ECC RAM
  • 2 TB RAID10 (Tuned for performance; VM Storage)
  • 8 TB RAID10 (Tuned for performance; VM Storage)
  • 28 TB RAIDZ2 (Performance not critical; Datagrave for all kind of things: media, backup, etc)
  • ZLOG and ZIL devices for each volume (SSD drives)
I'm not a ZFS expert and would appreciate any kind of details/information, shared experiences, knowledge, etc about this!

Thanks a lot in advance!

Regards,
Patrik
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Hi Patrik,

We're going to need a more technical level of detail here to see how to squeeze the most out of that system - mostly around your disk configuration.

Please post the following:
  1. The output of zpool status inside of CODE tags (this will tell us how many drives are set up in each of your pools)
  2. How the drives from the above pool status are connected (onboard SATA, SAS HBA, whether expanders or external shelves are used)
  3. Model and confirmation of "HBA/IT mode" firmware of any SAS HBAs used
  4. Exact manufacturer and model of the SSDs and how/where they're attached
  5. Method of network connectivity and protocols (eg: "Intel dual-port 1Gbps, serving iSCSI")
  6. Version of FreeNAS being used (current stable is 11.1-U6)
Some assumptions are made here:
  • You've left the default compression of lz4 enabled
  • You aren't using deduplication
  • Your hypervisor is VMware ESXi/vSphere
Drop some stats, and let's get our hands dirty!
 

pkernstock

Cadet
Joined
Nov 10, 2017
Messages
8
Hi,

thanks for your reply! Unfortunately the mail notification from FreeNAS forum landed within my Google Mail spam folder, so I saw your answer a bit later.

However - here are the requested information from my setup:
  1. The output of zpool status
    Code:
    # zpool status
      pool: data-bronze
     state: ONLINE
      scan: scrub repaired 0 in 0 days 16:31:46 with 0 errors on Sun Oct  7 18:31:47 2018
    config:
    
    		NAME											STATE	 READ WRITE CKSUM
    		data-bronze									 ONLINE	   0	 0	 0
    		  raidz2-0									  ONLINE	   0	 0	 0
    			gptid/3a09d1b4-433a-11e8-83f1-9457a58f6bf4  ONLINE	   0	 0	 0
    			gptid/3aa36154-433a-11e8-83f1-9457a58f6bf4  ONLINE	   0	 0	 0
    			gptid/3b3d6a0b-433a-11e8-83f1-9457a58f6bf4  ONLINE	   0	 0	 0
    			gptid/3be76ee9-433a-11e8-83f1-9457a58f6bf4  ONLINE	   0	 0	 0
    			gptid/3c853af1-433a-11e8-83f1-9457a58f6bf4  ONLINE	   0	 0	 0
    		logs
    		  mirror-1									  ONLINE	   0	 0	 0
    			gptid/f6b08dce-bad8-11e8-b285-9457a58f6bf4  ONLINE	   0	 0	 0
    			gptid/f6b659e9-bad8-11e8-b285-9457a58f6bf4  ONLINE	   0	 0	 0
    		cache
    		  gptid/f6bc8096-bad8-11e8-b285-9457a58f6bf4	ONLINE	   0	 0	 0
    		  gptid/f794824f-bad8-11e8-b285-9457a58f6bf4	ONLINE	   0	 0	 0
    
    errors: No known data errors
    
      pool: data-gold
     state: ONLINE
      scan: scrub repaired 0 in 0 days 01:33:13 with 0 errors on Sun Oct  7 03:33:16 2018
    config:
    
    		NAME											STATE	 READ WRITE CKSUM
    		data-gold									   ONLINE	   0	 0	 0
    		  mirror-0									  ONLINE	   0	 0	 0
    			gptid/009b2674-fb11-11e7-b488-9457a58f6bf4  ONLINE	   0	 0	 0
    			gptid/018abf52-fb11-11e7-b488-9457a58f6bf4  ONLINE	   0	 0	 0
    		  mirror-1									  ONLINE	   0	 0	 0
    			gptid/0271cffb-fb11-11e7-b488-9457a58f6bf4  ONLINE	   0	 0	 0
    			gptid/0355eabf-fb11-11e7-b488-9457a58f6bf4  ONLINE	   0	 0	 0
    		  mirror-2									  ONLINE	   0	 0	 0
    			gptid/0430d956-fb11-11e7-b488-9457a58f6bf4  ONLINE	   0	 0	 0
    			gptid/079533e6-fb11-11e7-b488-9457a58f6bf4  ONLINE	   0	 0	 0
    		logs
    		  mirror-3									  ONLINE	   0	 0	 0
    			gptid/a0905e71-fb11-11e7-b488-9457a58f6bf4  ONLINE	   0	 0	 0
    			gptid/a10fdab7-fb11-11e7-b488-9457a58f6bf4  ONLINE	   0	 0	 0
    		cache
    		  gptid/a47e082e-fb11-11e7-b488-9457a58f6bf4	ONLINE	   0	 0	 0
    		  gptid/a4df9c42-fb11-11e7-b488-9457a58f6bf4	ONLINE	   0	 0	 0
    		spares
    		  gptid/39d5c2d9-fb11-11e7-b488-9457a58f6bf4	AVAIL
    
    errors: No known data errors
    
      pool: freenas-boot
     state: ONLINE
      scan: scrub repaired 0 in 0 days 00:00:18 with 0 errors on Sun Oct  7 03:45:18 2018
    config:
    
    		NAME		STATE	 READ WRITE CKSUM
    		freenas-boot  ONLINE	   0	 0	 0
    		  mirror-0  ONLINE	   0	 0	 0
    			ada2p2  ONLINE	   0	 0	 0
    			ada3p2  ONLINE	   0	 0	 0
    
    errors: No known data errors
    


  2. How the drives from the above pool status are connected (onboard SATA, SAS HBA, whether expanders or external shelves are used)
    - Capacity disks connected via: External Shelve using 2x SAS on my HP Smart Array P421 which is running in HBA mode (option in RAID Controller firmware)
    - SSDs for data-gold and data-bronze connected via: SAS HBA (HP H220 6G SAS Dual Port HBA)
    - SSDs for freenas-boot and data-silver connected via onboard SATA (even it's unrelevant for this question)

  3. Model and confirmation of "HBA/IT mode" firmware of any SAS HBAs used
    See above.

  4. Exact manufacturer and model of the SSDs and how/where they're attached
    2x Samsung 850 PRO 256 GB for data-gold via said SAS HBA
    2x Samsung 850 EVO 250 GB for data-bronze via said SAS HBA
    2x Samsung 860 EVO 500 GB for data-silver via onboard SATA (6 Gbps) (planned, not existing yet)
    2x SanDisk 32 GB SSDs for freenas-boot via onboard SATA (3 Gbps) (even it's probably unrelevant for this question)

    To prevent any shocks when you find it out anyway: Yes, I'm using 2x SSDs for ZLOG and ZIL (RAID1) on the same disks using two partitions for each pool. It's not the best choice, but the more cheaper solution and I think also sufficient for my homelab. When having dedicated SSD devices for ZLOG and ZIL for my 3 pools, it would end in total of 12 SSDs - which is quite pricey and I've no idea where to attach them.

  5. Method of network connectivity and protocols (eg: "Intel dual-port 1Gbps, serving iSCSI")
    Currently: 4x 1 Gbps, serving iSCSI (8x 1 Gbps planned, waiting for second storage switch)

  6. Version of FreeNAS being used (current stable is 11.1-U6)
    11.1-U6

Thanks again! :)
 

toadman

Guru
Joined
Jun 4, 2013
Messages
619
It looks like it still sets memory headroom (which is why I was using previously) and then sets additional network buffer sizes and settings depending on available system memory. I suspect that unless you have a network performance issue using normal system determined values, I suspect you will be fine without autotune.

For example, it sets net.inet.tcp.delayed_ack to 0. Unless you know your network performs better with that setting, it's probably not needed. If there are performance issues one may want to tune with net.inet.tcp.delacktime (default =100). For example with low latency 10g links a setting of 20 might be better.

With the point being it looks to me (not an expert) from the code that autotune is trying to get you "unstuck" or "out of noticeable performance issues" vs necessarily "optimized."
 

pkernstock

Cadet
Joined
Nov 10, 2017
Messages
8
If AutoTune is actually limiting the ARC to prevent crashing the system, it's actually maybe a good thing. And when it's tweaking some network buffer values and stuff, it also sounds good to me. And it might be quite interesting, if AutoTune is actually boosting the network values in some kind (with having performance in mind), or actually reducing them to prevent memory issues. So I'm actually absolutely not sure, if it's now a good thing to enable AutoTune, or if it's actually the most evil function to turn on.

However - you asked for set tuneabls, and here are the current set ones:
upload_2018-10-12_19-42-8.png


I was researching on a few values in the past, but unfortunately I'm not really that experienced what EXACTLY each of the tunable is actually doing and what the positive or negative side effects may be. I'm not working neither in the network or storage IT branche - but of course I'm interested to learn more, that's why I'm asking here :)

Just in case: I'm also using Jumbo frames on my FreeNAS, VMware ESXi hosts and on the dedicated storage switches I have.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
@pkernstock

Sorry if I've missed it - but what are the model/size of the actual "capacity disks" backing these vdevs?

I'm assuming 10TB spinning disks for the "data-bronze" pool (since five of them in a single RAIDZ2 gives you about 28TB of space) but what are the ones backing the mirrors in "data-gold"? Are they spinning disks or SSDs?

To prevent any shocks when you find it out anyway: Yes, I'm using 2x SSDs for ZLOG and ZIL (RAID1) on the same disks using two partitions for each pool. It's not the best choice, but the more cheaper solution and I think also sufficient for my homelab. When having dedicated SSD devices for ZLOG and ZIL for my 3 pools, it would end in total of 12 SSDs - which is quite pricey and I've no idea where to attach them.

None of those SSDs is really a suitable SLOG (not ZLOG, not ZIL ;) ) device due to lacking power-loss-protection - they would make fine L2ARC devices, but with only 32GB of RAM I wouldn't put too much of that in.

In addition, the I/O profile of cache and log vdevs is completely different, and mixing those two workloads on the same device tends to make it do a poor job of both.

Aside: I can't say I'm a fan of the HP P-series cards in "HBA mode", but if they're behaving and passing SMART data then carry on, I suppose.

Another shell output request in CODE blocks again please:

zfs get all data-bronze

zfs get all data-gold
 

toadman

Guru
Joined
Jun 4, 2013
Messages
619
[EDIT: this in reqply to post 6]

I think it's basically trying to give the OS itself more memory. It limits ARC for that reason, i.e. we don't want to have a case where a rapid request for memory exhausts the system because the ARC cannot free memory quickly enough. (There are other ways to accomplish this btw.) From reading the code the buffers for the network are set based on the amount of memory in the system (with more memory equating to larger buffers). With I think the main reason being increasing system operation and stability, not necessarily optimizing performance. But if the network loading it's high, then you are probably better off with that memory used for ARC for example.

Truly optimizing the network for instance really depends on what the use case is, i.e. physical setup, services hosted, and applications. autotune can't know all of that.

From what I've seen, in general, you can let FreeNAS/FreeBSD use it's defaults and you will be ok. I have adjusted a few values only after hitting a reported issue (for example, my HBA required an increase in hw.mpa.max_chains).

I'm sure some other folks with more knowledge than I can suggest which, if any, optimizations would help. The good news is I doubt you will have something break if you keep all those values. :)
 

pkernstock

Cadet
Joined
Nov 10, 2017
Messages
8
@HoneyBadger

Sorry if I've missed it - but what are the model/size of the actual "capacity disks" backing these vdevs?

I'm assuming 10TB spinning disks for the "data-bronze" pool (since five of them in a single RAIDZ2 gives you about 28TB of space) but what are the ones backing the mirrors in "data-gold"? Are they spinning disks or SSDs?

Oh, sorry! Forgot that! But yes, you're right. data-bronze consists out of 5x 10 TB disks. So basically the disks are following:
  • data-gold: 5x WD Red 1 TB (+1 Spare) with RAID10
  • data-silver: 4x WD Red 4 TB with RAID10
  • data-bronze: 5x WD Red 10 TB with RAIDZ2
None of those SSDs is really a suitable SLOG (not ZLOG, not ZIL ;) ) device due to lacking power-loss-protection - they would make fine L2ARC devices, but with only 32GB of RAM I wouldn't put too much of that in.

In addition, the I/O profile of cache and log vdevs is completely different, and mixing those two workloads on the same device tends to make it do a poor job of both.

Ah! Good point. For the "power-loss protection" I'm hoping that my UPS'es and (in the worst case) ZFS might be able to save me from unexpected crashes. However, basically the power is quite stable here. But maybe I might be do some searches in the future for some cheap, used datacenter SSDs or so...

I understand. I do have 6 SATA ports and 6 SSDs. I'm not sure how to use this 6 SSDs in a very good way to have both: reliabilty and (best possible) performance. Due to the port limitations I don't have the ability to stick more than 6 drives into it.

Aside: I can't say I'm a fan of the HP P-series cards in "HBA mode", but if they're behaving and passing SMART data then carry on, I suppose.

Me neither. I would be more happier with a "real" HBA. But it's working fine so far. It was basically a decision of "re-use the HP Smart Array in HBA mode in the server" or "buy an extra HBA at additional cost and get rid of HP Smart Array". But first choice is also doing fine for my lab.

Requested settings:
zfs get all data-bronze
Code:
NAME		 PROPERTY				 VALUE					SOURCE
data-bronze  type					 filesystem			   -
data-bronze  creation				 Wed Apr 18 20:56 2018	-
data-bronze  used					 12.4T					-
data-bronze  available				13.5T					-
data-bronze  referenced			   156K					 -
data-bronze  compressratio			1.01x					-
data-bronze  mounted				  yes					  -
data-bronze  quota					none					 local
data-bronze  reservation			  none					 local
data-bronze  recordsize			   128K					 default
data-bronze  mountpoint			   /mnt/data-bronze		 default
data-bronze  sharenfs				 off					  default
data-bronze  checksum				 on					   default
data-bronze  compression			  lz4					  local
data-bronze  atime					on					   default
data-bronze  devices				  on					   default
data-bronze  exec					 on					   default
data-bronze  setuid				   on					   default
data-bronze  readonly				 off					  default
data-bronze  jailed				   off					  default
data-bronze  snapdir				  hidden				   default
data-bronze  aclmode				  passthrough			  local
data-bronze  aclinherit			   passthrough			  local
data-bronze  canmount				 on					   default
data-bronze  xattr					off					  temporary
data-bronze  copies				   1						default
data-bronze  version				  5						-
data-bronze  utf8only				 off					  -
data-bronze  normalization			none					 -
data-bronze  casesensitivity		  sensitive				-
data-bronze  vscan					off					  default
data-bronze  nbmand				   off					  default
data-bronze  sharesmb				 off					  default
data-bronze  refquota				 none					 local
data-bronze  refreservation		   none					 local
data-bronze  primarycache			 all					  default
data-bronze  secondarycache		   all					  default
data-bronze  usedbysnapshots		  0						-
data-bronze  usedbydataset			156K					 -
data-bronze  usedbychildren		   12.4T					-
data-bronze  usedbyrefreservation	 0						-
data-bronze  logbias				  latency				  default
data-bronze  dedup					off					  default
data-bronze  mlslabel										  -
data-bronze  sync					 standard				 default
data-bronze  refcompressratio		 1.00x					-
data-bronze  written				  156K					 -
data-bronze  logicalused			  7.85T					-
data-bronze  logicalreferenced		31K					  -
data-bronze  volmode				  default				  default
data-bronze  filesystem_limit		 none					 default
data-bronze  snapshot_limit		   none					 default
data-bronze  filesystem_count		 none					 default
data-bronze  snapshot_count		   none					 default
data-bronze  redundant_metadata	   all					  default
data-bronze  org.freenas:description  Datagrave				local


zfs get all data-gold
Code:
NAME	   PROPERTY				 VALUE					SOURCE
data-gold  type					 filesystem			   -
data-gold  creation				 Wed Jan 17  0:00 2018	-
data-gold  used					 1.77T					-
data-gold  available				880G					 -
data-gold  referenced			   88K					  -
data-gold  compressratio			1.56x					-
data-gold  mounted				  yes					  -
data-gold  quota					none					 local
data-gold  reservation			  none					 local
data-gold  recordsize			   128K					 default
data-gold  mountpoint			   /mnt/data-gold		   default
data-gold  sharenfs				 off					  default
data-gold  checksum				 on					   default
data-gold  compression			  lz4					  local
data-gold  atime					on					   default
data-gold  devices				  on					   default
data-gold  exec					 on					   default
data-gold  setuid				   on					   default
data-gold  readonly				 off					  default
data-gold  jailed				   off					  default
data-gold  snapdir				  hidden				   default
data-gold  aclmode				  passthrough			  local
data-gold  aclinherit			   passthrough			  local
data-gold  canmount				 on					   default
data-gold  xattr					off					  temporary
data-gold  copies				   1						default
data-gold  version				  5						-
data-gold  utf8only				 off					  -
data-gold  normalization			none					 -
data-gold  casesensitivity		  sensitive				-
data-gold  vscan					off					  default
data-gold  nbmand				   off					  default
data-gold  sharesmb				 off					  default
data-gold  refquota				 none					 local
data-gold  refreservation		   none					 local
data-gold  primarycache			 all					  default
data-gold  secondarycache		   all					  default
data-gold  usedbysnapshots		  0						-
data-gold  usedbydataset			88K					  -
data-gold  usedbychildren		   1.77T					-
data-gold  usedbyrefreservation	 0						-
data-gold  logbias				  latency				  default
data-gold  dedup					off					  default
data-gold  mlslabel										  -
data-gold  sync					 standard				 default
data-gold  refcompressratio		 1.00x					-
data-gold  written				  88K					  -
data-gold  logicalused			  663G					 -
data-gold  logicalreferenced		11.5K					-
data-gold  volmode				  default				  default
data-gold  filesystem_limit		 none					 default
data-gold  snapshot_limit		   none					 default
data-gold  filesystem_count		 none					 default
data-gold  snapshot_count		   none					 default
data-gold  redundant_metadata	   all					  default
data-gold  org.freebsd.ioc:active   yes					  local
data-gold  org.freenas:description  Fast Storage			 local
 

pkernstock

Cadet
Joined
Nov 10, 2017
Messages
8
@toadman

Thanks a lot for your suggestions! I see. That makes sense!

I do have AutoTune enabled since from the beginning of my new setup - so about 6-8 months I'd say. Never experienced any (serious) issues in the past, RAM usage is also always about 95% - so ARC seems to be used quite well. Unfortunely I don't have any performance values I might be able to compare with.

To get a better understanding of what the tunables do, I think I'll need to research each tunable one by one. However it might be still quite hard to deeply understand what exact results each setting may bring (especially in different variants and combinations with others).
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Ah! Good point. For the "power-loss protection" I'm hoping that my UPS'es and (in the worst case) ZFS might be able to save me from unexpected crashes. However, basically the power is quite stable here. But maybe I might be do some searches in the future for some cheap, used datacenter SSDs or so...

Assuming your SSDs actually honor the request from ZFS to "put this on safe storage" the only downside to an SSD without PLP is the horrible write performance. If the SSD lies about doing safe writes though in order to make itself "feel faster" - then you're in a world of hurt and potentially putting your VMs at risk.

UPS and stable power are always beneficial - but they won't save you in the case of something like an HBA failure or other system crash.

I assume you're using ZVOLs to back your iSCSI shares, not file-based extents.

Can you do zfs get sync,volblocksize poolname/zvol for each of the extents?

For example, if the ZVOL was called "vmfs" you would use zfs get sync,volblocksize data-gold/vmfs
 

pkernstock

Cadet
Joined
Nov 10, 2017
Messages
8
UPS and stable power are always beneficial - but they won't save you in the case of something like an HBA failure or other system crash.

Ah, yes, right. Very good point!

I assume you're using ZVOLs to back your iSCSI shares, not file-based extents.

Yes, exactly. I'm using ZVOLs for iSCSI.

Here's the requested output:
Code:
# zfs get sync,volblocksize data-gold/vm-storage
NAME				  PROPERTY	  VALUE	 SOURCE
data-gold/vm-storage  sync		  standard  default
data-gold/vm-storage  volblocksize  128K	  -

# zfs get sync,volblocksize data-bronze/vm-storage-bronze
NAME						   PROPERTY	  VALUE	 SOURCE
data-bronze/vm-storage-bronze  sync		  standard  default
data-bronze/vm-storage-bronze  volblocksize  16K	   -


I just realized that 128K is probably way too much, and 16K way too less. Based on my research 64K does the best job for using it as VM storage.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Two things I've noticed.

First and foremost - you're using sync=standard which means your SLOG devices are doing absolutely nothing, since I have yet to meet an iSCSI initiator that demands sync writes be default. You can verify this by running zilstat and watching the rows of zeroes roll up your screen.

Given your current SLOG devices (shared, PLP-less Samsung SSDs) I would advise against flipping on sync=always because performance will likely nosedive.

Secondly - you've got your block sizes backwards; your "media graveyard" should probably be using 128K records (and a file-based protocol like SMB/NFS, not block storage) and your performance-oriented VM storage should be 16K (if you value your I/O latency that is; if you're more interested in sequential performance you can bump this up)

Unfortunately unlike a dataset's recordsize parameter, volblocksize is immutable once set; if you have the free space you can create another ZVOL with the intended size and svMotion over to it.
 

pkernstock

Cadet
Joined
Nov 10, 2017
Messages
8
Because ZLOG: So as my ZLOG devices seems not to be used anyway, does it make sense just removing the ZLOG-dedicated partitions from my SSDs and using the whole disk as ZLOG? Does it make any performance-wise difference using the whole disk or just a single partition?

Because sizes: In this thread I've read that 64K is probably more suitable for VM storage. Any technical reasons why it should be 16K?

Based on my research following values are recommended for VM storage:
- volblocksize: 16K (HoneyBadger) or 64K (Source #1, Source #2)
- recordsize: 512B or 4K (only both seems to be supported by ESXi? Found here.)
- LogicalBlockSize (iSCSI): 512B or 4K (to align with recordsize)

Thanks again for your valuable feedback!
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
SLOG: The partition layout won't matter unless you do zfs set sync=always poolname/zvol - without this, your SLOG device, partition or full drive, won't be used at all. They will work better if the drive isn't shared between cache and log workloads, but the drives aren't good candidates for SLOG use either; you can see the throughput of an 850 Pro at various recordsizes here:
https://forums.freenas.org/index.ph...inding-the-best-slog.63521/page-5#post-486067

They'd be fine for L2ARC; if you've had your system running for a while and given it a chance to warm the cache, you could run arc_summary.py from a command line which should give you some insight into how much is actually being used.

(And remember, "SLOG" not "ZLOG" - before someone more cranky comes in to correct you. ;) )

Block sizes: Both do the same thing, just volblocksize is for ZVOLs and recordsize is for datasets - they limit the maximum size of a record that ZFS will write (and you can't change volblocksize on the fly!). The LogicalBlockSize is what will be seen by the iSCSI initiator (related to the "physical block size supported by the device" link - I believe you need VMFS6 support to handle 4K.)

The decision on what volblocksize/recordsize to use is a balancing act - a larger block size reduces overhead for large file transfers and sequential transfers, but penalizes smaller, random I/O - the issue is made even worse if a disk image is created as a "thick" or fully allocated disk, which populates with the maximum recordsize (eg: 64K) but then the VM does a smaller write (NTFS defaults to 4K clusters, for example) - which results in every 4K write needing to read the entire 64K record, modify part of it, and then write it. Your second source on the 64K write actually calls out that 32K records worked a little better - but bandwidth is only part of the calculation, latency is also an important metric for VMs.

For database workloads, you actually want to align your ZFS recordsize against the database internal recordsize - eg: 8K or 16K - to avoid the partial record modification mentioned above.

Generally speaking, lower recordsize gives better latency - higher recordsize gives better bandwidth; and since running VM performance tends to be latency-dependent, the trending is towards lower sizes; but the smaller the recordsize, the more metadata is generated and the more sequential performance (eg: large copies, storage migrations) suffers.

Summary: Your current SSDs aren't fast enough to be good SLOG devices, so don't turn on sync=always until you get better ones. Smaller records give lower latency, larger records give better bandwidth; the balance is yours to find, I personally like 16K.

Question: Is there a particular need to use iSCSI for the media storage (data-bronze) or can you switch to a file-based protocol like SMB/NFS?
 
Status
Not open for further replies.
Top