After assistance diagnosing VM real world performance on AIO box

Status
Not open for further replies.

Eds89

Contributor
Joined
Sep 16, 2017
Messages
122
Hi all,

I have built an ESXi/FreeNAS AIO box for home use, and while synthetic tests within my VMs using tools such as CrystalDiskMark seem to provide very reasonable read and write speeds, when I start copying actual files around things take a dump and drop to sub 1Gbps speeds.

For info, the machine is:
  • Supermicro X9SRL-F motherboard with 6 core Xeon E5-2618L v2
  • 64GB DDR3 ECC RAM
  • 24 bay chassis with 9207-8i controller and Intel 24 port SAS expander
  • ESXi booting from USB drive, and Samsung 960 Evo 250GB NVMe SSD as a datastore
  • Samsung SM953 NVMe PCIe SSD as SLOG
  • Intel quad port gigabit ethernet adapter
In terms of congfiguration
  • ESXi 6.5 free latest build installed and booting from USB drive
  • 960 Evo SSD set as a datastore for hosting FreeNAS VM (and test VMs)
  • FreeNAS provided 4 vCPU and 32GB RAM, with one LAN vnic for SMB clients and one for ESXi loopback iSCSI storage (both vmxnet3 so 10Gbps capable)
  • FreeNAS and ESXi connected to same virtual switch with no physical adapters connected for storage network
  • FreeNAS has LSI controller passed through as a PCIe device, as well as the entire SM953 NVMe SSD as SLOG
  • FreeNAS has pool of 2 mirrored vdevs consisiting of 4x 7200rpm 2TB Hitachi drives, SM953 SSD as SLOG, and a 64GB vmdk stored on the 960 Evo as L2ARC
  • One zvol created at 1TB on this pool with sync=always, and iSCSI configured to allow ESXi to connect to this VM/zvoland create a datastore
  • Other VMs then stored on this iSCSI datastore
If I connect to a VM which is stored on this iSCSI datastore, and run a 1GB sequential test in CrystalDisk, I see results close to 300MB/s reads, 720MB/s writes
If I copy an approx 1GB file over the virtual switch from a test VM stored on the 960 Evo SSD, the transfer completes within a couple of seconds and I see a spike up to about 5Gbps receive on the virtual nic on the destination VM from within task manager

If I run the same test with an 8GB size, read speeds come up at around 710MB/s, writes being about 350MB/s
If I do the same copy over the network of an approx 9GB file for comparison, it spikes at the beginning, then sits at around 60MB/s, then another spike towards the end of the copy (see attachment).

I cannot figure out this behaviour. Even if the underlying pool of disks becomes a limit, it is capable of much more than 60MB/s!
When doing the 1GB file copy, I can see my SLOG hitting about 600MB/s writes, and if I have 32GB RAM, surely it can cope with a 9GB file much better than this? It fits in RAM, it fits on the SLOG, so why does it drop to such poor speeds with larger files this this?

Cheers
Eds
 

Attachments

  • File copy over network.PNG
    File copy over network.PNG
    22.4 KB · Views: 377

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
I can see my SLOG hitting about 600MB/s writes, and if I have 32GB RAM, surely it can cope with a 9GB file much better than this? It fits in RAM, it fits on the SLOG
Don't think of the SLOG as a cache. Its not. ZFS will throttle writes as you don't ever want to fill you SLOG/TXGs and be waiting for your pool to catch up as this will halt ALL writes and cause major latency spikes. If long enough, you can start dropping connections and subsequently disks. This is bad. Just imagine going into a running server and pulling the hard drive... Bad things happen. The only point of the SLOG device storing the ZIL is that its not sharing IOPS with your pool and has lower latency. This allows much better small random IO performance but not better maximum throughput.

As for the sustained 60MBps. I'll just blame CIFS. Try FTP and see if its faster.
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
Perhaps if you could correlate the FreeNAS network transfer speeds with ARC stats this may provide some clue as to your spikes. Also don't forget windows does caching on "local" disks.

Its also a bit early for me and I'm am unsure of where where you are copying from/to.
 

Eds89

Contributor
Joined
Sep 16, 2017
Messages
122
Yes that has been explained to me before. My understanding was that FreeNAS's system RAM effectively acts as the write cache(?) which is why I have assigned 32GB for such a small deployment, to hopefully accommodate the occasional large file transfer like this.

My source, is a Windows VM on this same ESXi host, but stored on the 960 Evo (to eliminate source read bottle necks for my tests).
I am copying to another Windows VM on the same host, but where the disk is stored on the iSCSI datastore that FreeNAS is hosting (which itself is a VM running on this same host).

So when copying the larger 9GB file to the destination VM on the iSCSI zvol on the pool with the SLOG, I get this behaviour. This is true even if I remove cache and SLOG, so doubt either of those is the problem.
If I instead copy from the same source to a similarly configured 4 disk mirrored vdev pool using SMB, on the same FreeNAS box but with slower spinning disks, the write speed is slower, but 100% consistent throughout the copy. This drop to about 60MB/s only happens when copying to the iSCSI datastore.

It seems specifically a problem for the iSCSI datastore access.

EDIT: trying FTP instead of SMB, and it still behaves in the same fashion, whereby the transfer starts high, then gets slower, settling around 70MB/s
Must be an issue with the underlying storage/freenas setup.
 
Last edited:

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
My understanding was that FreeNAS's system RAM effectively acts as the write cache(?)
This is incorrect. RAM is only used as read cach as the ARC (part of which get used to map blocks in the L2ARC if used). There is no write cache with sync writes. For instance when a pool is set as sync always there is no write cache. Again the ZIL when stored on a SLOG device is only for reduced latency. This will provide better apparent throughput for small random write but not for large sequential writes like you are testing with. There it would make sense that removing the SLOG would have little difference for your type of test. With that said you will still want it there for the general VM performance but makes sense to remove for testing.

To be clear, your copy is a file going from a VMDK on the NVMe to a VMDK on the iSCSI vmk connected to the FreeNAS VM using disks via passthrough for your LSI controller and both VMDKs are attached to the same VM.
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
Also look into moving your non-OS VMDKs to a paravirtualized controller if you have not already.
 

Eds89

Contributor
Joined
Sep 16, 2017
Messages
122
This is incorrect. RAM is only used as read cach as the ARC (part of which get used to map blocks in the L2ARC if used). There is no write cache with sync writes. For instance when a pool is set as sync always there is no write cache. Again the ZIL when stored on a SLOG device is only for reduced latency. This will provide better apparent throughput for small random write but not for large sequential writes like you are testing with. There it would make sense that removing the SLOG would have little difference for your type of test. With that said you will still want it there for the general VM performance but makes sense to remove for testing.

To be clear, your copy is a file going from a VMDK on the NVMe to a VMDK on the iSCSI vmk connected to the FreeNAS VM using disks via passthrough for your LSI controller and both VMDKs are attached to the same VM.

But RAM is used as write cache when sync writes are disabled?
I also tried with sync disabled and got the same behaviour.

You say that a SLOG would not have any improvement for sequential writes, but as Stux has found (whose build I based mine on) when using sync writes via iSCSI or NFS, he was only getting 5MB/s within a VM. Once he added a SLOG, he was getting a far improved 700MB/s:
https://forums.freenas.org/index.ph...n4f-esxi-freenas-aio.57116/page-4#post-403374

I do indeed want to keep the SLOG, and it doesn't seem that it is what is causing this behaviour, so that's fine.

That is pretty much what I am doing, other than one VMDK is on one VM, and I am then doing an SMB network copy to the other VM with its VMDK on the iSCSI datastore. The connectivity between the two will be 10Gbps over the virtual switch.

Cheers
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
You say that a SLOG would not have any improvement for sequential writes, but as Stux has found (whose build I based mine on) when using sync writes via iSCSI or NFS, he was only getting 5MB/s within a VM. Once he added a SLOG, he was getting a far improved 700MB/s:
I would guess that 700MB/s is the max his pool can write in a semi sequential pattern. It is possible to effectively use your SLOG as a write cache but that is not its intent. Sync writes are not considered do until they hit "permanent" storage (always use PLP SLOGS if your data is mission critical). Because spinning drives a incredible slow for random IO especially when when doing double to work (when ZIL is in pool) a separate SLOG preferably an SSD is able to take the random IO, tell the system the data is persisted on media and move on to the next IO. Mean time the IO that just finished is still in RAM in the TXG and being reordered/culminated with other writes and then written to the pool in the most sequential (fastest) way possible.

The slog is never read unless the system has an unclean shutdown and the last few TXGs could not be written from RAM to disk and therefore cannot be considered a cache. A cache is IN the data path the SLOG is next to it and only serves to safely acknowledge a write before its actual written to the pool without lying.

So yes it speeds up random writes but does little to nothing for large sequential IO. You will never get more throughput on average than your pool is capable of under ideal conditions.
 

Eds89

Contributor
Joined
Sep 16, 2017
Messages
122
While that does clear up a couple of minor points I was unclear on, I am still unclear if the FreeNAS RAM is considered a cache for async writes? If so, how come I still get piss poor sequential writes with sync disabled.

I am still unclear why I see this drop to around 60MB/s after the initial spike on larger transfers.

There is no way that my physical pool is only capable of that, evidenced by the fact that smaller transfers finish far far quicker, and a single disk alone should be capable of 100MB/s write.
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
In the case of async writes, the write does not need to be confirmed but gets stuffed into the open TXG until full or a set time (5 sec in FreeNAS I think) at which point the writes are optimised and flushed to disk. I would have to do a bit more reading on this to be sure.

EDIT: I meant to imply that transaction groups are the write cache mechanism in ZFS.
 
Last edited:

Eds89

Contributor
Joined
Sep 16, 2017
Messages
122
Ok, so the two points I need to be sure on;
  1. If I have 32GB RAM and copy a 9GB file, should it all fit in RAM using async writes and confirm written quicker than I am seeing
  2. What could be causing my drop to 60MB/s
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
If I have 32GB RAM and copy a 9GB file
Assuming you have that much free RAM, yes.
What could be causing my drop to 60MB/s
Can you list the full virtual specs of the Windows VM including what type of disk contolder and NIC are used?
Also please detail the the file location and destination.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
So, a dual disk mirror actually only writes at the speed of a single disk.

My 6-way RaidZ2 does sequential writes at 4x the speed of a single disk. In theory.

For long sequential writes your transfer should be limited by your pools sequential performance. Your pool needs to be able to store 5 seconds of data in 5 seconds, or the transfer will throttle waiting on the first txg to flush, while the second is ready to flush. There are only ever two transaction groups, the open and the flushing one, and if the open one closes but can’t flush, then your pool becomes blocked for writing.

You can increase the TXG flush time, which essentially increases how much write cache you have, but eventually it has to flush, and then your transfer will block.

So, you can transfer 1GB/s via 10gbe, but your pool can’t write at 1GB/s, so the pool will end up blocking on the txg flushes. This can be bad for some algorithms. And I suspect the 60MB/s is occurring while it flushes all the data which was initially pushed. Then once that happens the data continues writing at 170MB/s, which is about right for a single disk.

My 700MB/s /4 = 175MB/s.

Also, I found vmxnet3 could actually do about 20gbps on my system.
 
Last edited:

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
@Stux Thanks for clarifying.
Also, I found vmxnet3 could actually do about 20gbps on my system.
A good example of why to never underestimate the importance of the virtual hardware configuration.
So, a dual disk mirror actually only writes at the speed of a single disk.
Minor thread jacking... But you still have the cumulative write performance of all vdevs correct? So in your case two 6-way RaidZ2 vdevs should be able to write at 1400MB/s. In theory.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
@Stux Thanks for clarifying.

A good example of why to never underestimate the importance of the virtual hardware configuration.

Minor thread jacking... But you still have the cumulative write performance of all vdevs correct? So in your case two 6-way RaidZ2 vdevs should be able to write at 1400MB/s. In theory.

Yes. And the IOPS would double too.

My ESXi AIO has a single vdev though ;)

My 24 bay system has more vdevs ;)
 

Eds89

Contributor
Joined
Sep 16, 2017
Messages
122
When I look at memory usage, the entire 32GB is all shown as wired. This is true during the 1GB copy and the 9GB copy. As the 1GB copy finishes so quickly, it suggests to me it is happening purely in RAM. As such, I would expect the 9GB file to do the same with sync disabled.
How can I tell if this is what is happening or not if none of the memory is marked as free?

The two VMs I am using as my source and destination are Windows 10 Pro 64 bit , ESXi version 13 VM. One has 4 vCPUs and 4GB RAM, with a vmxnet3 vnic, while the other has 4 vCPUs, 16GB RAM and a vmxnet3 vnic.
The scsi controllers on both are standard LSI logic SAS, as I didn't think the pvscsi controller would benefit me in this setup.

The file location is on C:\ of the source VM, where the VM is stored on the 960 EVO SSD. This is the smaller 4GB VM
The destination is a share on C:\ of the other VM, where the VM is stored on the iSCSI datastore on FreeNAS.
These two VMs are connected to the same virtual switch.

Odd that my other mirrored vdev pool of 4 slower disks can maintain about 300MB/s, but the same setup of faster disks can't, even if the writes are limited to the speed of a single disk.
On some of my test transfers though, after the drop to 60MB/s it never recovers, and stays like that until the transfer completes.
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
Your pool needs to be able to store 5 seconds of data in 5 seconds, or the transfer will throttle waiting on the first txg to flush
Based on this you would need to expand your transaction group time/size. Assuming 20gb/s (using vmxnet3 network cards in the VM) 9GB*8/20gb=3.6 seconds. Thats pretending you can read at 20gb/s or 2.5GB/s.
userbenchmark.com provides some real number to work with. Average sequential read is about 1,496MB/s so say we do 9/1,496=6 seconds. that's more than one 5 second TXG and now we will thottal down to disk speed until we catch up. This leaves 1.5 GB to write at approx. disk speed. Once you factor in overheads and real world nonsense like CIFS and suddenly this sounds not too far off.

Thi is all off the cuff and I'm sure Stux will have more insight than I.

Just for a test you could set vfs.zfs.txg.timeout="10" as a loader tunable, reboot, and test. Just know that if you fill your TXGs faster than you can write to the pool you will pause further writes until the pool can catch up.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419

Eds89

Contributor
Joined
Sep 16, 2017
Messages
122
I am indeed using a seperate virtual switch for the storage network, with the MTU being set at 9000 both ESXi side and FreeNAS side.

I have completed another test which confuses me even more;
I have taken the same source VM stored on the 960 Evo, added another vmxnet3 adapter and attached it to the storage switch (so the VM can directly address FreeNAS on the storage network).
I have taken 4x 7200rpm 500GB baracuda ES drives, in a single striped vdev, and created an iSCSI target on a zvol with sync disabled.
After attaching this to the VM through Windows (Microsoft iSCSI intiator), I copied a file from the VMDK stored on the 960 evo, to this volume, and it has similar behaviour;
There is a tiny spike of about 600MB/s for half a second, and then it drops down to between 170MB/s to 200MB/s. Isn't that still slow for a 4 disk stripe?!

If I copy to the same pool, but over SMB rather than iSCSI, the transfer speed is about 300MB/s constant.

Is this just an iSCSI problem?
 
Status
Not open for further replies.
Top