After assistance diagnosing VM real world performance on AIO box

Eds89 · Apr 7, 2018

Hi all,

I have built an ESXi/FreeNAS AIO box for home use, and while synthetic tests within my VMs using tools such as CrystalDiskMark seem to provide very reasonable read and write speeds, when I start copying actual files around things take a dump and drop to sub 1Gbps speeds.

For info, the machine is:

Supermicro X9SRL-F motherboard with 6 core Xeon E5-2618L v2
64GB DDR3 ECC RAM
24 bay chassis with 9207-8i controller and Intel 24 port SAS expander
ESXi booting from USB drive, and Samsung 960 Evo 250GB NVMe SSD as a datastore
Samsung SM953 NVMe PCIe SSD as SLOG
Intel quad port gigabit ethernet adapter

In terms of congfiguration

ESXi 6.5 free latest build installed and booting from USB drive
960 Evo SSD set as a datastore for hosting FreeNAS VM (and test VMs)
FreeNAS provided 4 vCPU and 32GB RAM, with one LAN vnic for SMB clients and one for ESXi loopback iSCSI storage (both vmxnet3 so 10Gbps capable)
FreeNAS and ESXi connected to same virtual switch with no physical adapters connected for storage network
FreeNAS has LSI controller passed through as a PCIe device, as well as the entire SM953 NVMe SSD as SLOG
FreeNAS has pool of 2 mirrored vdevs consisiting of 4x 7200rpm 2TB Hitachi drives, SM953 SSD as SLOG, and a 64GB vmdk stored on the 960 Evo as L2ARC
One zvol created at 1TB on this pool with sync=always, and iSCSI configured to allow ESXi to connect to this VM/zvoland create a datastore
Other VMs then stored on this iSCSI datastore

If I connect to a VM which is stored on this iSCSI datastore, and run a 1GB sequential test in CrystalDisk, I see results close to 300MB/s reads, 720MB/s writes
If I copy an approx 1GB file over the virtual switch from a test VM stored on the 960 Evo SSD, the transfer completes within a couple of seconds and I see a spike up to about 5Gbps receive on the virtual nic on the destination VM from within task manager

If I run the same test with an 8GB size, read speeds come up at around 710MB/s, writes being about 350MB/s
If I do the same copy over the network of an approx 9GB file for comparison, it spikes at the beginning, then sits at around 60MB/s, then another spike towards the end of the copy (see attachment).

I cannot figure out this behaviour. Even if the underlying pool of disks becomes a limit, it is capable of much more than 60MB/s!
When doing the 1GB file copy, I can see my SLOG hitting about 600MB/s writes, and if I have 32GB RAM, surely it can cope with a 9GB file much better than this? It fits in RAM, it fits on the SLOG, so why does it drop to such poor speeds with larger files this this?

Cheers
Eds

kdragon75 · Apr 7, 2018

Eds89 said:
I can see my SLOG hitting about 600MB/s writes, and if I have 32GB RAM, surely it can cope with a 9GB file much better than this? It fits in RAM, it fits on the SLOG

Don't think of the SLOG as a cache. Its not. ZFS will throttle writes as you don't ever want to fill you SLOG/TXGs and be waiting for your pool to catch up as this will halt ALL writes and cause major latency spikes. If long enough, you can start dropping connections and subsequently disks. This is bad. Just imagine going into a running server and pulling the hard drive... Bad things happen. The only point of the SLOG device storing the ZIL is that its not sharing IOPS with your pool and has lower latency. This allows much better small random IO performance but not better maximum throughput.

As for the sustained 60MBps. I'll just blame CIFS. Try FTP and see if its faster.

kdragon75 · Apr 7, 2018

Perhaps if you could correlate the FreeNAS network transfer speeds with ARC stats this may provide some clue as to your spikes. Also don't forget windows does caching on "local" disks.

Its also a bit early for me and I'm am unsure of where where you are copying from/to.

Eds89 · Apr 7, 2018

Yes that has been explained to me before. My understanding was that FreeNAS's system RAM effectively acts as the write cache(?) which is why I have assigned 32GB for such a small deployment, to hopefully accommodate the occasional large file transfer like this.

My source, is a Windows VM on this same ESXi host, but stored on the 960 Evo (to eliminate source read bottle necks for my tests).
I am copying to another Windows VM on the same host, but where the disk is stored on the iSCSI datastore that FreeNAS is hosting (which itself is a VM running on this same host).

So when copying the larger 9GB file to the destination VM on the iSCSI zvol on the pool with the SLOG, I get this behaviour. This is true even if I remove cache and SLOG, so doubt either of those is the problem.
If I instead copy from the same source to a similarly configured 4 disk mirrored vdev pool using SMB, on the same FreeNAS box but with slower spinning disks, the write speed is slower, but 100% consistent throughout the copy. This drop to about 60MB/s only happens when copying to the iSCSI datastore.

It seems specifically a problem for the iSCSI datastore access.

EDIT: trying FTP instead of SMB, and it still behaves in the same fashion, whereby the transfer starts high, then gets slower, settling around 70MB/s
Must be an issue with the underlying storage/freenas setup.

kdragon75 · Apr 7, 2018

Eds89 said:
My understanding was that FreeNAS's system RAM effectively acts as the write cache(?)

This is incorrect. RAM is only used as read cach as the ARC (part of which get used to map blocks in the L2ARC if used). There is no write cache with sync writes. For instance when a pool is set as sync always there is no write cache. Again the ZIL when stored on a SLOG device is only for reduced latency. This will provide better apparent throughput for small random write but not for large sequential writes like you are testing with. There it would make sense that removing the SLOG would have little difference for your type of test. With that said you will still want it there for the general VM performance but makes sense to remove for testing.

To be clear, your copy is a file going from a VMDK on the NVMe to a VMDK on the iSCSI vmk connected to the FreeNAS VM using disks via passthrough for your LSI controller and both VMDKs are attached to the same VM.

kdragon75 · Apr 7, 2018

Also look into moving your non-OS VMDKs to a paravirtualized controller if you have not already.

Eds89 · Apr 7, 2018

kdragon75 said:
This is incorrect. RAM is only used as read cach as the ARC (part of which get used to map blocks in the L2ARC if used). There is no write cache with sync writes. For instance when a pool is set as sync always there is no write cache. Again the ZIL when stored on a SLOG device is only for reduced latency. This will provide better apparent throughput for small random write but not for large sequential writes like you are testing with. There it would make sense that removing the SLOG would have little difference for your type of test. With that said you will still want it there for the general VM performance but makes sense to remove for testing.

To be clear, your copy is a file going from a VMDK on the NVMe to a VMDK on the iSCSI vmk connected to the FreeNAS VM using disks via passthrough for your LSI controller and both VMDKs are attached to the same VM.

But RAM is used as write cache when sync writes are disabled?
I also tried with sync disabled and got the same behaviour.

You say that a SLOG would not have any improvement for sequential writes, but as Stux has found (whose build I based mine on) when using sync writes via iSCSI or NFS, he was only getting 5MB/s within a VM. Once he added a SLOG, he was getting a far improved 700MB/s:
https://forums.freenas.org/index.ph...n4f-esxi-freenas-aio.57116/page-4#post-403374

I do indeed want to keep the SLOG, and it doesn't seem that it is what is causing this behaviour, so that's fine.

That is pretty much what I am doing, other than one VMDK is on one VM, and I am then doing an SMB network copy to the other VM with its VMDK on the iSCSI datastore. The connectivity between the two will be 10Gbps over the virtual switch.

Cheers

kdragon75 · Apr 7, 2018

Eds89 said:
You say that a SLOG would not have any improvement for sequential writes, but as Stux has found (whose build I based mine on) when using sync writes via iSCSI or NFS, he was only getting 5MB/s within a VM. Once he added a SLOG, he was getting a far improved 700MB/s:

I would guess that 700MB/s is the max his pool can write in a semi sequential pattern. It is possible to effectively use your SLOG as a write cache but that is not its intent. Sync writes are not considered do until they hit "permanent" storage (always use PLP SLOGS if your data is mission critical). Because spinning drives a incredible slow for random IO especially when when doing double to work (when ZIL is in pool) a separate SLOG preferably an SSD is able to take the random IO, tell the system the data is persisted on media and move on to the next IO. Mean time the IO that just finished is still in RAM in the TXG and being reordered/culminated with other writes and then written to the pool in the most sequential (fastest) way possible.

The slog is never read unless the system has an unclean shutdown and the last few TXGs could not be written from RAM to disk and therefore cannot be considered a cache. A cache is IN the data path the SLOG is next to it and only serves to safely acknowledge a write before its actual written to the pool without lying.

So yes it speeds up random writes but does little to nothing for large sequential IO. You will never get more throughput on average than your pool is capable of under ideal conditions.

Eds89 · Apr 7, 2018

While that does clear up a couple of minor points I was unclear on, I am still unclear if the FreeNAS RAM is considered a cache for async writes? If so, how come I still get piss poor sequential writes with sync disabled.

I am still unclear why I see this drop to around 60MB/s after the initial spike on larger transfers.

There is no way that my physical pool is only capable of that, evidenced by the fact that smaller transfers finish far far quicker, and a single disk alone should be capable of 100MB/s write.

kdragon75 · Apr 7, 2018

In the case of async writes, the write does not need to be confirmed but gets stuffed into the open TXG until full or a set time (5 sec in FreeNAS I think) at which point the writes are optimised and flushed to disk. I would have to do a bit more reading on this to be sure.

EDIT: I meant to imply that transaction groups are the write cache mechanism in ZFS.

Eds89 · Apr 8, 2018

Ok, so the two points I need to be sure on;

If I have 32GB RAM and copy a 9GB file, should it all fit in RAM using async writes and confirm written quicker than I am seeing
What could be causing my drop to 60MB/s

kdragon75 · Apr 8, 2018

Eds89 said:
If I have 32GB RAM and copy a 9GB file

Assuming you have that much free RAM, yes.

Eds89 said:
What could be causing my drop to 60MB/s

Can you list the full virtual specs of the Windows VM including what type of disk contolder and NIC are used?
Also please detail the the file location and destination.

Stux · Apr 8, 2018

So, a dual disk mirror actually only writes at the speed of a single disk.

My 6-way RaidZ2 does sequential writes at 4x the speed of a single disk. In theory.

For long sequential writes your transfer should be limited by your pools sequential performance. Your pool needs to be able to store 5 seconds of data in 5 seconds, or the transfer will throttle waiting on the first txg to flush, while the second is ready to flush. There are only ever two transaction groups, the open and the flushing one, and if the open one closes but can’t flush, then your pool becomes blocked for writing.

You can increase the TXG flush time, which essentially increases how much write cache you have, but eventually it has to flush, and then your transfer will block.

So, you can transfer 1GB/s via 10gbe, but your pool can’t write at 1GB/s, so the pool will end up blocking on the txg flushes. This can be bad for some algorithms. And I suspect the 60MB/s is occurring while it flushes all the data which was initially pushed. Then once that happens the data continues writing at 170MB/s, which is about right for a single disk.

My 700MB/s /4 = 175MB/s.

Also, I found vmxnet3 could actually do about 20gbps on my system.

kdragon75 · Apr 8, 2018

@Stux Thanks for clarifying.

Stux said:
Also, I found vmxnet3 could actually do about 20gbps on my system.

A good example of why to never underestimate the importance of the virtual hardware configuration.

Stux said:
So, a dual disk mirror actually only writes at the speed of a single disk.

Minor thread jacking... But you still have the cumulative write performance of all vdevs correct? So in your case two 6-way RaidZ2 vdevs should be able to write at 1400MB/s. In theory.

Stux · Apr 8, 2018

kdragon75 said:
@Stux Thanks for clarifying.

A good example of why to never underestimate the importance of the virtual hardware configuration.

Minor thread jacking... But you still have the cumulative write performance of all vdevs correct? So in your case two 6-way RaidZ2 vdevs should be able to write at 1400MB/s. In theory.

Yes. And the IOPS would double too.

My ESXi AIO has a single vdev though ;)

My 24 bay system has more vdevs ;)

Eds89 · Apr 8, 2018

When I look at memory usage, the entire 32GB is all shown as wired. This is true during the 1GB copy and the 9GB copy. As the 1GB copy finishes so quickly, it suggests to me it is happening purely in RAM. As such, I would expect the 9GB file to do the same with sync disabled.
How can I tell if this is what is happening or not if none of the memory is marked as free?

The two VMs I am using as my source and destination are Windows 10 Pro 64 bit , ESXi version 13 VM. One has 4 vCPUs and 4GB RAM, with a vmxnet3 vnic, while the other has 4 vCPUs, 16GB RAM and a vmxnet3 vnic.
The scsi controllers on both are standard LSI logic SAS, as I didn't think the pvscsi controller would benefit me in this setup.

The file location is on C:\ of the source VM, where the VM is stored on the 960 EVO SSD. This is the smaller 4GB VM
The destination is a share on C:\ of the other VM, where the VM is stored on the iSCSI datastore on FreeNAS.
These two VMs are connected to the same virtual switch.

Odd that my other mirrored vdev pool of 4 slower disks can maintain about 300MB/s, but the same setup of faster disks can't, even if the writes are limited to the speed of a single disk.
On some of my test transfers though, after the drop to 60MB/s it never recovers, and stays like that until the transfer completes.

kdragon75 · Apr 8, 2018

Stux said:
Your pool needs to be able to store 5 seconds of data in 5 seconds, or the transfer will throttle waiting on the first txg to flush

Based on this you would need to expand your transaction group time/size. Assuming 20gb/s (using vmxnet3 network cards in the VM) 9GB*8/20gb=3.6 seconds. Thats pretending you can read at 20gb/s or 2.5GB/s.
userbenchmark.com provides some real number to work with. Average sequential read is about 1,496MB/s so say we do 9/1,496=6 seconds. that's more than one 5 second TXG and now we will thottal down to disk speed until we catch up. This leaves 1.5 GB to write at approx. disk speed. Once you factor in overheads and real world nonsense like CIFS and suddenly this sounds not too far off.

Thi is all off the cuff and I'm sure Stux will have more insight than I.

Just for a test you could set vfs.zfs.txg.timeout="10" as a loader tunable, reboot, and test. Just know that if you fill your TXGs faster than you can write to the pool you will pause further writes until the pool can catch up.

Stux · Apr 9, 2018

@Eds89 are you using a separate Virtual Storage Network with MTU9000?

https://forums.freenas.org/index.ph...n4f-esxi-freenas-aio.57116/page-3#post-401898

In my case:

This does make a performance improvement and will increase throughout from 2-5gbps to 20gbps in internal networking.

On bare metal you don’t need mtu9000 but with virtual networking the cpu load is too high without it.

Eds89 · Apr 9, 2018

I am indeed using a seperate virtual switch for the storage network, with the MTU being set at 9000 both ESXi side and FreeNAS side.

I have completed another test which confuses me even more;
I have taken the same source VM stored on the 960 Evo, added another vmxnet3 adapter and attached it to the storage switch (so the VM can directly address FreeNAS on the storage network).
I have taken 4x 7200rpm 500GB baracuda ES drives, in a single striped vdev, and created an iSCSI target on a zvol with sync disabled.
After attaching this to the VM through Windows (Microsoft iSCSI intiator), I copied a file from the VMDK stored on the 960 evo, to this volume, and it has similar behaviour;
There is a tiny spike of about 600MB/s for half a second, and then it drops down to between 170MB/s to 200MB/s. Isn't that still slow for a 4 disk stripe?!

If I copy to the same pool, but over SMB rather than iSCSI, the transfer speed is about 300MB/s constant.

Is this just an iSCSI problem?

Eds89 · Apr 11, 2018

No further thoughts on whether it may be iSCSI specific?

Important Announcement for the TrueNAS Community.

After assistance diagnosing VM real world performance on AIO box

Contributor

Attachments

Wizard

Wizard

Contributor

Wizard

Wizard

Contributor

Wizard

Contributor

Wizard

Contributor

Wizard

MVP

Wizard

MVP

Contributor

Wizard

MVP

Contributor

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "After assistance diagnosing VM real world performance on AIO box"

Similar threads