Lab VM NFS Storage - Sanity check

Herr_Merlin

Patron
Joined
Oct 25, 2019
Messages
200
Hi all,
I am looking to redo / rebuild our lab VM Storage. I would like to get your feedback on this build.

OptionA:
Intel 2U Server with 24x2.5 Slots.
2x LSI 9400-16i
Drives are sperated in 3 bays with each 8 disk. Each bay has 2 connectors and will be connected to each of the HBAs
2x Xeon E5-2680V2
384GB DDR3 ECC reg.
2x RMS-200/8G
6x ZeusIOPS SAS SSD 800GB
1xIntel P4500 1TB NVMe
2xRandom Intel 80GB SATA SSD internally at the chipset ports for OS
30x1.8TB SAS 10k

Physical disk Layout:
per 8 drive bay:
-6x 1.8TB 10k SAS
-2x ZeusIOPS SAS SSD 800GB

Pool Layout:
One big pool consisting of 9 mirrors of the 1.8TB SAS disks
1xIntel P4500 1TB for L2ARC
2x RMS-200/8G for SLOG
3x ZeusIOPS for Metadata
3x ZeusIOPS for dedup table

Would run the pool with dedup and lz4

any ideas or recommendation regarding this setup.
At most it will run up to 120VMs usually the load is <20 with a ton of the same VMs.. hence dedup.
Storage need would be about 9TB

OptionB:
Upgrade our current TrueNAS Storage with external shelfs
Supermicro 2U with 12x3.5 SAS
2x Xeon E5-2680V2
256GB DDR3 ECC reg.
1x12Disk 3.5 Expension Shelf
1x25Disk 2.5 Epension Shelf

Pool Layout:
One big pool consisting of 10 mirrors of the 1.8TB SAS disks
2x RMS-200/8G for SLOG
2x ZeusIOPS for Metadata
2x ZeusIOPS for dedup table
1x 1.8TB hotspare

Using most likely the same HBA with an internal to external breaket.
As to limited free PICe Ports no L2ARC. Rest stays the same.


Optane is not an option as PCIe lanes within those systems are far to limited. Everything that can be ran of SAS is preferred. Additionally no U.2 and no hotswap of those NVMe options like Optane with those old systems. Not going down that road. The RMS-200 are already a huge step away from this and the furthest I am going on those old hardware.
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I would go with "Option A" here - I assume your meta/dedup vdevs will be mirror3 - but be careful on the use of deduplication overall. It doesn't alleviate the RAM footprint necessary and often times a pure VM workload doesn't deduplicate as well as you expect.

Data block misalignment can be mitigated somewhat by cloning VMs from a template. But divergence over time is generally inevitable unless you plan to re-clone from a template image periodically.

I'd say "build the option A system, have some ZVOLs with dedup and some without. Migrate or clone a number of various test machines to it. Make sure you are getting good solid wins from the deduplication data reduction, and it's not adversely impacting performance. If not then disable it."
 

Herr_Merlin

Patron
Joined
Oct 25, 2019
Messages
200
past dedup results from those VMs where about 4.35x so I would assume it will work quite well. Issue in the past was the CPU of the older unit. only 2x 6 cores xeon e5-2400 2Ghz with 192GB of memory and fewer disks. Usually saw the CPU hammered at 90 to 100% while performaning changes at bulk at those vms.
Speed was quite decent to be fair as well.
Yes those are 3 way mirrors just for sanity of mind.
 

Herr_Merlin

Patron
Joined
Oct 25, 2019
Messages
200
Went with option a)
Currently was only able to test a bit sequential by moving 9TB of data mostly offline vms and some online. Wasn't going below 500MB/s but source system was mainly the limiting factor.
Oh and yes dedup is set on and running CPU load while writing 500MB/s with dedup and metadata vdevs is between 30 and 60% usually at 40-50
 
Last edited:

Herr_Merlin

Patron
Joined
Oct 25, 2019
Messages
200
Some first benchmarks:
1611160711865.png

Looks quite good. At the same time I was hitting the system with about 3Gbit of VM storage vmotion.
CPU load spiked to 90% on the TrueNAS while benchmark and storage vmotion.
Usage of the vm feels super snappy can't tell a difference to our main full flash SAN with FC.
And all that on a deduped dataset.

anything I could optimize for more performance? atime is off. To be fair I am happy. Just wondering if I could be even more happy.

@HoneyBadger
One thing I noticed I was not able to set up a 3 way mirror for the metadata or dedup vdev.
System did not allowed it. always got the message that those vdevs must be the same configuration as the data vdevs
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Since you're using VMFS, have you applied 12.0-U1.1 hotfix yet just to avoid the potential data integrity issues raised by some others?

One thing I noticed I was not able to set up a 3 way mirror for the metadata or dedup vdev.
System did not allowed it. always got the message that those vdevs must be the same configuration as the data vdevs

You can check the "Force" option to bypass that on creation. ZFS will allow you to add additional mirror copies after a vdev is formed (eg: stripe into mirror, mirror into mirror3) but not sure if the middleware allows that through the GUI.

For added performance depending on usage patterns you could look at tuning your L2ARC to fill from MFU only (not feeding from MRU or prefetch buffers) that way it has "more value" and hopefully churns less. As well as making it persistent across reboots, not that you're planning to restart the system often but with an L2ARC of that size you wouldn't want it to be having to re-warm itself.
 

Herr_Merlin

Patron
Joined
Oct 25, 2019
Messages
200
Since you're using VMFS, have you applied 12.0-U1.1 hotfix yet just to avoid the potential data integrity issues raised by some others?
Not yet.. thanks did not notice that there is one... damn it. have to wait for another vms to be moved over...

You can check the "Force" option to bypass that on creation. ZFS will allow you to add additional mirror copies after a vdev is formed (eg: stripe into mirror, mirror into mirror3) but not sure if the middleware allows that through the GUI.
not via GUI. do you have the command at hand to do it via shell?

For added performance depending on usage patterns you could look at tuning your L2ARC to fill from MFU only (not feeding from MRU or prefetch buffers) that way it has "more value" and hopefully churns less. As well as making it persistent across reboots, not that you're planning to restart the system often but with an L2ARC of that size you wouldn't want it to be having to re-warm itself.
I would go for MFU. That makes much more sence than fill up it up with MRU. Addionally making it persistent - great idea.
Next reboot for the mentioned hotfix.. and the next one after that? 2023?

1611162521161.png

Just noticed I benched that VM while still in transit.
Benchmarked it again on the target system. Well one can see that the system is hammered as well by the storage vmotion.
Went and check the network speed arriving on the system at the switch. 9.89Gbit.. guess there is my limiting factor...
Another question why are reads so much slower? Looks like the max speed of a single SSD used for meta data.. so going for nvme would be an upgrade.. but costly.
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
not via GUI. do you have the command at hand to do it via shell?

According to the last comment of the ticket here (https://jira.ixsystems.com/browse/NAS-100033), you can do it via these steps (shown as single disk to mirror, but should be similar for mirror to mirror3)

1) Go to Storage > Pools

2) Click the gear icon and select “Status”

3) Next to the single disk, click the three vertical dots and select “Extend”

4) Select the disk you would like to add and press “Extend”

Double-check that it's showing the same amount of usable space after the extend, maybe test this with a different pool first to ensure it actually makes a mirror3 out of it.

I would go for MFU. That makes much more sence than fill up it up with MRU. Addionally making it persistent - great idea.
Next reboot for the mentioned hotfix.. and the next one after that? 2023?

Set the following sysctl's through Tunables (comments afterwards)

Code:
vfs.zfs.l2arc.rebuild_enabled=1
vfs.zfs.l2arc.mfuonly=1
vfs.zfs.l2arc_noprefetch=0


First enables persistent L2ARC, second will make L2ARC feed from MFU tail only, last one will prevent it from trying to fill from prefetches because generally speaking a sufficiently concurrent VM workload is random.

Another question why are reads so much slower? Looks like the max speed of a single SSD used for meta data.. so going for nvme would be an upgrade.. but costly.

Reads will be impacted by having to go to disk for any cache misses, writes are always hitting your RAM/NVRAM combination and will be fast regardless. Could consider using something like HCIbench https://flings.vmware.com/hcibench as a measure of concurrent VM performance as there's only so fast you can go from a single VM.
 

Herr_Merlin

Patron
Joined
Oct 25, 2019
Messages
200
extend works for special vdev but not for dedup via gui.
even more strange. I had setup the dedup as mirror as well but it just shows as 2 disk..
1611165951763.png
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
If that image is accurate it's showing your dedup vdev as a stripe with no redundancy.

You should be able to remove and re-add it since all top-level vdevs are mirrors, but you'll want to build it back up ASAP or face some pretty awful performance repercussions.
 

Herr_Merlin

Patron
Joined
Oct 25, 2019
Messages
200
already moving the lab vms off.. just like 40 vms remaining..
than remove the the special vdev and readd?
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
already moving the lab vms off.. just like 40 vms remaining..
than remove the the special vdev and readd?

Just the dedup vdev but yes. You could theoretically do it live, although you would have to evacuate all the data and you'd be dealing with the additional memory overhead of the redirection pointers until it all gets updated. Clean-slate is best.

If it still creates it as a stripe even when you specify a mirror (or mirror3 with +force) in the GUI, file a Jira bug since that's a pretty big issue if people are going to think they have redundancy on a vdev when they don't.
 

Herr_Merlin

Patron
Joined
Oct 25, 2019
Messages
200
removal from the pool was not possible via GUI or CLI. recreated the pool and datasets after all VMs where moved and moving the VMs now again.
All tuneables set.
Let's see how it preformace within the next weeks / monthes under load.
So far I am really happy.
 

Herr_Merlin

Patron
Joined
Oct 25, 2019
Messages
200
Benchmarked during more or less normal load:
1611310869368.png

1611311032649.png

Looks good.

For comperission, benchmark of 8G FC full flash SAN (there are a ton of VMs running):
1611311073942.png

1611311058305.png

.
 

Attachments

  • 1611311001677.png
    1611311001677.png
    53.3 KB · Views: 182
Last edited:

Herr_Merlin

Patron
Joined
Oct 25, 2019
Messages
200
Last 24h system load. You may see working hours.. :D
1611333099346.png

Judging by the CPU use I think I could have gone for even more compression. Currently zstd-5.
1611333142391.png

Will see how much L2ARC will be over there over time:
1611333222514.png
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
An average (mean) L2 hit rate nearing 50% is impressive, especially with L2ARC being a simple ring-buffer.

What happened at 1300-1330h though? Your ARC plummeted and your L2HIT% cratered as well. Big delete, svMotion of something off the array?
 

Herr_Merlin

Patron
Joined
Oct 25, 2019
Messages
200
Guess this is a bug?
1611755273446.png

Seems self "fixing" now down again.
1611763792644.png
 
Last edited:

Herr_Merlin

Patron
Joined
Oct 25, 2019
Messages
200
1611824061549.png

dedup dataset for a ton of VMs vs non dedup
 

Herr_Merlin

Patron
Joined
Oct 25, 2019
Messages
200
a few monthes later.
System is still strong.
Just noticed that those SAS HDD are ST1800MM0018, which have onboard 32GB of MLC SSD Cache.
As I had a powerline maintance I could test some things and upgrade to current release.

Noticed that if I remove all the fancy stuff ( slog / L2ARC ) I simply get "HDD performance".
As I don't have a free machine with SAS to test the disk I went into the warehouse and picked up an old Seagate Laptop HDD with SSD Cache and another one without but with the same size and rpm.
Testing with Windows there was a huge difference between those two.

Can TrueNAS handle those SSD Cache from those disk? How to enable it?
System has it's own UPS.
 
Top