FreeNAS prosumer HCI build

kngpwr · Jun 15, 2017

So I recently decided it was time to retire the Dell C2100/FS12-TY build and make a new one from scratch. This is to be an HCI build (https://en.wikipedia.org/wiki/Hyper-converged_infrastructure). I've been doing burn in for a month+ and it has been rock solid. Before I move it into production, I wanted to ask here if anyone has anything they'd like tested on a setup like this before they potentially make a similar investment/build. I can't promise I will do everything asked but I think some of you might have some fun stuff I can demonstrate.

Software:

ESXi 6.5.0 (5310538)
FreeNAS-11.0-RELEASE (a2dc21583) - virtual machine
Windows 10 Creator Update - virtual machine
Ubuntu 16.04 LTS - virtual machine hosting plex w/ HW transcoding
pfsense 2.3.4 - virtual machine

Hardware:

Supermicro X11SSH-CTF w/ BIOS 2.0a
64GB Kingston DDR4 2133Mhz ECC Unbuffered RAM (4x16GB)
Intel Xeon E3-1275 v6 (Kaby Lake)
8x Western Digital Red 8TB (WD80EFZX)
400GB Intel DC P3700 NVMe SSD
512GB Samsung 850 Pro SATA SSD
40GB Intel DC S3700 SATA SSD

ESXi Configuration notes:

Intel DC P3700 has PCI passthrough enabled and is attached to the FreeNAS VM
Intel DC S3700 has been turned into an RDM via ESX command line. Attached to the FreeNAS VM
Intel P630 iGPU has PCI passthrough enabled and is attached to the Ubuntu VM for plex hardware transcoding (once its fixed; still in beta)
LSI 3008 HBA has PCI passthrough enabled as is attached to the FreeNAS VM
Booting from the Samsung 850 Pro
The rest of the Samsung 850 Pro is used a a datastore
2 vSwitches configured (storage & LAN)
An NFS v3 datastore is configured and hosted by FreeNAS via the storage network
An iSCSI LUN backed datastore is configured and hosted by FreeNAS via the storage network

FreeNAS VM Configuration:

Booting from a vmdk on the Samsung 850 Pro datastore
LSI 3008 (via passthrough) for HBA for the WD Reds
Intel P3700 (via passthrough) for SLOG
Intel S3700 as RDM to the VM for L2ARC
RAID-Z2 configured on the 8 WD Reds with "sync=always"
32GB RAM provisioned
2 vCPUs
vNIC for dedicated HCI storage network (no egress) with MTU 9000
vNIC for on LAN protocol access with MTU 1500
iSCSI, SMB, NFS v3 enabled

pfsense VM Configuration:

1 vCPU
512MB RAM
2 vNICs (LAN, WAN)

Ubuntu VM Configuration:

4 vCPUs
8GB RAM
Intel iGPU drivers installed from 01.org (hardware transcoding tested and working)
NFS mounting the "media" dataset for plex from FreeNAS

Windows 10 CU VM Configuration:

4 vCPUs
8GB RAM
VMDK on ESXi NFS datastore attached
VMDK on ESXi iSCSI datastore attached
iSCSI LUN attached directly from FreeNAS via the storage network
SMB v3 share mounted directly from FreeNAS via the storage network

Additional thoughts:

Making the Intel DC S3700 into an RDM was done to avoid passing through the entire SATA controller on the motherboard and removes the overhead of VMFS from the SSD. Since L2ARC can be lost with no data loss, using it this way is an accepted risk since the data is safe if it fails without warning (due to lack of SMART).
I realize the Intel DC P3700 is being used as a SLOG without a mirror. That is a thing that will be fixed since I am paranoid in general about data integrity. Right now the only risk is if the SLOG fails and the system crashes very close to the same time. As I understand it, in the event of a SLOG failure the system should just revert to only using the on-disk ZIL with the successive write flushes. This is the smallest size the P3700 came in and even so its heavily over-provisioned. That should only help it be more reliable in the long run.
I have passed the iGPU through to both Windows and linux separately without any obvious issues. I tested hardware transcoding in plex server on both setups to compare speeds and loads but the current plex "beta" that supports the feature has some showstopper bugs that prevent me from recommending it to anyone yet. When it works, it works well and the CPU overhead change is dramatic. This is especially true when working with HEVC/H.265 content.
The Samsung 850 Pro SSD is being used to boot ESXi and to host the FreeNAS boot vmdk right now. I have also configured it for VM swap should the need ever arise but I could have certainly gone a different route for that piece of the system. I just happened to have it available on the test bench so I put it to use.
The Windows VM has been my primary benchmarking rig. That's why is has so many types of storage.

Results:

Write speeds with various benchmarks are showing 250-600MB/sec depending on the I/O pattern and protocol
Read speeds vary from 450-1200MB/sec depending on the I/O pattern and protocol
iSCSI vmdk's have been impressive on this rig. Passmark 9.0 Disk Mark "very long" tests (2 iterations) are showing results of over 5300 (compared to 4400 on NFS vmdk's). For comparison, a Samsung 840 EVO direct attached gets ~4400 and a Samsung 950 Pro direct attached gets ~7500.
The WD Reds are coasting. Most of the testing I have done has not pushed any individual drive past 55MB/sec. I'm glad I went with the cooler/quieter 5400 RPM models.
I have tested datasets twice the size of the VM's RAM (to push the ARC) and the results were still similar to what is stated above.

Now that I'm done poking at it and playing with it I'm taking a break and opening it up to suggestions here. I'll probably run one final week long punisher of some kind before I flip the switch to "production". I work in the storage industry and I NEVER trust new drives ;-)

Is there anything you all want to know or want me to try?

Dice · Jun 15, 2017

kngpwr said:
Is there anything you all want to know or want me to try?

I'd be interested in benchmarks or real heavy usage patterns to test SLOG size for performance.
Briefly to bring the conversation up to speed, a SLOG needs to be as large as the the throughput of data captured for transactions group flush/cycle, which translates to a few GB's on a Gbit LAN. Somewhere around 10GB for 10Gbit seems quite common. For in dept check the primers on the topic.

Now, when all stuff happens on one machine, there is no LAN interface to size the SLOG drive. Rather, it is "unlimited".
IIRC, ZFS will tank if the pool cannot keep up with output of the SLOG.

On to what you perhaps could figure out a way to check/measure/share would be to extract some good data points on what size SLOG that is required in your setup. That is, what is an obviously too small SLOG, and what seems to be just about overkilling it slightly. Any insight would be useful.

I'm running a 40GB SLOG for my setup, which, to be honest, is just a number corresponding to "worst case scenario" on physical NICs.
I'd be interested in making a more informed decision.

kngpwr · Jun 15, 2017

Disclaimer: If I make a statement about the inner workings of ZFS it is based on my current understanding. My background is with similar (non-free) enterprise storage which has many of the same concepts. I'm totally ok being wrong so feel free to correct me.

So regarding some of your points, Since this is HCI and all the storage networking for VM's is local to the box (and therefore not actually limited by the physical network) I run some network tests. I can achieve well over 30Gbps aggregate throughput between my ubuntu and windows VM's via iPerf benchmarks. Benchmarks being what they are, I never expected to achieve that speed for "real" work.

Even so let's start with that; 30Gbps ~= 3GB/sec in nice round numbers. The Intel DC P3700 will peak out at 2GB/sec of writes during a "rigged demo". In my testing on this system I've never seen the SLOG go much past 800MB/sec in bursts (as expected). As an additional note for those of you that didn't pull the specs, my motherboard has dual 10GbE on it. I doubt I'll be doing anything that "needs" 20Gbps in my home anytime soon. That makes this more of an academic exercise for me

So stating those baselines, remember that the SLOG is only read for replay following a crash or something. The ZIL still has to do all the work during normal operations. That means I just want my SLOG to never get in the way. If it can outrun my disks then its probably fast enough for my personal use case.

So for the first part of the testing, I decided to simply remove the SLOG from the pool and see how much all that low latency acknowledging is bringing to the party. You know that I was able to get a Passmark Disk Mark score of 5300 with the SLOG. With the SLOG removed from the pool that number drops to 2880. Sequential write score goes from 243 to 9. That's not a typo.

So you have the results for a 400GB SLOG (way bigger than required but harmless) and no SLOG. Restricting the SLOG to a size smaller than 400GB requires booting linux and using hdparm trickery to make the visible size of the drive smaller to the OS (over-provisioning the hard way). Maybe these numbers and some quick math on a napkin can give you part of what you seek.

Is there some specific test you would like beyond this very quick and dirty exercise that does not require booting another OS to hack the drive? ;-)

kngpwr · Jun 16, 2017

Here is a link to a lengthy iozone run with temp files of 3 time the FreeNAS RAM (72GB file size). This was run in the local Windows VM on an direct attached iSCSI LUN.

https://www.dropbox.com/s/vn6uuh29kqcm6vc/freenas_iscsi_large_run.xls?dl=0

r0nski2000 · Jul 13, 2017

I am considering a build with a similar/same software configuration as the OP, but x11ssm/e3-1245v6/h200 instead.
This is for a home lab with nothing really critical on the Nas side.
I am wondering whether the cpu will provide sufficient performance for multiple vms - the only other option that I am considering is to wait for better amd support and getting an 8c/16t one instead.
Maybe some of the more knowledgeable people can give me some advice?

In case it helps, I am new to all of this, so I do see a benefit in sticking with a more stable/supported intel hardware...

kngpwr · Jul 13, 2017

For my purposes a 4C/8T CPU has been plenty. It is handling hardware encrypted ZFS with 2 vCPUs, pfsense with 1 vCPU (on a 300/50 WAN connection, VPN and QOS heavily utilized), and plex media server on Ubuntu with with 6 vCPUs without any real issues.

Plex is, by far, the beast in the box and regularly supports 4 live streams. I have tested with 4 streams of 1080p H.264 software transcoding (intentionally not "Directplay") and it did fine. The CPU is so fast that once the 120 second buffers I have configured are full it is able to juggle the work loads effectively. The plex preview releases that support hardware transcoding are just not mature/ready yet but when they work the impact is very noticeable.

Unless you think you're going to pound on the CPU all the time I suspect you'll be fine with mild CPU oversubscription. Of course I say that not knowing your actual application(s). Also, never forget to consider the speed of your LAN connection. If this is in home and limited to 1GbE, you may find that a machine like this is overkill ;-) Another thought is that once we have mature container support in FreeNAS 11 you may not need VMs at all to accomplish your goal. That would lower the overhead of several use cases that seem to be typical when reading the forums.

As always, the devil is in the details. Just my $.02

r0nski2000 · Jul 13, 2017

Thank you for the feedback. Sorry to hijack your thread.
I don't expect high load at all - all I really need is a nas, but I want overpowered hardware that would allow me to explore and learn what is possible with some vms(pfsense, cameras, gaming?, Plex)....
Based on your build, and response, it seems that sticking with intel would be sufficient..

r0nski2000 · Aug 8, 2017

@kngpwr
In the ESXi configuration u have 2 vSwitches, and the pfSense configuration u have 2 vNICs. How do they map?
I read the pfSense wiki on ESXi installation, and it suggests WAN/LAN configuration for vSwitches/portgroups/vNICs - wouldn't it be better/safer to just do a PCI passthrough of the physical WAN NIC to the pfSense VM?

kngpwr · May 17, 2018

I can't believe I missed this question when it was posted. I'll reply for completeness and in case any one revisits this thread. My apologies for the delayed response.

I have a vSwitch with MTU 9000 for pure storage I/O. This includes a vmkernel NIC for ESXi to use for communicating with FreeNAS. My VM's all have a vNIC on this switch/network and any I/O directed at FreeNAS follows that path. There is no gateway on this network and the traffic never traverses a physical network.

The other vSwitch has MTU 1500 and is used for all my routable traffic and external connections. This is where all the non-storage stuff happens.

As far as "safer" for WAN traffic to use a passthru NIC...I have my two 10GbE ports connected to a managed switch in an LACP pair. I separate the LAN and WAN traffic with VLANs on this switch. My cable modem is direct connected to the switch and is on a port default tagged with my WAN VLAN ID. I have a port group on my non-storage vSwitch that is tagged with my WAN VLAN ID. I have trunked that VLAN down to the LACP group and then I have given pfsense a vNIC in the WAN port group.

This allows me to have some level of redundancy and load balancing for all ingress/egress traffic to the HCI server. I don't consider it unsafe enough to engineer around in a home use case. At work it would be a different story. There are a LOT of things I would do differently for that use case but that's a topic for another thread & another time.

Again, I'm very sorry for the delayed response.

r0nski2000 · May 17, 2018

:)) wow, thanks!
It all made sense once I started playing with it. And I think I also posted on pfsense forums and they recommended vSwitches in ESXi instead of passthrough.

Important Announcement for the TrueNAS Community.

FreeNAS prosumer HCI build

kngpwr

Dabbler

Dice

Wizard

kngpwr

Dabbler

kngpwr

Dabbler

r0nski2000

Dabbler

kngpwr

Dabbler

r0nski2000

Dabbler

r0nski2000

Dabbler

kngpwr

Dabbler

r0nski2000

Dabbler

Similar threads