XenServer Storage High-Performance Build check

Chris Wieringa · Apr 1, 2016

First time builder here, been lurking and researching for 2-3 weeks now. I'm looking at building a high-performance FreeNAS storage box to use as primary storage for 5 XenServer hypervisors. This build would be replacing some old Dell MD3000i SAS 3.0Gbps arrays (60 spindles total) running iSCSI, with a new FreeNAS system of >35TB of usable space running NFS. I hoping to get 5 years of worry-free usage on this platform.

Currently I have 2 VM file servers, and 1 database server that do non-trivial amounts of IOPS, and a whole lot (40-60) VMs that are idle 98% of the time. Additionally, nightly incremental host-based backups are performed of important VMs and full backups on weekends.

I figure total IOPS of my current Dell arrays is less than 12000, and am looking to rely on FreeNAS's read cache / ZIL write-cache to bring up IOPS while bringing down the total number of disks. A Slog may be considered in the future if RAM doesn't suffice for read-cache. A ZIL would be used because of NFS for attaching the disk to the XenServers.

I'm planning on upgrading networking to 10Gbps for this build, and would hope to connect the FreeNAS via 20Gbps bond. Servers may get upgraded to 10Gbps, or may continue to use a 2 or 4Gbps bond.

After reading the forums, I thought I would run the build past some knowledgeable folks before moving on with it.

(Build via Thinkmate: http://www.thinkmate.com/system/superstorage-server-5028r-e1cr12l/138752 )

Base System: Supermicro SuperStorage Server 5028R-E1CR12L
CPU: (1) Intel Quad-core E5-1620v3 3.50Ghz
Memory: 256GB - (8) 32GB PC4-17000 2133Mhz DDR4 ECC Registered (running at effective 1866Mhz)
System Disks: (2) 128GB Micron M600 SATA 6.0Gbps SSD
Storage Disks: (12) 6.0TB SAS 3.0 12.0Gb/s 7200RPM 3.5" Hitachi Ultrastar HE8 (512e)
ZIL: Intel SSD DC P3600 Series 400GB PCIe 3.0 x4 NVMe Solid State or Intel 750 400GB PCIe 3.0 x4 NVMe Solid State (overprovisioned)
Controller: Integrated LSI 3008

A couple of questions I had that I can't find clear answers for:

What is the state of support for LSI 3008? While as of mid-2015 it was "don't do it - wait", but I can't find any statements on it with 9.10. Is it considered stable in FreeNAS 9.10? Or are there other 12G SAS controllers that would be better?
Would I get the performance with FreeNAS (with ZIL, 256GB RAM read-cache) that I need when shrinking the spindles from 60 to 12?
What would be the best way to configure my disks for performance and space? Two RAIDZ2 vdevs (4disks6TB2vdevs = 48TB, usable up to 80% 38.4TB), six mirrored vdevs (6TB * 6vdevs = 36TB, usable up to 80% 28.8TB)? I'm assuming a 12-disk RAIDZ3 would be too CPU intensive and beyond recommending to even consider.
Even with massive over-provisioning, would the Intel P3600 or 750 not have enough endurance for a ZIL? I could go with a P3700 if necessary, but I figure it would be mostly wasted as a ZIL.
Considering a 20Gbps network bond, what size ZIL should I shoot for?
And last, is a single quad-core 3.5Ghz CPU fast enough or should I redesign with a dual-socket system?

Thank you in advance for any thoughts on the build.

jgreco · Apr 1, 2016

The ZIL isn't a write cache. It's an intent log. You don't seem to understand what a SLOG is. Read that link in blue please.

Your IOPS estimate for your Dell array may even be a little optimistic.

The E5-1620 v3 is either a good choice or one step too small. The 1650 would be more resilient to heavy loads, two more cores.

The thing that will make ZFS go fast for block storage isn't the SLOG device. It's proper design, which specifically includes keeping your pool utilization low. RAIDZ2? Horrible idea. Read any of the things I write here on a daily basis about block storage. For block storage, you want mirrors. You also want to make sure your pool is relatively empty, this helps ZFS quickly allocate space and increases performance. 80% usable space? No fricking way. That's a reasonable goal for a typical moderate usage office fileserver. What you want is a lot more demanding.

The two variables here are the pool percent-full and the pool fragmentation. A pool that's 90% full and 0% fragmented will seem really fast. However, over time, what happens in a copy-on-write (CoW) filesystem like ZFS is that you get a lot of new writes, which causes fragmentation, which causes suck. Free space combats fragmentation. Over time, speeds will decrease until you get to a "steady state" where things settle into a "this is about as bad as it gets."

If we look at a graph of steady state performance on a single drive:

So, see, the thing here is that a single drive with random I/O on it would normally only be capable of maybe 150 IOPS or about 600KB/sec. What you'll notice is that ZFS is actually making it possible for that drive to go a lot faster for writes. But there are some caveats: first, ZFS treats sequential and random writes very similarly. Normally storage guys think of sequential and random I/O as very different things, but ZFS basically just looks for a nice contiguous run of blocks to allocate, and starts writing. What this means is that as a pool fills, both sequential and random write speeds suffer. It also means that fragmentation can become a significant performance problem. We mitigate the read performance issues associated with fragmentation by using lots of RAM (ARC) and L2ARC. We mitigate the write performance issue primarily by throwing lots of extra space at the pool.

Do note in particular that at 10% occupancy, a one disk pool, at its worst, is likely to seem a full TEN TIMES faster at random writes than the underlying storage device would be capable of in a normal environment.

The real world implications get hard to wrap your head around. But you can take some rough guesses here. So let's say you want 20,000 IOPS. You have 12 disks in mirrors. That's six vdevs, or (conservatively) around 900 IOPS raw. At 25% occupancy, you can probably expect to see something that is about seven to ten times as fast as the underlying devices, so maybe 9000 write IOPS with a tailwind. You may have to get down near 10% occupancy to get 20,000 write IOPS. No guarantees, just extrapolation there. Real world is always different.

So the problem is, then, that if you were really hoping to use 80% of the space of a dozen 6TB drives, that implies you're looking for about 28TB of space. Back-calculating that to a 12 drive array, 30TB * 1/10% = 300TB, divide by six vdevs, you actually need much larger drives, like a dozen ~~50TB drives, to pull off such a stunt.

Hopefully you don't need 20K write IOPS. Give it some thought.

As for read IOPS, you can fix that with RAM and L2ARC. Whatever your working set size is, you make it fit in ARC/L2ARC, and it will Go Really Fast.

Chris Wieringa · Apr 1, 2016

Thanks for the reply - you are right, I was confusing ZIL/SLOG and the L2ARC. I meant to say that I was going to avoid an L2ARC if possible, but was planning on setting up a SLOG using the NVMe SSD. I have no real way of knowing how many IOPS are actually happening on my old array (due to total lack of statistics in the Dell software), so I was making a super high estimate with the 12k IOPS number.

However, due to the math that you put it out here and the fragmentation that is involved, it sounds like ZFS / FreeNAS probably isn't the right solution for me for VM storage. The intent was to cut back the total amount of spindles, but since I'll probably be over 50% in-use on the pool this would be ill-advised to consider moving forward with it. Would something more traditional be a better fit for this? (I'm thinking Linux bcache/dm-cache?)

cyberjock · Apr 1, 2016

For VM workloads, L2ARCs really are just a hair from a necessity unless your workload is low or you plan to buy a boatload of very expensive DIMMs. I generally describe the L2ARC as being something that "supercharges VM workloads" as the L2ARC really is designed in a way that the kinds of workloads that VMs typically put on the system are precisely what the L2ARC happens to work best for.

Of course, the slog is a mandatory need if you plan to use sync writes. ;)

jgreco · Apr 1, 2016

Chris Wieringa said:
Thanks for the reply - you are right, I was confusing ZIL/SLOG and the L2ARC. I meant to say that I was going to avoid an L2ARC if possible, but was planning on setting up a SLOG using the NVMe SSD. I have no real way of knowing how many IOPS are actually happening on my old array (due to total lack of statistics in the Dell software), so I was making a super high estimate with the 12k IOPS number.

However, due to the math that you put it out here and the fragmentation that is involved, it sounds like ZFS / FreeNAS probably isn't the right solution for me for VM storage. The intent was to cut back the total amount of spindles, but since I'll probably be over 50% in-use on the pool this would be ill-advised to consider moving forward with it. Would something more traditional be a better fit for this? (I'm thinking Linux bcache/dm-cache?)

I can't tell you whether or not it'd be a good choice for you. I definitely put some numbers out there so you weren't as likely to get hurt by unrealistic expectations, but there's no guarantee you need a big heavy bruiser like that. As I said, that was for a filer I would have given pretty good odds on being able to hit at 20K write IOPS. I totally hear ya on the "it's a guess", because that seems to be a common issue. All you can do is look at the stats and estimate. :-(

You probably want to look at the sorts of things your VM's are actually doing and how demanding they actually are. ZFS *could* be the right solution for you, but you need to be able to characterize your workload a little bit.

For example, here, I have a workload that largely consists of FreeBSD VM's that do very little writing (no atime updates, etc) and only do the stuff necessary to do their jobs. These are so well-optimized that I bet I could run a few dozen of them off a RAIDZ2. But there's other stuff, especially various development VM's and also the Windows stuff, which are constantly write-chatty with their storage. That's the workload I primarily worry about. Windows updates can be a real problem.

As cyberjock says, L2ARC is the turbocharger for reads. The VM storage pool here is 52TB of raw disk, providing a 16TB pool, that in turn services about 7TB of datastore. Currently there's 768GB of L2ARC and 128GB of RAM in the system. It stabilized around 450GB of L2ARC after the last boot and then only very slowly grew to fill the entire amount; that means the working set size I need to worry about is ~~550GB (100GB of ARC). This is a really happy situation for a VM filer because it means that almost all of the pool IOPS are available for write, so things are about as fast as they can get. I've got another 512GB NVMe SSD sitting here waiting for the next downtime.

David E · Apr 3, 2016

@jgreco Not to derail the thread, but how do you deal with maintenance/downtime in your pool? IE in my pool bad things happen when I'm turning VMs on and the L2ARC isn't fully populated, lots don't correctly start or just have problems. The maintenance I did this weekend I paced it at 5 vms started per 2 min, but that was still very aggressive and had poor overall results.

If it is anything I've learned during the last 2.5 years of running a production FreeNAS system with ESXi, it is to take however many disks you think you actually need, and double or triple it. My next build sometime this fall is likely to be fully SSD based to try and resolve the IOPS problem.

jgreco · Apr 3, 2016

Three words: "storage vmotion, dude." Move the VM's elsewhere and leave them run.

Doing a cold start is hell. You have the worst of all possible situations - lots of idiotic IOPS while OS's thrash around for stupid boot stuff that's probably been updated numerous times, resulting in lots of reads of highly fragmented data. And the L2ARC is empty, and the ARC is empty, so no help there.

How long do your VM's actually take to boot and stabilize? That is, to stop doing stupid frickin' stuff related to the booting process. You might actually want to look at that and simulate some scenarios because it sounds like your 5 VM's per 2 minutes is way too aggressive.

David E · Apr 3, 2016

jgreco said:
Three words: "storage vmotion, dude." Move the VM's elsewhere and leave them run.

Yeah, just maintaining a second instance that is similar to the main one is expensive. Assuming you have a second system, how much slower is it than the first? Do you do snapshots/backups on it while performing maintenance on your primary?

Doing a cold start is hell. You have the worst of all possible situations - lots of idiotic IOPS while OS's thrash around for stupid boot stuff that's probably been updated numerous times, resulting in lots of reads of highly fragmented data. And the L2ARC is empty, and the ARC is empty, so no help there.

How long do your VM's actually take to boot and stabilize? That is, to stop doing stupid frickin' stuff related to the booting process. You might actually want to look at that and simulate some scenarios because it sounds like your 5 VM's per 2 minutes is way too aggressive.

Yeah it is all bad. I think many if not most of them can stabilize in the 1-2 min timeframe with a populated L2ARC, but based on this test I think it is probably more like 3-5x that timeframe is needed given all the thrashing for limited IOPS on the mechanical disks.

jgreco · Apr 3, 2016

David E said:
Yeah, just maintaining a second instance that is similar to the main one is expensive. Assuming you have a second system, how much slower is it than the first? Do you do snapshots/backups on it while performing maintenance on your primary?

So you don't maintain an instance that's similar to the main one. You just maintain something that's merely sufficient to suffer your way through. Putting some local storage on the hypervisors works out okay here. Having some other NAS/SAN storage also helps. If you storage vmotion the VM's onto other storage, you're taking a huge hit on performance during the migration, but that can be managed one or two at a time (and VMware limits concurrent migrations).

Yeah it is all bad. I think many if not most of them can stabilize in the 1-2 min timeframe with a populated L2ARC, but based on this test I think it is probably more like 3-5x that timeframe is needed given all the thrashing for limited IOPS on the mechanical disks.

So then you need to run the delay up higher and boot fewer simultaneously, most likely.

The VM filer here is built out of 2TB 2.5" drives. 26 of them in eight three-wide mirrors. The individual drives are capable of about 100 write IOPS and 60 read IOPS but ZFS accelerates all of that as expected. The pool is only storing about 7TB of data, and of that, there's a TB of L2ARC. The pool itself is mostly doing writes since the reads are almost entirely serviced by ARC/L2ARC.

Important Announcement for the TrueNAS Community.

XenServer Storage High-Performance Build check

Chris Wieringa

Cadet

jgreco

Resident Grinch

Chris Wieringa

Cadet

cyberjock

Inactive Account

jgreco

Resident Grinch

David E

Contributor

jgreco

Resident Grinch

David E

Contributor

jgreco

Resident Grinch

Similar threads

Important Announcement for the TrueNAS Community.

XenServer Storage High-Performance Build check

Cadet

Resident Grinch

Cadet

Inactive Account

Resident Grinch

Contributor

Resident Grinch

Contributor

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "XenServer Storage High-Performance Build check"

Similar threads