ESXi NFS storage SSD RAIDZ3

mstinaff · Sep 18, 2014

From numerous searches, I've seen people warned off using anything but striped mirrors for ESXi storage. It seems the non-sequential performance of RAIDZ3 would be detrimental to virtual machine performance.

However the resiliency of RAIDZ3 is very appealing. -Any- three drives can fail as opposed to multiple drives can fail as long as they are not in the same mirror.

With the competitive SSD prices one can get 512G of supercap ssd for less than 450G of enterprise spinning rust. So I was thinking of this build and config:

ASRock EP2C602-2T/D16 Dual LGA 2011 with Intel X540 10GBase-T
LSI 9201-16i 16 port 6Gb/s SAS/SATA HBA
128GB (16GB x 8 sticks) DDR3 1600 ECC
Dual Xeon E5-2609 v2
However many Crucial M550 512GB SSD (My plan is 11 in RAIDZ3 (8 + 3 parity))

The ESXi server will host the following for a 60 person team:
fistful of build machines
source repository
defect tracking
requirements management

Questions
1. Am I completely insane?
2. Would the increased performance of SSD offset the RAIDZ3 speed cost?
3. Would additional SSDs for ZIL and or L2Arc be wasted?
4. Can anyone point me to perfomance tests of pure SSD RAIDZ configs vs spinning rust? rebuild times, non sequential access etc.

My google fu must not be strong as I keep getting RAIDZ with SSD ZIL/L2Arc results and not much in the way of pure SSD configs

Many thanks

mjws00 · Sep 18, 2014

Heh. I wouldn't say insane, but nicely on the cutting edge. You jump well out of the range of common experience. :)

So not much available information. Good discussion in Building a High Performance SAN. I'd note jgreco's comments on slog etc.
Jpaetzel one of the developers did an All SSD Rig.

I'd love to say I've done it, but haven't had a project to scale there yet. If it were my money, supporting 60 people, I'd phone iX and see about paying for support and some hardware/advice. Their recently announced trueflash z50 is all flash, and pretty similar to what you are proposing not to mention vmware certified... so someone has tested thoroughly. :)

Lots of gotcha's with zfs, nfs, and esxi. Probably the most interesting and challenging configuration to master, imho. Not sure if iX was forced to utilize some sort of fusion-io, or zuesram, zil to get the numbers up, but it wouldn't surprise me. I would also bet z3 is a performance killer. IOPS are pretty much controlled by the number of vdevs. One big pool = ~one drive speed.

Good luck, hopefully we hear of your successful project.

cyberjock · Sep 18, 2014

Questions
1. Am I completely insane?
2. Would the increased performance of SSD offset the RAIDZ3 speed cost?
3. Would additional SSDs for ZIL and or L2Arc be wasted?
4. Can anyone point me to perfomance tests of pure SSD RAIDZ configs vs spinning rust? rebuild times, non sequential access etc.

1. Nope, but don't expect amazing performance considering the money you are going to spend.
2. Yes, but not enough that you're going to be happy with your purchase.
3. If you are doing RAIDZ3 the money is better spent going to mirrors.
4. I don't know of any that are public, but I've worked on a few myself. Yes, it's an improvement, but no I don't generally recommend RAIDz of any kind for VMs and such. Not that you can't do it (just like you can't do a 50 disk RAIDZ1) but it just doesn't "work well" from my experience. As for rebuild times and non-sequential values that's going to be solely on a per-server setup. Even with identical hardware those numbers will change. So you can't really read one person's experience and immediately assume the same for your setup even if its identical in every way.

mstinaff · Sep 18, 2014

Thanks for the prompt replies.

mjws00. I've been using FreeNas at home for so long that the thought of contacting iX flew right past me. I'll look into that.

cyberjock. Glad for the first hand experience. Could you elaborate on your answer to 3? are you suggesting a mirror of two raidz3 devs or a raidz3 dev of mirrors? I guess this begs the question about other permutations:

original crazy plan:
one 11 drive raidz3 = 11 drives total =~ 8drives worth of usable space and can withstand 3 failures

alternative plans:
a striped set of four 3 drive mirrors = 12 drives =~4drives worth of usable space and can withstand 2 failures in any one set
a striped set of two 7 drive raidz3 vdevs = 14 drives =~8drives worth of usable space and can withstand 3 failures in any one set
a striped set of six 2 drive mirrors + 2 online spares = 14 drives ~6 drives worth of usable space and can handle 2+ failures as long as the failures are not on a mirror that is rebuilding.

Do I have an irrational fear of failure during rebuild? I run an 11 drive 3TB raidz3 at home and when one of the drives failed it was about 3.5 days to rebuild it. The two additional parity drives gave me a nice warm fuzzy feeling during that rebuild.

cyberjock · Sep 18, 2014

I'm talking mirrors... as in mirrors.. no RAIDZ(x) at all. There is no such thing as "a mirror of two RAIDZ3 vdevs". ;)

Your alternative plans are all possible. I'd go with 6xtwo disk mirrors. If you want to be extra save then you can do 3-disk mirrors too. I'm not sure what your server was doing with 11x3TB RAIDZ3 but I've got a bigger pool that was way fuller than it should have been and it resilvered in less than 12 hours. The fact that it took 3.5 days makes me wonder what was going on or if other disks are failing and you don't know about it.

The fear isn't irrational. That's what backups are for. But, if the time lost due to recovering from backups are excessive then the obvious answer is to go 3-disk mirrors.

As I said above.. any RAIDZ is not going to be recommended. The whole goal is to have as many vdevs as possible, which means mirrors.

mstinaff · Sep 18, 2014

I guess a set of six 2 disk mirrors with online spares gives the best balance of resiliency speed and usable space for what I need.
Assuming I can guarantee that someone is available to add in the online spare 24x7.

cyberjock, I read your stance on hot spares and I can respect that, but I'm thinking about the worst case "the guy with the on call pager got hit by a bus(or some part of the notification chain broke down)" type scenario. Would be nice to have the nas auto add the spare disk if it is not manually dealt with in an hour or some such time frame.

Hmm... seems it could be scriptable

. Am I missing anything blindingly obvious that would prevent running a script that says "hey I've got a mirror that has been degraded for over an hour and I have an available spare. I'll scream for help for another 15 minutes and then swap it in myself."

To further hijack my own thread (promise I'll start a new one if it warrants it) my home rig is:
Dual Xeon L5520
48GB ECC DDR3
IBM ServeRAID M1015 flashed to IT mode
six 4TB 3.5" 7200RPM SATA and
five 3TB 3.5" 7200RPM SATA in one 11 disk raidz3 vdev for file storage
two 1TB 3.5" 7200RPM SATA mirrored for jails

The biggest loads would be seven or so concurrent Plex streams, and the CrashPlan jail that is almost done with my initial 12TB backup. When the drive (a 3TB one) failed I was out of town. So the vdev was in a degraded state for about 3 days. When I did get home I was able to swap in a new 3TB drive and start the resilver.
The vdev was roughly 50% full (12TB of 24TB) and took about 3 days to rebuild.

short question after long lead-up. Anything obvious as to why it took surprisingly long to rebuild?
I was seeing smart errors before the 3TB went bad, but I haven't seen similar errors on the other drives. Anywhere else to look?

Again, thanks for your time.

jgreco · Sep 19, 2014

mstinaff said:
Hmm... seems it could be scriptable. Am I missing anything blindingly obvious that would prevent running a script that says "hey I've got a mirror that has been degraded for over an hour and I have an available spare. I'll scream for help for another 15 minutes and then swap it in myself."

Assuming you can reliably identify when a drive has failed, and maintain state correctly through the process, your main liability is that you're injecting dumb scripted automation into a process that usually has an intelligent admin poking at things and looking at the failure. In my experience this can occasionally lead to a cascade failure when your script doesn't handle some odd case.

The vdev was roughly 50% full (12TB of 24TB) and took about 3 days to rebuild.

short question after long lead-up. Anything obvious as to why it took surprisingly long to rebuild?

Resilvers work on a process similar to scrubs. If the scrubs take a long time, so too will the resilver. Can also happen if there are disk errors that the drives are valiantly trying to recover.

jgreco · Sep 19, 2014

mjws00 said:
Heh. I wouldn't say insane, but nicely on the cutting edge. You jump well out of the range of common experience. :)

I have a lot of oddball crap on my plate for this fall, but one of the things I'd really like to do is to make a VM storage engine out of SSD's. The environment here isn't particularly stressy (we're very careful about designing VM's) but it'd be super-fun. We have about 2.5TB on the SAN that'd be eligible for transition, and I note Supermicro now has

http://www.supermicro.com/products/system/1U/1028/SYS-1028U-TRT_.cfm

a 10-drive 2011-3 system whose primary failing is that there isn't an SFP+ riser just yet (see http://www.supermicro.com/support/resources/aoc/aoc_compatibility_ultra.cfm ).

But just piddlin' around with some basic numbers, 10 drives gives you up to 5 mirror vdevs. Or 4 plus a warm spare plus a SLOG.

4 1TB vdevs would be a 4TB array, which is basically hitting my 60% max guideline, but dedup and compression would be available...

mstinaff · Sep 19, 2014

jgreco That supermicro system is interesting, but how does onboard intel sata compare to something like an lsi hba? Is it like the difference between realtek and intel in the nic world? Calomel.org has some interesting numbers on this page under the heading "All SATA controllers are NOT created equal"

Also to your point about the lack of SFP+. I've read that the main difference between SFP+ Direct Attach and 10GBASE-T is latency. SFP is about 0.4 microseconds and base-t is 6 to 8 microseconds. Any indo on how much impact would that have on ESXi datastore performance? Would it be noticeable or is it more of a difference that only matters in compute clusters?

I will be making the hardware purchase within the next few months and may even have enough time to try some different configs and get performance numbers. I would be happy to share the results, let me know what configs and tests you would be interested in.

cyberjock · Sep 19, 2014

Hate to break it to you mstinaff, but if a simple script would solve that problem it would have existed a *long* time ago.

The real problem is that things aren't totally cut and dry and building in enough logic and fail-safes that don't accidentally kill your pool while trying to restore it to full redundancy is not easy. It was promised in 8.x, then moved to 9.x, then moved to 10.x, and now they are saying 11.x (and they are saying that this is for real this time).

I'd say if it's been a problem for developers for that long there's no way a little script is going to be something you should rely on.

The other thing I disagree with is the "worst case scenario". As someone who has is ex-military you do have to consider those "worst-case scenarios". For all you know they could be very real tomorrow. But for the general public you'll find that scenario so close to unlikely that it's not something you can ever truly plan for. You'll find your workgroup much more strained trying to pick up the pieces of the projects that your former co-worker was working on than anything else. Unfortunately too many people seem to think that the solution is to just cross-train everyone and cc everyone on all emails. That's all fine and dandy, but the efficiency is so incredibly poor that it's disgusting to even consider. Not to mention the fact that emailing alone rarely conveys the whole story, so you'll still be left picking up pieces.

In shot, don't worry about that worst case. The odds of you seeing that in your lifetime are pretty close to zero. The bigger problems stem from things like a building fire that destroys all of the hardware in your building and you are in a position where you might have zero hardware, zero software licenses, and your employer says "lets get the backups out so we can start recovery". You'll be responding with a "pardon me, but what!? we have no hardware no ability to write a check to purchase hardware, no clients to serve anyway, etc."

We IT people seem to think that "worst case scenarios" are a smooth transition. They never are. No matter how much planning you do ahead of time there's just no way you can possibly be prepared for all of those scenarios and still remain productive.

Intel SATA actually isn't bad (bad being realtek NIC). I've found them to be very reliable. They may not be high-performing, but as most people's bottleneck isn't the SATA controller I don't generally consider it to be a potential problem if they are using it. Just put a few SSDs on an Intel controller and you'll see that you can often do over 1GB/sec easily. Considering that most people are running Gb LAN and only a handful are on 10Gb, 1GB/sec is enough to even give 10Gb LAN a run for the money.

Just put things in perspective, keep good backups, and do mirrored vdevs. You can thank me later. ;)

jgreco · Sep 19, 2014

What CJ said. The Intel SATA parts are the best of what you are going to find included on a mainboard, excepting something like a built-in LSI HBA. An external HBA may well outperform it but probably not by that much. Four 6Gbps SATA controllers connected to four SSD's in a two mirror vdev configuration theoretically gives you 12Gbps minimum to the pool (6Gbps per vdev) which exceeds even 10Gbps. You're not likely to GET that out of it with ZFS due to overhead, but you need to think about it like that to understand that the bottleneck in most systems is the network interface, and even cruddy SATA ports - while not desirable - are not a huge issue, while a crappy network interface is a performance disaster.

jgreco · Sep 19, 2014

mstinaff said:
lack of SFP+. I've read that the main difference between SFP+ Direct Attach and 10GBASE-T is latency. SFP is about 0.4 microseconds and base-t is 6 to 8 microseconds. Any indo on how much impact would that have on ESXi datastore performance? Would it be noticeable or is it more of a difference that only matters in compute clusters?

I prefer SFP+ for the reduced latency, but also for the much lower power consumption and better choice of cards.

I do not have a 10Gbps SAN setup for any of our ESXi clusters and don't have any comparisons available for you. All I can tell you is that we've largely dismissed 10GbaseT because it makes less sense; we're upgrading switching and routing to 10G slowly but it is all SFP+. Electrical isolation (if fiber), short or long haul by choice of modules, QSFP+, lower power, lower latency...

The edge network is also getting redone with Cat6A as time permits but even Apple dissed 10GbaseT on the new Mac Pro so ... who knows.

Important Announcement for the TrueNAS Community.

ESXi NFS storage SSD RAIDZ3

mstinaff

Dabbler

mjws00

Guru

cyberjock

Inactive Account

mstinaff

Dabbler

cyberjock

Inactive Account

mstinaff

Dabbler

jgreco

Resident Grinch

jgreco

Resident Grinch

mstinaff

Dabbler

cyberjock

Inactive Account

jgreco

Resident Grinch

jgreco

Resident Grinch

Similar threads

Important Announcement for the TrueNAS Community.

ESXi NFS storage SSD RAIDZ3

Dabbler

Guru

Inactive Account

Dabbler

Inactive Account

Dabbler

Resident Grinch

Resident Grinch

Dabbler

Inactive Account

Resident Grinch

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "ESXi NFS storage SSD RAIDZ3"

Similar threads