HW RAID for ZIL Question

Status
Not open for further replies.

eroji

Contributor
Joined
Feb 2, 2015
Messages
140
Is it possible (or reliable) to use a HW RAID with BBU solution with 2 prosumer grade SSDs mirrored as ZIL device to get around having to have a enterprise grade SLC SSD with power protection built in? It seems like this would potentially offer better sync write performance than the latter implementation and may still be cheaper as well.
 

mjws00

Guru
Joined
Jul 25, 2014
Messages
798
@jgreco has some interesting thoughts on this. Even success without the ssd's and simply using fast hdd's. Latency is a HUGE part of the game with an SLOG, so the RAID controller memory can in fact be abused as a very small extremely fast slog, with good effect. Just utilizing the controller as battery backup hasn't been discussed at length, but even striped very fast ssds would be acceptable in many cases as we simply fall back to the pool ZIL in the case of failure, and non-commited writes are still protected.

Even nicer is that power protected pcie ssds with specs that would have made the enterprise crowd blush are becoming affordable. Can't wait.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I was really hoping to find that FreeNAS would have support for the LSI 3108 controller, as SuperMicro has a nice 2GB cache version of it (AOC-S3108-H8IR). Alas, no such luck. Well, it's supported, but it is way slow.

I think this'll rapidly become a nonissue with the advent of cheap NVMe SSD like the Intel 750, which I'm thinking I might order a few of.
 

eroji

Contributor
Joined
Feb 2, 2015
Messages
140
Right. the idea came about after reading about jgreco's setup with the laptop drives. I am just trying to find a more cost effective solution to my upcoming build, that I would like to use the pool for some light datastore purposes. Enterprise grade SSDs are just expensive and writes are limited to level that would be sufficient for saturate gigabit but not 10GbE, which is what I'm planning to use to link up FreeNAS to ESXi host. Assume I get 2 Samsung 850 PRO 128GB and mirror them (while under provisioned) using a LSI 9240 with BBU for instance, I'd imagine that should satisfy redundancy and cost while writes should be significantly closer to using up a more of the 10GbE link. Furthermore, if for whatever reason, the 1 or both drives starts to fail, it would be easy and cheap to swap them out for replacements.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The LSI 9240-4i with BBU is going to run you around $400 and the two Samsung 850's and cabling will be another $200. $600 for solution.

The Intel 750 will come out the door at NewEgg at $409 and you could probably omit the redundancy. I'm having a rough time seeing the LSI as being competitive pricewise with this, but of course, who knows what real world performance will be like.

The other thing is to ponder whether or not your application actually requires SLOG. What level of risk is there if you lose power and there's some damage to a VM? Do you have backups? Can you easily restore from backups, or would that (for example) totally hose an ecommerce application you're hosting?
 

eroji

Contributor
Joined
Feb 2, 2015
Messages
140
Actually, you can find 9240 or 9260 easily on Ebay for less, even with BBU, closer to $150. 850s will be around $200, true, but there are alternative MLC out there that are cheaper and potentially sufficient for the usecase. As for the Intel 750, I agree it's a great potential alternative (if not anything, most definitely faster). However, with NVMe being new, I am unsure of the motherboard or FreeNAS support. Moreover, in the event of failure, it will be purely cost of replacing drives vs replacing the whole NVMe device.

The questions for if I need SLOG is something I thought about for a little while, but since I do want to put the large pool (8x4TB raidz2 is the current plan) to good use other than just serving some files, and I do plan on hosting a number of things, including small ecommerce website, it would be a more efficient use of resources if I do put in this investment so I can have a reasonably good sync write performing datastore for ESXi.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The BBU's do fail periodically on the 9240/9260's, which I hate, and which kinda ruins the point of having redundant SSD's. The 9260 does come in a CV variant, but my guess is you'll be paying a pretty penny for a good RAID controller and SSD combo.

I'm pretty sure the NVMe support will be fine since we've had people using P3700's. If it makes you nervous, wait until some of us pick up the 750.
 

eroji

Contributor
Joined
Feb 2, 2015
Messages
140
Hmm, let me ask this then. Do you think the Intel 750 will work with Supermicro X10SL7-F? That is the board I think I have my mind set on now.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I'd be dumbfounded if not. But, again, feel free to let others blaze the trail. I can assure you it'll happen quickly.
 

eroji

Contributor
Joined
Feb 2, 2015
Messages
140
The reason I ask is because my workstation motherboard (ASUS X99-Deluxe) just recently received a BIOS update for NVMe support, which was surprising. I had thought it would've been built in, but since that happened, I have been somewhat skeptical on server board support for new NVMe products that are coming out.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Most likely that'd be support for booting of NVMe devices. You do not need to boot off a SLOG device.
 

mjws00

Guru
Joined
Jul 25, 2014
Messages
798
We aren't booting from it. Just need the device to be recognized by FreeNAS. Nvme is supported on other devices so should work.

<too slow>
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Yeah, I posted the answer more than two minutes prior. What, are you typing at 11WPM? Geeeeeeeeeeeeeeeeeeeeeeeeeeeeeez.
 

marbus90

Guru
Joined
Aug 2, 2014
Messages
818
I wouldn't favour that HW RAID solution either. The cheap SSDs still have their internal write cache. Writes to that cache are ack'd to the parent layers but in a case of power loss the data is *gone*, because said on-SSD write cache is not power loss protected. Also 128GB SSDs have less cells -> less write speed... and what do you want with a SLOG? write speed.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I wouldn't favour that HW RAID solution either.

That really depends. Infinite endurance. Hard to get any other way. That 128GB Samsung Pro has 150TBW write endurance. That's pretty pathetic; if you were making a VMware storage server in the several dozen TB range or larger, I'd be worried that you could blow through that within the first year.

The cheap SSDs still have their internal write cache. Writes to that cache are ack'd to the parent layers but in a case of power loss the data is *gone*, because said on-SSD write cache is not power loss protected.

Ok, good catch, I don't really keep track of that on the non-Intels and didn't pull the specs until this post. Getting lazy in my old age.

Also 128GB SSDs have less cells -> less write speed... and what do you want with a SLOG? write speed.

Unconvincing argument. Properly underprovisioned to a reasonable size, you are giving the drive controller a chance to maintain a massive pile of free pages. I actually asked for this to be supported years ago (see https://bugs.freenas.org/issues/2365 ) and given that most SSD's can easily outpace gigE these days, all you really need is to make sure that you're not running that pool out of space.

I note that the Samsung Pro's write speed is 470MB/sec for 128GB and 520MB/sec for the others. This is effectively very near the SATA 6Gbps practical limit, so while you could go a LITTLE faster by bumping up to the 256MB model, it is a trite difference, and in a scenario where the speed difference actually mattered on a consistent basis, you'd be hitting the 150TBW write endurance limit quickly, I'd think...!

The Intel 750 is a tad better at 219TBW ... but isn't limited by SATA and sports 230K IOPS at 900MB/sec write speeds, 20us (read's obviously irrelevant for SLOG).

For endurance, the Intel P3700 is a hell of a thing. The 400GB model sports 7.3PBW, with the higher capacity versions increasing at least linearly. 75K IOPS at 1080MB/sec write speeds, 25us.

I've gone with my typical selection of the random write speeds rather than sequential write speeds, because I'm a pessimist.

Now, what this really comes down to is how much data is getting written to the pool. Some of us have pools that do nothing but soak up backups. So if you've got a pool that's 150TB, and you were cramming data at it at 10GbE speeds, you could be writing 100TB/day to it. Even with the P3700 400GB unit, you will exceed the PBW rating within three months, and you can measure that cost in dollars.

You can also get some zippy enterprise HDD's in RAID somethingorother (0, 10...) and gain virtually infinite endurance. The question basically becomes one of cost and whether or not one needs the endurance.

Give it another year and the flash will likely be the winner here.
 

marbus90

Guru
Joined
Aug 2, 2014
Messages
818
That really depends. Infinite endurance. Hard to get any other way. That 128GB Samsung Pro has 150TBW write endurance. That's pretty pathetic; if you were making a VMware storage server in the several dozen TB range or larger, I'd be worried that you could blow through that within the first year.
That's where I would throw HGST ZeusRAM at the problem. Obviously not feasible for everybody, but ~$1.2k is pretty good for such an unlimited endurance device with Dualport SAS capabilities.

Unconvincing argument. Properly underprovisioned to a reasonable size, you are giving the drive controller a chance to maintain a massive pile of free pages. I actually asked for this to be supported years ago (see https://bugs.freenas.org/issues/2365 ) and given that most SSD's can easily outpace gigE these days, all you really need is to make sure that you're not running that pool out of space.

I note that the Samsung Pro's write speed is 470MB/sec for 128GB and 520MB/sec for the others. This is effectively very near the SATA 6Gbps practical limit, so while you could go a LITTLE faster by bumping up to the 256MB model, it is a trite difference, and in a scenario where the speed difference actually mattered on a consistent basis, you'd be hitting the 150TBW write endurance limit quickly, I'd think...!
But still the controller cannot write to 16+ chips in parallel, boosting performance. It's stuck at 8 or so chips, crippling the write performance - even if there is lots of space left over. I don't deal with 128GB SSDs in performance sensitive enviroments, but iirc the IOPS rating was quite a lot off the usual rating for 256+GB models. Also non-Intel SSDs are usually lapping their TBW rating 10+ times. maybe even 15x. The NAND usually dies gracefully, the controllers aren't.

For endurance, the Intel P3700 is a hell of a thing. The 400GB model sports 7.3PBW, with the higher capacity versions increasing at least linearly. 75K IOPS at 1080MB/sec write speeds, 25us.
There you can see that more NAND chips/capacity do help with write performance. That's mostly due to an "internal raid0" across all these flash chips - if you take some chips ("vdevs") away, you have less write performance and endurance.

Now, what this really comes down to is how much data is getting written to the pool. Some of us have pools that do nothing but soak up backups. So if you've got a pool that's 150TB, and you were cramming data at it at 10GbE speeds, you could be writing 100TB/day to it. Even with the P3700 400GB unit, you will exceed the PBW rating within three months, and you can measure that cost in dollars.
ZeusRAM. ;)
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
That's where I would throw HGST ZeusRAM at the problem. Obviously not feasible for everybody, but ~$1.2k is pretty good for such an unlimited endurance device with Dualport SAS capabilities.

Last I looked they were twice that cost (~$2.5K), and 3.5" devices on top of it, AND crippled by that SAS interface, which, while it offers sub-23us latency, still has to traverse the HBA and go through all that command queueing, serialization over SAS, etc.

nvmestack.png


Pretending for a moment that all devices sport a 20us drive latency, it should still be obvious that the relatively modest driver shim for NVMe and total lack of a drive controller for NVMe totally beats the round trip a SAS device needs to commit data.

The hardware RAID for SLOG question falls into an edge case not illustrated in the above graph; we eliminate the drive latency and a large part of the controller latency because that is all post-write-cache activity. The relevant bits (assuming the backing store is sufficiently fast) is merely the driver and whatever latency the controller introduces while writing that into its local write cache, which is pretty fast, but still at least all-of-the-green plus a little of the purple.

There you can see that more NAND chips/capacity do help with write performance. That's mostly due to an "internal raid0" across all these flash chips - if you take some chips ("vdevs") away, you have less write performance and endurance.

Yes, I understand the design principles. However, more chips/capacity also translates to higher cost. For conventional SATA SSD's, performance at the low end is often already near SATA 6Gbps performance limits, and exceeds gigabit or even 2xgigabit ethernet speeds handily.

ZeusRAM. ;)

Yes, indeed, we have a winner. ZeusRAM, the Quantum BigFoot of the SSD world. Too expensive, too large, and rapidly rendered obsolete by more practical, less specialized technology.

From hot stuff to yesterday's has-been in three short years. Such is the fate of most high performance technology. :-/
 

marbus90

Guru
Joined
Aug 2, 2014
Messages
818
Tell me how you are connecting 1 SLOG SSD with unlimited endurance to 2 host servers. Some people do require HA capabilities... buuut, yes indeed, NVMe has specs included for Dualport drives it seems. However these aren't available right now... and they don't offer unlimited endurance.
 
Status
Not open for further replies.
Top