Decided on the majority of my spec, now I need to make choices about SLOG and L2ARC

Stilez · Oct 12, 2016

I'm putting together a new FreeNAS build. The main hardware issue left is SLOG/L2ARC to get the best chance of good performance/reliability.

Hardware -

Xeon E5-1620 v3 (4 core, 3.5 GHz+ as I might heavily use CIFS or other single-thread services)
32GB RDIMM 2400 (Amazon have an insanely good offer on Crucial DDR4 RDIMM kits, go check it out!) - I can always increase to 64GB in future if needed.
Baseboard - either a Supermicro X10-something or perhaps a well-built X99 (Asus, ASRock)
NICs - Chelsio T420 (10G dedicated links to an ESXi server and to a Windows workstation) + 1G Intel for LAN file shares + server admin.
Data drives redundant 7200 4+6 TBs HDDs, to be migrated from my old Windows server.

I want to use it for -

ESXi file store - VMs will mostly be Windows, not many but could have 1 or 2 heavy use (computational workstations).
LAN general-purpose file-share - mix of folder/file uses, some large file manipulation might go on (mass renames/moves/upload/download of photo archive and data folders up to 50-80GB at times, though mostly less). Clients something like 4 Windows PCs, 2 Windows workstations, 1 Mac
ZFS snapshots

My priorities are -

Data integrity (goes without saying)
Performance - Once reliability is taken care of, I want to feel a performance improvement over my current setup. I want it to feel like it's run faster and more consistently/reliably than my current setup. So for example, I want ESXi to continue to run fast and smoothly during VMDK loads, saves, and snapshots. I don't need blinding speeds but I would like these to be served fast, as I often go back and forth between them and I take quite a few snapshots as I test things, and as they're 5-10GB file load/saves this could be worth attention. I also want the same for the LAN file shares, to be able to access, modify, and copy back and forth as needed, with large folders as needed, running smoothly.
Keeping raw disk space down if possible - The workstation creates a lot of photo processing datasets which include a lot of duplicate data, and I have to keep disk costs down. I'm hoping it may not be needed but I'm planning that I may need to enable dedup, even though it hits performance much more than compression. I reckon at the moment about 2 TB of the datastore will benefit from dedup, and as much as 4TB in future. I can put these on their own ZFS volume. I'm hoping to compensate by making the file server high enough spec to comfortably manage 2-4 TB of deduped data in RAM, if that's possible.

There are some things I've already done. For example I've tried to avoid the 1 GbE bottleneck by buying 10G cards and using dedicated connections, and maybe reduce the CIFS single thread bottleneck and dedup slowdown (if they're used) by getting a processor with 3.5GHz single thread speed and modern instruction set additions, and 32GB (or 64GB if needed) of 2400 RAM. What's left needs more experience than I have, and I need help.

My questions:

Where might SSDs help, and what size would be sensible? - Will it help to add SSDs for SLOG, L2ARC or any other kind of caching? I want to have ESXi/NFS sync write enabled for data safety, but performance on large (5 - 20 GB) read/writes not too badly impaired. From reading the forum, SLOG is only useful up to a given size which is based on the amount of data able to be queued for writing during about 5 seconds, and 1/8 of RAM, by default. I'd be happy to add SSDs (SATA or NVMe) to my build, but it's hard to figure out if they'll definitely do good or if they might be no point and what places to use them. If I add SSDs, they would probably be mirrored NVMe SSDs on PCIe (128 - 256GB Samsung SM/PM series, 380-500k IOPS, 1.5-2.5 GB/s) or fast SATA/SAS SSDs, for SLOG or other caching. It also seem to warn that it can be a problem to have so much RAM that 1/8 of RAM every 5 seconds at peak could overload the disk system. But I'm stuck at that point and confused what to do.
Possible use of caching/BBU on RAID card used as HBA - I have an old MegaRAID card with 8 ports and Cachecade 2.0 (LSI's write-back caching SSD system) and a near-new battery, so while ZFS doesn't like hardware RAID, I could use it as a pure HBA making use of its onboard SSD battery-backed SSD caching mechanism. Reviews say it's very fast indeed and improves HDD write speeds a lot. Would this help or isn't this card useful for my build? Will UPS do much if the SSDs are battery backed?

Thanks for the read and look forward to comments and help to get it into use

Dice · Oct 13, 2016

hello.
I'll offer some comments and sort of answer your inquiries. What I gather is that you want <to have a lot>, but miss a few links to get there, or to accept certain trade-offs.

CPU is well appropriate.
RAM is waaay on the low side, specially considering your temptations for L2ARC and SLOG additions. I'd suggest you revisit the sticky on the topic to be found Some insights into SLOG ZIL with ZFS
Granted you will become well read and informed in this matter, you should arrive at some conclusions. First and foremost, a need to bump RAM. Secondly, perhaps realizing L2ARC is not necessarily for you.

Regarding storage.
VM's tend to pull hardware spec's away from RAIDz2, towards mirrored vdevs to achieve performance required. Since you desire both "keeping raw disk space down if possible" and performance, you are at a loss. ZFS gives you the trade-off. Redundancy and speed at the cost of cheap hardware - but lots of it. In particular for VM performance is a lot of free space key to performance. Typically before looking into L2ARC but obviously heavily dependent on the type of workload.
If the storage situation is such that the VM's and the most intensive work could be handled on a fairly small portion of the total required space, you might want to build two pools. One that is blazing fast using a a couple of mirrored ssd's in multiple vdevs to form a pool. Ie, "striping the mirrors".
For the less speed intensive storage area - settle for a raidz2 configuration, single or multiple vdev's.
Dedup - I've no memory reading anything positive about this recently. Rather the opposite. Unless this is correct, I bet others will contribute and correct it. I'd stay away from dedup. Cool on paper, IIRC, less cool in action.

Stilez said:
There are some things I've already done. For example I've tried to avoid the 1 GbE bottleneck by buying 10G cards and using dedicated connections, and maybe reduce the CIFS single thread bottleneck and dedup slowdown (if they're used) by getting a processor with 3.5GHz single thread speed and modern instruction set additions, and 32GB (or 64GB if needed) of 2400 RAM. What's left needs more experience than I have, and I need help.

Good.

Stilez said:
Where might SSDs help, and what size would be sensible? -

For a dedicated high performance VM pool.
Size according to your needs. I'd look for the general rule of thumbs - at least 50% free space, mirror configured vdev. If you need let's say 1TB space to handle the intensive VM workload, you'd end up with [very rough approximation] you'd end up having 4x 600GB SSD's.
I'd consider a SLOG due to the requirement of sync=enabled, too.
I'll not comment further on choices of SLOG, since I'd only retell a less in depth version than already provided in the sticky.

Stilez said:
Possible use of caching/BBU on RAID card used as HBA -

Only as mentioned in the provided link on slogs. Or - if you would like to completely separate your vm datastores from freenas, that could be a usecase. For me personally, I'd discard that raidcard from this project. completely.

cheers

Stilez · Oct 16, 2016

Thank you Dice. I was away a day or two, so only just saw the reply.

Dice said:
RAM is waaay on the low side, specially considering your temptations for L2ARC and SLOG additions. I'd suggest you revisit the sticky on the topic to be found Some insights into SLOG ZIL with ZFS. Granted you will become well read and informed in this matter, you should arrive at some conclusions. First and foremost, a need to bump RAM. Secondly, perhaps realizing L2ARC is not necessarily for you.

I just bought another 64GB, bringing it to 96GB (RDIMM 2400) and if needed I can always increase to 128GB - 256GB for ARC or anything else (instead of L2ARC) when I can see stats. Cheap NVME's are another L2ARC option but the stats will say what's needed.

I had read that page about the SLOG (ZIL) but not carefully enough. So I'll be adding NVMe based ZIL for lowest latency (dual 10G). The choice of SLOG/ZIL brings its own issues......

PCIe SSDs that are both fast enough to not slow down dual 10G (ruling out SATA/SAS/older NVMe), multi-PB lifetime writes, and full loss protection, tend to be high capacity (wasted on ZIL) and expensive. Mirrored SATA won't help because the issue is latency of each write (I think), and even if striped+mirrored I'd need 4 drives.

Most SLOG options seem to call for undesirable compromises. The Intel P36xx is a bit too expensive even on EBay after already buying an extra 64GB RDIMM, and I certainly couldn't afford to mirror it. The S37xx is affordable mirrored but limited to SATA speed and latency, and will consistently hammer write-caching and latency on incoming 10G links. The Intel 750 is unclear whether or not it has full loss-protection. The Samsung NVMe M.2's are fast enough and reliable but have no loss-protection and aren't PB-scale endurance. Older cards are slow.

One option that I'm taking seriously is using a PCIe HBA/RAID controller card with onnboard write caching - but trying to do it properly and/or use it for SLOG only. Ideally a card with a large battery/flash protected write-back cache (2GB - 4GB), high write speed (150k writes and 1 - 2.5 GB per sec) and a couple of SATA/SAS SSDs attached. I've read the pages on ZIL and on cards like the MegaRAID and the issues they have with passthrough and true JBOD/HBA functionality, and it sounds like it's grudgingly seen as viable, but I need to ask a bit more about the mode of failure and whether cards exist that I can afford and do what I want:

If power is lost and the ZIL has a failure as well (so it can't be read back when FreeNAS reboots), do I just lose a few seconds of recent writes, or am I at risk of the whole pool being unreadable? The risk/cost/benefit from non-redundant ZIL would be different if I felt that a small and rare data loss equivalent to a few seconds of writing (but not a large loss) is tolerable, and this was the limit of the risk.
Are there any true HBA cards, or cards that support a JBOD/HBA mode suitable for 1 or 2 ZIL drives, which also have a 2GB or 4GB onboard flash backed ram write cache? (I think flash backed is better than battery backed for this). The reason being that if so, I can use two cheap SATA SSDs with loss protection as a mirrored ZIL, and most of the time, the SSDs won't impact latency or write speed, because sync writes will be acknowledged when cached. As a bonus I can also attach the main pool drives to such a card to speed up writing data to the big HDDs. I've read the warnings and would not want a card that prevented ZFS doing its job... do such cards exist? Meaning, HBAs with large write cache that work well with FreeBSD/ZFS, or RAID/controller cards with a true JBOD/HBA mode and cache? I've seen the HP Smart Array cards suggested for this, but hard to tell which models, and seen MegaRAID not recommended for the main pool but not clear about them for just the ZIL.
Do the MegaRaid cards work (other than SMART and failure detection? What about the ones flashed to HBA mode, does the write-back cache still work at speed or is that lost with the firmware change?

TechReport's reviews seem to show that other than Intel's P36xx/P37xx, or specialised or high-cost devices, almost all SSDs (enterprise or consumer) are lkikely to have very poor sustained write speeds after a while. I can't tell if that's mostly due to thermal throttling (fixable with a heatsink+fan) or if it's inherent in the controller/ram (not fixable). This also limits my choice a lot.

Hopefully in a year or 2 prices will come down and loss protection will be more available on enthusiast NVMe, but for now... hmm

Dice said:
Regarding storage.
VM's tend to pull hardware spec's away from RAIDz2, towards mirrored vdevs to achieve performance required. Since you desire both "keeping raw disk space down if possible" and performance, you are at a loss. ZFS gives you the trade-off. Redundancy and speed at the cost of cheap hardware - but lots of it. In particular for VM performance is a lot of free space key to performance. Typically before looking into L2ARC but obviously heavily dependent on the type of workload. If the storage situation is such that the VM's and the most intensive work could be handled on a fairly small portion of the total required space, you might want to build two pools. One that is blazing fast using a a couple of mirrored ssd's in multiple vdevs to form a pool. Ie, "striping the mirrors". For the less speed intensive storage area - settle for a raidz2 configuration, single or multiple vdev's.

Dedup - I've no memory reading anything positive about this recently. Rather the opposite. Unless this is correct, I bet others will contribute and correct it. I'd stay away from dedup. Cool on paper, IIRC, less cool in action.

For a dedicated high performance VM pool. Size according to your needs. I'd look for the general rule of thumbs - at least 50% free space, mirror configured vdev. If you need let's say 1TB space to handle the intensive VM workload, you'd end up with [very rough approximation] you'd end up having 4x 600GB SSD's. I'd consider a SLOG due to the requirement of sync=enabled, too.

The VMs will mostly be desktops (remote desktop, PCoIP, or whatever may be), a few very minor servers, and test setups. The user data for these will be a second "chunk" of data. The third "chunk" of data" is mostly WORM and accumulates over time - ESXi/ZFS snapshots, movie AVIs + MP3s, photo backups, software/drivers/ISOs, and user data backups. This third "chunk" of data need a lot of organising, tagging and so on, so they get moved and copied a lot, and the photo backups often contains hundreds of GB of duplicates, but basically that chunk is static. So there is a good opportunity to differentiate data as you suggest, in terms of read/write speed, caching requirements, need for striping as well as mirroring, etc. If I use 3 way mirrors then that will also improve read speeds a lot, won't it? (I'm assuming ZFS uses all 3 disks to read from a 3 way mirror)

Dedup is only really relevant to the user data backups - they contain hundreds of GB of photos and experience shows that a lot of them are dups. I could probably save a TB or so and some extra HDDs if I dedup these, and the older copies are not often needed. I make local backups from habit and I might not need to given ZFS snapshots and replication. Again the tools will show what's possible and I can trial it on a couple of spare 4TBs to see what it does to memory and performance and whether or not I like the cost/benefit.

I hadn't looked at free disk space implications other than above about 80-90% being a problem. I'll re-read this.

SLOG/ZIL yes, see above.

Dice · Oct 19, 2016

I see you've done a fair amount of research. I can appreciate that.
Looking forward to further unfolding of this project. Keep us informed ;)

If you can formulate a bit more specific questions, others might be encouraged to chime in with opinions.
As of now, good luck.

Stilez · Oct 19, 2016

Dice said:
I see you've done a fair amount of research. I can appreciate that.
Looking forward to further unfolding of this project. Keep us informed ;)
If you can formulate a bit more specific questions, others might be encouraged to chime in with opinions.
As of now, good luck.

That's really appreciated, Dice.

Most of the hardware has arrived (10G transceivers arrived this morning), just the motherboard (in transit) and a bit more pool storage and to sort out the ZIL. Boot drive will be mirrored Intel 320 40GBs; I am guessing FreeNAS doesn't use the boot drive, other than at boot and to save config/logs, so speed doesn't matter. I haven't got the ESXi server up yet so I won't be able to fully test anything except ordinary file serving for now. Comments welcome on these points:

Main drives - Reviews oddly seem to show enterprise SAS 12G drives not that much more efficient in data speed than enterprise SATA 6G, even though there's no tunnelling overhead and its nominally a faster bus (and both drives have enough built-in cache to use any bus bandwidth). I can't find any review that shows a clear SAS data advantage, other than in terms of controllability/recovery. I looked at Backblaze's idea of cheap commodity drives, but decided I prefer the certainty of warranty coverage - the economics change if you only count a significant "fail" when out of warranty and a new drive must be bought. I find that cost per TB per warranty year is a useful metric for this.

ZIL - still not happy with any choices, and budget squeezed hard on both RAM and additional pool drives for extra redundancy. These are the options I seem to have reached (easiest ruled out 1st!), but it's hard to know which to go for, so this is a bit of a brain-dump on it and any advice from anyone is really appreciated...

Mirrored SATA SSDs: Loads of SSDs can saturate a SATA 3 connection, so it's easy to choose one that is both fast and has loss protection, and where I can afford 2 of them. Problem part solved. Intel S3700 or something similar with good sustained write speeds. Downside: a significant write source will be ESXi saves with sync enabled on 2 x 10G, sata latency will kill the performance continually, that's what I'm trying to get away from. I really *really* want to go the NVMe route if at all possible.
Mirrored NVMe SSDs with loss protection: can't afford mirrored P3xxx or specialised Zeus/Ramdisk cards (mirrored or not). Most NVME cards are currently M.2 without loss protection (Samsung etc) or enterprise modules and cost too much to mirror, or older and slower. Checked EBay, miracles might happen but not holding my breath for one. Bottom line, I can't see a way I can afford mirrored NVMe's that also have loss protection. Maybe in a year's time, the market will include enthusiast NVMe with loss protection, but not affordable for this, for now.
Mirrored NVMe SSDs without loss protection: Viable if - and the "if" is crucial - if the failure of the ZIL could in no way imperil more than the small amount of data acknowledged by the ZIL but not actually written to the ZIL's NVRAM, and if I'm convinced that for my needs, that wouldn't be an unmanageable loss. If loss of ZIL when needed meant a risk however small that I might not recover the pool, or might have to roll the pool back more than a couple of minutes (say 15 - 120 seconds) then this starts to look dodgy. The flip side is that unexpected power loss is exactly the situation ZIL is needed most. The data acknowledged but not saved to NRAM could be tiny - a small fraction of a second given the speed at which SSDs work. But I don't have a way to tell how many milliseconds/seconds data would be at risk, which is what I need to know, to decide the risk
Single NVMe SSD with loss protection: Affordable but not very reassuring, especially as lifetime (PBW) often too low for ZIL. That said, a P36xx is pretty reliable and if the worst that can happen is a small amount of lost data and not more (up to 15-120 seconds as above), this would be viable. But a single device always leads to a single point of failure. Same issues apply as above. How much data is actually at risk - a few milliseconds, 5 seconds, 30 seconds, the entire pool?
6G HBA with 2-4GB battery/flash backed write caching (even in HBA mode), backed by single SATA SSD: An HBA wiuth a large DRAM loss-protected write cache would be viable. Except for extreme and sustained loads, most of the time if would buffer the SSD so the server wouldn't be affected by SATA latency, it would get near-immediate acknowledges, and loss protection on both HBA and SSD. Although an intermediate cache and full view of the drive sound contradictory, such cards do exist. But not many of them, and often not confirmed which specific cards have a true HBA/JBOD mode and also a 2-4GB write cache that still works when it's enabled. If it did work as hoped then this could also be used to buffer the main pool drives as well, perhaps (added bonus!). I haven't come across a confirmed report of an LSI card that does true HBA/JBOD and keeps battery/flash backed write cache enabled as well, but maybe some LSI card does. Also some HP Smart Array (440/840) might. Areca 1881/1883 don't seem to be affordable. Any confirmation of a viable card by someone who's used it, would be useful.
6G RAID card with 2-4GB battery/flash backed write caching (even in HBA mode), backed by single or mirrored SATA SSD configured as single drives/RAID0/almost-JBOD: Not ideal but the final fallback option I can think of that won't slow things down as much as SATA. Would probably work but ZFS would be trusting the ZIL "blind". Shouldn't be a problem doing so (other than cannot report SMART/failure automatically) but far from ideal. But probably in fact could be trusted. Also adds an extra layer and against the philosophy of ZFS.

I really need help with figuring out the "worst case" for how much data ZFS can lose at #3 given a really fast NVMe SSD ZIL with low latency (is it milliseconds of data in flight within the SSD, a few seconds of data being processed in recent transaction groups, tens of seconds for a pool rollback, or at worst the entire pool) would really be useful, and so would information on specific cards known to work at #5 or (as a final fallback) #6, risks notwithstanding.

Tentative backup strategy - The backup server will be offsite (20-30 mbps connection). On the basis that both are incredibly unlikely to go at the same time, for now I'm going with either 2 way mirror on both the main and backup servers, or 3 way mirror on the main and single non-mirrored on the backup. Its a tricky risk balancing exercise, but cash is tight and if a vdev goes down on the main server will I be better served until a replacement disk arrives in the post, by (1) having a main server that's unaffected because it has 2 more mirrors + also a backup that's also usable unless its drive also dies, or (2) a main server that's usable but now down to a single non-redundant copy and which has also just had one drive die, and which could be restored from backup but slowly and losing most recent data. So I think there may be an argument for a redundancy policy of keeping 3-way main + 1-way offsite backup, and stopping immediately if I lose a second volume - at which point I would still have either 2 mirrors 100% up to date or 1 mirror 100% up to date + backup. So 3+1 may not be entirely insane if I want to continue uninterrupted but not be at very high risk from a second failure.

File-serving - If I serve ESXi through NFS, can I "see" and copy VM files to/from the4 ESXi store via the FreeNAS GUI, or from other file shares? I'm assuming this is an advantage of an NFS share over an iSCSI share, as both seem to be competitive on speed.

Authentication - Can FreeNAS use pfSense's FreeRadius authentication service, certs, etc, rather than adding a second authentication process, and if so is there a "How-to" on this?

Any comments really appreciated, especially on the ZIL options.

depasseg · Oct 19, 2016

My vote would be for #4

Stilez said:
But a single device always leads to a single point of failure.

While, yes, if it fails you no longer have a low latency SLOG and performance will slow, you won't however have a data loss failure.

Stilez said:
How much data is actually at risk

None. The data to be written is in RAM and the SLOG (in order to be Ack'd) and then it's written to disk. If the SLOG fails, data will be in RAM and written to Disk before being Ack'd. The ONLY time a SLOG is ever read from is after a failure (power, system crash, etc), and the pool is be checked for consistency.

Stux · Oct 19, 2016

I'd go for the single nvme with PLP. To experience a failure, which would involve a possible rollback, you need to have a catastrophic failure (Ie crash, power loss etc) followed by a slog failure on restart.

Consider a wider raidz2/3 for your backup. It doesn't need to have the IOPS performance if it's just a backup.

If you need mirrors for performance, then stepping up to 3 way is a large parity loss. Perhaps best to stick to 2-way with semi-constant replication offsite. And perhaps have a burnt in spare ready to go, or even a hot spare.

Mirror rebuilds are fast

Stilez · Oct 19, 2016

depasseg said:
My vote would be for #4
if it fails you no longer have a low latency SLOG and performance will slow, you won't however have a data loss failure.
The data to be written is in RAM and the SLOG (in order to be Ack'd) and then it's written to disk. If the SLOG fails, data will be in RAM and written to Disk before being Ack'd. The ONLY time a SLOG is ever read from is after a failure (power, system crash, etc), and the pool is be checked for consistency.

If the SLOG or its controller but nothing else has an issue (or dies), then whatever happens to the SLOG is very soon 100% irrelevant for the reason you give - no crash, so the system will handle the vanished SLOG gracefully within seconds. So questions about the reliability of SLOG or "what ifs" about how it's set up, probably ought to assume the worst case of wider power loss/failure in which the wider system recovers but the SLOG has been lost (perhaps in a pathological way or with "edge case-y" timing), otherwise the SLOG just never gets needed or is "known handled gracefully", rendering the question meaningless. So I'm assuming in thinking about SLOG devices, that at some point the SLOG content is needed after a reboot but the system has suffered a "worst possible timing and manner of loss of SLOG that's consistent with rest of system and disks not being harmed" event.

Concrete if slightly fabricated example - very heavy duty writing of a sync-enabled VM snapshot at 10G or 40G, transaction groups almost full and having to pause, worst case timing, and PSU detonates taking CPU and ZIL with it so everything stops without warning. Main disks + contents intact. New system, reboot... and what's the worst case of lost data? Milliseconds? Seconds? Minutes? Or "anything up to the entire pool"? How much could not having done the SLOG to the highest standard have cost us in data loss at the pool, in the worst case (assuming a better ZIL design which would have by magic survived)?

Stux said:
I'd go for the single nvme with PLP. To experience a failure, which would involve a possible rollback, you need to have a catastrophic failure (Ie crash, power loss etc) followed by a slog failure on restart.
Consider a wider raidz2/3 for your backup. It doesn't need to have the IOPS performance if it's just a backup.
If you need mirrors for performance, then stepping up to 3 way is a large parity loss. Perhaps best to stick to 2-way with semi-constant replication offsite. And perhaps have a burnt in spare ready to go, or even a hot spare.
Mirror rebuilds are fast

I decided on pure mirroring because that the backup isn't just a data backup, it's also possible I may need to use it during a rebuild of repair of the main server too, and for performance and ease of resilvering when it happens, mirroring is hard to beat. It's a good point though, and needs thinking about. It would save disk space too, if I can identify archived files that aren't likely to have frequent use.

I don't understand the comment about a 3 way mirror leading to a large parity loss, as pure mirrors don't have parity disks or the ZFS equivalent, or so I thought? Semi-constant replication makes sense for sure. SLOG failure - see comment above.

Stux · Oct 19, 2016

Well, yes, technically, mirrors don't use parity, thus don't have parity loss. But you still have a loss to parity/redundancy. 50% with 2 way mirror, and 66% with a 3-way mirror.

Vs, say 25% with an 8-way RaidZ2.

You *can* run iSCSI/NFS off a RaidZ2... the performance would not be as good, but for a backup system... while the primary is being brought back... is that critical... only you can say.

I'd far prefer a RaidZ2 backup system to a backup system with no redundancy.

depasseg · Oct 19, 2016

Stilez said:
Concrete if slightly fabricated example - very heavy duty writing of a sync-enabled VM snapshot at 10G or 40G, transaction groups almost full and having to pause, worst case timing, and PSU detonates taking CPU and ZIL with it so everything stops without warning. Main disks + contents intact. New system, reboot... and what's the worst case of lost data? Milliseconds? Seconds? Minutes? Or "anything up to the entire pool"?

That's quite a scenario (and not to be pedantic, but it's the SLOG, not ZIL). In that case, I'd say that whatever wiped out the CPU and SLOG would likely wipe out a mirrored SLOG as well, if you put one in there, so in that case, you are looking at, by default, up to 5 seconds of data. IIRC, there is a small chance that some critical pool data or metadata could be in that destruction, which could cause the loss of the pool. So, if you think that doomsday scenario is likely, then having a replicated backup is key.

Dice · Oct 20, 2016

Stilez said:
I looked at Backblaze's idea of cheap commodity drives, but decided I prefer the certainty of warranty coverage - the economics change if you only count a significant "fail" when out of warranty and a new drive must be bought. I find that cost per TB per warranty year is a useful metric for this.

I like this approach to econometric the investment. Too few plan ahead, across the first failures.

I'd go for #4.
The key metric to focus on for the SLOG is power loss protection and low latency. This pretty much narrows down your selections to the NVMe options. Since mirroring does not add value, a single unit is the way to go.
The life expectancy ie write tolerance for the NVMe P3500 - P3700 comes in a lot of different variations, reaching from 0.3 DWPD (drive writes per day for 5 years) to 17 DWPD(!) ...17DWPD for a 400Gb unit ..is about 6.8TB per day written data. On average, over 5 years. Add to that the overprovisioning from sizing down the slog to fit the 10Gbit - somewhere around 10Gb sized SLOG (there is a recent benchmark thread on this topic in in this thread if overprovisioning works as expected ..you'd have hell of a time to burn that out...

Stilez · Oct 21, 2016

I need quick advice on my choice of two ZIL devices.

I've mostly settled on Intel NVMe for the ZIL. The 910 400GB is likely to be affordable, the P3700 400GB is likely to be twice the price and I don't know if the difference in performance will justify the extra $400 or so in cost. (The P3600 was the other option but the 400GB version is slow and the 800GB is hard to find). Relevant paper specs:

910 400GB - seq write 750 MB/s, random write 4k 38k IOPS, write latency 65 usec, probable price USD 225 - 375 (friend willing to sell a lightly used one, if I can decide by tomorrow)
P3700 400GB - seq write 1080 MB/s, random write 4k 75k IOPS, write latency 20 usec, probable price USD 700 (online shops)

How much will I notice the difference in response? Or asked another way, is the difference likely to be big enough to be worth considering the P3700 anyway, for ESXi VM store use and windows over CIFS, all over 10G? I know it's a bit "YMMV" but any idea what these translate into would help. My use is light about 60% of the time, and when busy it's slightly bursty, with the big writes being occasional ESXi VM saves/snapshots/clones, ZFS replication, and large file/folder copy/writes from Windows shares.

Thanks for the comments

SweetAndLow · Oct 21, 2016

The cifs share will never use the slog. Only the NFS mounts to esxi will use it. And at that point i think it's you get what your pay for. It ask depends on the workflow of those vms. Do they need to be performance vms or are they OK being slow?

Sent from my Nexus 5X using Tapatalk

Stilez · Oct 22, 2016

At the moment I'm running 2 medium-use desktops on real hardware (quad core sandy bridge era + 16GB to give an idea), another 3 desktops on a single VMware Workstation install running on an overclocked rock-solid Ivy Bridge extreme (4960X + 64GB@2400), and a couple of tiny servers (negligible load). On real hardware the virtualised desktops would need a similar spec as the real ones (quad core SB/IB but perhaps 24-32GB not 16GB). The desktops aren't usually being fully used at the same time and often some are idle, so there's scope for a big saving if they're on one platform with shared CPU/RAM. Storage is local SSDs, mostly mirrored Samsung 840 pro, and the local VM store is currently an HDD mirror on LSI MegaRAID with write-back acceleration.

I'm consolidating all of these to a single ESXi octocore server (E5-2667 v3) with separate FreeNAS for the VM store and CIFS file store, and offsite ZFS replication, which should improve all aspects of things - reliability, resilience, and flexibility. It'll also mean I can maintain them easier and focus on one good hardware platform with resource-sharing, not 3 separate platforms, and hopefully using an enterprise grade platform will improve them in day-to-day use as well.

I'll admit to being a bit performance-focussed in my end use, and programs I use on the desktop can be intense on their random I/O, mostly data updates on local disk or accessed over CIFS. Along with ESXi snapshots/restores/clones those will be the main ongoing FreeNAS use, but snapshots are periodical, not continuous. I'm also aiming for the VMs themselves to fit completely in RAM when running. I value reliability more than getting the last few % of performance, so while the 4960X is overclocked, it's only that way after several 48 hour stress tests/memtests and it's never had a known issue since building.

My main concern is really just that the end-use doesn't slow down after all this work and hardware cost (or hopefully can be improved). A lot of that depends on ESXi (hardware and config), but I'm a bit apprehensive about the impact of the VM store and VM local files, and that moving to a separate server could slow down my workflow noticeably, just because it's separate now and a non-local device. So I want to be sure that any slow-down is offset by improved hardware spec, larger RAM to work with, 10G not 1G, file store not in critical data paths if avoidable, suitable FreeNAS caching, etc.

That's the background and why I'm trying to second-guess whether I am likely to notice a difference between the 910 and P3700 400GB if used for SLOG, and if so whether it's likely to be significant, to have some basis to anticipate if it's worth the extra cash to me (which would be a big chunk of extra budget but the value depends on whether there is likely to be a significant impact from paying for the P3700).

Dice · Oct 22, 2016

I don't think the difference in performance between the two suggested SLOGs will have any noticeable impact in this scenario. From what I gather your VM's are more memory/cpu intensive than IOPS grinding.

The day you can trace a significant performance hit to an underperforming SLOG then you could sell the 'cheap' and step up your game. It is not a one-time affair.
Chances are other aspects will impact the perceived performance.

I used to ran a remote desktop controlled windows box that fed a file server. Upon re-configuring the servers I ended up with a windows VM replacing the bare metal windows box. While the benchmarks of the CPU were fairly similar there was a significant perceived latency/performance hit compared to bare metal. The VM was hosted on SSD/data stores, not involving FreeNAS. I won't imply this is significant to your particular situation. The point here is rather how to prepare a mindset for the 'VM-rationalization project' converting bare metal junk to fewer more powerful setups.

In retrospect I consider the virtualization endeavor to behave similar to my FreeNAS experience:
The correct mindset is to think of FreeNAS is - as a mastermind that needs to be fed cookies. It eats more than it gives back. Give it enough cookies (resources) until it performs to your expectations.

It just takes a while to accept that point.

Stilez · Oct 22, 2016

Dice said:
I don't think the difference in performance between the two suggested SLOGs will have any noticeable impact in this scenario. From what I gather your VM's are more memory/cpu intensive than IOPS grinding.

The day you can trace a significant performance hit to an underperforming SLOG then you could sell the 'cheap' and step up your game. It is not a one-time affair. Chances are other aspects will impact the perceived performance.

Dice said:
The correct mindset is to think of FreeNAS is - as a mastermind that needs to be fed cookies. It eats more than it gives back. Give it enough cookies (resources) until it performs to your expectations.
It just takes a while to accept that point.

I reckon that's good advice and went ahead with the 910. The case also just arrived; room for about a dozen HDDs and plenty of spare to add an SSD rack, nicely built. Which means... that's my FreeNAS hardware questions dealt with, thank you. Next come the config choices :D

When I get to the point of config decisions, is it better to continue here (same build) or start a new thread (different topic area)?

Dice · Oct 22, 2016

Stilez said:
When I get to the point of config decisions, is it better to continue here (same build) or start a new thread (different topic area)?

I'd say you'd be best off to get proper replies by creating new threads in the relevant sections matching the particular problem.
That way, backlog-reading the discussion will stick to topic a lot better, improving the mood once at the last comment ;)
(it is good code of conduct to provide the hardware specs in any thread started to aid/rule out troubleshooting and solutions). You've shown good character and mindset.

Cheers,

Important Announcement for the TrueNAS Community.

Decided on the majority of my spec, now I need to make choices about SLOG and L2ARC

Stilez

Guru

Dice

Wizard

Stilez

Guru

Dice

Wizard

Stilez

Guru

depasseg

FreeNAS Replicant

Stux

MVP

Stilez

Guru

Stux

MVP

depasseg

FreeNAS Replicant

Dice

Wizard

Stilez

Guru

SweetAndLow

Sweet'NASty

Stilez

Guru

Dice

Wizard

Stilez

Guru

Dice

Wizard

Similar threads