BUILD Critique production VM build

Status
Not open for further replies.

hammong

Dabbler
Joined
Mar 18, 2014
Messages
22
I've decided to build a small production NAS for hosting some moderately busy small (300-400 GB) VMware ESX 5.0.1 guests. The total size as planned will be between 8 and 12 TB of usable storage, and there are about 10 VMs running, spread across two VMware hosts.

VMware host #1 is an older HP DL580 G5, 64GB ECC RAM, Quad Xeon E7340 2.4 QC Xeons, 2x 146GB 10K in RAID 1 for boot volume, HP NC522SFP dual-port 10 GBe SFP+ NIC with iSCSI offload dedicated for iSCSI.

VMware host #2 is a newer home built Supermicro H8DG6-F-O dual G34 socket AMD platform with dual 6272 16-core 2.1 GHz Opterons, 112 GB ECC PC3-10800 RAM, HP NC522SFP DP 10 GBe NIC dedicated for iSCSI.

Switch fabric is dual HP 5406ZL, with iSCSI on 8-port 10 Gbe ZL modules, HP SFP+ cables.

My proposed "production" FreeNAS box has:

Supermicro X9SRL-F motherboard
E5-2630 2.3 GHz Xeon
4x8 GB -- 32GB total ECC DDR3-1333 unregistered RAM
3x3 TB HGST Ultrastar NAS 7200 RPM 64MB SATA, 2 attached via 6.0 Gbps motherboard SATA ports, 2 attached via 3.0 Gbps motherboard SATA ports.

I've got the NAS set up and configured in a temporary $70 rack chassis, but will be moving it to a Supermicro 2U CSE-826E 12-drive chassis with redundant Fans and PSU before it hits production.

So, my primary concerns are ...

1) Should I plan for a L2ARC or ZIL with this configuration? I was thinking 2x 120 GB Samsung 840 pro for L2ARC mirrored, and try to find a pair of less expensive 32GB or 64GB SLC SSD for ZIL. Is the performance worth it? The VM guests are mostly mail servers, web servers with low transaction rate volume, and some relatively light SQL Server, Mongo DB (on Ubuntu x64).

2) The NAS has 3x3TB disks, and are currently formatted RAIDZ1. I am waiting for my local Microcenter to re-stock some more drives so I can run 6 of them in Stripe/mirror, but it's RAIDz1 for now.

3) The Opteron host has a pretty fast LSI 9266-8i SAS controller, LSI battery backup unit, and 6x 4TB Desktar 7k4000 SATA in RAID 10 + hot spare as local storage. I may pull the Deskstars out and put them in the NAS once the production chassis gets ordered and set up a second volume. The LSI 9266 would be reconfigured as JBOD and the raid battery pulled off.

How much RAM should I put in my NAS? I have 32GB in there now, and can put another 32GB in the open 4 slots. I have the extra memory on hand, but was thinking of putting it into my Opteron VM host if I don't need it for the NAS. Is 32GB sufficient for 6TB storage now and 12-18TB total storage later?

Thanks!

Greg
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
1a) L2ARC, maybe. At only 32GB of RAM that's a little stressy on the ARC. You shouldn't exceed maybe about 4x RAM, so limit yourself to a single 120GB or quite possibly don't even do it. There is no need to mirror L2ARC; loss of L2ARC means it just reads from pool.
1b) You always have ZIL. If you don't have a separate log (SLOG) device, it is just part of the pool.
1c) Only you can know your write load to judge whether or not you need SLOG. It is highly suggested for VM environments where you are writing lots of sync data, which for VM environments should be "all the time." But you can always try without and then add it later.

2) Wrong way to go. Use mirrored. Just set up two drives in mirrored mode and it'll be much more pleasant serving VM's than your RAIDZ1. You can add more vdev's to your pool later so trying to use Z1 now and then redo it later is basically crazy and making lots of extra work for yourself. RAIDZ1 IOPS are essentially restricted to the speed of the slowest single component which makes both writes and reads problematic. Not like "it won't work" but just "it's slowish."

3) Peeling the battery off a 9266 doesn't make it a HBA. It merely makes it a eunuch RAID controller. RAID controllers are strongly discouraged for reasons outlined in the LSI sticky and elsewhere.

For VM storage, you will really be more comfortable with 64GB or quite possibly even more RAM. If you put in 64GB, then I would say you should be fine to add your two 120GB SSD's as L2ARC devices, giving you 240GB of L2ARC.

One further note about VM usage: restrict your pool to about 50% capacity to reduce fragmentation over time. That may mean you need to buy larger disks than you expected. Trust me, it pays off in the long run, especially if you have something transactional like mail or database that is ratcheting up the fragmentation pain level.

Anyways, welcome to the forums and thanks for coming in with a rational starting point. It is very disappointing when new posters appear wondering how they can make their old Pentium 4 with 1GB act as a 16TB ZFS NAS... it is much nicer making minor suggestions than it is trying to engineer an entire system for someone.
 

hammong

Dabbler
Joined
Mar 18, 2014
Messages
22
jgreco - Thanks for the detailed answers. I appreciate your time!

Regarding the L2ARC, thanks for clearing up the mirror vs. no mirror decision. I had read specifically that the ZIL device should be RAID because it holds uncommitted write data, and I assumed incorrectly that L2ARC should be mirrored too to detect corruption on the fly, but I suppose the checksum in ZFS would spot any bad data on the way through when reading. Maybe I was lured in by some people using multiple L2ARC SSDs in a stripe for better read performance off the L2ARC. We'll start with no ZIL and no L2ARC and see where that leaves us performance-wise. Most of our VMs cache their own data in RAM, so almost everything read off the array is missed cache hit data and booting the VMs.

I'll definitely go mirrored then vs. RAIDz1. Just waiting for more HDD to come into stock. I don't need > 3TB initially so a single pair of disks will do. I do have a question about a simple 2-drive mirror performance though, will ZFS interleave the reads so that I get double (or close) to single drive read performance, or do you need 4/6/8/++ drives to get higher mirror performance by the stripe?

My 9266-8i has a true JBOD presentation method, it won't be a bunch of 1-drive RAID 0 JBOD emulation. I've done it with a test machine and a LSI 9560SE-24M8 that has 24 ports on it, FreeNAS saw the individual discs and could read the SMART data on them with it. I guess I should probably just get a M1015 and reflash it to JBOD mode and save the $700 RAID controller for another project.

I'll stick the extra 32GB of RAM in the NAS and go with 64 GB. If nothing else, should boost my read hit cache %. I can't go > 64GB without getting into registered DIMMS, and that's not an expense on the radar just yet.

Thank you!

Greg
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The advice for SLOG used to be to mirror it because on older ZFS versions a loss of SLOG would kill your pool. Today, that is still best practice but mostly only mission critical systems bother with the mirrored SLOG.

ZFS typically interleaves mirror vdev reads which may not actually be what is ideal for a VM environment. I'm a big fan of concurrent accesses, and would actually prefer that each drive be fulfilling different read requests.

As to the RAID controller, you're doing the right things by checking SMART etc. I strongly suggest that you validate your ability to access the disks on a standard SATA port just in case your controller is lying to you. The main concern beyond SMART and RAID caching is inadvertently getting married to a controller that you cannot then later swap out. As long as it truly acts as an HBA, hey great.

Good luck.
 
Joined
Mar 24, 2014
Messages
1
The advice for SLOG used to be to mirror it because on older ZFS versions a loss of SLOG would kill your pool. Today, that is still best practice but mostly only mission critical systems bother with the mirrored SLOG.


Hi.
This it's true. From zpool version 19 you can remove the SLOG safely, but this it's not the whole story. If a failure happens, you STILL have a data corruption. That can be repaired with no problem often but not for sure. This might not be a big trouble if you are using ZFS for NFS, with a lot of small files, that can be tracked from ZFS, and you can restore from a backup or a snapshot. But even using COW dealing with the "big files" on which resides the VMs's disks you lose you ZIL you will have some corruption into the filesystem layer of the virtual machine. EXT3 will suffering this a lot for example. So this is the reason for have a mirrored SLOG. If the SLOG die, your VMs can be corrupted.

Also you need SSD with the ability to write all the data from internal DRAM to the flash if some power loss happens, so you need supercap or something equivalent (DDRAM PCI, etc).

It's a better and safer choice to put more RAM for the ZIL instead of messing with integrity of the data adding complexity. For sure, at some point, if you need to increase write performance you may still have to use SLOG, but you can wait and keep your pool 50% free for best performance.

Regards,

Gianluca
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The only time that the ZIL is relevant is after an unexpected event (crash, panic, power loss). Under normal operation, a SLOG device is only being written to. A failure of the SLOG results in ZFS falling back to using the in-pool ZIL methods. There is a short window of several seconds where if you had the SLOG device fail, and then the system panicked, then yes, there would be data loss. Otherwise, if your SLOG fails, the current transaction groups flush to disk, and ZFS begins using the in-pool ZIL, there is no risk of data loss.

There is certainly no data loss if your SLOG fails and the system remains up; it just starts doing sync writes very slowly as the in-pool ZIL takes over. The ZIL/SLOG is NOT a write cache.
 
Status
Not open for further replies.
Top