I'm building a high performance ZFS SAN for education

Status
Not open for further replies.

jamiejunk

Contributor
Joined
Jan 13, 2013
Messages
134
I'm looking to bounce what we're doing off some knowledgeable people. I figured this would be the place to look.

You can see a graphical representation of what we're doing and our specs here.

https://docs.google.com/file/d/0B3_C5r_n0NxLMUp6c1RrejNKMVk

We've been running some sort of ZFS san since 2008. Performance has almost always been an issue for us. We use the storage for typical stuff. NFS shares for home directories and NFS shares for Vmware shared storage.
We don't have huge data in the grand scheme of things, but we have thousands of staff and students with lots of little preference and home directory files. We never come close to maxing out our gigabit ethernet, but with so many little files the lack disk IO has been killing the sans performance. I ran a ls -R | wc -l on the home directories last night and we're at about 19 million files.

Also any dumb luck that could come our way with storage has. In 2009 we had some sort of power event. At the time we had a san which backed up to another san. Both of these drive shelfs had SAS drives in them. When the power event happened it killed over 30 sas drives in the both disk shelfs.
Luckily we only lost one weeks worth of work because we had a 3rd backup to a SATA shelf.

We recently invested in an APC Symmentra so hopefully we won't see that issue again. But because of our past issues we are very data paranoid. Also with literally millions of small files, backing up has been a nightmare. If we needed to restore an entire pool from a backup it could take weeks to copy all the data. Not because it's a lot size wise, but rather because there are so many small files. The file copy speed is at a crawl.

So our plan is to move around some existing hardware, and purchase some new hardware.

The new hardware is
Head Unit:
Supermicro SYS-2027R-72RFTP
128 Gigs of ram
Single 2.40GHz Xeon E5-2609 4-Core Processor
Stec SAS ZeusRam for ZIL
Hopefully a Stec SAS Zeus IOPS for L2ARC
Both connected to LSI 2208 on motherboard

JBOD:
Supermicro 216BAC-R920LPB
24 - 512 gig Samsung 840 Pro SSDs connected to the
LSI 9207-4i4E PCI-E SAS2 PCI Controller in head unit
It seems the 512 gig Samsung 840 Pro SSDs seem to be sold out everywhere though. Since i'm in a time crunch to get this done we may be looking at other SSDs.

Running FreeNAS OS - Installed on a 4GB Apacer SATADOM SLC SSD
Setup as Raidz2 4 devices per vdev.
Total of 6 Vdevs

This "hotstor" SAN will do a zfs send to the "Warmstor" san every 5 minutes.
Also it will back up to the offsite "coldstor" nightly.

So worst case if the hotstor data san goes down or looses it's pool we can remap the servers to use the warmstor and be up and running in a few minutes. If we lose both pools we'll have the offsite backup.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Opinions:

-Get a CPU with HT. I've been very disappointed with the CPUs that don't have HT. I wouldn't have expected it to matter "that" much, but it does. http://www.newegg.com/Product/Product.aspx?Item=N82E16819117269 is probably about the same but with HT. You said only 1 CPU, which I think is a wise choice. Keep yourself open to the plan to add a second CPU(and utilizing the extra RAM slots) if necessary. For you, your bottleneck is probably your zpool and not the CPU.

-If you are going full on SSD I'm not sure a ZIL or L2ARC is going to be helpful. In theory, it will make your numbers go higher, but I doubt its going to really matter with only 1Gb LAN connections. SSDs can do very high iops, and you're trying to cache SSDs with "higher speed" SSD. You can probably save some money by skipping the ZIL and L2ARC. Remember that the main bottleneck is physical limitations of moving hard drive parts. Those will be gone with an all-SSD zpool.

-If you have a fixed budget I'd probably move some of that cash spent on the ZIL and L2ARC to more RAM or at least larger sticks of RAM with some slots free in case you want to upgrade later.

I'm not sure how fast/slow your Warmstor SAN is, but now that you are removing your bottleneck from the Hotstor you may need to change how often the zfs send occurs. If the Hotstor isn't slow anymore you can expect activity to increase. Some people have had problems where one zfs send doesn't finish before the next one starts and then things can mess up. So be careful and keep an eye on it. I'm not sure if it was user-error, a ZFS issue, or just something administrators are supposed to know and avoid.
 

jamiejunk

Contributor
Joined
Jan 13, 2013
Messages
134
Opinions:

-Get a CPU with HT. I've been very disappointed with the CPUs that don't have HT. I wouldn't have expected it to matter "that" much, but it does. http://www.newegg.com/Product/Product.aspx?Item=N82E16819117269 is probably about the same but with HT. You said only 1 CPU, which I think is a wise choice. Keep yourself open to the plan to add a second CPU(and utilizing the extra RAM slots) if necessary. For you, your bottleneck is probably your zpool and not the CPU.

-If you are going full on SSD I'm not sure a ZIL or L2ARC is going to be helpful. In theory, it will make your numbers go higher, but I doubt its going to really matter with only 1Gb LAN connections. SSDs can do very high iops, and you're trying to cache SSDs with "higher speed" SSD. You can probably save some money by skipping the ZIL and L2ARC. Remember that the main bottleneck is physical limitations of moving hard drive parts. Those will be gone with an all-SSD zpool.

-If you have a fixed budget I'd probably move some of that cash spent on the ZIL and L2ARC to more RAM or at least larger sticks of RAM with some slots free in case you want to upgrade later.

I'm not sure how fast/slow your Warmstor SAN is, but now that you are removing your bottleneck from the Hotstor you may need to change how often the zfs send occurs. If the Hotstor isn't slow anymore you can expect activity to increase. Some people have had problems where one zfs send doesn't finish before the next one starts and then things can mess up. So be careful and keep an eye on it. I'm not sure if it was user-error, a ZFS issue, or just something administrators are supposed to know and avoid.


Thanks for taking the time to read over my stuff and give me some tips. I'll take your advice on the processor. As for the ZIL and L2ARC the reason i went for those even though we're doing an all SSD pool is because it sounds like SSD performance drops off over time. The STEC stuff is suppose to keep the same IOPS for the lifespan of the drive. I know it's a lot more money, but I'm going after our performance issues with a nuclear bomb instead of snipers. I'm just tired of it.
I'd get some larger ram chips but i couldn't find anything bigger than 16 gig in that speed of ram. I'll look around some more. Thanks again!
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
RAM speed won't matter compared to RAM quantity. For example, if the choice is 64GB of 1600 or 128GB of 1333 I'd recommend the 128GB of 1333 anyday.

I always recommend people not spend money on a ZIL or L2ARC until they determine they need it. Home users almost never do anyway. It's easy to add, but in the event of some problems both will add a lot of complexity. FreeNAS 9 includes some utilites that will help you determine if those will even help.
 

jamiejunk

Contributor
Joined
Jan 13, 2013
Messages
134
Yea the last box we bought from a "VAR" came set up as 16 7200 RPM sata drives in 2 raidz2 vdevs of 8 drives each. Came with an L2ARC SSD but no ZIL SSD and 32 gigs of ram.

Needless to say having NFS sending sync writes to a pool that could only handle maybe 200 IOPS was a problem. Not this time! :)

Thanks for the tip on the slower ram. That might open up some funds for more ram!
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Agree that the non-HT CPU's seem to be kind of lethargic, but I have an exceedingly small sample size I'm working from in making that statement. But since you were suggesting it, ours is an an E5-2609 too, which seems to compare poorly against our common E3-1230's.

If I had to pick an E5 for fileservice, I'd consider the E5-2643, but I haven't actually laid hands on one, so this is theoretical and based on the fact that it more closely resembles the E3-1230 that people find to be a nice CPU. I disagree on the 2620 mainly because core speed ends up being important to some aspects of a fileserver, like CIFS.

That machine looks gorgeous for your purposes. I have something similar on the bench, a Supermicro 846 chassis with X9DR7-TF+. So I have some specific advice based on six months of tinkering with it.

1) It supports 768GB of RAM, but only 384GB with a single CPU, and only 256GB if you want 1600 and not just 1066. And with 32GB parts costing so much more than similar 16GB parts, $5000-$6000 for 256GB is frustratingly expensive when 128GB is only $1000ish. With a single CPU, you are limited to 128GB of fast RAM or 192GB of slow RAM unless you've got a large budget. I'm assuming that part is also compatible with your board but haven't verified that.

2) An all-SSD pool is still somewhat slow with ZFS write purposes, so a separate SLOG device is a good idea. I'm a little more skeptical of the need for L2ARC.

3) Do not connect any ZFS pool drives to the LSI 2208 on the system mainboard. You already have a 9207-4i4e for that. I expect that'd be a reasonable choice but haven't actually used one with ZFS, but it's a 2308 based controller and I believe it can be configured to look like an HBA.

4) And now, for something clever. Omit the ZeusRAM you wanted to buy. Instead, buy the BBU for the onboard LSI 2208. Then attach the 2208 to a conventional hard drive sufficiently fast to handle your sync write load. The BBU plus cabling should be about $300. Now the thing is, by doing this, you gain a SLOG device that has virtually unlimited write endurance, and the latency of having the mpt driver shove data out to the RAID controller write cache is very low compared to having a write command issued and completion waited-for over the SAS channel. Bonus: the cache is MUCH faster than 6Gbps, so your SLOG device can accept the first NNN megabytes of sync writes at incredible speed.

I'm doing something entirely different than you, so I'll explain a bit first: we're using the 2208 with a pair of Seagate Momentus XT's in RAID1 for ESXi local storage and boot. Since most of our ESXi storage is SAN based, the 2208 is only lightly used. The system is slated to handle a bunch of VM's and then also a FreeNAS instance for general fileservice. There is normally no need for a SLOG device, but by just adding a virtual disk on the local ESXi server, I magically get this SLOG device with awesome characteristics... and since the 2208 is usually not at all busy, I get a SLOG device that can sustain almost 50MB/sec (craptacular throughput on the XT's) but bursts the first few hundred MB in a second.
 

Sir.Robin

Guru
Joined
Apr 14, 2012
Messages
554
Why only 4 devices per vdev? Does not sound too effective in neither disk IO or CPU usage?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Why only 4 devices per vdev? Does not sound too effective in neither disk IO or CPU usage?

Fantastic fault tolerance combined with awesome performance potential? That seemed clear from the first post. If you are not a home user, doing things as cheaply as possible can be a secondary concern. For example, I wanted 30TB of storage... and basically 12 4TB drives (11 in Z3 plus hot spare) is $2000, adding another $180 for a cold spare doesn't really show as a significant expense. Yet it likely seems excessive to you to have 13 disks when only 9 would do the same thing in RAIDZ1. The extra cost may save time and effort later, and is totally cool for our purposes here.
 

Sir.Robin

Guru
Joined
Apr 14, 2012
Messages
554
Yes i see that. But still. 4 devices in Raidz2? I would say 6 or even 8 would still be way acceptable with dual parity.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
For reliability, yes. For performance AND reliability.. no.

His performance is lots of small writes. That demands smaller vdevs. For home users where our writes are 5GB+ files, a wider array can still provide excellent performance and reliability.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Again, I'll repeat it: "Fantastic fault tolerance combined with awesome performance potential."

Your suggestion doesn't go there. A wider RAIDZ2 is slower than several striped RAIDZ2 vdevs. And since SSD has a limited endurance, the likelihood of several failing in close temporal proximity could be substantially higher. Also, performance will be more narrowly impacted when degraded or rebuilding. It makes perfect sense to build smaller units that are protected more aggressively. Your suggestion makes more sense if you are less focused on minimizing failures and you can accept lower performance.
 

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,403
2) An all-SSD pool is still somewhat slow with ZFS write purposes, so a separate SLOG device is a good idea. I'm a little more skeptical of the need for L2ARC.
I thought all-SSD pools were quite performant? Can't say I've personally tried any. The rule of thumb for the SLOG or L2ARC is they should be an order of magnitude faster than the main pool devices. There may be specific workloads where one or the other is still beneficial. In this case not only will the BBU be faster, but it will help reduce some wear & tear on the pool drives.

Depending on how paranoid you want to be, you may want to "prematurely age" the SSD's by differing amounts and then spread them out in different vdevs.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
All-SSD pools aren't a replacement for SLOG. The way ZFS works, the sync write to push ZIL data out to a RAIDZ2 vdev is substantially less efficient than a single write to/wait for a SLOG device. It is counterintuitive until you think about the fact that the pool is designed to be writing large blocks of data, up to 128KB in the case of v28. Sync-writing out a 512 byte block turns inefficient quickly.

My expectation would be that an SSD pool would render L2ARC largely useless, though. Read speeds from a single SSD are often pretty good even if you don't have the fastest available thing, and I would imagine that having a pile of them would easily outpace the capabilities of a single L2ARC device. Given the price of the Stec stuff, I think I'd be more in favor of trying to go big on RAM and rely on the pool speed. A 200GB Stec ZeusIOPS is $2000. You can get 256GB of system memory for $2000, but you probably need a second CPU to do it inexpensively. But the system RAM has massive speed in its favor, and no need for L2ARC to fill when the system starts - every pool read gets stuffed in ARC. And an all-SSD pool should read fast.

So I think my suggestion, 128GB RAM. E5-2643 seems like a good idea. SSD pool. Hijack the 2208 for SLOG with a fastish small SAS drive. In case of problems, second E5-2643 (just to enable more memory) and a second 128GB RAM. If that's still insufficient for ARC, it can still be bumped up to 384GB RAM at the cost of the system memory dropping in speed to 1066.

I bet the substantial cost savings of avoiding the Stec devices pays for the E5 and extra RAM.
 

jamiejunk

Contributor
Joined
Jan 13, 2013
Messages
134
Thanks for all the advice guys. It really has been helpful. I'm sending out the quote request tonight. We'll see what the vendors come back with. I think i'm going to hold off on the ZeusIOPS as the L2Arc and do the 2nd processor and additional ram instead and just have a giant ARC.
 

jamiejunk

Contributor
Joined
Jan 13, 2013
Messages
134
3) Do not connect any ZFS pool drives to the LSI 2208 on the system mainboard. You already have a 9207-4i4e for that. I expect that'd be a reasonable choice but haven't actually used one with ZFS, but it's a 2308 based controller and I believe it can be configured to look like an HBA.


Yea, i'm gonna connect the zpool to the 9207. I was only going to connect the ZeusRam to the 2208. Maybe this is old school thinking and maybe i'm incorrect but the zeusram is SAS and the SSDs are sata and i seem to remember their being an issue mixing SAS and SATA. So i figured i'd just put them on totally different controllers :)
 
Status
Not open for further replies.
Top