Example recommended builds?

Ron Watkins · Nov 8, 2016

Is there a document showing recommended builds?
Im looking to put together an array which can serve approx 2GByte/sec or possibly 2 arrays with 1GByte/sec each to a pair of 16Gb FC switches.
Host application needs approx 2GByte/sec for running simulations.
The sims need about 120Tb and will be running for approx 6 months to complete a single sim. Thus, reliability is important.
Hosts are HP DL380 Gen9 boxes with 2x E5 2690 v4 processors and 192GB ram.

SweetAndLow · Nov 8, 2016

Hardware thread. It's in my signature.

Dice · Nov 8, 2016

Have a look through the threads in my sig. They are sort of a collection of what I deem the most important readings of the forum to get a good grasp about hardware/setting up a FreeNAS box.

Ericloewe · Nov 8, 2016

Ron Watkins said:
The sims need about 120Tb and will be running for approx 6 months to complete a single sim.

You need 6 months uninterrupted uptime? If so, you need to call iXsystems and ask for a TrueNAS quote.

Ron Watkins · Nov 8, 2016

Ericloewe said:
You need 6 months uninterrupted uptime? If so, you need to call iXsystems and ask for a TrueNAS quote.

6 months doesn't seem like any big deal. Most Linux systems stay online for years before needing any reboots. That's not the hard part here, getting the 2GByte/sec seems to be the tricky part.

Ericloewe · Nov 8, 2016

Ron Watkins said:
6 months doesn't seem like any big deal.

It's more a question of unforeseen circumstances, like a random motherboard failure. Everything fails, and if a failure means you lose five and a half months of work (I'm guessing your environment isn't that sensitive, but you never know), that's a pretty crappy day, which could've been avoided with High Availability.

If you're feeling lucky, make sure to burn in the hardware for very long.

Ron Watkins said:
getting the 2GByte/sec seems to be the tricky part.

Eeh, that's the part you can control by throwing hardware at the problem. FC makes things more complicated (it's not very popular with FreeNAS/TrueNAS), but 10GbE or 40GbE would both be viable, well-known solutions.

You'd probably be looking at triple-mirror vdevs for this, possibly SSDs. Need more performance, just add more vdevs (and make sure you stay below 50% full, otherwise IOPS will drop - a lot).

I assume this would need SLOG, which means you need it in a form factor that can be replaced with the server running - that means a U.2 backplane for an Intel P3700 or two.

The biggest problem are the 120TB... Just what kind of workload is the storage going to see? Because if you need SSD speeds from all 120TB, your life is not going to be pleasant. If accesses will focus around a sort of sliding window, with lots of repeated accesses, ARC and L2ARC can significantly speed up reads.

Ron Watkins · Nov 8, 2016

Our thoughts are that the FreeNAS would run on server grade HP DL380 G9 or DL560 G9 boxes.
Theu use redundant SAS ports to the storage trays. Ive never had an HP board go south in over 7 years of working with them. They are pretty fault tolerant and allow for hot swap of many parts, but not (obviously) the cpu, ram or motherboard.
The thought is that we were going to run the client side simulation software under ESXi, which allows us to tune the performance dynamically between VMs. The "arrays" would be using HW raid so that I can keep the workload off the CPUs and also allow for dynamic rebuilds for failed drives.
We are not planning to use SSD, rather are looking at the 6Tb or 8Tb SAS 7.2k enterprise drives from Seagate. From the description, each drive should be able to support 100MB/s, so a Raid-5 7+1 should achieve at least 500 MB/s. Using 4 raid groups on seperate storage trays allows us to have the 2GB/s throughput target as well as being cost effective. We are still thinking about using Raid-6 instead, to get raid rebuilds without the headache of a second drive failure from the rebuild workload taking the array down.

MatthewSteinhoff · Nov 8, 2016

The raw performance required is certainly available through FreeNAS given appropriate hardware.

What is your working data set size, Ron?

I understand your total data set is 120TB but how much of that is touched inside, say, a four-hour period? Is your data set primarily random or are you reading and writing large streams? I'm trying to guess out how much of your data is going to come out of cache as opposed to having to be read from storage.

Fibre Channel instead of Ethernet makes me a bit nervous given your uptime requirements. Ethernet is much better supported and tested. When you say six months to complete a simulation, certainly it can be interrupted? If interrupted unceremoniously, what are the ramifications?

SSDs are sweet but, with enough spindles, conventional disks should be able to provide the throughput required at a much lower price point. (If your storage were five times as faster, could you finish in a third the time and make twice the money? If so, SSDs are delicious.)

How are you accessing the data? NFS? CIFS? iSCSI?

Do you need snapshots and replication?

Cheers,
Matt

Ron Watkins · Nov 8, 2016

The simulator works mostly in RAM, but will be reading through the entire dataset and re-writing the entire dataset each iteration. As the dataset doesn't fit in RAM, the reads and writes overlap as it moves in a portion of the data and recomputes then writes it back out again, moving through the dataset as a whole.
The 100TB is actually more like 120TB. It ramps up to about 50% of that size pretty quickly, within a few days, then will be reading/writing that for the duration of the simulation. At this point, im estimating about 6 months, but depending on the disk througput, it may be faster or slower. This is the reason for the high throughput requirement.
I could run this on a crappy storage system, but then it would take 1-2 years to complete the simulation. If I can get around 2GB/s thenI should be closer to the 6 month estimate.
Naturally, if I could get hold of a 100TB RAM, then this wouldn't be an issue and I could probably finish the simulation in a few days. The speed of the I/O (in MB/s) will be the major factor in the runtime of the simulation.
Ive already run this simulation on smaller datasets to get a feel for how it works. Basically it has a cache built into it, so the reads and writes are not synchronized. It will fill the buffer with a large chunk of data and allow the cpu to recompute, which will mark the data a dirty and force it to get flushed back to disk.

Ericloewe · Nov 8, 2016

Ron Watkins said:
The "arrays" would be using HW raid

That's not going to work with FreeNAS.

Ron Watkins said:
so that I can keep the workload off the CPUs

What else are they there for? The simulation is running elsewhere.

Ron Watkins said:
and also allow for dynamic rebuilds for failed drives.

What on earth is a "dynamic rebuild"? Whatever it is, ZFS does it better.

Ron Watkins said:
From the description, each drive should be able to support 100MB/s, so a Raid-5 7+1 should achieve at least 500 MB/s.

That's an excessively simplistic assumption. IOPS are probably going to be a much bigger issue.

Ron Watkins · Nov 8, 2016

I wanted to use HW raid to provide the reliability factor and ability to keep the lun active while the simulation is active even with a failed drive. Ive testing using software raid on linux (md) and found that I could easily saturate the cpu's and slow-down.
As I understand Raid-Z/ZFS, it also provides some protection, and if I layer that on top of HW raid then it should be even more reliable.
After all, how would the host know the difference between a physical disk and a LUN. both come out of the controller the same way, just one is more reliable than the other. If I totally rely on ZFS and have any issues, im stuck. However, if I use both HW raid under ZFS then either could fail and the other would protect me.

Ericloewe · Nov 8, 2016

Ron Watkins said:
As I understand Raid-Z/ZFS, it also provides some protection

Not "some" protection, all protection that cab be reasonably provided.

Ron Watkins said:
and if I layer that on top of HW raid then it should be even more reliable.

No, the opposite is true.

ZFS is designed from top to bottom to interface as directly as possible with disks. That means absolutely no RAID controller. Off the top of my head, here's a sampling of everything that will go wrong, otherwise:

Conflicting, uncoordinated caching mechanisms will absolutely destroy performance.
Errors detected by ZFS are impossible to correct because ZFS doesn't have any redundancy to work from.
Should the array stall for whatever reason, ZFS will not be happy and, at the very least, make the stall longer. Using ZFS properly avoids this scenario because ZFS will simply drop any offending drives and carry on normally.
FreeNAS has exactly zero facilities for dealing with HW RAID controllers and whatever stupid software they might need. Disk failure notifications? Nope, you'd better hope someone notices the red light that the HW RAID controller hopefully turned on.

Ron Watkins said:
After all, how would the host know the difference between a physical disk and a LUN. both come out of the controller the same way

Wrong, a physical disk has SMART data, for one. Good luck monitoring the drives without proprietary crap.

Ron Watkins said:
If I totally rely on ZFS and have any issues, im stuck.

ZFS doesn't randomly have issues. It's an immensely-tested filesystem.

Ron Watkins said:
However, if I use both HW raid under ZFS then either could fail and the other would protect me.

Wrong, either fails and you're screwed. You just added points of failure and complexity to the system, with no advantage.

MatthewSteinhoff · Nov 8, 2016

Ron Watkins said:
each drive should be able to support 100MB/s, so a Raid-5 7+1should achieve at least 500 MB/s.

Seems optimistic. At 80% reads, it might be seeing 500 MB/s. As soon as you get into a 50:50 mix of reads and writes, it'll be more along the lines of 320 MB/s according to an online RAID calculator.

I've laid out the disks a half dozen different ways and the only conclusion I've come to is you're going to need a lot of spindles. I'd go with 12 RAIDZ1 groups of six drives each. With that many drives, my back-of-a-napkin math says you'll meet your throughput requirements and your space requirements (with 3TB or 4TB drives). IOPs won't be great - under 2,000 - but you haven't listed that as a requirement.

Also of note, that's raw disk speed in the best case scenario. Protocol overheads and reality are going to slow it further. How are you attaching your processor nodes to FreeNAS? NFS? CIFS? iSCSI?

Cheers,
Matt

Ron Watkins · Nov 15, 2016

The freenas node(s) will be connected via dual 8Gb or 16Gb ports. If the numbers are as bad as you suggest, a pair of 8Gb ports per freenas node will be more than sufficient as each will yeild only 800 MByte/sec max. Do you really think that if I use 6 boxes, each with 2 sets of RAIDZ1 SSD drives that each will only pump out around 2GByte/sec total accross all 6? That seems odd and quite a bit lower than the manufacturer suggests. They seem to think that you can push 500 MB/sec per ssd.
Also, im not that familiar with SuperMicro, but our HP rep said they are pretty bad compared to HP (naturally, since he's the HP rep).
He suggested using a pair of HP DL380 Gen9 boxes with dual e5-2690 v4 chips and 128GB of ram, each with 24 of the 3.82TB SSD enterprise drives.

Im assuming HP boxes are supported along with the QLE 2562 controllers?

Simon Sparks · May 2, 2017

24 of the 3.82TB SSD enterprise drives will NOT meet your storage requirements unless you are going to be using compression and deduplication on FreeNAS which will need HBAs and direct disk access for it to work correctly NOT a RAID card NEVER a RAID card

Simon Sparks · May 2, 2017

1 x 8 Gigabits per second fiber channel port after the 8b/10b encoding provides roughly 800 Megabytes per second
1 x 16 Gigabits per second fiber channel port after the 64b/66b encoding provides roughly 1600 Megabytes per second

Should you loose one of your 8 Gigabit per second fiber channel connections you will have destroyed your predicted timescale for the simulation to complete.

It is always best to over provision the links in case of failure because you should be designing for High Availability NOT Load Balancing.

danb35 · May 2, 2017

Bunch of necro-threads today...

Important Announcement for the TrueNAS Community.

Example recommended builds?

Ron Watkins

Dabbler

SweetAndLow

Sweet'NASty

Dice

Wizard

Ericloewe

Server Wrangler

Ron Watkins

Dabbler

Ericloewe

Server Wrangler

Ron Watkins

Dabbler

MatthewSteinhoff

Guru

Ron Watkins

Dabbler

Ericloewe

Server Wrangler

Ron Watkins

Dabbler

Ericloewe

Server Wrangler

MatthewSteinhoff

Guru

Ron Watkins

Dabbler

Simon Sparks

Explorer

Simon Sparks

Explorer

danb35

Hall of Famer