BUILD New Highend build with FreeNAS

notjoe · Jan 31, 2017

I am going to be building a large storage array for the compay and I wanted to get some advice from you guys as I will be using FreeNAS to power the build.

The main requirement for me is throughput of the data, both with reads and write, and making sure that it is rock solid build and fully redundant.

I am considering two systems with identical specs. The specs I'll be listing are per server. I plan on using ZFS zend/receive to keep these machines in sync.

The platform I've picked is the supermicro chassis. Here is a link to it: https://www.supermicro.com/products/system/4U/6048/SSG-6048R-E1CR60L.cfm
Specs of what will be in each server:

Memory: 256GB Memory of DDR4 2400MHz ECC Reg
CPU: INTEL XEON 6 CORES 1.7Ghz 15M
Storage Drives: 30x6TB Drives (every 2x6TB Drives will be paired for raid 1, each pair will create the overall storage array)
ZIL Drives: 2x1TB MLC SSD Drives (Mirrored)
L2ARC Drives: 2x1TB MLC SSD Drives
HDD Connectivity: 2 HBA LSI 3008 in IT mode (JBOD)

Networking: 1 NIC w/2x10GB Ports.

Is there anything I might be missing? Anything I should be concerned about or what about anything that you guys might change for the build?

Thanks in Advance!

Arwen · Jan 31, 2017

Using 1TB ZIL is way over kill. Plus, you want something that has both power loss
protection and high numbers of writes per day. The actual size is based on some
formula which goes something like, (don't quote me :), 2 x maximum throughput
for 5 seconds. Also, the need of such is highly application specific. Meaning some
uses don't benefit from one.

It's better to maximize RAM, than add a L2ARC, since some RAM is used for the
directory of the L2ARC anyway.

The term you are looking for in regards to using a pair of disks is Mirrored vDev.
ZFS uses the term vDev, (virtual Device), for sub-units of a pool. A pool made up
of Mirrored vDevs tends to work better for VM storage and applications that need
high IOPS. Each vDev gives the amount of IOPS for it's slowest component, (disk).
Thus, a pool of 15 vDevs has the IOPS has the equivalent of 15 disks, pretty good.

You might want to mention you use case. And network topology. Some people think
that aggregating 2 network ports on a FreeNAS server will double throughput for a
singe client, it won't.

PS. The real name for the ZIL Drives is SLOG, (Separate intent LOG). All ZFS pools
come with a ZIL inside the pool, unless a SLOG is configured. Using the wording of
SLOG automatically implies a separate device, unlike a ZIL.

Spearfoot · Jan 31, 2017

What the Half-Elven lady said... we really do need to know your intended use of the server.

If I understand you correctly, your plan is to use 15 mirrored pairs in your pool. That may or may not be appropriate... mirrors generally give the best performance, but are horribly space-inefficient -- 50%! So your 30 x 6TB drives will only yield the storage capacity of 15 x 6TB drives. And bear in mind that you will lose your data if both drives in any particular vdev fail. ZFS does provide for 3-way mirrors, which would increase the safety factor - but at the cost of even less space efficiency - 33 1/3%!

If you're planning on using the server for general-purpose file sharing, you may want to consider using RAIDZ2 or RAIDZ3, as these are both more space-efficient. Some examples:

Pool = 5 x 6-disk RAIDZ2 vdevs (2 parity drives per vdev): 66 2/3%
Pool = 3 x 10-disk RAIDZ3 vdevs (3 parity drives per vdev): 70%

@Arwen is right about the L2ARC; you don't need one unless you've installed the maximum memory supported by your motherboard and still have performance issues. And you probably won't need a SLOG device unless you're providing block storage. Typically this is to one or more hypervisors for running virtual machines.

Good luck!

notjoe · Jan 31, 2017

Thanks for the replies...This is exactly the sort of thing that I am looking for!

Workloads:
The storage array is mainly going to be used to work with a lot of heavy VR Videos. We might be rendering, encoding, or otherwise working directly off this storage array. I can easily see a few hundred mbps being sustained between each of the editors and/or encoding machines. Right now we will have 1 video editor and 3 machines that will be accessing the array. I can easily see that jumping up to 2 or 3 editors and 5 or 6 encoding machines. The network might actually end up using more network bandwidth if we start building encoding machines with dual GPU and dual CPUs. There will be a lot of writing to / reading from taking place at the same time.

Networking:
The dual port nic is jut because. I wasn't planning on using port spanning and knew that it would not double the speed that the storage array can achieve.

Storage Configuration:
I know that by using the above configuration I am losing 50% of the space. The reason for this configuration is that if a drive does fail it is easily replaced without any sort of raid rebuild process taking place, which would hinder the operation by slowing everything down. Sure, it'll have to re-sync the data to the new drive but it'll be a lot less intensive of a task than resilvering the array and take a lot less time to do it. Supermicro provisions with minimum of 30 drives. That'll give me 90TB worth of storage to start, another 90TB of storage available (or more if i use bigger drives to create the new vdevs). I don't plan on using dedupe option so that should help out with the memory. Would 256GB of memory be a good starting point for this configuration?

I don't foresee block storage being needed yet but who knows about the future.

Spearfoot · Jan 31, 2017

notjoe said:
Would 256GB of memory be a good starting point for this configuration?

As with so very many things in life... 'it depends'.

But yes, sir, 256GB is a good start. The rough rule-of-thumb is "16GB + 1GB for every 1TB of storage", which you meet handily. However, given your workload, you may need more RAM. The savvy thing to do would be to leave yourself plenty of memory slots so that you can add additional memory in the future, i.e., don't fully populate the board with smaller-capacity RAM modules.

You've obviously given this design some thought... kudos!

Arwen · Jan 31, 2017

You might plan ahead and get the dual 10Gbps & dual 25Gbps network card. Even
if you don't use the 25Gbps ports to a switch, you can use one of them on each server
as the inter-connect. Simply setup a private sub-net and private host name for the hosts.
Then use that for the replication. This also keeps the replication traffic off the main port.

Of course, you can implement such a scheme with your second 10Gbps port.

Plus, if you ever do need more performance, you can use the 2nd 25Gbps port to a newer
switch as the FreeNAS's main port.

If I have the choice, I never buy an under provisioned card, (Fibre Channel, Network or
SAS), if I can help it. So even if you don't get a network card with dual 25Gbps ports,
you can get the 4 x 10Gbps card instead. That gives you options to have multiple sub-nets
that serve your different needs, (editors verses encoders). And a free port for inter-connect
to your other server.

MatthewSteinhoff · Jan 31, 2017

notjoe said:
Would 256GB of memory be a good starting point for this configuration?

Absolutely, yes. The entire configuration looks solid. You did your research.

As others have noted, examine the SSDs you're considering and do the math to see if they have the latency, endurance and power characteristics necessary. Using a way larger than needed SSD for SLOG is fine if you're doing that for over provisioning and the enhanced write endurance and not because you are actually using all that space for SLOG.

L2ARC may not be much use if your most recently used or most frequently used files are all in excess of its size. The more testing and experience I have with L2ARC, the less useful I find it for the data sets I've thrown at it. Of course, I've never thrown video its way and don't have nearly as much RAM as you do. Still, I'd do performance testing without it first and see if it meets your needs.

For RAM, if raw performance is the overriding goal, you'll need to look at your active data sets. How much data is actually moving around? Are you editing a minute at a time? Five minutes? Grading an entire movie? If you can keep your entire active data set or often-used elements in ARC by jumping from 256GB to 512GB, that's the first upgrade I'd make. On the other hand, if 2TB of RAM still wouldn't be enough to keep it all in ARC, I'm not sure it would be worth upgrading to even 512GB.

Cheers,
Matt

nojohnny101 · Jan 31, 2017

Lots of good advice so far and this is a bit out of my experience level but one thing I can recommend (and I believe @MatthewSteinhoff will chime in on this) is that you don't need a similarly specd backup target. Save your money and invest more in the main machine if need be.

Replication (built into FreeNAS and the best way if you are backing up from one zfs to another zfs) is very efficient and extremely light on the CPU. If the replication target machine is solely going to be used for that, you can get by with minimum specs, letting the backups run overnight and not have the machine break a sweat.

notjoe · Jan 31, 2017

MatthewSteinhoff said:
Absolutely, yes. The entire configuration looks solid. You did your research.

As others have noted, examine the SSDs you're considering and do the math to see if they have the latency, endurance and power characteristics necessary. Using a way larger than needed SSD for SLOG is fine if you're doing that for over provisioning and the enhanced write endurance and not because you are actually using all that space for SLOG.

L2ARC may not be much use if your most recently used or most frequently used files are all in excess of its size. The more testing and experience I have with L2ARC, the less useful I find it for the data sets I've thrown at it. Of course, I've never thrown video its way and don't have nearly as much RAM as you do. Still, I'd do performance testing without it first and see if it meets your needs.

For RAM, if raw performance is the overriding goal, you'll need to look at your active data sets. How much data is actually moving around? Are you editing a minute at a time? Five minutes? Grading an entire movie? If you can keep your entire active data set or often-used elements in ARC by jumping from 256GB to 512GB, that's the first upgrade I'd make. On the other hand, if 2TB of RAM still wouldn't be enough to keep it all in ARC, I'm not sure it would be worth upgrading to even 512GB.

Cheers,
Matt

Thanks for the thoughtful reply. Video files can be anywhere from 10gb to 50gb. The main editor has and I have been discussing things and we might end up not using videos at all but rather rendering the videos to image sequences. Each video is 60fps and each frame is about 50mb, give or take and millions of files. I suspect there might be a hit on the initial caching but if the person is reusing those image sequences then it should be fine. Maybe 512GB of memory might be worthwhile to be able to help cache all these images/videos.

I definitely like your approach though. I'll start off with 256GB memory and see how that works out first. I believe it should be enough to hold the data that editors will be working with in memory, if not, I'll upgrade to 512GB and see how much of an improvement I gain. After that, its either another CPU+Memory or L2ARC cache.

notjoe · Jan 31, 2017

nojohnny101 said:
Lots of good advice so far and this is a bit out of my experience level but one thing I can recommend (and I believe @MatthewSteinhoff will chime in on this) is that you don't need a similarly specd backup target. Save your money and invest more in the main machine if need be.

Replication (built into FreeNAS and the best way if you are backing up from one zfs to another zfs) is very efficient and extremely light on the CPU. If the replication target machine is solely going to be used for that, you can get by with minimum specs, letting the backups run overnight and not have the machine break a sweat.

The backup is pretty important. We need to be able to replace once machine with the other in the event of a failure. We're willing to accept some downtime/data loss, so, the replication doesn't need to be real time but once per 30 minutes via applying snapshots to the backup machine should be fine. A lot of $$$ is spend on people working and on the content. The storage array being offline for a day costs us in many ways.

notjoe · Feb 1, 2017

So, I've been doing some more research and I came across this site with some recommendations about ZIL/SLOG devices.

I am thinking that the Intel DC P3700 drive that has power loss protection would be the way to go for the ZIL. I'll hold off on SLOG until I've reached memory capacity of the motherboard.

Incase any of you are interested, the article can be found here: https://www.servethehome.com/buyers...as-servers/top-picks-freenas-zil-slog-drives/

Arwen · Feb 1, 2017

notjoe said:
So, I've been doing some more research and I came across this site with some recommendations about ZIL/SLOG devices.

I am thinking that the Intel DC P3700 drive that has power loss protection would be the way to go for the ZIL. I'll hold off on SLOG until I've reached memory capacity of the motherboard.

Incase any of you are interested, the article can be found here: https://www.servethehome.com/buyers...as-servers/top-picks-freenas-zil-slog-drives/

The Intel DC P3700 has been mentioned around here as a decent SLOG.

The SLOG has little to do with the memory capacity, that's the L2ARC, (aka Cache pool entry). It's
confusing I know, and will get worse since OpenZFS is introducing Metadata cache devices, (so your
directory and file attributes can live on SSD).

Think of it this way, SLOGs are write only unless you have a server crash. Then you want something
that survived the crash to pick up any synchronous writes that still need to be written to the normal disks.
Thus, the desire to have one with power loss protection. To be practical, it should be noticably faster
that your normal devices, (generally disks). So SSDs come into play with both low latency and high
write speeds.

L2ARC / Cache devices are the opposite, they are read mostly. Only when data is read alot and can't
fit into RAM, then you want it written to a higher speed and lower latenancy device. Their is a directory
entry in RAM for each L2ARC / Cache entry, thus the desire to max out RAM before using a L2ARC.

notjoe · Feb 1, 2017

Yeah, I always referred to it as ZIL but then found out it was SLOG and got confused since they're the same!

MatthewSteinhoff · Feb 1, 2017

Arwen said:
To be practical, it should be noticably faster that your normal devices, (generally disks). So SSDs come into play with both low latency and high write speeds.

Read that a couple times then do the math, @notjoe: noticeably faster. With a 15-wide striped mirror of 7200rpm drives, you're going to have 15 * 175 MB/s, give or take, of throughput: 2,625 MB/s aggregate. The Intel P3700 has 1,900 MB/s of throughput. If you SLOG your pool with the P3700, you could actually have lower performance, give or take.

Your machine - and requirements - are full beast. Rules of thumb are going to break down at that level.

For your replication host, @nojohnny101 is correct in my belief you can really cut corners there depending on your service-level requirements. If the primary node fails, it may be quicker to pull parts out of the replication target and put them in the primary to get it back up and running than it would be to fall over all services to the secondary. When I look at what cascade of failures would be required to cause the primary to fail, those failures are so unlikely that I'm happy to limp along on an underpowered secondary while waiting for parts for the primary to arrive.

Cheers,
Matt

notjoe · Feb 1, 2017

MatthewSteinhoff said:
Read that a couple times then do the math, @notjoe: noticeably faster. With a 15-wide striped mirror of 7200rpm drives, you're going to have 15 * 175 MB/s, give or take of throughput: 2,625 MB/s aggregate. The Intel P3700 has 1,900 MB/s of throughput. If you SLOG your pool with the P3700, you could actually have lower performance, give or take.

Your machine - and requirements - are full beast. Rules of thumb are going to break down at that level.

For your replication host, @nojohnny101 is correct in my belief you can really cut corners there depending on your service-level requirements. If the primary node fails, it may be quicker to pull parts out of the replication target and put them in the primary to get it back up and running that it would be to fall over all services to the secondary. When I look at what cascade of failures would be required to cause the primary to fail, those failures are so unlikely that I'm happy to limp along on an underpowered secondary while waiting for parts for the primary to arrive.

Cheers,
Matt

I am starting to see the light! I think , taking it a step further, is if we assume that I will consume the other 30 drive bays, for a total of 60, then the math would be 30 * 175 MB/s, give or take of throughput: 5250 MB/s aggregate, a ZIL drive would most definitely kill performance. So, ZIL/SLOG is out of the question then? Whether the data gets redistributed across newly added disks or those newly added disks get used more until they hit a capacity similar to the disks in the rest of the raid array is beyond me. I might be over thinking this.

As for the backup machine. The expectation is that with a failed server we can be operational within minutes using a backup. I'd imagine it should only take a few minutes to switch IP addresses out. Personally, I'd probably just stock some replacement parts and call it a call but the requirements of it are a 1:1 mirror of the primary. I'll happily spend money on kit though ;)

Thanks for saving me ;)

ChriZ · Feb 1, 2017

Hello..
Just one thing.
In your first post you mentioned your CPU specs, from which I understand it is an E5-2603v4..
Don't know if this CPU is good enough for your intented workload.
Don't quote me on this, though.. Let's see what others think about this...

MatthewSteinhoff · Feb 1, 2017

notjoe said:
So, ZIL/SLOG is out of the question then?

Short answer: maybe, I don't know.

Long answer: Not out of the question. Just a more complicated answer. If one is too slow, maybe two in a stripe would work? You're way beyond what I've built. Here's my napkin math...

With a 30-wide, 60-disk mirrored stripe, you're moving up to 10,500 MB/s, (5,250 MB/s per side of the mirror assuming 175 MB/s per disk). Your LSI 3008 is limited to 6,000 MB/s, right? Further, the PCIe 3.0 bus is just 985 MB/s per channel: an 8x card such as the LSI SAS 9300 is limited to 985 * 8 for 7,880 MB/s.

So, while the 5,250 MB/s you could be moving is less than the 7,880 MB/s you have available, you're running harder than most. You also haven't accounted for everything else chewing up the other PCIe channels such as the 10G NIC.

Your NIC, by the way, will likely be the bottleneck: 10G is 1,250 MB/s. You need 40G of network to reach 5,000 MB/s of disk.

Whether the data gets redistributed across newly added disks or those newly added disks get used more until they hit a capacity similar to the disks in the rest of the raid array is beyond me.

If your data is not sedentary and your original pool isn't 100% full, rebalancing should happen magically as data is written. You should receive improved performance along with the additional space as soon as you bring the VDEVs into the pool. Further, since you're adding the same number of drives as was in the original pool, the worst case scenario is that the performance won't change.

(Where you would see a problem is if you had a ten spindle pool that was really close to being full then added two more spindles. Until the spindles rebalanced, those two relatively empty spindles would be hammered while the original ten spindles would not be nearly as used. If your performance was 1,000 MB/s initially, you might choke down to 200 MB/s. Your IOPS, too, would suffer. But, like I said, that isn't your scenario. (An archival pool where data is added then never deleted or changed is harder to rebalance since nothing is naturally moving around.))

an E5-2603v4.. Don't know if this CPU is good enough for your intented workload.

An E5-2603 v4 does seem slow. For most situations, any CPU will have plenty of performance as long as you're not running applications on the FreeNAS host itself (most notably, Plex). In this case, I'd throw a bit more horsepower at the server.

Cheers,
Matt

Arwen · Feb 1, 2017

Uh, lets not forget that SSDs have MUCH lower latency. Meaning I could write data and get
acknowledgement that's its written in a mili-second. But, hard disks might take 8 mili-seconds
just to seek to the point on the disk. Then wait a bit more until the disk spins to the right block.

Further, the pool can also be busy with normal reads, backups, scrubs, re-silvers, async writes
and SLOG write backs, (which come from RAM). But a SLOG is a dedicated device, (or
generally should be).

Remember, in the case of synchronous writes, it's how fast you can get the data in a secure
storage, (the pool builtin ZIL, or external SLOG). Because until that happens, the write is not
complete and the writer program is waiting.

That's why it's application specific. If the application does not need or want synchronous writes,
the NAS does not need SLOG.

MatthewSteinhoff · Feb 2, 2017

You are correct, @Arwen. For many applications latency would be paramount.

I discounted latency in my analysis because video file writes strike me as being larger than any reasonable SLOG - you're going to end up waiting on the spinning platters anyway. Large, streaming writes instead of bursty writes.

At least that's how I'm imagining it. Truth is, most of my clients don't do a video. They are still photographers with hefty archives of largish (35MB - 200MB) images. For their workflow, SSD SLOG is sweet.

Cheers,
Matt

notjoe · Nov 20, 2017

So coming back to this I am looking for some advice.

@MatthewSteinhoff The video guys here want to start working with image sequences. This would entail hundreds of thousands of images at 4k resolution. Would something like ( https://www.amazon.es/gp/product/B073BQX2K6/ref=ox_sc_act_title_1?smid=A3H5PT8C3B863&psc=1 ) work well for SLOG? I am not sure if it might actually slow down the writes over NFS if your platter capacity is more than this.

Important Announcement for the TrueNAS Community.

BUILD New Highend build with FreeNAS

Explorer

MVP

He of the long foot

Explorer

He of the long foot

MVP

Guru

Wizard

Explorer

Explorer

Explorer

MVP

Explorer

Guru

Explorer

Patron

Guru

MVP

Guru

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "New Highend build with FreeNAS"

Similar threads