BUILD Large build cluster shared storage

Status
Not open for further replies.

ahunt

Cadet
Joined
Apr 27, 2015
Messages
3
I'm looking for second opinions on a rather large build. We will be using this as a shared storage for our population analytics clusters, meaning at least 50 (and if it can handle it, 250+) nodes potentially all writing at once, over NFS. Extremely random sync writes, several TB of data each day.

MB - 1x- Supermicro X10DRC-T4+
CPU - 2x - Xeon E5-2687W v3
RAM - 12x (384GB total) - Samsung M386A4G40DM0-CPB
Chassis - 1x - Supermicro SC847BE1C-R1K28LPB
External HBA - 1x - LSI00343 9300-8e
JBOD - 1x - Supermicro SC847E16-R1K28JBOD
HDD - 81x - HGST Deskstar NAS 6TB H3IKNAS600012872SN
SLOG - 2x (mirrored) - Intel P3700 400GB NVMe SSD

We were planning to set up the HDDs with 9x 8-disk raidz2 vdevs, leaving 9 for warm spares and yielding 324TB after parity. Our IT department doesn't want to have to constantly be changing out drives, hence the large number of spares. Would raidz3 be a better use of these drives, potentially running in a degraded state for an extended period?

The OS will probably be installed on a SATA DOM, or reduce the spares by two and use two mirrored HDDs. We haven't really decided yet.

For the SLOG, is the P3700 considered reliable enough that we don't need to mirror it? I would hate to lose 300TB+ just because we wanted to save $1200.

Based on the theoretical throughput of the 4x 10Gb network (I know I won't actually get this, and likely won't set up the network in this way), 10 seconds (two transactions) of network writes would be 50GB, leaving quite a bit of extra space on the P3700. If we give 100GB for the SLOG (extra space for potential local writes and to improve IOPS), is it problematic (or even worth it) to use the other 300GB as L2ARC?

We do not plan on using deduplication, but do plan on enabling compression. I know the general recommendation is to give as much ram as possible; at least 1GB for every 1TB of storage. Would going higher than 384GB give us a significant performance boost? With those ram modules, we could go as high as 768GB.

I know this is a FreeNAS forum, and I could not find the hardware above in the FreeBSD 9.3 hardware compatibility list. I did a search, though, and found at least one other person using that motherboard, so I am confused if this will work with FreeNAS 9.3. If not, you guys are still the most knowledgeable ZFS people I could find, and I hope you would still be willing to discuss if I have to go to a later version of pure FreeBSD.
 

marbus90

Guru
Joined
Aug 2, 2014
Messages
818
Don't you need high availability for that? FreeNAS does not offer that. TrueNAS supports two controllers which failover.

The P3700 400GB is rated for max 4TBW daily. Don't use it as mirrored SLOG but have one spare around. With 2687W v3 you get 32GHz per chip, with 2690 v3 36GHz. if you have that many clients, don't plan for high singlethreaded speed but many cores. Socket 2011-3 wants 4, 8 or 12 DIMMs per CPU for optimal performance - 8 or 16x 16GB RDIMMs, at 24 DIMMs the LRDIMMs are faster. The JBOD chassis is SAS 6Gbps, but the external controller (which isn't required) 12Gbps - I'd match that up and replace those consumer-grade NAS drives with enterprise SAS drives. That mobo comes with a LSI 3108 RAID controller, which you don't want for ZFS.

If you want to improve IOPS, there's no way around striped mirrors. Each vdev only has single-disk performance for random I/O, especially writes. Maybe tier that storage so that writes are going to striped mirrors and then moved to 10disk raidz2 or 11disk raidz3 vdevs for archival storage.

Short writeup if you really don't want high availability and business support of the TrueNAS:
http://www.supermicro.nl/products/system/4U/6048/SSG-6048R-E1CR36N.cfm <- barebone complete with mobo, PSUs, coolers, cables, JBOD expansion ports etc
http://www.supermicro.nl/products/accessories/addon/AOC-S3008L-L8E.cfm <- HBA instead of RAID-controller
http://www.supermicro.nl/products/chassis/4U/847/SC847E1C-R1K28JBOD.cfm <- 44bay SAS 12Gbps JBOD
2x Xeon E5-2690 v3
16x16GB DIMMs for a start
The SATADOMs are fine, get two for mirrored boot at least.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I'd do mirrored SATA DOMs if this is for production (sounds like it is).

Not enough experience on the P3700 to decide not to mirror it. But keep in mind if you have any kind of unplanned shutdown, the slog's data will only exist on the slog. So if your single-drive slog fails, you'll lose that data (but not the zpool). So some data that was allegedly written to the zpool according to your NFS clients will have been lost. :(

You should not use the same device for an slog and L2ARC. So if you want an L2ARC, buy another device(s). Mirroring is unnecessary and not recommended for L2ARC.
 

ahunt

Cadet
Joined
Apr 27, 2015
Messages
3
Thank you so much for your input. HA would be very nice, but that generally comes with a massive cost increase. I could build the proposed for under $50k, even before any vendor discounts. Before building this, I will definitely ask for a quote from TrueNAS, but I want to know how much it would cost for a DIY solution first, for comparison.

We are currently using a bunch of smaller machines nearing the end of lease (also unfortunately not HA), which this would be replacing. I was thinking we could try something like this out by itself as a POC (it can't possibly be worse than our current solution), and if it works, we could provide not-quite-HA by building another and using ZFS replication on a very frequent schedule. We are running batch jobs, so it isn't too incredibly difficult to re-run the most recent stuff if we lose a little.

If we use ZFS replication from a machine with mirrors to a machine with 10 disk raidz2 or 11 disk raidz3, does that replication also perform random writes? Would it benefit from having a fast SLOG?

That 6048R-E1CR36N server seems to also have an LSI 3108. I assume this would just need to be replaced with a 3008 in IT mode.

I will definitely be following your suggestions to change to:
2x Xeon E5-2690 v3
X10DRi-T4
16x16GB DIMMs
2x SATA DOMs
847E1C-R1K28JBOD
non-mirrored P3700
3008 HBAs instead of the 3108

Switching to SAS disks would be a large increase in cost, which I was hoping to avoid. How critical is it to use them over the SATA disks?
 

marbus90

Guru
Joined
Aug 2, 2014
Messages
818
This is why I linked the SAS HBA between the barebone and the JBOD.

HA requires SAS HDDs and those are expensive. If you price up enterprise disks (HGST NAS is not enterprise grade but consumer grade), you will find that there's not much difference between SAS and SATA. SAS provides dual-pathing across dual expanders/controllers, true bidirectional troughput, better compatiblity with SAS Expanders (SATA disks may be flaky with them) and better vibration resistance as well. with 44 disks in one chassis shaking each other you may encounter performance and reliability issues just from that.
This is also the biggest cost factor. The TrueNAS itself isn't expensive when you compare to other solutions. unlimited snapshots, unlimited remote replication, unlimited disks, unlimited iSCSI LUNs sums up pretty quickly on comparable arrays.

In any case, just shoot ixsystems a mail and get a quote on that. They also sell the Supermicro systems.
 
Status
Not open for further replies.
Top