BUILD Replacement OpenFiler

Status
Not open for further replies.

JeroenvdBerg

Dabbler
Joined
Aug 27, 2016
Messages
11
I'm looking for a replacement for our 5 year old OpenFiler 'SAN' that is running about 20-30VM's, we now have an old HP machine that has about 3TB storage but we need more and more speed (current drives are a mix of 15k SAS, 7.2k NL-SAS and a bunch of SATA, depending on the workload they get placed on a different datastore comprised of a set of drives)

We are a small IT shop that runs databases, fileserver and Exchange mainly Windows as the OS, we are using VMware ESXi as the hypervisior. (3 hosts)
This will be our main storage connected via 10Gbit over iSCSI. (no Samba or NFS)
I want this thing to be screaming fast, expandable, stable and redundant, but at low cost, so:

Superserver 2028R-E1CR24L (X10DRH-iT motherboard)
Intel Xeon E5-1650 v3 6C @ 3,5-3,8GHz (with option to upgrade to a second processor)
Samsung M393A4K40BB0-CPB 32GB x 4
Crucial MX300 2,5" 1TB x 13 in a stripe of 6xRAID1 (1 spare, with option to add 12 more drives in sets of 2)
16GB SATA DOM boot device

Expected maximum performance characteristics:
Usable capacity: 6TB
Redundancy: 1 drive failure per RAID1 set
Speed: 742560.78 IOP's / 4443.29MB/s (@60% read)

Price: +- 5500,- (euro)

As an extra option for more read performance I was dubbing to add a Intel 750 SSD Half Height PCIe 3.0 x4 400GB for L2ARC but I don't think it will be necessary.
And maybe 2x Samsung SM863 240GB (LOG) RAID1 for log writing.

Any concerns/suggestions?
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Welcome to the forum!

Wow, you are planning on building a really nice, high performance system! I'm green with envy! :)
Superserver 2028R-E1CR24L (X10DRH-iT motherboard)
Intel Xeon E5-1650 v3 6C @ 3,5-3,8GHz (with option to upgrade to a second processor)
Samsung M393A4K40BB0-CPB 32GB x 4
Crucial MX300 2,5" 1TB x 13 in a stripe of 6xRAID1 (1 spare, with option to add 12 more drives in sets of 2)
16GB SATA DOM boot device

Expected maximum performance characteristics:
Usable capacity: 6TB
Redundancy: 1 drive failure per RAID1 set
Speed: 742560.78 IOP's / 4443.29MB/s (@60% read)

Price: +- 5500,- (euro)
Standard advice for iSCSI block VM storage is to use mirrors instead of RAIDZ1/2/3. IOPS scale by vdev, so multiple mirrors deliver more IOPS than an equivalent number of disks in a RAIDZn topology. But you're planning on SSD-based storage instead of hard drives, so you may get perfectly acceptable performance from the striped RAIDZ1 pool you describe.

Because FreeNAS is a copy-on-write system, another rule of thumb with iSCSI is to use no more than 50-60% of your block storage capacity, while noting that using an even smaller proportion would be better.

Therefore, your ideal design would be to install 12 mirrors in your 24-bay cabinet, yielding ~12TB of usable space, with the goal of never using more than ~6TB of that space. Also, this configuration would have 6 times the IOPS of your two-vdev RAIDZ1 design! But I understand that's double the SSD disks you'd planned on using. Note that you'd have to mount your spare drive(s) in the optional 2-drive bay at the back of the cabinet.

But again, these rules-of-thumb may not apply in your case.

Using just 3 more SSD drives than your original design, you could stripe three 5-disk RAIDZ1 vdevs with 1 spare drive, for a total of 16 drives. This would give you a third more IOPS and ~2TB more space than your original design.
As an extra option for more read performance I was dubbing to add a Intel 750 SSD Half Height PCIe 3.0 x4 400GB for L2ARC but I don't think it will be necessary.
And maybe 2x Samsung SM863 240GB (LOG) RAID1 for log writing.

Any concerns/suggestions?
With 128GB RAM, you may be correct about not needing an L2ARC. It's better to add more memory before adding an L2ARC, and you can install up to 512GB or 2000GB(!) on your board, depending on memory type.

But you will need a SLOG device (see "Some Insights Into SLOG ZIL"). These need to have 'supercapacitors/battery backup', low latency, high speed, and high write durability. I'm not familiar with the Samsung SM863 device, which may be perfectly suitable. The thread I linked recommends good models. In your case, you might be best served by the STEC ZeusRAM or Intel DC P3700. The latter has the advantage of being housed in a half-height card form factor, using a PCI-e 3.0 interface and is more-or-less the 'enterprise' version of the Intel 750 you mentioned.

Search the forum for postings by user @jgreco - he has quite a bit of expertise in this area.

Good luck!
 

ChriZ

Patron
Joined
Mar 9, 2015
Messages
271
If you want the option to upgrade to a second CPU, you must use one of the E5-26xx CPUs
The 16xx CPUs don't support it.
Just my 2c...
 

JeroenvdBerg

Dabbler
Joined
Aug 27, 2016
Messages
11
Welcome to the forum!

Wow, you are planning on building a really nice, high performance system! I'm green with envy! :)
Thank you :) I have been trying to get my manager to replace the old, slow, underperforming, randomly purple screening machine for 2 years now, but it looks like I finally get some needed upgrade.

Standard advice for iSCSI block VM storage is to use mirrors instead of RAIDZ1/2/3. IOPS scale by vdev, so multiple mirrors deliver more IOPS than an equivalent number of disks in a RAIDZn topology. But you're planning on SSD-based storage instead of hard drives, so you may get perfectly acceptable performance from the striped RAIDZ1 pool you describe.

Because FreeNAS is a copy-on-write system, another rule of thumb with iSCSI is to use no more than 50-60% of your block storage capacity, while noting that using an even smaller proportion would be better.

Therefore, your ideal design would be to install 12 mirrors in your 24-bay cabinet, yielding ~12TB of usable space, with the goal of never using more than ~6TB of that space. Also, this configuration would have 6 times the IOPS of your two-vdev RAIDZ1 design! But I understand that's double the SSD disks you'd planned on using. Note that you'd have to mount your spare drive(s) in the optional 2-drive bay at the back of the cabinet.

But again, these rules-of-thumb may not apply in your case.

Using just 3 more SSD drives than your original design, you could stripe three 5-disk RAIDZ1 vdevs with 1 spare drive, for a total of 16 drives. This would give you a third more IOPS and ~2TB more space than your original design.
With 128GB RAM, you may be correct about not needing an L2ARC. It's better to add more memory before adding an L2ARC, and you can install up to 512GB or 2000GB(!) on your board, depending on memory type.[/QUOTE]

Just to make sure we understand each other, this is the planned setup left (6xRAID1 striped) and your suggestion on the right, since the parity needs to be written to all drives that would cripple write performance?

SCL13Sb.png




But you will need a SLOG device (see "Some Insights Into SLOG ZIL"). These need to have 'supercapacitors/battery backup', low latency, high speed, and high write durability. I'm not familiar with the Samsung SM863 device, which may be perfectly suitable. The thread I linked recommends good models. In your case, you might be best served by the STEC ZeusRAM or Intel DC P3700. The latter has the advantage of being housed in a half-height card form factor, using a PCI-e 3.0 interface and is more-or-less the 'enterprise' version of the Intel 750 you mentioned.

Search the forum for postings by user @jgreco - he has quite a bit of expertise in this area.

Samsung SM863 is Samsungs answer to the Intel DC line drives they have a supercapacitor, consistent performance and a TBW rating of 3PB (480GB version) vs 3.0PB (400Gb version of the S3610) and are a little bit cheaper.

STEC ZeusRAM is not really an option, I see it's priced at more than 2k (euro) in a local shop.
Also: the Intel DC P3600 PCIe 3.0 x4 400GB has an IOPS rating of 320.000 read / 30.000 write TBW: 2,1PB, the Intel 750 PCIe 3.0 x4 400GB has an IOPS rating of a whopping 430.000 read / 230.000 write but it's TBW rating is scary low at 125TB so that might not be a good option for a SLOG/write cache.

I think the options are:
Intel DC P3700 PCI-e 400GB 450.000 read / 75.000 write / €450,-
2 x Intel DC S3610 100GB 84000 read / 28000 write / €210,-
2 x Intel DC S3710 200GB 85.000 read / 43.000 write / €375,-
2 x Samsung SM863 240GB 97.000 read / 20.000 write / €250,-

But I think i'm not that bold to run the cache on one P3700, basically creating a single point of failure, if that thing fails during a heavy write I lose the entire array.
I think I will go with the S3710 in RAID1, but that means I get 43.000 iops to the array max if i'm not mistaken?

Good luck!

Thank you and thanks for looking into my build :)


If you want the option to upgrade to a second CPU, you must use one of the E5-26xx CPUs
The 16xx CPUs don't support it.
Just my 2c...

Shoot, that is not a 2c advise my friend, that is worth a lot more.

Changed to Intel Xeon E5-2637 v4
 

JeroenvdBerg

Dabbler
Joined
Aug 27, 2016
Messages
11
Updated:

Superserver 2028R-E1CR24L (X10DRH-iT motherboard)
Intel Xeon E5-2637 v4 4C @ 3,4-3,7GHz (with option to upgrade to a second processor)
Samsung M393A4K40BB0-CPB 32GB x 4
Crucial MX300 2,5" 1TB x 13
Intel DC S3710 200GB x 2 (RAID1)
16GB SATA DOM boot device

Price: +- €6500,-
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Just to make sure we understand each other, this is the planned setup left (6xRAID1 striped) and your suggestion on the right, since the parity needs to be written to all drives that would cripple write performance?

SCL13Sb.png
I misunderstood your specifications! I thought you were going to set up 2 RAIDZ1 vdevs, each comprising 6 drives, for a total of ~10TB of usable space. But you were planning on using mirrors instead. Ooops! :oops:

I apologize for the confusion. Mirrors are the preferred topology for iSCSI block storage and you designed your system to use 6 of them, for a total capacity of ~6TB. Still, everything I said about space utilization remains true. When I read your original post, I also assumed that you need ~5-6TB of storage capacity for your VM images. Is this true? Or did I misunderstand that as well? If you do need that much space, then your design won't work as it stands because you will be using 100% of your storage capacity. However, if your VM images only occupy 2-3TB then your design will be fine. The important thing is to design your system so that you only use roughly half (or less) of the total available storage capacity.
Samsung SM863 is Samsungs answer to the Intel DC line drives they have a supercapacitor, consistent performance and a TBW rating of 3PB (480GB version) vs 3.0PB (400Gb version of the S3610) and are a little bit cheaper.

STEC ZeusRAM is not really an option, I see it's priced at more than 2k (euro) in a local shop.
Also: the Intel DC P3600 PCIe 3.0 x4 400GB has an IOPS rating of 320.000 read / 30.000 write TBW: 2,1PB, the Intel 750 PCIe 3.0 x4 400GB has an IOPS rating of a whopping 430.000 read / 230.000 write but it's TBW rating is scary low at 125TB so that might not be a good option for a SLOG/write cache.

I think the options are:
Intel DC P3700 PCI-e 400GB 450.000 read / 75.000 write / €450,-
2 x Intel DC S3610 100GB 84000 read / 28000 write / €210,-
2 x Intel DC S3710 200GB 85.000 read / 43.000 write / €375,-
2 x Samsung SM863 240GB 97.000 read / 20.000 write / €250,-

But I think i'm not that bold to run the cache on one P3700, basically creating a single point of failure, if that thing fails during a heavy write I lose the entire array.
I think I will go with the S3710 in RAID1, but that means I get 43.000 iops to the array max if i'm not mistaken?
It used to be true that you would lose your pool if the SLOG device failed, but that is no longer the case. Because of this, you can use a single P3700 if you prefer. The only reason to mirror your SLOG devices is if you're concerned about the drop in performance that would result if the SLOG failed.
Thank you and thanks for looking into my build :)
You're welcome!
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
With all the iops of those SSDs is mirroring still the right way to go?

Also, the 50% thing is to avoid fragmentation and thus multiple random seeks on writes.... Right?

Again with SSDs can you not assume you'd be able to go higher... Say 70%

I dunno. Just saying :)
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
With all the iops of those SSDs is mirroring still the right way to go?
I'm not sure, which is why I hedged and said "these rules-of-thumb may not apply". But, given the same model of SSD, it's a certainty that mirroring them will provide more IOPS than putting them into any RAIDZn configuration.
Also, the 50% thing is to avoid fragmentation and thus multiple random seeks on writes.... Right?

Again with SSDs can you not assume you'd be able to go higher... Say 70%
Right. And you may be correct about a higher utilization rate being okay for SSD-based block storage.
I dunno. Just saying :)
Understood. We are in fairly new territory here; there are very few all-flash systems to compare.

It may be that @JeroenvdBerg would be better served using HDDs instead of SSDs for his pool and maxing out his memory instead. After all, the board he's chosen can support up to 500GB or 2000GB of RAM, depending on CPU & RAM type. But I can't recommend this, in good conscience, because I don't know for certain whether it's true.

I wish @jgreco would stop by and give his opinion as he is the one most likely to have experience on this subject. But he hasn't been around in several weeks...
 

JeroenvdBerg

Dabbler
Joined
Aug 27, 2016
Messages
11
I misunderstood your specifications! I thought you were going to set up 2 RAIDZ1 vdevs, each comprising 6 drives, for a total of ~10TB of usable space. But you were planning on using mirrors instead. Ooops! :oops:

I apologize for the confusion. Mirrors are the preferred topology for iSCSI block storage and you designed your system to use 6 of them, for a total capacity of ~6TB. Still, everything I said about space utilization remains true. When I read your original post, I also assumed that you need ~5-6TB of storage capacity for your VM images. Is this true? Or did I misunderstand that as well? If you do need that much space, then your design won't work as it stands because you will be using 100% of your storage capacity. However, if your VM images only occupy 2-3TB then your design will be fine. The important thing is to design your system so that you only use roughly half (or less) of the total available storage capacity.
It used to be true that you would lose your pool if the SLOG device failed, but that is no longer the case. Because of this, you can use a single P3700 if you prefer. The only reason to mirror your SLOG devices is if you're concerned about the drop in performance that would result if the SLOG failed.
You're welcome!

Okay good, just to be clear, we have in our current 'SAN':
530GB 15k SAS - 606GB provisioned - 30GB free
1730GB 7.2 NL-SAS - 1670GB provisioned - 127GB free
440GB SATA - 482GB provisioned - 202GB free
Total: 2700GB - 2758GB provisioned - 359GB free

Most VM's (like Exchange and SQL) are thick provisioned, but as you can see there are some thin provisioned VM as well (i'm 70GB short on the fast pool, with just 30GB free that's not good to say the least)
So with maximum utilization and no compression i'm still well under the 3TB best practice/rule of thumb mark.
But of course i'm gonna keep an eye on the performance and just go 'oh you wan't another VM, then we just have to buy 2 more SSD's so we don't compromise performance' giving me more IOP's for the entire array and more space for VM's.
Also, if management decides the initial price is to high I can easily reduce the number of SSD's to 11 or 9, giving 5TB or 4TB of storage, not ideal but still within safety limit's.
The ESXi hosts have some local disk's that are just used for booting ESXi, I might repurpose them into a small 'slow storage' RAIDZ1 pool for temporary backup locations or unimportant VM's.

I'm not sure, which is why I hedged and said "these rules-of-thumb may not apply". But, given the same model of SSD, it's a certainty that mirroring them will provide more IOPS than putting them into any RAIDZn configuration.
Right. And you may be correct about a higher utilization rate being okay for SSD-based block storage.
Understood. We are in fairly new territory here; there are very few all-flash systems to compare.

It may be that @JeroenvdBerg would be better served using HDDs instead of SSDs for his pool and maxing out his memory instead. After all, the board he's chosen can support up to 500GB or 2000GB of RAM, depending on CPU & RAM type. But I can't recommend this, in good conscience, because I don't know for certain whether it's true.

That's the main problem, all-flash makes not much sense in a homelab / mass storage array's (yet), so nobody is really doing it except businesses like TrueNAS (and I was not impressed with their prices) and EMC (we got a quote for about 40k all flash, and an array with 24 x 10k SAS would cost us 19k) we had a quote from DELL but they are out of their fucking mind (50k for hybrid).
I have been searching online for quite some time, but all-flash ZFS is not something that is very easy to find.
There have been some other users that try to do a High-performance homelab SAN:
https://forums.freenas.org/index.php?threads/high-performance-freenas-build.28820
https://forums.freenas.org/index.ph...-choice-and-the-rest-of-the-components.28671/
https://blog.pivotal.io/labs/labs/high-performing-mid-range-nas-server
http://www.hyperionepm.com/category/hyperion-home-lab/

All of them consider the price of SSD vs 7.2k spinning drives, that is not a compromise I want and have to make, especially since they are a lot cheaper than 10k/15k SAS drives (the drives I would use if not using SSD) so I can just buy 2 SSD's for the price of 1 SAS drive, it might fail faster, but hey: I have 2 so just replace it and send the other back to the supplier, that why I wan't redundancy in the system, as much as possible, but without compromising performance.

But for our purposes EMC/TrueNAS/DELL/Nutanix/PureStorage/HP, cost to way too much and are overdesigned, let me explain: we are just a small IT shop so most of us know our stuff, but we have to deal with decisions made by our no longer employed predecessors, so our storage LAN is on the same network/switch as the rest of the machines (no VLAN), the 'SAN' hast just 1 power supply and has failed 2 times in the last 4 years, the switch that was used to connect the ESXi hosts/'SAN' and data network was 100Mbit (thank god it failed), the OS (openfiler) of the 'SAN' purple screens every year at least once and is no longer updated, the NIC's on the motherboard of the SAN are not working with the OS so a PCI card with 2 x 1000Mbit has been installed, but no multipathing is configured, jumbo frames off, all LUN's are 100% allocated to VMware, warranty 1 year (including drives), etc.
So just to fix a few of these issues would be a relief for me and colleagues. /rant

But given that the differences:

Superserver 6028R-E1CR12L 12x3,5" + 2x2,5" (Super X10DRH-iT) - €1,715.56
Intel Xeon E5-2637 v4 - €869.83
Samsung M393A4K40BB0-CPB 32GB 4x - €492.56
Intel DC S3710 200GB - 1x - €188.99
Intel DC P3700 PCIe 3.0 x4 400GB - 1x - €455.95
WD Black WD1003FZEX 2TB 7.2k SATA 12x - €1,391.26
MCP-220-82611-0N 1x - €30.00
Total: €5,037.13 / 12TB usable (with no spare) / 10TB (2 spare)

And this:

Superserver 2028R-E1CR24L - 24x2,5" + 2x2,5" (Super X10DRH-iT) - €1,865.74
Intel Xeon E5-2637 v4 - €869.83
Samsung M393A4K40BB0-CPB 32GB 4x - €492.56
Intel DC S3710 200GB - 1x - €188.99
Crucial MX300 2,5" 1TB - 9x/13x - €2,043.00
Total: €5,460.12 / 4TB usable
Total: €6,368.12 / 6TB usable

The price difference is between 300,- and 1200,-, of course with the first option utilizing 3,5" drives at 2TB you get double the capacity, but no room for expansion, and far less performance.
If using proper enterprise grade SAS drives: 13 x 10k 2,5" SAS drives (HP 718162-B21 SAS2 10k 1,2TB) total: €6,296.14 (7,2TB usable) / 13x15k 2,5" SAS, Total: €7,053.26 (3,6TB usable) (Seagate Enterprise Turbo SSHD ST600MX0052 600GB)
So a RAID (Redundant Array of Inexpensive Disks) of SSD's starts making more sense.
The prices above are just what I could find online, I expect the system itself + CPU and memory to be cheaper with one vendor that we recently contacted about a quote for a system. (including assembly and testing)
 
Last edited:

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
I thought no you'll have to carefully evaluate your performance yourself :(

If the plan (and it's a good one) is to grown the array over time by adding pairs of drives then you should definately go with mirrors
 

JeroenvdBerg

Dabbler
Joined
Aug 27, 2016
Messages
11
Okay, I finally got a go-ahead the machine up-and running and I'm doing some testing, any input would be helpful.
I have enabled auto-tune, so far it has come up with vm.kmem_size = 171614888960
Any tuning command's I should use (and why?)

Final specs:
Supermicro Superserver 2028R-E1CR24L (X10DRH-iT motherboard)
Intel Xeon CPU E5-2637 v4 @ 3.50GHz (HT disabled)
4 x Samsung M393A4K40BB0-CPB 32GB (128GB)
13 x Crucial MX300 2,5" 1TB (Data)
2 x Intel DC S3520 120GB (LOG)
1 x 16GB SATA DOM (boot)

hcQ5cq0.png



FTWnCQB.png



qAFBYFY.png






/mnt/volume1# iozone -a -s 200g -r 4096
Iozone: Performance Test of File I/O
Version $Revision: 3.457 $
Compiled for 64 bit mode.
Build: freebsd

Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
Al Slater, Scott Rhine, Mike Wisner, Ken Goss
Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer,
Vangel Bojaxhi, Ben England, Vikentsi Lapa,
Alexey Skidanov.

Run began: Sat Apr 8 14:43:58 2017

Auto Mode
File size set to 209715200 kB
Record Size 4096 kB
Command line used: iozone -a -s 200g -r 4096
Output is in kBytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 kBytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
random random bkwd record stride
kB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread
209715200 4096 4895068 4259914 5271392 5246237 5221923 4132339 5429667 12583534 5289570 4322788 4330756 3859841 3863263

iozone test complete.


If i'm not mistaken that is 5,2GB/s read, 4,1GB/s write random?
 
Status
Not open for further replies.
Top