Proposed All SSD Build for VM SAN

Ctiger256

Dabbler
Joined
Nov 18, 2016
Messages
40
Any thoughts/improvements for this build? It will serve as an iSCSI target for VMWare on a dedicated storage 10GB network.

Chasis: SUPERMICRO CSE-216BE2C-R920LPB
Motherboard: SUPERMICRO MBD-X10SRH-CLN4F-O
CPU: Xeon E5-2637v4
RAM: 4x32GB Samsung DDR4-2400
HBA: LSI 9300-8i w SFF8643 to SFF8643 to connect
OS: SSD-DM064-SMCMVN1 (64GB DOM)
SLOG: Intel S4610 - 240GB
Array Drives - Samsung 860DCT 1.92GB x 8
Network - Chelsio T520-BT

I play to provision it in mirrored 5 vdevs for about 8TB capacity to start with but may move it up to 12. Too much RAM? Too powerful a CPU? Concerns with any of the components?

Much appreciated as always.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Too much RAM?
Nope. ZFS loves RAM and the more you feed it the better. Go with 256GB if you can afford it.

Concerns with any of the components?

SATA SLOG is not nearly fast enough for small recordsize writes (which VMs will drive a lot of) and will end up being a bottleneck. Go NVMe, Intel DC P3700 or Optane P4801X. Check the link in my signature for benchmarks (https://forums.freenas.org/index.php?threads/slog-benchmarking-and-finding-the-best-slog.63521/)

Other than that you've got a great looking setup with lots of room to expand with additional vdevs as you grow. Remind me to dig up the usual tunables for ZFS on all-flash, make sure you set sync=always and enjoy
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Any thoughts/improvements for this build? It will serve as an iSCSI target for VMWare on a dedicated storage 10GB network.

Chasis: SUPERMICRO CSE-216BE2C-R920LPB
Motherboard: SUPERMICRO MBD-X10SRH-CLN4F-O
CPU: Xeon E5-2637v4
RAM: 4x32GB Samsung DDR4-2400
HBA: LSI 9300-8i w SFF8643 to SFF8643 to connect
OS: SSD-DM064-SMCMVN1 (64GB DOM)
SLOG: Intel S4610 - 240GB
Array Drives - Samsung 860DCT 1.92GB x 8
Network - Chelsio T520-BT

I play to provision it in mirrored 5 vdevs for about 8TB capacity to start with but may move it up to 12. Too much RAM? Too powerful a CPU? Concerns with any of the components?

Much appreciated as always.

Wrong CPU. The 2637v4 is a quad core 3.5GHz-turbo-to-3.7 with 15MB cache that's designed for dual-socket operation and costs around $1000.

The 1620v4 is almost the same CPU except it turbos to 3.8 and only has 10MB cache, costs a third as much ($294). The 1650v4 is actually a sweet CPU, six cores 3.6GHz-turbo-to-4.0, around $600.

The one downside to the E5-16xx CPU's is that they do not support LRDIMM. This probably isn't a big deal.

Having built something like this but with HDD's, I'll also note that I had a hard time getting the CPU to do much more than yawn even under fairly heavy load. My *impression* is that the 1620 would be fine, but once you're paying for all the other components, a few hundred extra for more cores and the slight speed bump is really nothing.
 

Ctiger256

Dabbler
Joined
Nov 18, 2016
Messages
40
SATA SLOG is not nearly fast enough for small recordsize writes (which VMs will drive a lot of) and will end up being a bottleneck. Go NVMe, Intel DC P3700 or Optane P4801X. Check the link in my signature for benchmarks (https://forums.freenas.org/index.php?threads/slog-benchmarking-and-finding-the-best-slog.63521/)

Very helpful (as were jgreco's comments on CPU chocie).

A few kind of basic questions on the SLOG (I've read through the benchmarking thread):

First, really basic, is there a recommendation for a solid PCI-e to M.2 NVMe adapter that I could use with a P4801x and my mobo? I see there are such things available, but I didn't know if quality was an issue with any of them.

Second, any advantage to U.2 over M.2 (there seem to be even less U.2 -> PCIe slot adapters)?

Third, I couldn't find any benchmarking of the 200MB vs 100MB P4801x. Thoughts on that? Is over provisioning necessary with Optane?

As you note, my use case is more focused on the smaller end of the recordsize writes.

Thanks!
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Oh and P.S. The Addonics definitely feel a bit more like cheesy PC hardware than server grade stuff, but it's really the result that counts.
 

Ctiger256

Dabbler
Joined
Nov 18, 2016
Messages
40
Also curious as to whether for a VM use case it's not better to go with a 900p vs a 4801x. From the benchmarking thread it looks like that performs better on the lower end of recordsizes. Nearly twice as fast on 4k for example. (And it would be an easier fit for me because of the pcie form factor.) Is the concern there just that it isn't enterprise grade?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
First, really basic, is there a recommendation for a solid PCI-e to M.2 NVMe adapter that I could use with a P4801x and my mobo? I see there are such things available, but I didn't know if quality was an issue with any of them.

The Addonics one @jgreco posted has been popular; Startech also makes a few. Bear in mind that the ones providing multiple NVMe M.2 slots may require your motherboard to support PCIe bifurcation. Normally not an issue with a modern server board, might be on older gear or repurposed "workstation" stuff.

Second, any advantage to U.2 over M.2 (there seem to be even less U.2 -> PCIe slot adapters)?

Larger form factor, better heat dissipation, the potential for hotswap if you eventually get to the point of hotswappable NVMe backplanes.

Third, I couldn't find any benchmarking of the 200MB vs 100MB P4801x. Thoughts on that? Is over provisioning necessary with Optane?

Optane P4801X 200G:

Code:
Synchronous random writes:
     0.5 kbytes:     24.8 usec/IO =     19.7 Mbytes/s
       1 kbytes:     25.0 usec/IO =     39.1 Mbytes/s
       2 kbytes:     25.3 usec/IO =     77.2 Mbytes/s
       4 kbytes:     22.9 usec/IO =    170.5 Mbytes/s
       8 kbytes:     25.1 usec/IO =    310.8 Mbytes/s
      16 kbytes:     29.8 usec/IO =    523.5 Mbytes/s
      32 kbytes:     41.0 usec/IO =    762.9 Mbytes/s
      64 kbytes:     60.3 usec/IO =   1037.3 Mbytes/s
     128 kbytes:     96.6 usec/IO =   1293.5 Mbytes/s
     256 kbytes:    162.0 usec/IO =   1543.1 Mbytes/s
     512 kbytes:    291.4 usec/IO =   1715.6 Mbytes/s
    1024 kbytes:    551.4 usec/IO =   1813.6 Mbytes/s
    2048 kbytes:   1073.1 usec/IO =   1863.8 Mbytes/s
    4096 kbytes:   2109.1 usec/IO =   1896.5 Mbytes/s
    8192 kbytes:   4183.8 usec/IO =   1912.1 Mbytes/s


Optane P4801X 100G:

Code:
Synchronous random writes:
         0.5 kbytes:     28.0 usec/IO =     17.4 Mbytes/s
           1 kbytes:     27.0 usec/IO =     36.1 Mbytes/s
           2 kbytes:     27.5 usec/IO =     71.1 Mbytes/s
           4 kbytes:     23.4 usec/IO =    167.1 Mbytes/s
           8 kbytes:     30.4 usec/IO =    257.0 Mbytes/s
          16 kbytes:     44.1 usec/IO =    354.3 Mbytes/s
          32 kbytes:     64.6 usec/IO =    483.5 Mbytes/s
          64 kbytes:    103.7 usec/IO =    602.7 Mbytes/s
         128 kbytes:    161.1 usec/IO =    776.1 Mbytes/s
         256 kbytes:    285.8 usec/IO =    874.6 Mbytes/s
         512 kbytes:    527.5 usec/IO =    947.9 Mbytes/s
        1024 kbytes:    988.2 usec/IO =   1012.0 Mbytes/s
        2048 kbytes:   1905.6 usec/IO =   1049.5 Mbytes/s
        4096 kbytes:   3730.2 usec/IO =   1072.3 Mbytes/s
        8192 kbytes:   7398.6 usec/IO =   1081.3 Mbytes/s


The major difference is at the top end, and since you're focusing on small recordsize writes (VMs) you'll see less of a difference. But even at 16K you're getting 523MB/s vs 354MB/s. Could consider potentially the Optane 900p or the non-Optane P3700.

Overprovisioning isn't necessary (or possible) with Optane, but the P-series drives should be changed to use 4K native sector emulation as well as overprovisioned down to a smaller size.
https://www.intel.com/content/www/us/en/support/articles/000006392/memory-and-storage.html
 

Ctiger256

Dabbler
Joined
Nov 18, 2016
Messages
40
Finally have all the parts and am putting this together. (I went with a p4800x slog, BTW...)

One question on my backlplane. It's a SAS3-216EL2, which has a primary and secondary expander. I'm using it with a single LSI 9300-8i HBA. In this configuration, is my only option to connect one cable from the HBA to each of the 'A' ports on the expanders and get failover capability? In other words, there's not some way to get increased throughput from dual expanders? That's how I read the SuperMicro manual on this backplane, but thought I'd check here as well.

Thanks all!
 

Ctiger256

Dabbler
Joined
Nov 18, 2016
Messages
40
Also, is there some recommended number of molex connectors you need to connect to the backplane or is one enough?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
In this configuration, is my only option to connect one cable from the HBA to each of the 'A' ports on the expanders and get failover capability?

You're using SATA SSDs, which only have a single path, so "failover" in that sense won't be possible.

Check page 3-5, figure 3-8 - for a single internal HBA it shows cabling to both the A and B ports; this should give you an effective 8 lanes to the expander.

Also, is there some recommended number of molex connectors you need to connect to the backplane or is one enough?

Connect them all; the SSDs won't draw nearly enough power to actually require it but the backplane will very likely complain if there's a plug left unplugged.
 

Ctiger256

Dabbler
Joined
Nov 18, 2016
Messages
40
You're using SATA SSDs, which only have a single path, so "failover" in that sense won't be possible.

Check page 3-5, figure 3-8 - for a single internal HBA it shows cabling to both the A and B ports; this should give you an effective 8 lanes to the expander.

So:

1. The situation described on page 3-2 isn't applicable to SATA drives? I.e., one expander can't just take over for another?

2. If I had 2 HBA's does that give me any better alternatives? E.g. the diagram 3-9 on 3-5? Or no real point?

Thanks!
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
So:

1. The situation described on page 3-2 isn't applicable to SATA drives? I.e., one expander can't just take over for another?

2. If I had 2 HBA's does that give me any better alternatives? E.g. the diagram 3-9 on 3-5? Or no real point?

Thanks!

1. SATA devices only support a single data path; the second expander board won't "see" any connected devices downstream.

2. If you have two HBAs you could connect one to EL1-A and the other to EL1-B. EL1 would still be a single point of failure but you could survive a blown HBA. The BSD multipath driver would handle the load-balancing across HBAs.

Since you're using a SAS3008 card I think that single-HBA wouldn't bottleneck, but if you were on older 2008/2308 based cards a config like you're proposing might actually have performance benefits. I haven't had a chance to explore this yet though.
 

Ctiger256

Dabbler
Joined
Nov 18, 2016
Messages
40
I've got this build up with the specs as per original post other than (a) now using E5-1650 v4 as the CPU (b) P4800x as the slog; (c) no need for a separate HBA since this mobo has a built in 3008 (which I flashed to IT, of course). I'm going to document my testing of this as I get it ready in the event that's helpful to someone else and in the hopes I'll get some helpful comments from the knowledgeable folks here.

I ran diskinfo -wS on all the array SSDs to make sure none of them was out of whack. They all looked identical, here's a sample:

Code:
root@freenas[~]# diskinfo -wS /dev/da7
/dev/da7
        512             # sectorsize
        1920383410176   # mediasize in bytes (1.7T)
        3750748848      # mediasize in sectors
        4096            # stripesize
        0               # stripeoffset
        233473          # Cylinders according to firmware.
        255             # Heads according to firmware.
        63              # Sectors according to firmware.
        ATA SAMSUNG MZ7LH1T9    # Disk descr.
        S455NY0M210671          # Disk ident.
        id1,enc@n500304801f1e047d/type@0/slot@8/elmdesc@Slot07  # Physical path
        Yes             # TRIM/UNMAP support
        0               # Rotation rate in RPM
        Not_Zoned       # Zone Mode

Synchronous random writes:
         0.5 kbytes:    125.2 usec/IO =      3.9 Mbytes/s
           1 kbytes:    124.1 usec/IO =      7.9 Mbytes/s
           2 kbytes:    126.5 usec/IO =     15.4 Mbytes/s
           4 kbytes:    130.9 usec/IO =     29.8 Mbytes/s
           8 kbytes:    138.9 usec/IO =     56.2 Mbytes/s
          16 kbytes:    153.3 usec/IO =    101.9 Mbytes/s
          32 kbytes:    196.6 usec/IO =    158.9 Mbytes/s
          64 kbytes:    275.3 usec/IO =    227.0 Mbytes/s
         128 kbytes:    422.3 usec/IO =    296.0 Mbytes/s
         256 kbytes:    722.0 usec/IO =    346.2 Mbytes/s
         512 kbytes:   1342.7 usec/IO =    372.4 Mbytes/s
        1024 kbytes:   2543.1 usec/IO =    393.2 Mbytes/s
        2048 kbytes:   4896.4 usec/IO =    408.5 Mbytes/s
        4096 kbytes:   9583.3 usec/IO =    417.4 Mbytes/s
        8192 kbytes:  19007.7 usec/IO =    420.9 Mbytes/s


And here's the same for the P4800X I'm using as a SLOG (I x-posted these results to the SLOG thread)

Code:
root@freenas[~]# diskinfo -wS /dev/nvd0
/dev/nvd0
        512             # sectorsize
        375083606016    # mediasize in bytes (349G)
        732585168       # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        INTEL SSDPED1K375GA     # Disk descr.
        PHKS7481007J375AGN      # Disk ident.
        Yes             # TRIM/UNMAP support
        0               # Rotation rate in RPM

Synchronous random writes:
         0.5 kbytes:     14.6 usec/IO =     33.4 Mbytes/s
           1 kbytes:     14.6 usec/IO =     66.8 Mbytes/s
           2 kbytes:     14.9 usec/IO =    130.8 Mbytes/s
           4 kbytes:     12.0 usec/IO =    326.1 Mbytes/s
           8 kbytes:     13.6 usec/IO =    573.8 Mbytes/s
          16 kbytes:     18.1 usec/IO =    864.2 Mbytes/s
          32 kbytes:     24.7 usec/IO =   1264.5 Mbytes/s
          64 kbytes:     40.7 usec/IO =   1535.0 Mbytes/s
         128 kbytes:     74.5 usec/IO =   1678.4 Mbytes/s
         256 kbytes:    132.2 usec/IO =   1891.3 Mbytes/s
         512 kbytes:    233.7 usec/IO =   2139.8 Mbytes/s
        1024 kbytes:    436.1 usec/IO =   2292.8 Mbytes/s
        2048 kbytes:    851.0 usec/IO =   2350.3 Mbytes/s
        4096 kbytes:   1673.7 usec/IO =   2389.9 Mbytes/s
        8192 kbytes:   3304.5 usec/IO =   2420.9 Mbytes/s


I'm not going to "burn in" any of the drives as this is an all flash setup. Any disagreement there?

My next step is to set-up the pool (I have 8 drives which I'm going to use as 4 mirrored zdevs for max performance in my VM SAN use case) and then run some dd tests with compression off. Any other local tests to try before I move into testing out the 10G ethernet?
 

Ctiger256

Dabbler
Joined
Nov 18, 2016
Messages
40
DD results:

4 vdev pool (no compression, no sync):

Code:
root@freenas[/mnt/flashpool]# dd if=/dev/zero of=testfile bs=1M count=20000
20000+0 records in
20000+0 records out
20971520000 bytes transferred in 10.925645 secs (1919476641 bytes/sec)

now with sync, no slog:

Code:
root@freenas[/mnt/flashpool]# dd if=/dev/zero of=testfile bs=1M count=20000
20000+0 records in
20000+0 records out
20971520000 bytes transferred in 52.849869 secs (396813096 bytes/sec)

and now with the slog added:

Code:
root@freenas[/mnt/flashpool]# dd if=/dev/zero of=testfile bs=1M count=20000
20000+0 records in
20000+0 records out
20971520000 bytes transferred in 14.681608 secs (1428421222 bytes/sec)

and here are the write speeds (these are mirrored vdevs):

Code:
root@freenas[/mnt/flashpool]# dd if=testfile of=/dev/null bs=1M count=20000
20000+0 records in
20000+0 records out
20971520000 bytes transferred in 4.770633 secs (4395961619 bytes/sec)

So, bottom line:

1.9 GB/s no-sync writes
1.4 GB/s sync writes with the P4800x slog
4.4 GB/s read speeds

Assuming I'm doing this right, that's enough to saturate a 10Gb link on serial read/writes.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
You'll likely be limited by the volblocksize for the live VM writes; speeds closer to the bs=1M tests will likely only be seen in svMotions or similar "async" operations.

I really need to take some time to benchmark and document performance with all-SSD setups; there's some notes stating that with an all-flash solution that does data reduction, VMDKs should be created as "Eager Zeroed Thick" in order to speed things up, but I'm not sure if that would apply in a ZFS scenario as well.
 

Ctiger256

Dabbler
Joined
Nov 18, 2016
Messages
40
Here is my first test on a VM accessing this as an iSCSI target over a dedicated 10Gb network. Sync is on.

Capture.PNG

Pretty much saturating the link on the sequential read/writes so network is not an issue. Thoughts on the performance of the other tests? It's about double what I get on a similar system with HDDs. I feel like I could do better...

What about tuning? HoneyBadger you mentioned some All flash tuning parameters you had. Would be interested if those are easy to dig up.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Is the network a single iSCSI link or MPIO? If MPIO you might realize some gains by adjusting the default VMware IOPS round-robin policy from 1000 down to 1 (and if you aren't using round-robin, get on that)

Since this is an all-flash setup, set vfs.zfs.metaslab.lba_weighting_enabled: 0 since you don't need to be concerned about writing to the "fast portion" of a spinning disk.
 

Ctiger256

Dabbler
Joined
Nov 18, 2016
Messages
40
Single link. I didn't do MPIO because I actually didn't have enough spare 10Gb ports in the test setup (10Gb copper ports are still a bit pricey...). But I also would think 10Gb single would be enough for anything but pure sequential transfers. The numbers above seem to bear that out. Other than the sequential test, everything else is falling well below the line speed.

Will try the tunable you suggest.
 

Ctiger256

Dabbler
Joined
Nov 18, 2016
Messages
40
Here are the results fromthe same host and same VM but targeting an NFS datastore. Significantly lower performance on the high queue/thread random tests.

NFS.PNG
 
Top