Testing my new FreeNAS build with NVMe

Evi Vanoost

Explorer
Joined
Aug 4, 2016
Messages
91
Current setup
2 Intel DC P3700 (intended for write log)
2 Intel DC P3500 (intended for read cache)
6 Samsung 950Pro NVMe (intended for either read cache or a small SSD pool)

Supermicro 1028U-TN10RT+
2x 2.4GHz E5-2640V3
16x 32GB RAM (DDR4 ECC RDIMM) - 512GB total
64x 6TB SAS3 hard drives in 2x 36-slot SAS3 SuperMicro enclosures

I'm going to incrementally update this post as testing goes along.

Now I haven't done any pool creation or anything yet, I'm just wondering which of the SSD's is fastest in this setup for reads and writes.

The problem now seems to be that some of these drives use compression algorithms so writing zero's skews the result. There isn't a fast enough random number generator (tried OpenSSL - 300MB/s and /dev/random - 60MB/s). So I create a 25GB memory disk "mdconfig -a -t swap -s 25G -u 1" and fill it with random data from OpenSSL "openssl enc -aes256 -pass pass:"dd if=/dev/urandom bs=128 count=1" -nosalt < /dev/zero | dd of=/dev/md1" which I can then read from at ~3GB/s. Reading zero's can be done at 24GB/s

I updated the firmwares, BIOS of all controllers, SSD, motherboard etc. I saw a slight improvement so I'm re-posting with new numbers (~1s faster on each test across the board).

I made a little script that does 100,000 IO's per run.

P3700
read 128k
13107200000 bytes transferred in 9.058048 secs (1447022580 bytes/sec)
write 128k
13107200000 bytes transferred in 13.474332 secs (972753225 bytes/sec)
write 128k (random)
13107200000 bytes transferred in 14.872023 secs (881332681 bytes/sec)

P3700
read 32k
3276800000 bytes transferred in 2.159134 secs (1517645395 bytes/sec)
write 32k
3276800000 bytes transferred in 3.514131 secs (932463860 bytes/sec)
write 32k (random)
3276800000 bytes transferred in 4.581190 secs (715272689 bytes/sec)

P3700
read 4k
409600000 bytes transferred in 1.067073 secs (383853737 bytes/sec)
write 4k
409600000 bytes transferred in 1.186724 secs (345151879 bytes/sec)
write 4k (random)
409600000 bytes transferred in 1.708557 secs (239734481 bytes/sec)

P3500
read 128k
13107200000 bytes transferred in 10.264678 secs (1276922666 bytes/sec)
write 128k
13107200000 bytes transferred in 13.479953 secs (972347617 bytes/sec)
write 128k (random)
13107200000 bytes transferred in 14.398046 secs (910345750 bytes/sec)

P3500
read 32k
3276800000 bytes transferred in 2.331348 secs (1405538805 bytes/sec)
write 32k
3276800000 bytes transferred in 3.818419 secs (858156221 bytes/sec)
write 32k (random)
3276800000 bytes transferred in 4.367819 secs (750214225 bytes/sec)

P3500
read 4k
409600000 bytes transferred in 1.024098 secs (399961754 bytes/sec)
write 4k
409600000 bytes transferred in 1.197006 secs (342187094 bytes/sec)
write 4k (random)
409600000 bytes transferred in 2.155121 secs (190058926 bytes/sec)

950Pro
read 128k
13107200000 bytes transferred in 9.070986 secs (1444958680 bytes/sec)
write 128k
13107200000 bytes transferred in 8.548027 secs (1533359679 bytes/sec)
write 128k (random)
13107200000 bytes transferred in 11.666276 secs (1123511909 bytes/sec)

950Pro
read 32k
3276800000 bytes transferred in 3.316984 secs (987885399 bytes/sec)
write 32k
3276800000 bytes transferred in 2.714811 secs (1207008479 bytes/sec)
write 32k (random)
3276800000 bytes transferred in 4.060683 secs (806957842 bytes/sec)

950Pro
read 4k
409600000 bytes transferred in 1.359640 secs (301256188 bytes/sec)
write 4k
409600000 bytes transferred in 1.389694 secs (294741150 bytes/sec)
write 4k (random)
409600000 bytes transferred in 2.436863 secs (168084939 bytes/sec)


The two Intel's are fairly close, the 950Pro is definitely 'slower' although it does relatively well compared to the Intel's for larger writes (128k and up or 'desktop' loads) because at that point there are a number of algorithms that kick in. You can write zeroes about twice as fast in some cases (probably due to compression). There is a huge difference between the writing zeroes and writing random (half the speed) on the 950Pro which makes the thing very inconsistent to predict how it will do with your data.

Please let me know if there are any particular tests you want me to do or if you have any feedback as to what commands to use for IOPS testing on the raw devices (and later pool).

So I created a pool for each SSD, I tested it, then the second SSD of the same type to the pool (stripe). Doing fio tests with 8 threads yields I have too much memory to do accurate testing. I get 1M IOPS and 3GB/s transfer times for all SSD. In each write case, the first test was almost double IOPS but that doesn't take into account that SSD's can write very fast when they're empty (or recently TRIMmed), but I kind of want to avoid that because it's not a very accurate reflection of long-term or worst-case.

So I ran the test 5 times, writing 8 threads, IO depth 1, 5GB each file (40GB writes) and only took the last measurement, make sure the SSD and ZFS COW has to actually re-write blocks. For 'new' writes, all measurements are practically doubled. Please also note, the 950Pro has 512MB of DRAM cache which happily ignores fsync writes and uses (100k+ IOPS if you test files <500MB) but AFAIK it doesn't have a backup battery. This would make these puppies BAD for ZIL because you could easily lose 512MB worth of "committed" data when you lose power. I haven't found a way to ignore the DRAM buffer.

P3700: 25614 random writes - 102456KB/s
2x P3700: 47707 random writes - 190831KB/s
P3500: 18948 random writes - 75793KB/s
2x P3500: 26382 random writes - 105529KB/s
950Pro: 10753 random writes - 43015KB/s
2x 950Pro: 7170 random writes - 28683KB/s

On paper the 950Pro should be 'faster' in raw writes (110k vs 75k) than the P3700 but it seems that once we 'actually' use the thing, the speeds tumble down.

Not sure what is causing the 950Pro to do badly though, I created a pool with all 6 950Pro's striped and still getting really, really 'bad' performance ~1000 IOPS on average for the pool. I do see however that turning off fsync writes through fio (not the pool) yields me better performance (20k IOPS on the pool). Looking at the activity LED's, some of the tests apparently caused TRIM to trigger (high LED activity on the drives without any processes driving it) and it may also be degrading the performance.

So I put in 61 drives in the enclosures. Just the drives in 30 mirror formation, I get ~4k IOPS fsynced random writes. I get about 20k IOPS straight writes. It doesn't really matter how many processes I launch, 1-16 processes yield approximately the same results. The read speed is just overshadowed by the ARC and thus yields no significant results. I'd assume the read and write IOPS on a hard drive are roughly the same. Using RAIDZ2 however (7x8) I get ~2k IOPS fsynced but when using multiple processes, the IOPS drops down to ~700 IOPS.

So RAIDZ2 is OUT for my purposes. It's just way too slow. Now I'm adding the P3700 as SLOG and the P3500 as read cache. Now the peak IOPS for fsync random writes is ~10k IOPS and with 16 processes I can get up to 225K IOPS straight, non-synced writes.

Now for a straight read/write 70/30 test: 205K read 88.2K write simultaneously, with fsync on: 85.7K/36.1K (IOPS)
I ended up putting the 6 950Pro's in 3 mirror vdev's, 504K/217K same test without fsync with fsync: 3236/1392 - 7.42 microseconds latency.

So TL;DR:
The Intel's are WAY better when doing random read/writes with fsync enabled. They are about twice as fast. They don't suffer as heavy from performance loss over time and are very consistent in their speeds.

The Samsung may be a little faster when doing random read/writes without fsync enabled but at the expense of losing your pool if you lose power. I'd say, the Samsung's are probably a good option if you have a small pool and want to do a read cache (it's better than just drives) but don't use it for 'big boy stuff'.
 
Last edited:

cookiesowns

Dabbler
Joined
Jun 8, 2014
Messages
31
Have you tried over provisioning the SSD 950's? They should perform quite a bit more consistent, but the peak performance should roughly be the same.. just simply due to architecture and form factor limitations.

Curious to see how FreeNAS performs with 512GB of RAM on the host. Are you using LR dimms by any chance or regular RDIMMs?
 

Evi Vanoost

Explorer
Joined
Aug 4, 2016
Messages
91
Hi,

I haven't tried over provisioning the SSD's. I don't think I'll require steady performance out of them at this point, it's just an interesting data point to see how 'bad' cheap desktop SSDs actually perform in 'datacenter' situations.

FreeNAS (so far) performs quite well with 512GB. I am using 32GB RDIMMS. LRDIMMS are significantly slower and way more expensive.
 

SMnasMAN

Contributor
Joined
Dec 2, 2018
Messages
177
Evi- this is REALLY cool, supprised ive missed it till now!
ive been looking and testing for nearly 6 months for a good way to bench FN (ideally on FN to be as close to the disks/disk IO as possible).

(and THANKS ALOT for all those results, VERY helpful , and similar to what ive tested on various p3700 and p3605 and other nvme disks- i have the 2u version of your x10 SM server, among others )

in the past, I had tried doing something similar with just dd if=/dev/urandom of=testfile.dd (then cp that around, or cp testfile.dd /dev/null ) , but your way is better (and i like the memdisk/ramdisk ! ). in any scenario - Its still very difficult (impossible?) to test read on FN as ARC ends up being tested.

2 questions:

1- how are you getting around arc on your benchmarks? (it looks like you are, as those reads are not arc speeds, yet you have ALOT of ram, im guessing you are testing direct to disk, ie no ZFS involved, not even a single disk ZFS volume)

2- when you test 128k or 32k iterations, are you recreating the openssl enc... file but using :
(ie for 128k file):
... pass:"dd if=/dev/urandom bs=128 count=1" -nosalt < /dev/zero | dd of=/dev/md1
(ie to test with 32k file):
... pass:"dd if=/dev/urandom bs=32 count=1" -nosalt < /dev/zero | dd of=/dev/md1

(or are you using changing the blocksize when you run the dd command on the actual disk, all awhile using the same single data generated on your md1?)

3- im guessing your results are direct to disk (ie no ZFS filesys created, even on a single 1 disk pool, correct? (ie your using if=/dev/da2 OR of=/dev/da2 VS somethign like if=/mnt/p3700/opensslOutput.dd ?)

thanks alot! im curious what you ended up running as you have very VERY nice HW (did you end up using FN / ZFS?)

(ill post some results, using your method, but with a ZFS FS in place, most likely only Write results will be valid tho due to arc)
 

Evi Vanoost

Explorer
Joined
Aug 4, 2016
Messages
91
I did use direct-io to the disks to do the SSD tests under FreeNAS. I then made the disks into a pool to try to test but as I mentioned, the tests proved to be useless due to ARC.

You can avoid ARC by rebooting and doing only a single run or you can do as I and make sure you test file systems larger than your ARC can contain and then selecting only the tail end of your test to get the most accurate results.
So I ran the test 5 times, writing 8 threads, IO depth 1, 5GB each file (40GB writes) and only took the last measurement

Note that your ARC is likely much smaller than you expect when your ZIL is large, that's due to the allocation required in memory for the ZIL. So if read-performance is a necessity, make your ZIL smaller.
 
Top