FreeNAS 9.1 - Performance

Bjarke Emborg Kragelund · Aug 7, 2013

Hi mav@,

I totally agree - it is not kosher to use a OS/appliance centric counter for a real test. But it was only used as very general way of knowing if we were on the right track. The counter should be taken from the bonnie++ VM.

And the 200 GB was made so that it could fit in ARC - at least most of it.

The LSI 9207 should be good for about 700K IOPS, and my cache devices "only" gives about 75K each, and we have 4.

We have set vfs.zfs.write_limit_shift = 9 ( according to the all knowing Internet

) - By a glance, does that seem to be in order ?

Further more I have set the following :

(Autotune)
vfs.zfs.arc_max = 196137536808
vm.kmem_size = 217930596454
vm.kmem_size_max = 272413245568

( The all knowing.... )
vfs.zfs.vdev.cache.size = 16M
vfs.zfs.txg.synctime_ms = 200
vfs.zfs.l2arc_write_boost = 30 MB
vfs.zfs.l2arc_write_max = 100 MB
vfs.zfs.write_limit_shift = 9

Am I in the woods, or?

Thx again !

mav@ · Aug 7, 2013

"The LSI 9207 should be good for about 700K IOPS" -- on my tests I was able to reach at most 450K IOPS per card. I am really curious how they measured these marketing 700K.

"vfs.zfs.arc_max = 196137536808" -- that is clearly less then 200GB of dataset. Depending on specific load pattern, having cache smaller then active data set size may be about the same as not having cache at all if cache is constantly purged. Walking on the edge can cause unexpected edge results if different appliance have different default tuning.

Bjarke Emborg Kragelund · Aug 8, 2013

Hi mav@,

I am Ok with the 450K on my system

, but yes, i got the numbers from their marketing.

A lot better numbers with larger blocksizes - feeling a bit stupid atm... I will put them up as soon as they are gathered.

Still got some choppy writes to the system via NFS though. I have set the initial blocksize on the dataset = 128k - that made a huge difference in read speed.

I am suspecting some timing-issues regarding the ZIL's... flush / purge / commit.

Would you recommend that we use the ZIL's as 2 seperate devices, or should we mirror them?

mav@ · Aug 8, 2013

Originally ZIL mirroring was insisted because you could not remove failed ZIL from pool and so the whole pool was screwed. Now I guess that should be easier, but still loosing ZIL you will at least loose all transactions uncommitted to the main storage. For some environments that may be critical, so reliability of ZIL should not be lower then reliability of the main storage.

Your 200GB ZILs are probably overkill from the point of size, hardly you will ever use more then several GB. But latency and bandwidth of ZIL are important. With NFS love to synchronous operations ZIL should be ready to handle data stream with network rate still staying within reasonable latencies. You could probably use `gstat -I 1s` during your benchmark to see I/O bandwidth and latencies for every block device in system.

To check whether ZIL is your bottleneck you may try to disable synchronous operations for some dataset with `zfs set sync=disabled dataset`. But beware that it is only for test because if NAS crash for any reason any uncommitted data will be lost.

Bjarke Emborg Kragelund · Aug 8, 2013

mav@ - brilliant - I will keep them mirrored then.

I only have 100 GB as ZIL's - but I know they are still overkill in size, but it was hard to find reasonable size SLC SAS-2 devices, so we opted for the smallest Seagate ones...

Bjarke Emborg Kragelund · Aug 8, 2013

pbucher said:
What do you get with cd /mnt/tank/sql/, dd if=/dev/zero of=ddfile bs=128k count=200000? Also try turning off compression to see what difference that might make.

[root@nas] ~# dd if=/dev/zero of=/mnt/tank/sql/ddfile bs=128k count=200000
200000+0 records in
200000+0 records out
26214400000 bytes transferred in 26.964567 secs (972179529 bytes/sec)

The above is with LZ4.

Bjarke Emborg Kragelund · Aug 8, 2013

Just to let everyone know of my "progress" .... tested all the disks to be 100% sure that they perform on par.

Spindles
dd if=/dev/da[spindle] of=/dev/null bs=1M count=1000 ~ 160 MB/sec
dd if=/dev/zero of=/dev/da[spindle] bs=1M count=1000 ~ 165 MB/sec

Cache (MLC)
dd if=/dev/da[cache] of=/dev/null bs=1M count=1000 ~ 390 MB/sec
dd if=/dev/zero of=/dev/da[cache] bs=1M count=1000 ~ 175 MB/sec

ZIL (SLC)
dd if=/dev/da[zil] of=/dev/null bs=1M count=1000 ~ 65 MB/sec
dd if=/dev/zero of=/dev/da[zil] bs=1M count=1000 ~ 185 MB/sec

Much as expected and none of the disks drifted for more than a few percentages, so I do not foresee any glitches here.

Here is my list of ZFS values. Most of them are default values, but just to be sure that I am not doing something obviously degenerated to everybody but me ...

[root@nas] ~# sysctl -a | grep 'vfs.zfs.'
<118>vfs.zfs.l2arc_write_boost: 8388608 -> 67108864
<118>vfs.zfs.l2arc_write_max: 8388608 -> 67108864
vfs.zfs.l2c_only_size: 0
vfs.zfs.mfu_ghost_data_lsize: 0
vfs.zfs.mfu_ghost_metadata_lsize: 0
vfs.zfs.mfu_ghost_size: 0
vfs.zfs.mfu_data_lsize: 0
vfs.zfs.mfu_metadata_lsize: 0
vfs.zfs.mfu_size: 0
vfs.zfs.mru_ghost_data_lsize: 0
vfs.zfs.mru_ghost_metadata_lsize: 0
vfs.zfs.mru_ghost_size: 0
vfs.zfs.mru_data_lsize: 0
vfs.zfs.mru_metadata_lsize: 0
vfs.zfs.mru_size: 0
vfs.zfs.anon_data_lsize: 0
vfs.zfs.anon_metadata_lsize: 0
vfs.zfs.anon_size: 0
vfs.zfs.l2arc_norw: 1
vfs.zfs.l2arc_feed_again: 1
vfs.zfs.l2arc_noprefetch: 1
vfs.zfs.l2arc_feed_min_ms: 200
vfs.zfs.l2arc_feed_secs: 1
vfs.zfs.l2arc_headroom: 2
vfs.zfs.l2arc_write_boost: 67108864
vfs.zfs.l2arc_write_max: 134217728
vfs.zfs.arc_meta_limit: 49034384202
vfs.zfs.arc_meta_used: 0
vfs.zfs.arc_min: 24517192101
vfs.zfs.arc_max: 196137536808
vfs.zfs.dedup.prefetch: 1
vfs.zfs.mdcomp_disable: 0
vfs.zfs.nopwrite_enabled: 1
vfs.zfs.write_limit_override: 0
vfs.zfs.write_limit_inflated: 12882690432
vfs.zfs.write_limit_max: 536778768
vfs.zfs.write_limit_min: 33554432
vfs.zfs.write_limit_shift: 9
vfs.zfs.no_write_throttle: 0
vfs.zfs.zfetch.array_rd_sz: 1048576
vfs.zfs.zfetch.block_cap: 256
vfs.zfs.zfetch.min_sec_reap: 2
vfs.zfs.zfetch.max_streams: 8
vfs.zfs.prefetch_disable: 1
vfs.zfs.no_scrub_prefetch: 0
vfs.zfs.no_scrub_io: 0
vfs.zfs.resilver_min_time_ms: 3000
vfs.zfs.free_min_time_ms: 1000
vfs.zfs.scan_min_time_ms: 1000
vfs.zfs.scan_idle: 50
vfs.zfs.scrub_delay: 4
vfs.zfs.resilver_delay: 2
vfs.zfs.top_maxinflight: 32
vfs.zfs.write_to_degraded: 0
vfs.zfs.mg_alloc_failures: 24
vfs.zfs.check_hostid: 1
vfs.zfs.deadman_enabled: 1
vfs.zfs.deadman_synctime: 1000
vfs.zfs.recover: 0
vfs.zfs.txg.synctime_ms: 200
vfs.zfs.txg.timeout: 5
vfs.zfs.vdev.cache.bshift: 16
vfs.zfs.vdev.cache.size: 16777216
vfs.zfs.vdev.cache.max: 16384
vfs.zfs.vdev.trim_on_init: 1
vfs.zfs.vdev.write_gap_limit: 4096
vfs.zfs.vdev.read_gap_limit: 32768
vfs.zfs.vdev.aggregation_limit: 131072
vfs.zfs.vdev.ramp_rate: 2
vfs.zfs.vdev.time_shift: 29
vfs.zfs.vdev.min_pending: 4
vfs.zfs.vdev.max_pending: 10
vfs.zfs.vdev.larger_ashift_minimal: 1
vfs.zfs.vdev.larger_ashift_disable: 0
vfs.zfs.vdev.bio_delete_disable: 0
vfs.zfs.vdev.bio_flush_disable: 0
vfs.zfs.vdev.trim_max_pending: 64
vfs.zfs.vdev.trim_max_bytes: 2147483648
vfs.zfs.cache_flush_disable: 0
vfs.zfs.zil_replay_disable: 0
vfs.zfs.sync_pass_rewrite: 2
vfs.zfs.sync_pass_dont_compress: 5
vfs.zfs.sync_pass_deferred_free: 2
vfs.zfs.zio.use_uma: 0
vfs.zfs.snapshot_list_prefetch: 0
vfs.zfs.version.ioctl: 3
vfs.zfs.version.zpl: 5
vfs.zfs.version.spa: 5000
vfs.zfs.version.acl: 1
vfs.zfs.debug: 0
vfs.zfs.super_owner: 0
vfs.zfs.trim.enabled: 1
vfs.zfs.trim.max_interval: 1
vfs.zfs.trim.timeout: 30
vfs.zfs.trim.txg_delay: 32

Still looking for the equivalent of vfs.zfs.stop_doing_10_sec_drops_in_write_and_for_gods_sake_please_stay_below_10ms_latency=1

pbucher · Aug 8, 2013

Looks like your main disk array is good based on the dd results. I should have mentioned to monitor the disk array during the test with a second connection using zpool iostat -v 10 to see what the actual raw performance of the disks are and what's going on, dd is subject to memory caching which if you have enough ram can greatly skew the results.

Assuming that ram caching didn't mess too much with your result of 900mbs I'd say your spinning rust media and LSI controller are in good shape. Next let's pull your cache into the equation do zfs set sync=always tank/sql and repeat the dd test and see how things performance once the log devices are brought into play. It would be great if you could catch the output from zpool iostat -v 10 at the peak during the test and post that.

It looks like the 2 of us are plowing the same field pretty much, I'm using nfs on FreeNAS as datastore for ESXi also. I've been running it for a year now. My main difference is I've running FreeNAS as a virtual SAN so I got a poor man's 10GB network for my SAN network. I've grow tired of the use iSCSI chant when it comes to ESXi and getting better performance, but we can save that discussion for offline or another thread.

Moving on to actually testing VM performance from ESXi, I do the following has a quick and dirty test to verify performance.

Take one of your Linux VMs and add a 2nd HD to it about 40GB in size or so(just to give you some room if you want to try writing some really big sets of data). Don't both to format the drive or anything other then just adding it to the VM. Then do "dd if=/dev/zero of=/dev/sdb bs=4k count=2000000"(assuming your 2nd drive is /dev/sdb). This removes any file system, memory caching of the guest VM, etc from the equation. Also ignore the 1st time you run the test because the disk will be thin provisioned and you will have the overhead of vmware increasing the size of the disk files and your results will be crappy for the 1st run. Result wise I'm seeing 115MB/s very consistently from the above test on my centos 6 VM, below is what my zpool iostat looks like in the middle of running the test. For this test I cut the bs setting back to 4k to make it a little more real world like vs bs=128k which plays to ZFS's native disk size and is good to see what max throughput the disk array & pool can actually deliver.

capacity operations bandwidth
pool alloc free read write read write
-------------------------------------- ----- ----- ----- ----- ----- -----
san 3.46T 10.1T 0 2.89K 0 246M
raidz1 1.15T 3.38T 0 350 0 39.9M
gptid/02494665-14aa-11e2-8643-005056ac0e80 - - 0 91 0 9.99M
gptid/02a3aab2-14aa-11e2-8643-005056ac0e80 - - 0 91 0 9.99M
gptid/02fdf259-14aa-11e2-8643-005056ac0e80 - - 0 92 0 9.99M
gptid/035a380e-14aa-11e2-8643-005056ac0e80 - - 0 92 0 9.99M
gptid/03b7bab8-14aa-11e2-8643-005056ac0e80 - - 0 92 0 9.99M
raidz1 1.15T 3.38T 0 347 0 39.8M
gptid/9eec528c-14aa-11e2-8643-005056ac0e80 - - 0 110 0 9.94M
gptid/9f5912b4-14aa-11e2-8643-005056ac0e80 - - 0 110 0 9.94M
gptid/9fbf4dc5-14aa-11e2-8643-005056ac0e80 - - 0 110 0 9.94M
gptid/a0224c86-14aa-11e2-8643-005056ac0e80 - - 0 110 0 9.94M
gptid/a0912d86-14aa-11e2-8643-005056ac0e80 - - 0 110 0 9.94M
raidz1 1.15T 3.38T 0 349 0 39.8M
gptid/d5c825f3-14aa-11e2-8643-005056ac0e80 - - 0 92 0 9.94M
gptid/d627c990-14aa-11e2-8643-005056ac0e80 - - 0 92 0 9.94M
gptid/d68824be-14aa-11e2-8643-005056ac0e80 - - 0 92 0 9.94M
gptid/d6ec2b59-14aa-11e2-8643-005056ac0e80 - - 0 92 0 9.94M
gptid/d74c7d89-14aa-11e2-8643-005056ac0e80 - - 0 92 0 9.94M
logs - - - - - -
gptid/6a5637fd-a6c1-11e2-ae12-005056ac7786 2.32G 5.11G 0 1.87K 0 127M
-------------------------------------- ----- ----- ----- ----- ----- -----

I'm not expert on any of this but the dd tests if you are careful not to have your results get skewed by some sort of cache makes for a quick and easy benchmark to test out various configs and settings. iozone or bonnie++ can probably give you some better info, but it takes so much longer to get everything setup and when you want to tweak, test, & tweak again it's nice to have something simple to test and you can get some meaning full results in a minute or two.

Bjarke Emborg Kragelund · Aug 9, 2013

Hi pbucher et al,

First of all, thx to everyone contributing to this thread - brilliant!

The initial pool test with 'zfs set sync=always tank/sql'.

[root@nas] ~# dd if=/dev/zero of=/mnt/tank/sql/ddfile bs=128k count=200000
200000+0 records in
200000+0 records out
26214400000 bytes transferred in 1159.553446 secs (22607324 bytes/sec)

~ 22.5 MB/sec

And 'zpool iostat -v 10' when the test were running - note: the tank write speed shifted from 0 -> 65 MB during the test, with a lot of 'silence'.

capacity operations bandwidth
pool alloc free read write read write
-------------------------------------- ----- ----- ----- ----- ----- -----
tank 18.7G 2.70T 0 861 0 50.1M
mirror 3.76G 552G 0 17 0 83.2K
gptid/00829736-002a-11e3-ae0b-d4ae52e8db6b - - 0 15 0 89.6K
gptid/00e6f70f-002a-11e3-ae0b-d4ae52e8db6b - - 0 15 0 89.6K
mirror 3.75G 552G 0 17 0 83.2K
gptid/014d2e27-002a-11e3-ae0b-d4ae52e8db6b - - 0 17 0 94.4K
gptid/01b19e4b-002a-11e3-ae0b-d4ae52e8db6b - - 0 17 0 94.4K
mirror 3.73G 552G 0 36 0 191K
gptid/02181e7e-002a-11e3-ae0b-d4ae52e8db6b - - 0 33 0 200K
gptid/027e8f1b-002a-11e3-ae0b-d4ae52e8db6b - - 0 33 0 200K
mirror 3.73G 552G 0 24 0 128K
gptid/02e93fa7-002a-11e3-ae0b-d4ae52e8db6b - - 0 23 0 138K
gptid/034d92f0-002a-11e3-ae0b-d4ae52e8db6b - - 0 23 0 138K
mirror 3.73G 552G 0 19 0 109K
gptid/03d2acbd-002a-11e3-ae0b-d4ae52e8db6b - - 0 19 0 115K
gptid/0436fe53-002a-11e3-ae0b-d4ae52e8db6b - - 0 19 0 115K
logs - - - - - -
mirror 17.4M 91.0G 0 745 0 49.5M
gptid/04970518-002a-11e3-ae0b-d4ae52e8db6b - - 0 745 0 49.5M
gptid/04f28661-002a-11e3-ae0b-d4ae52e8db6b - - 0 745 0 49.5M
cache - - - - - -
gptid/053a3ddc-002a-11e3-ae0b-d4ae52e8db6b 456K 186G 0 0 0 7.60K
gptid/0574521e-002a-11e3-ae0b-d4ae52e8db6b 452K 186G 0 0 0 7.60K
gptid/05af4e8a-002a-11e3-ae0b-d4ae52e8db6b 392K 186G 0 0 0 6.80K
gptid/05eb41d6-002a-11e3-ae0b-d4ae52e8db6b 652K 186G 0 0 0 8.00K

-------------------------------------- ----- ----- ----- ----- ----- -----

BTW i saw a lot of busy > 100% on the ZIL's with gstat during the test.

I will start running the test from the *nix VM as you requested.

Thx again!

Bjarke Emborg Kragelund · Aug 9, 2013

The test from the *nix VM gives the following :

1st run
root@bench-linux:/dev# dd if=/dev/zero of=/dev/sdc bs=4k count=2000000
2000000+0 records in
2000000+0 records out
8192000000 bytes (8.2 GB) copied, 457.97 s, 17.9 MB/s

2nd run
root@bench-linux:/dev# dd if=/dev/zero of=/dev/sdc bs=4k count=2000000
2000000+0 records in
2000000+0 records out
8192000000 bytes (8.2 GB) copied, 452.562 s, 18.1 MB/s
root@bench-linux:/dev#
Cheers!

pbucher · Aug 9, 2013

Bjarke Emborg Kragelund said:
The initial pool test with 'zfs set sync=always tank/sql'.

[root@nas] ~# dd if=/dev/zero of=/mnt/tank/sql/ddfile bs=128k count=200000
200000+0 records in
200000+0 records out
26214400000 bytes transferred in 1159.553446 secs (22607324 bytes/sec)

~ 22.5 MB/sec

logs - - - - - -
mirror 17.4M 91.0G 0 745 0 49.5M
gptid/04970518-002a-11e3-ae0b-d4ae52e8db6b - - 0 745 0 49.5M
gptid/04f28661-002a-11e3-ae0b-d4ae52e8db6b - - 0 745 0 49.5M

The test from the *nix VM gives the following :
2nd run
root@bench-linux:/dev# dd if=/dev/zero of=/dev/sdc bs=4k count=2000000
2000000+0 records in
2000000+0 records out
8192000000 bytes (8.2 GB) copied, 452.562 s, 18.1 MB/s
root@bench-linux:/dev#
Cheers!

Based on the above, the issue isn't nfs/esxi/network related. As you see you can only get 22.5MB/s out of the array right inside the OS. My first reaction is to blame your log devices because we've seen that your drives can push 900+MB/s when not going through the log devices. I've had this issue on my own array, where a fairly decent flash based SSD drive brought me down to the 20 something MB/s mark. I'm not convinced that's your problem because I then saw your log devices pushing almost 50MB/s which indicates that the logs devices are pushing far more then what we are seeing as a end result. My next suggestion is to remove the logs and do a sync test and see what you get and then add the log devices back as a stripe instead of a mirror. With the current version of FreeNAS it's a waste of good money to mirror the SLOG devices, it was very important in the past. In today's world you will only loose data if your slog device fails followed by a power failure/kernel crash within the next few seconds before the zil gets written out from RAM to the pool devices. Folks saw well a mirror helps system availability because a single failure won't degraded your performance, but I think that a stripe will do exactly the same with the bonus effect that you'd have even better performance when both devices are online. Final observation is that while the the log devices where clocking 50MB/s the pool devices where basically idle, which doesn't look normal. See my iostat above and you will see plenty of activity on the pool devices. I'm running factory settings btw.

In the end my gut says too look at the LSI card, SAS cable, and SAS enclosure has possible issues. I'm assuming you are using a 4 lane SAS cable to connect the controller to the drives. What brand and such is your JBOD enclosure? Are you using both ports on the LSI card or just one? I assumed just one since I don't see the devices listed as multipath devices, which is what I'm doing. My setup wasn't stable back in the early 8.3 beta days when doing multipath and the performance was also slightly less.

pbucher · Aug 9, 2013

One very important thing I overlooked and was just reminded in a very painful way. Make sure you LSI firmware & bios are at level 14. Do a "dmesg | grep mps" and look for a line that talks about driver & firmware.

My lesson for the morning was that the combo of moving from firmware 13 & FreeNAS 8.3.1 to firmware 14 & FreeNAS 9.1 solved the issues with my flash based SSD drive. Below is the zpool iostat from running my dd test locally against a dataset with sync=always. As you can see both the pool and the log device are seeing fairly heavy i/o. Also I moved from getting 20MB/s average to getting 350MB/s average. Even we I was getting the 20MB/s I was seeing the i/o load split between the pool and the log devices.

capacity operations bandwidth
pool alloc free read write read write
-------------------------------------- ----- ----- ----- ----- ----- -----
san 2.09T 2.44T 3 3.69K 6.39K 369M
raidz1 2.09T 2.44T 3 2.38K 6.39K 280M
gptid/bf8261de-b14d-11e2-b9a2-000c29a9f2bb - - 1 640 1.20K 70.4M
gptid/c04edb27-b14d-11e2-b9a2-000c29a9f2bb - - 1 636 1.25K 70.2M
gptid/c1192785-b14d-11e2-b9a2-000c29a9f2bb - - 2 635 1.45K 70.2M
gptid/c1e9357b-b14d-11e2-b9a2-000c29a9f2bb - - 2 634 1.40K 70.1M
gptid/c2bdb9a6-b14d-11e2-b9a2-000c29a9f2bb - - 1 637 1.10K 70.3M
logs - - - - - -
gptid/28111bb9-a77e-11e2-ae12-005056ac7786 268K 93.0G 0 1.30K 0 88.6M

pbucher · Aug 9, 2013

:sigh: taking the 2nd step and doing dd from the linux VM took me back to the same bad performance. Below is once again the zpool iostat from the test. I need to double check these results, because apparently my SSD drive doesn't like the lack of A/C in my basement data center. I'm gonna put it back into my big server next week and do some head to head comparisons against my zeusram using FN 9.1. This reminded me that pure flash based SSD drives tend to perform much worse then their specs once the size of the write falls down to real world sizes, I don't know if it's possible but if ZFS has some kind of tuning constant to queue the write to the slog device for a second or 2 so it can write a bigger chuck of data that might be what's needed to crank up the performance. Of course there is always the risk trade off of not writing it to non-volatile memory write away, Nexentastor might have been doing this by default.

capacity operations bandwidth
pool alloc free read write read write
-------------------------------------- ----- ----- ----- ----- ----- -----
san 2.09T 2.44T 0 492 12.8K 38.0M
raidz1 2.09T 2.44T 0 224 12.8K 20.7M
gptid/bf8261de-b14d-11e2-b9a2-000c29a9f2bb - - 0 56 0 5.18M
gptid/c04edb27-b14d-11e2-b9a2-000c29a9f2bb - - 0 56 3.20K 5.18M
gptid/c1192785-b14d-11e2-b9a2-000c29a9f2bb - - 0 56 3.20K 5.18M
gptid/c1e9357b-b14d-11e2-b9a2-000c29a9f2bb - - 0 56 3.20K 5.18M
gptid/c2bdb9a6-b14d-11e2-b9a2-000c29a9f2bb - - 0 56 3.20K 5.18M
logs - - - - - -
gptid/28111bb9-a77e-11e2-ae12-005056ac7786 409M 92.6G 0 267 0 17.3M

mav@ · Aug 9, 2013

Bjarke Emborg Kragelund said:
BTW i saw a lot of busy > 100% on the ZIL's with gstat during the test.

That is not good. Busy above 100% on my experience usually mean that some requests were running for more then a second. Not a good answer for potential ZIL.

Also, is the "65 MB/sec" in ZIL test is a typo? From modern SSD I would expect at least 5 time more!

Bjarke Emborg Kragelund · Aug 12, 2013

pbucher, mav@ et al,

I will try to round up the answers from the previous posts in one.

Please forgive me if I have left some out...

The LSI 9207-8e *was* single cabled to the JBOD. And yes we are using a 4 lane SAS cable. As per your recommendations, we are connecting to the JBOD with dual cables today, to achieve a multipath environment.

The JBOD is the Dell MD1220. I have seen a few ZFS systems running with that without issues.

We have today moved the ZIL's from the JBOD to the head - and connected them to an internal backplane via a LSI 9207-8i. And we have inserted our spare ZIL to get 3 X SLOG.

Both LSI HBA's are running FW v16 - and I am using the LSI v16 driver from their website. I have tried the combination FW v16 and driver v14 - as shipped with FreeNAS 9.1 - no difference.

I have *not* tried FW v14 and driver v14. Should I consider reverting my FW on the LSI HBA's to achieve a FW v14 / Driver v14 system?

Also we have inserted new SD boot cards to the FreeNAS to get faster reboots, so I will be rebuilding the system today. Hence numbers properly will be coming your way at the earliest tomorrow.

I tried to slice my ZIL's with 4 X 16G partitions instead ( so 8 logical LOG devices added as stripes ) - and we got a lot more responsive system with heavy writes. Still the occasional 'stall' for 5 seconds though :-(

@mav@. I am affraid the 65 MB/sec was not a typo... but it is a SSD designed for writing, so I was not alarmed at first. But I will check the specs from Seagate to see if that could be of relevance.

BTW : During the pool write-test, the gstat %busy was in the 2500 - 3000% ranges in spikes... and then silence :-(

Are my ZIL's garbage? We will be moving them to a dedicated HBA and backplane today, to see if that is the case. And we will bump the number to 3 pcs.

I will return with the numbers... have I forgotten to answer any questions?

Thx again everybody!

pbucher · Aug 12, 2013

Bjarke Emborg Kragelund said:
I have *not* tried FW v14 and driver v14. Should I consider reverting my FW on the LSI HBA's to achieve a FW v14 / Driver v14 system?

Are my ZIL's garbage? We will be moving them to a dedicated HBA and backplane today, to see if that is the case. And we will bump the number to 3 pcs.

I would suggest using the built-in v14 driver and firmware. It's what most folks are using and is seeing the most wide spread usage, I don't think it's the source of your problems though.

For best performance I'd suggest dropping the mirror and stripping across your slog devices. At the end of the day though if you want to get any kind of real write performance you need something like a Fusion I/O board or a STEC ZeusRAM. Most flash based SSDs are just not optimized for writing small chucks of data. Just remove the SLOG device from the pool and do a simple dd with bs=1k against the physical device and compare that to bs=1m and you will see what I mean. I'll run some benchmarks on my STEC device here and post it later today.

Bjarke Emborg Kragelund · Aug 13, 2013

Hi pbucher et al,

Did not have the skills to downgrade the LSI HBA FW to v14 - did downgrade the BIOS to v15 and ran with the v14 driver - no difference.

Still looking for a way of getting to v14 FW / v14 driver though ...

My SLOG / ZIL's are individually able to sustain ~ 185 MB/sec of 128k blocks - so I would suspect them to be able to sustain at least the sum minus some overhead. In the area of 400 MB/sec - that is unfortunately not what I am seeing :-(

When I increase the vfs.zfs.write_limit_shift = 10 ( ~ 250 MB/sec write_limit ) the ZIL's are still stalling. I see that the data is being written to my ZIL's with ~25 MB/sec and then silence for 5 seconds, then a burst of 500 - 1000% ( accoording to gstat ), then silence, and then 5 seconds 'normal' transfer again ... and etc. Even if we opted for sTec ZIL SSD's they are "only" about 50% better spec'ed than the ones we have, and I do not see that the ZIL's we have are bad spec'ed compared to a lot of other 'successful' ZFS systems.

I am almost at my whits end :-(

I have disabled the Sync / ZIL on my pool and now we are getting numbers that are a total match to the Nexenta-scores, which gets me to agree with your previous thoughts that the Nexenta in 'performance'-mode actually disables the sync. Anyone ?

I see some guys having problems with v16 FW / BIOS and v16 driver... so my hope right now would be to wait for LSI phase 17, and see what that gives me... and I have requested a quote on 4 pcs sTec ZIL SSD's just to be on the safe side :)

Thanks again to everyone involved in the commenting and suggestions - priceless!

pbucher · Aug 13, 2013

Bjarke Emborg Kragelund said:
Hi pbucher et al,
Still looking for a way of getting to v14 FW / v14 driver though ...

Here is a link for prepacked LSI update I made for updating my FreeNAS servers. Just unzip it and run the doit.sh inside the lsi_update folder it creates.

Bjarke Emborg Kragelund said:
Hi pbucher et al,
I have disabled the Sync / ZIL on my pool and now we are getting numbers that are a total match to the Nexenta-scores, which gets me to agree with your previous thoughts that the Nexenta in 'performance'-mode actually disables the sync. Anyone ?

This is good news, it shows that FreeNAS can stand toe to toe with NexentaStor. I've played around with NexentaStor a few times over the years and I've just not been able to get past some of it's short comings. It also saves me the time of firing up another install of it to do a comparison on my current hardware with it. I've been sort of curious how it would perform on my current hardware it's been a few years since I last played with it.

Bjarke Emborg Kragelund said:
Hi pbucher et al,
My SLOG / ZIL's are individually able to sustain ~ 185 MB/sec of 128k blocks - so I would suspect them to be able to sustain at least the sum minus some overhead. In the area of 400 MB/sec - that is unfortunately not what I am seeing :-(

... and I have requested a quote on 4 pcs sTec ZIL SSD's just to be on the safe side :)

These #s sound about right, I've gone back and done some fresh benchmarks with my STEC ZeusIOS and it's not really getting anything better then what your posting. I'm in the middle of rebuilding my test/develop SAN so I can pull any # of what I get with the ZeusRAM from a Linux VM but I'm recalling I'm able to clock about 120MB/s with sync=standard, ZeusRAM, & FreeNAS 9.1 with default tuning(except nfs server count set to 4 - I have 4 CPU cores for the SAN), try tweaking that # to the # of cpu cores you have or some variation of that. From what I've seen with commercial SAN rolls outs STEC SSDs are often used for L2ARC devices and ZeusRAM for low end boxes and Fusion io cards for mid to high end boxes. That said it pretty much boils down to running sync=disabled or sinking some dollars of some. If you can shake some budget loose I'd suggestion contacting the ixSystems guys and see if they will quote you a head unit running TrueNAS(FreeNAS's big brother) with a Fusion io card in it and a small array that you could attach your existing JBOD to.

pbucher · Aug 13, 2013

Some #s to compare with:
When doing "dd if=/dev/zero of=/dev/sdb bs=4k count=2000000" from a Linux VM attached via NFS on the same box has my virtual SAN I'm seeing 110MB/s with sync=standard and a single ZeusRAM for the SLOG device. If I use sync=disabled I get 245MB/s. I think this is about the max you can get without either going to multiple striped SLOG devices are a Fusion IO board. I've tried putting the ZeusRAM on a separate path from the pool on a multipathed system and that did help some.

Also I don't know if you got around to multipathing your system or not yet, but I'd wait on that until you get this straightened out, multipathing FreeBSD still feels rather cutting edge. Also multipathing the system after the fact isn't straight forward. I ended up just nuking the pool and starting over just to make sure everything was squeaky clean.

Bjarke Emborg Kragelund · Oct 21, 2013

Hello all,

I just want to revisit this thread as we have now seen 3 weeks of very nice behaviour.

Also I would write down what we have done to achieve this.

From my VMware ESXi's we now seldom reach latency > 100ms - mostly < 10 ms, and *never* > 200ms. That is with load form at least 20 VMs doing heavy random read and write operations. Bear in mind we only have 5 vdevs.

I believe that one of the three below did the trick - We did all at once, which you should never do :) But time was of some essence.

1. We opted for some new SLOGs. sTec 18GB ZIL - 4 pcs. We added them as striped / JBOD.
2. We upgraded both the BSD-driver and FW/BIOS of the LSI 9207-8e to version 17.
3. We dropped the vdev.cache setting for the spindle drives.

My money is on the new SLOGs. When we were looking around for the best price for the sTec's, we got to talk with som system builders, who said that the Seagate SLCs we initially had chosen was to flaky / unstable on burst writes - some internal timing issues, that would make it a very poor choice for a SLOG-device. They said we could never go wrong with a sTec ZIL-device, as they were custom-built to the job.

With what we are seeing, we have to agree on that one!

Further more - here are our tweaking to the vanilla-FreeNAS 9.1.1 on a 16 core - 256 GB ARC, 4X200 GB L2ARC, 4X18 GB SLOG, 5 mirrored vdevs 600 GB ~ Pool=2.7 TB. NFS server only on 10 Gb.

Sysctl
kern.ipc.maxsockbuf : 16777216
net.inet.tcp.recvbuf_inc : 524288
net.inet.tcp.recvbuf_max : 16777216
net.inet.tcp.sendbuf_auto : 1
net.inet.tcp.sendbuf_inc : 16384
net.inet.tcp.sendbuf_max : 16777216
vfs.zfs.l2arc_noprefetch : 1
vfs.zfs.l2arc_write_boost : 33554432
vfs.zfs.l2arc_write_max : 33554432

Tunables
vfs.zfs.arc_max : 196137707857
vfs.zfs.prefetch_disable : 1
vfs.zfs.scrub_delay : 3
vfs.zfs.txg.synctime_ms : 200
vfs.zfs.txg.timeout : 1
vfs.zfs.write_limit_shift : 9
vm.kmem_size : 217930786508
vm.kmem_size_max : 272413483136

NFS Number of servers : 14

The above is not considered a 'you should do it like this or...' - but rather a point of reference for your own testing.

Thank you all again!!

Best regards
Bjarke

Important Announcement for the TrueNAS Community.

FreeNAS 9.1 - Performance

Dabbler

iXsystems

Dabbler

iXsystems

Dabbler

Dabbler

Dabbler

Contributor

Dabbler

Dabbler

Contributor

Contributor

Contributor

iXsystems

Dabbler

Contributor

Dabbler

Contributor

Contributor

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "FreeNAS 9.1 - Performance"

Similar threads