FreeNAS 9.1 - Performance

Bjarke Emborg Kragelund · Jun 27, 2013

We have built a NAS with the following HW :

Dell PE R715
2 X 8 cores AMD 3.2 GHz
256 GB Memory
2 X 10 Gb ( Storage head )
LSI 9207
SAS-2 JBOD with 24 slots.

One ZFS Pool consisting of 5 mirrored vdevs ( Seagate 10.5K Savvio 600 GB ) + 4 Striped 200 GB MLC ( L2ARC ) + 2 mirrored 100 GB ZIL / Log. All with SAS-2 interface.

We have been trying OpenIndiana ( with napp-it ) and Nexenta, which were pretty similar in performance, but they come with a pricetag, and we just want to "play" with the storage for the moment.

As I have been using FreeNAS since 0.7 - and am pretty fond of it, I would like to take it for a spin on the above HW. ( FreeNAS installed on a internal SD module )

We installed 9.1 ( one of the earliest Alpha's ), and we usually upgrade to the newest nightly build.

We use the NAS as a secondary storage for our vSphere - so we only configure and use it for NFS connectivity.

On the Nexenta we saw hot / accelerated data hit 480K OPS Read & 350K OPS Write and with maximum 20 ms latency ( usually < 5 ms ) over the total period of 30 days - even with a lot of activity.

On the FreeNAS 9.1 we are seeing maximum 16K OPS and our latency peaks at > 35K ms :-(

A little less than expected - and on the same HW :-(

So what could I miss ?

I have tried the following :

- Rebuilt the ZFS pool a number of times on BSD.
- Enabled / disabled autotune.

I suspect a bad behavior on some driver / HW related issue, as the only difference is the OS.

All the HW are updated to the lasted FW - including the spindels and SSDs.

Any hints as to where I should go look for holy grail ?

Brgds.
Jugger

jgreco · Jul 5, 2013

Sorry, I'm probably one of the few people around here with insight into stuff like this, and I've been a bit busy lately. Didn't notice the thread.

For a system like that, there's probably a fair bit of tuning that's needed.

Your latency is probably peaking at 35 seconds (hey that's nothing, I've had it out to minutes) because transaction group sizing is by default driven by memory size (1/8th system memory size, or 32GB in your case), and you have a system where the memory size is substantially larger than the pool's capability.

I assume you probably meant "Seagate 10K.5 Savvio". I'm going to guess that those are only good for ~100-120MB/sec transfer each. If you have ten of them as five mirrored vdevs, the most optimistic your pool speed could be is 600MB/sec. Usually ZFS won't give you quite that, and it would be interesting and helpful to have an actual idea of what your pool average write speed is: do something like a "dd if=/dev/zero of=testfile bs=1048576" and let it run for an hour before stopping it and recording the speed. It should be less than 600MB/sec.

Now this is the important bit. Are you tuning for throughput or responsiveness? These two trade off against each other under the ZFS model. For something like iSCSI or NFS, I'd prefer a server remain responsive ("lower latency") even under high system stress, and you can do this through tuning it in ways that reduce "performance".

The problem you were hitting is that ZFS was queuing up as much as 32GB of traffic in a transaction group for a five second period to a pool that could only write less than 3GB in that same five second period.

So.

Hanakuso · Jul 7, 2013

Hi Mr. Greco,

I've read a number of your posts and think I'm having a similar problem as the gentleman who started the thread. My system specs are as follows:

HP DL180 G6
12 Slot LFF Backplane
2x Intel L5520 Xeon 4 Core 2.26Ghz
6x Seagate 7.2K 3TB HDDs
48GB RAM
LSI 9211-8i (Have also tried with HP Smart Array P410 with 256 Flash-Backed Write Cache)

I have tried FreeNAS 8.3, FreeNAS 9.1, NAS4Free 9.1, with the disks always configured as one RAIDZ2 vdev containing all 6 disks. I'm trying to use it as an ESXi 5.1 iSCSI target to hold a few VMs for a home lab.

My problem is that I'm experiencing extremely high latency. As soon as I initiate some kind of file transfer, the average latencies start shooting up to ~300ms. I get many peaks to above 3000ms, to the point where VMs become unresponsive and the datastore itself sometimes disconnects. I am aware of the ZFS "breathing" issue, and the file transfer graphs resemble that to the extreme, with file transfers fluctuating between 110MB/s and simply stalling out (going down to 0 KB/s) for a few seconds.

In some of your other posts, you mention that ZFS needs to be tuned, and that sometimes too much RAM can cause issues. As a test, I took out half the RAM, leaving 24GB, and ran the same tests using FreeNAS 9.1. The results were much improved, with average latencies ~10ms and peaks at ~200ms. This leads me to believe that the problem could be fixed with tuning. You understand that ZFS can either be tuned for responsiveness or for performance, and I am much more interested in the former. I ran the following as you specified:

[root@FREENAS] ~# dd if=/dev/zero of=/dev/zvol/zoo/monkey bs=1048576
882589+0 records in
882588+0 records out
925460594688 bytes transferred in 3701.002535 secs (250056731 bytes/sec)
Ideally, I would like to use all 48GB of RAM that I have while also minimizing latency. What do I need to tune to accomplish this? I'd really appreciate any insight you could offer.

Thank you.

jgreco · Jul 7, 2013

see bug 1531 (not really a bug) and also see my recent incomplete message on the slog/zil in performance to help understand the zfs write cache better.

Hanakuso · Jul 7, 2013

Thank you so much for the direction. This was definitely the tuning I was looking for. Using gstat, the %busy for the vdev never goes above 50%, and I'm not experiencing latencies higher than 10ms from ESXi. I'm also not experiencing sustained transfers greater than 60MB/s, but that was to be expected. Thanks again!

Bjarke Emborg Kragelund · Jul 8, 2013

Thx jgreco,

Excellent feedback !

I will get right on it - and you're right about the savvio models :)

Actually the disks give ~ 165 MB/s sustained write and ~ 200 MB /s sustained read - so we are pretty satisfied with the HW itself.

And you're also right about the Latency / IO vs. Throughput. We would prioritise low latency as we will be using it as a NFS mountpoint for ESXi.

Brgds.
Jugger

jgreco · Jul 8, 2013

So the overall summary is, you can have "fast" or "responsive" but they're kind of either/or. If you have a large write cache that'll be a win for "fast" but will suck for "responsive" if the system gets too busy. If you have a too-small write cache, that will totally suck for "fast" and will also hurt "responsive" because the system will struggle unnecessarily under high loads.

So I suggest getting things as stressy as you expect them to be, increase it by some safety margin, make sure the system is doing a scrub at the same time, then start experimenting to see what you need to do to guarantee responsiveness. My current procedure for that is basically documented through 1531.

So when you take a system that is suitably tuned in that fashion and make it idle, and then compare that to a default system: I've been finding that a tuned system is definitely slower than default, never less than half as fast, usually maybe 20-30% slower. But that's for I/O involving the pool. ZFS is a total win if you have the data in ARC/L2ARC for reads, for example.

Hanakuso · Jul 8, 2013

If I make a RAIDZ2 pool with 6x 3TB disks with 10.3TB of available space, is it okay to make a zvol using an even 10TB of space from that pool?

I've run into a situation that no matter how much tuning I do, latencies jump up to the thousands as soon as any kind of large file transfer begins. I'm wondering if it's because there is not enough free space? The 10TB zvol is only about 10% full.

jgreco · Jul 8, 2013

No, you should have at least 20% capacity free, probably more for any SAN style use.

Hanakuso · Jul 8, 2013

I should have 20% free in the pool? or in the zvol? I have plenty of free space in the zvol, but only a few hundred megabytes that are unclaimed by the zvol that remains in the pool. Does that make sense? Sorry about the questions, and thank you for your time.

Bjarke Emborg Kragelund · Jul 11, 2013

Hi jgreco et al,

Going for the "slow" and responsive with your recommendations in mind.

Still seeing some spikes on about 1000 ms though ( mostly writes ) - but the 30 seconds drops have totally disappeared - Thx !

On a System with :

256 Gb Memory
4 X 200 GB SSD L2ARC
2 X 100 GB SSD ZIL
10 X 600 GB SAS-2 spindles in 5 mirrored vdevs

Configured only for NFS towards ESXi.

Via your recommendations above I have come to the following tunables :

vfs.zfs.write_limit_shift = 9 ( ~ about 62% of real max of the write capabilities of the pool spindels )
vfs.zfs.txg.synctime_ms = 200
vfs.zfs.txg.timeout = 1

While troubleshooting I also added :

kern.ipc.maxsockbuf = 8388608
net.inet.tcp.recvbuf_max = 8388608
net.inet.tcp.sendbuf_max = 8388608
vfs.zfs.write_limit_override = 432537600 ( about 50% of real max, which should leave room for reads and / or scrubs )

( Do not see / feel any difference after the above is active )

Autotune is responsible for these :

vm.kmem_size = 217931704012
vm.kmem_size_max = 272414630016
vfs.zfs.arc_max = 196135180461

Any hints as to where I should hunt for the last spikes in latency ?

Brgds.
Bjarke

Bjarke Emborg Kragelund · Jul 12, 2013

Hi all,

After having my numbers based on a theoretically max - i decided to do a real-life test, and my pool could only write ~ 160 MB/sec - WTF :-(

As the problem seems elsewhere than ZFS, I have now tuned for the above numbers, and suspect to see a real responsive pool.

Anyway, 10 X SAS-2 spindels in a 5 X mirrored vdevs - should not max out at 160 MB/s. Given that the number is very close / equal to what I would suspect for a single disk... hmmm...

Here is my pool :

[root@nas] ~# zpool status tank
pool: tank
state: ONLINE
scan: resilvered 512 in 0h0m with 0 errors on Thu Jul 11 10:48:22 2013
config:

NAME STATE READ WRITE CKSUM

tank ONLINE 0 0 0

mirror-0 ONLINE 0 0 0

gptid/ad4db362-ea03-11e2-804a-d4ae52e8db6b ONLINE 0 0 0

gptid/76ece55e-ea01-11e2-804a-d4ae52e8db6b ONLINE 0 0 0

mirror-1 ONLINE 0 0 0

gptid/774c628a-ea01-11e2-804a-d4ae52e8db6b ONLINE 0 0 0

gptid/77ad6135-ea01-11e2-804a-d4ae52e8db6b ONLINE 0 0 0

mirror-2 ONLINE 0 0 0

gptid/781546bc-ea01-11e2-804a-d4ae52e8db6b ONLINE 0 0 0

gptid/787757fc-ea01-11e2-804a-d4ae52e8db6b ONLINE 0 0 0

mirror-3 ONLINE 0 0 0

gptid/78dd070c-ea01-11e2-804a-d4ae52e8db6b ONLINE 0 0 0

gptid/79452961-ea01-11e2-804a-d4ae52e8db6b ONLINE 0 0 0

mirror-4 ONLINE 0 0 0

gptid/79aac0c8-ea01-11e2-804a-d4ae52e8db6b ONLINE 0 0 0

gptid/7a0ca2b6-ea01-11e2-804a-d4ae52e8db6b ONLINE 0 0 0

logs

gptid/ced8d1a8-ea02-11e2-804a-d4ae52e8db6b ONLINE 0 0 0

gptid/f46b3776-ea02-11e2-804a-d4ae52e8db6b ONLINE 0 0 0

cache

gptid/d119a2b6-ea01-11e2-804a-d4ae52e8db6b ONLINE 0 0 0

gptid/f423eb33-ea01-11e2-804a-d4ae52e8db6b ONLINE 0 0 0

gptid/17ede83c-ea02-11e2-804a-d4ae52e8db6b ONLINE 0 0 0

gptid/3ba19304-ea02-11e2-804a-d4ae52e8db6b ONLINE 0 0 0

spares

gptid/956f014d-ea03-11e2-804a-d4ae52e8db6b AVAIL

errors: No known data errors

All disks are connected via a LSI-9207. 24 slots JBOD. The LSI HBA is in IT-mode and presents all the disks RAW to the OS.
As we were seeing that the exact system above - on both Nexenta and OpenIndiana - could almost saturate a 10 Gb nic on writes to the pool, i suspect some HW glitches ... somewhere...
Anyone ?
Brgds.
Jugger

jgreco · Jul 12, 2013

do not use write limit override without a good reason, it is better to allow zfs to manage the limit but just force it to make better default choices with write limit shift.

Bjarke Emborg Kragelund · Aug 7, 2013

Hello again,

Scratched the pool, sysctl's and tunables and started from scratch with FreeNAS 9.1 release.

I began with testing the drives one by one, to see if the basic setup was flawed in any way. I think it is.

With a dd if=/dev/daX of=/dev/null I get :

SAS-2 spindels = 11.5 MB/s
SAS-2 eMLC = 8.9 MB/s

Writing to the pool with dd if=/dev/zero of=/mnt/tank/sql/testfile I get :

53 MB/s

Reding from the pool with dd if=/mnt/tank/sql/testfile of=/dev/null I get :

14.2 MB/s

That is almost a factor 20-30 of what i was hoping for and a factor 10-15 of what i would expect.

My HBA's are 9207-8e for the JBOD (spindels and logs) and a 9207-8i for the Cache SSD's - everything connected with SAS-2 end-to-end.

Both my HBA's are updated with the newest FW from LSI. I have tried with the newest driver as well ( v16 ) no difference ( FreeNAS 9.1 is shipped with v14 ).

All disks ( both spindels, cache and logs ) are updated to newest FW.

My pool is with LZ4, atime=off, dedup=off. 5 X vdev mirrors + 4 X cache SSD 200 GB + 2 X log SSD 100 GB.

Where should I investigate further?

Bare in mind that the above pool could saturate a 10 Gb/s nic on Nexenta Enterprise.

mav@ · Aug 7, 2013

Bjarke, when doing tests with dd make sure to specify block size (at least bs=128k or even bs=1m). Otherwise 512 bytes blocks are used as default and you will measure more latency then bandwidth.

pbucher · Aug 7, 2013

What do you get with cd /mnt/tank/sql/, dd if=/dev/zero of=ddfile bs=128k count=200000? Also try turning off compression to see what difference that might make.

Bjarke Emborg Kragelund · Aug 7, 2013

mav@ and pbucher - sounds proper - thx a bundle.

I will do the tests tomorrow.

You gentlemen shouldn't know which ZFS tunings are set in sysctl and which are set in 'tunable' - having trouble finding some documentation.

Thx again!

mav@ · Aug 7, 2013

Bjarke Emborg Kragelund said:
You gentlemen shouldn't know which ZFS tunings are set in sysctl and which are set in 'tunable' - having trouble finding some documentation.

Most of sysctls also have matching tunables, while opposite is not always true. But I would not bet too much on tuning, since in most cases system should work well out of the box. The reason why sysctls/tunables are not very much documented is because most of system don't require them.

When results that should be close differ by two orders of magnitude -- it usually means either incorrect benchmark (like in one case some caching is used while not in other) or something is very wrong in setup (like broken hardware). Please describe your benchmark and how have you measured these 480K IOPS of NFS performance.

Bjarke Emborg Kragelund · Aug 7, 2013

Hi mav@,

Thx again!

I was looking to making my system responsive with very low latency vs. high throughput. I got some pointers regarding some tunables I should try - and it worked a lot. Still seeing 100+ ms spikes though, and a lot of "silence" / drops when copying.

480K was achieved with a benchmark that read and wrote exclusively to a 200 GB file - i believe I used bonnie++. It was hit with 4K random in about 2 days from a Ubuntu 12.04 LTS VM. After that I ran the report on the Nexenta and got the max it had logged in the period.

The 480K is almost certainly ARC-only hits, but at least the head can deliver.

I will try with and without compression as well.

mav@ · Aug 7, 2013

200GB of data set size looks suspiciously close to the 256GB of your system's RAM size. Considering that some RAM will be taken to random service things, you may hit the situation with need for active I/O to the main storage, where even 2-3K IOPS may completely saturate your disks. L2ARC should probably compensate that a bit, but that is still at I/O and several fast SSDs can easily saturate LSI 9207 IOPS limit.

Also I would more trust some kind of average IOPS, not a maximum. And comparing two completely different OSes I would use some metrics that are really equal for both systems, preferably calculated at point that remains the same in both tests.

Important Announcement for the TrueNAS Community.

FreeNAS 9.1 - Performance

Dabbler

Resident Grinch

Cadet

Resident Grinch

Cadet

Dabbler

Resident Grinch

Cadet

Resident Grinch

Cadet

Dabbler

Dabbler

Resident Grinch

Dabbler

iXsystems

Contributor

Dabbler

iXsystems

Dabbler

iXsystems

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "FreeNAS 9.1 - Performance"

Similar threads