iSCSI Performance & File System Parameters (Record/Block size etc).

iamafish · Jan 13, 2016

I'm somewhat new to FreeNAS and ZFS but have been configuring Hyper-V and iSCSI for several years. I'm using this newly constructed system:

FreeNAS 9.3.1: Supermicro 6048R-E1CR36L, 1x Xeon E5 2603v3, 64GB ECC DDR4
OS: 2x Kingston V300 60GB
Volume1: 8x WD Red Pro 3TB (4x2 Mirror), 2x 200GB Intel S3710 (ZIL/SLOG), 1x Intel 750 800GB PCI-E (L2ARC)
Volume2: 4x Samsung 850 EVO (2x2 Mirror)

Storage is accessed by 2x clustered Hyper-V 2012 R2 nodes using the Windows iSCSI initiator and through a Netgear XS712T 10Gb switch. The NICs are all Intel X540 AT2 for which I've installed the latest drivers on the Windows nodes, set jumbo frames to 9216 on the switch and Windows and "mtu 9000" on the FreeNAS. I've got MPIO configured with 2 paths per target.

I'm currently using 3 targets for iSCSI on ZVOL device extents, a 3TB volume for data on volume 1 and a 1GB cluster quorum plus 1TB volume for VM OS on volume 2. Apart from the quorum these are present on the Hyper-V nodes as Cluster Shared Volumes (CSV).

Typical VM workload is MSSQL feeding our internal data analysis tools written in Java or Python, ideally though the data working set is in memory so the SQL and OS traffic will be the main source of disk usage.

I'm in search of the optimal filesystem and volume configuration and I'm looking at these parameters:
- Dataset record size
- ZVOL block size
- ZFS compression
- Logical block size in the iSCSI extent
- NTFS/CSVFS allocation unit size used by the Hyper-V hosts for the CSV
- NTFS allocation unit size used by the guest within the VHDX

What is the best relationship between these values, are their optimal ratios or matching to be made?

With the parameters at
- Default/inherit
- 16KB
- lz4
- 4096
- 4KB
- 4KB for OS, 64KB for SQL data volume

I'm using CrystalDiskMark to benchmark inside an otherwise idle Windows Server 2012 VM, at the moment sequential performance looks OK and sometimes I have hit nearly 2GB/s on Seq Q32T1 tests so the iSCSI networking and MPIO seems to be working, but that result is sporadic and it's usually more like the screenshots below.

However the 4K random results seem appalling (less than a single basic SSD) for test file sizes that should fit in the ARC/L2ARC and so be coming from RAM or SSDs whichever volume I am using.

I just got the following from CrystalDiskMark on volume1:

And volume 2:

Any ideas where I can make some improvements?

Thanks!

hugovsky · Jan 13, 2016

Well, you should start by not using jumbo frames. Jumbo frames only work if set all equal, as I understand it. You have 9000 in freenas and 9216 in windows. I think your cpu is a bit slow per core but I can't say if that's a problem with your setup. 64 MB isn't much for vm's and mysql doesn't like CoW file systems. But I'm just giving my opinion.

p.s.: You should put your hardware in your post. Many of us can't view signatures with mobile devices.

iamafish · Jan 13, 2016

Thanks for the signature tip, specs moved. The CPU doesn't appear to be hitting 100% usage, it has 6x 1.6GHz cores but it is the most basic in the E5 v3 range.

MSSQL is unaware of the underlying ZFS file system, it's looking at an NTFS formatted VHD inside a VM, so how would the underlying storage system cause issues?

The MTU setting on the switch should have no effect so long as the endpoints are under that value, it is simply a maximum value AFAIK, switch config indicates that 9216 is the maximum value. When I set "mtu 9216" on FreeNAS there were issues, portal discovery worked by the volumes were inaccessible and if I set 9000 on both ends then I could not ping (with do not fragment set) the FreeNAS on the iSCSI networks with a packet size of 8972 (9000-28 bytes for headers) and had to drop to to a packet size of 8958 on the Hyper-V servers. However if I set MTU to 9216 on the Windows nodes and 9000 on FreeNAS then I can ping with a packet size of 8972 as expected and 8973+ fails.

zambanini · Jan 13, 2016

a COW filesystem for MSSQL is not the best solution. You will end up with fragmentation.

iamafish · Jan 13, 2016

The VMs are not long lived, typically for short projects after which they will be purged. It's also mostly a data import, with minor manipulation and then just used a source database for analysis so the potential for fragmentation seems low to me?

In any case I'm not even running SQL workloads yet as I'm still testing the deployment and obtaining the low 4K r/w and IOPs <2000 which seems very low.

jgreco · Jan 13, 2016

iamafish said:
The VMs are not long lived, typically for short projects after which they will be purged. It's also mostly a data import, with minor manipulation and then just used a source database for analysis so the potential for fragmentation seems low to me?

In any case I'm not even running SQL workloads yet as I'm still testing the deployment and obtaining the low 4K r/w and IOPs <2000 which seems very low.

Your system has some shortcomings, one of which is a large amount of L2ARC (800GB) with a tiny amount of ARC (~50GB). The Intel X540 is known to sometimes be problematic, and jumbo frames are generally a bad idea.

You'd be better off if you bumped the RAM to 128GB, and got a Chelsio network card on the FreeBSD side.

The 2000 IOPS is totally reasonable. Your four individual mirror vdevs are capable of perhaps 150 IOPS each, under stress, times four, is 600. ZFS is already giving you some win there. Properly configured and kept provisioned to a small amount of space, you might easily get 10K IOPS out of it if a lot of the data is read and is available in the ARC/L2ARC. ZFS speed is largely dependent on fragmentation which is largely dependent on percent-pool-fill plus how much stuff has happened on the pool. There are a lot of nonobvious interactions.

Anyways my assessment is that you're doing pretty well and that your primary crisis is a lack of RAM.

jgreco · Jan 13, 2016

By the way, the 2603 is a horrible CPU. Generally speaking, the 16xx's are a better choice for NAS unless you absolutely need the lanes a multi-CPU system offers. A 1650 v3 is about ~$550 and is a totally awesome performer.

iamafish · Jan 13, 2016

The choice of E5-2603v3 was based on it being a dual socket motherboard and allowing for future expansion and a 2nd CPU, which would probably be driven by needing additional PCI-E cards (for example for extra network cards), perhaps an E5-2620v3 would have been a better choice given the conditions.

How does a larger L2ARC compared to ARC inhibit performance? Is there an ideal ratio for that?

I was expecting that as CrystalDiskMark wrote to the test files they would appear in the ARC or if that was full then the L2ARC and subsequently when read they IOPs would be high even for 4K random since they would be coming from the FreeNAS memory or an SSD?

I'll have a go without the jumbo frames and see how that pans out.

jgreco · Jan 13, 2016

iamafish said:
The choice of E5-2603v3 was based on it being a dual socket motherboard and allowing for future expansion and a 2nd CPU, which would probably be driven by needing additional PCI-E cards (for example for extra network cards), perhaps an E5-2620v3 would have been a better choice given the conditions.

The most generally sensible CPU is actually the 2637. Being a quad core high clock rate 26xx CPU means you can get up to 8 cores on your platform. It isn't clear on whether or not the CPU is actually hurting you here or whether an upgrade would be useful, but if you do end up needing a bump, look at the NAS-optimized options. What's usually happening is that performance gets tied to clock. This is sometimes extremely important (as with CIFS) or sometimes only vaguely important (compressing/decompressing blocks). If your CPU is just puttering along at two-fifths speed, there's a lot of room for extra latency.

How does a larger L2ARC compared to ARC inhibit performance? Is there an ideal ratio for that?

It isn't just that. It's that there's not enough ARC to properly identify useful blocks prior to eviction.

https://forums.freenas.org/index.ph...res-more-resources-for-the-same-result.28178/

But the L2ARC also robs space from the ARC for the L2ARC headers, so having a huge L2ARC actually squeezes the space available in the ARC; space that you need to identify the truly valuable blocks that you'd ideally like to migrate on out to the L2ARC. See how that's a nasty thing?

We usually suggest a 4:1 or 5:1 L2ARC-to-ARC ratio until you can actually observe your production workload and see if maybe more is acceptable. I've got a 128GB RAM box here with 768GB of L2ARC. After 30 days it's stabilized at about 516GB of L2ARC used. That means that most of the working set of the workload is actually present in ARC or L2ARC. When I put more VM's on the box, my best guess is that I could probably go to 1TB of L2ARC, but it's possible I should stop at 768. The best indicator is to keep an eye on

I was expecting that as CrystalDiskMark wrote to the test files they would appear in the ARC or if that was full then the L2ARC and subsequently when read they IOPs would be high even for 4K random since they would be coming from the FreeNAS memory or an SSD?

That's one possible thing that COULD happen. However, the system is only evicting things out to L2ARC at a given rate, and so if you suddenly demand a whole bunch of performance out of the system, it is actually far more likely that the data in ARC will simply be tossed in order to make more room, and very little of it will be evicted to L2ARC. The L2ARC mechanism is designed to function over a long period of time, slowly identifying the most useful data to cache. It isn't frantically shuffling stuff around from ARC to L2ARC based on instantaneous demand, and in production you really don't want it to. It hurts benchmarks, admittedly. Unless your benchmark is small enough to reside entirely in ARC.

I'll have a go without the jumbo frames and see how that pans out.

I'm guessing that's not actually a problem for you right now, but definitely something to be aware of. The gain you might get from jumbo is often not worth the headaches.

iamafish · Jan 13, 2016

jgreco said:
The most generally sensible CPU is actually the 2637..... If your CPU is just puttering along at two-fifths speed, there's a lot of room for extra latency.

Would you prioritise MHz over cores then? I would have guessed it the other way round with many clients accessing the box, although in this case the 2603 was used for cost reasons. The 2623 is half the list price of the 2637 so that could an option to? At the moment though the CPU didn't seem overloaded, spiked but not excessively.

jgreco said:
However, the system is only evicting things out to L2ARC at a given rate, and so if you suddenly demand a whole bunch of performance out of the system, it is actually far more likely that the data in ARC will simply be tossed in order to make more room, and very little of it will be evicted to L2ARC. The L2ARC mechanism is designed to function over a long period of time, slowly identifying the most useful data to cache. It isn't frantically shuffling stuff around from ARC to L2ARC based on instantaneous demand, and in production you really don't want it to. It hurts benchmarks, admittedly. Unless your benchmark is small enough to reside entirely in ARC.

That makes a lot of sense, thanks. I will see about the possibility of additional RAM.

jgreco · Jan 13, 2016

I think you'll know when you're out of CPU. Since you have it already, there's no harm in seeing how it pans out. As for memory, it's difficult - not impossible, but difficult - to have too much. The more you have, the better the system can perform.

Hard drives will be inherently somewhat IOPS-limited. ZFS can go far, far beyond what your HDD's are capable of for raw IOPS, but the trick is that you're basically throwing resources at the issue. For example, when ZFS writes blocks, even though you might think you're writing noncontiguous blocks (think: random database writes), ZFS may be laying those down contiguously. This works best when there's gobs (and I mean *GOBS*) of free disk space, so it really wouldn't shock me to see a pool that was 10% full, with lots of memory and L2ARC, with your particular pool hardware, hit many thousands of random IOPS even though the underlying hardware is only capable of hundreds of random IOPS.

But it's important that you understand that what's actually happening when you get that kind of thing happening is not that ZFS is magic, but rather that coalescing random writes into a sequential allocation (which works best with relatively low utilization pools) and that reads that can be fulfilled from ARC are both things that can happen lightning fast compared to directly talking to a hard drive array and seeking heads around to do the work.

iamafish · Jan 14, 2016

Yes I wasn't expecting high IOPs with read/writes from the hard drives, I thought that on a fresh system with only a small test VM that all of it would fit in ARC/L2ARC and thus I expected greater performance numbers from the benchmark, understanding that the ARC/L2ARC doesn't automatically contain that data until it becomes "hot" enough changes that expectation. I didn't expect "magic" from ZFS, I'm always a bit of a cynic and this FreeNAS install is something of an experiment to test storage technologies.

However the second volume is 4x Samsung 850 EVO SSDs, I didn't add an L2ARC or ZIL/SLOG drive when creating this volume. I expected that this volume would manage around the performance of a single SSD, the sequential numbers seem to be around that level but the random 4K tests are way below expectations - they are barely faster than the hard drive backed volume. Is that purely an iSCSI limitation, a ZFS limitation, a CPU limitation or something I can tune to improve?

Is adding a second SSD volume in this way counter to ZFS philosophy/sense, is it better to just have a single volume with the L2ARC and ZIL/SLOG? I wanted to see the difference between the cached/HDD layout and pure SSDs, at the moment the Samsung SSDs seem to be pointless.

jgreco · Jan 14, 2016

Just to be clear, the ARC is absolutely expected to hold your VM, if it actually fits. What I was referring to was the system that evicts things from the ARC into the L2ARC. This works more slowly. It's also tunable, because it's very conservative by default. The ARC itself will get rid of data when a block is freed ("deleted file" etc) or under memory pressure, and in that case it'll scan the stuff that is becoming eligible for eviction to see if any of it could be sent to L2ARC.

SSD performance as a pool is a complicated issue. I've got a small SSD pool (mirrored 500G 850 Evo) but been too busy to play with it much. First off, RAIDZ on SSD is a bad idea, you avoided that pitfall, yay.

But then it gets a bit more complex. One of the big factors here is that the way ZFS treats the pool makes some assumptions. The write path out to a SLOG device, for example, is highly optimized for that task, but if you omit a SLOG device on an SSD pool and require sync writes, it goes v...e...r...y...s...l...o...w... because the in-pool ZIL requires the writes to go through the normal pool write mechanism, which for HDD isn't painfully slow, but for SSD, you start to really see how slow that is. It has to commit the writes to both mirror devices, traversing all the pool write code (which means selecting a vdev, allocating space, etc). So the irony of it is that SSD doesn't necessarily make for a lightning fast SLOG-less pool. Also, we rapidly reach a point where ZFS hits its limits in terms of CPU performance. Instead of limited I/O speeds and seek times that give the CPU some opportunity to breathe, SSD can put the system under a lot of stress as it could be just 100% busy. One of the reasons I went with an E5-1650v3 for our VM filer here was because I knew sooner or later there'd be a situation that called for more CPU.

You actually got me curious so maybe I'll report back later some iozone testing on the SSD pool.

iamafish · Jan 14, 2016

jgreco said:
Just to be clear, the ARC is absolutely expected to hold your VM, if it actually fits.

This has me suspicious that I have an unsolved issue, I had a very small test VM under the size of the ARC - it would have been at most 20GB of a Windows 2012 R2 install and then the test files. Right at the start I did see sequential results hit nearly 2GB/s but the random 4K results have never been any better than my screenshots above and I would have expected better from the ARC.

So disabling sync writes on the SSD pool could be worth trying?

Is there mileage in matching up value for things I mentioned in the first post, for example using 4K/4096 for the ZVOL block size, logical block size in iSCSI extent (how about using smaller 4K jumbo frames), NTFS allocation sizes?

jgreco · Jan 14, 2016

iamafish said:
This has me suspicious that I have an unsolved issue, I had a very small test VM under the size of the ARC - it would have been at most 20GB of a Windows 2012 R2 install and then the test files. Right at the start I did see sequential results hit nearly 2GB/s but the random 4K results have never been any better than my screenshots above and I would have expected better from the ARC.

Fine. However, what you're doing is like measuring the water flow out of the end of a garden hose and saying you'd expect more pressure since you can see a water tower on the horizon. The route water takes to get from its source to the end of your hose is a complicated multifactor problem. What you show in your screenshots isn't particularly meaningful or useful other than to suggest maybe improvements are possible.

What you actually need to do is to look at performance of each subsystem and then identify and eliminate problems. One at a time.

So disabling sync writes on the SSD pool could be worth trying?

Everything's worth trying. Some things are smarter to try than others.

Is there mileage in matching up value for things I mentioned in the first post, for example using 4K/4096 for the ZVOL block size, logical block size in iSCSI extent (how about using smaller 4K jumbo frames), NTFS allocation sizes?

You probably don't want a 4K zvol block size. That's the underlying storage system allocation size, so something like compression will never do anything meaningful. As with so many things, there are counterexamples... for example, with a really slow CPU and a really hot SSD pool, you might actually *want* that if you disabled compression.

You probably shouldn't be using jumbo frames, and certainly shouldn't be doing that before everything else is working well. As indicated in the previously linked article, jumbo introduces alternate code paths which may cause all sorts of interesting ... crap.

So much is dependent on so many different variables, though, it's hard to make universal statements. For example, the way you might design a pool to store incompressible data is probably different than how you might store other things.

jgreco · Jan 14, 2016

Ah and the results are in.

Code:

[root@storage3] /mnt/storage3-ssd# iozone -a -s 256g -r 3072
[...bla bla bla...]
                                                            random  random    bkwd   record   stride
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
       268435456    3072  444458  439822   905381   914302  740853  342074  864115 11262547   959590   335193   385016  889304   900711

iozone test complete.
[root@storage3] /mnt/storage3-ssd#  iozone -a -s 32g -r 3072
[...bla bla bla...]
                                                            random  random    bkwd   record   stride
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
        33554432    3072  453647  456902  9209207  9265739 9161585  454317 8688336 11267027  9220991   480921   452800 5975958  6007309

iozone test complete.
[root@storage3] /mnt/storage3-ssd#

So. On a system with 128GB of RAM, you really need something like a 256G test. If you try a smaller test, like the second one, things are distorted by ARC.

The first test suggests that write speeds are ~440MB/sec, which is generally consistent with the speeds of the underlying devices on an HBA. Read speeds leverage the presence of two devices. Random read is somewhat slower, which makes sense because the rated speed of the 850 EVO is 98K IOPS random read, or about 400MB/sec. Two units cooperating together would only ever peak around 800MB/sec, so 741MB/sec is pretty good. Random write is good, but with spec being 88K IOPS on random writes, that'd be 360MB/sec, and it's managing 342MB/sec.

The second test shows distortion by ARC. Suddenly all the read speeds are an order of magnitude higher. That's totally awesome.

iamafish · Jan 14, 2016

I've returned the MTU to 1500 everywhere already and it's had little effect on the benchmark results, some things better on average, some slightly worse. At the moment I am testing the difference made by various parameters, but again I'm yet to find a combination that makes significant difference, I'm treating this as a learning exercise and not necessarily to find "production settings". I get what you're saying about expectations but if I know the drives can perform up to 2 orders of magnitude better when accessed directly compared to within my test VM then it feels like something is wrong...

I repeated your iozone test on the FreeNAS SSD pool using double the RAM for the size, and the results are quite consistent with yours (considering 4 SSDs vs 2):

Code:

[root@san-b02-1] /mnt/solid# iozone -a -s 128g -r 3072
        ....
                                                            random  random    bkwd   record   stride
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
       134217728    3072 2288741 2412699  3006963  3111538 2218611 2369500 2191826  5177935  2268813  2495781  2359875 2192828  2219578

jgreco · Jan 14, 2016

So then what you do is you configure up a basic zvol on a single network connection, and focus on making that work, and then focus on making that work well. Eventually as you add ingredients to the soup, slowly, carefully, one at a time, taste testing after each one, you'll run across something (possibly MPIO?) that's causing it to get all hosed.

diehard · Jan 14, 2016

Just a quick question for jgreco, why is RAIDZ on a SSD pool a bad idea?
(i assume you mean all variants of RAIDZ)

jgreco · Jan 14, 2016

diehard said:
Just a quick question for jgreco, why is RAIDZ on a SSD pool a bad idea?
(i assume you mean all variants of RAIDZ)

For the iSCSI model, RAIDZ would be tempting due to the relatively high cost of the media compared to conventional HDD. One might even be tempted to contemplate the general guidance that a vdev adopts the general performance characteristics of one of the underlying devices. This is kind of true, but with SSD the performance warts start to become more readily apparent. If you actually want an SSD pool for iSCSI, then do it right with mirrors.

Important Announcement for the TrueNAS Community.

iSCSI Performance & File System Parameters (Record/Block size etc).

Dabbler

Guru

Dabbler

Patron

Dabbler

Resident Grinch

Resident Grinch

Dabbler

Resident Grinch

Dabbler

Resident Grinch

Dabbler

Resident Grinch

Dabbler

Resident Grinch

Resident Grinch

Dabbler

Resident Grinch

Contributor

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "iSCSI Performance & File System Parameters (Record/Block size etc)."

Similar threads