FreeNAS 9.10 iSCSI + ESXi 5.5 - Write Performance Issues

zimmy6996 · Mar 28, 2016

Good morning community ... I'm new to the FreeNAS ZFS world, so please take it easy on me as I ask for some advice. I will preface this by saying I have spend the last several days doing all sorts of performance testing, along with digging in to the forums here to try to gain some insight in to my questions, but I'm coming up empty at this point, so I am reaching out.

First, a quick punch list of my FreeNAS build.

Dell C6100
2 x Intel E5672 Xeon Procs
96GB RAM
1015M controller card flashed with V20
XX2X2 daughter card flashed with V20
10 x Seagate ES.2 3TB 7200 SATA Enterprise Drives
- 8 Drives setup as RAIDZ2 (connected to 1015M controller card)
- 2 Drives setup as Hot Spares (connected to ports 1-2 of XX2X2 daughter card)
2 x Intel S3700 100GB SSD (connected to ports 3-4 of XX2X2 daughter card)
- 10GB Partition on Each, Mirrored ZFS Partition for SLOG
- 12GB Partition on Each, FreeBSD Swap Partitions for SWAP
- 78GB Partition on Each, ZFS Partitions for L2ARC
2 x Intel NIC Interfaces, setup with LACP, terminated to a Cisco switch

Okay, that should do it ... Now you have at least a very basic baseline of where I'm coming from.

Some other high level info about my environment. I have about 10 ESXi 5.5 hosts, each of which have 2 NICs (active/passive) for the guest network traffic, and then 2 NICs setup as active/active, with MPIO configured.

For the past several years, my storage backends have been based on Windows 2012 Storage Server, providing iSCSI to the hosts. The two storage servers I have are both using LSI 9265 RAID Controllers, and the drive arrays are setup in RAID6 on one box, with a hot spare, and RAID 60 on the other, no hot spare. Both systems have 12 of the same drives in them (Seagate ES.2) so the drive type is apples to apples with the new FreeNAS build I have created.

In both the windows storages servers, as well as the FreeNAS box, i have a port-channel defined on the switch. LACP is the protocol. The windows storage servers, as well as the FreeNAS box are both setup with LACP.

Again, as a note, on the ESXi host side, I have MPIO configured properly, round robin setup, and iops set to 1 second.

Okay, sorry for being so long winded, but I wanted to make sure that all the details of my current configuration were out there!

Down to the problem ... I have a test guest VM (Windows 8.1, 6GB RAM, 4 Procs, 2 x 40GB Hard Disks) and have installed CrystalMark5, ATTO Disk Benchmark, and HD Tune Pro for benchmarking. Guest ethernet connectivity is turned off, so we are just getting basic benchmark activity on the machines.

(Side background, I have 3 ESIi hosts, all running 96GB of RAM, all with E5672 procs, all idle except for this VM build, again, making sure we are apples to apple. The ONLY thing that isn't apples to apples is the FreeNAS box has zero traffic hitting it, as it's a new build, where as my other two Windows Storage Servers are serving up VM's for about 25 guests, clearly meaning there results could be degraded, which I understand going in to testing).

Okay, so here are the performance numbers I am seeing from ATTO. What you will see is that while READ access to FreeNAS+iSCSI is able to sustain about 235MB/sec, which is expected, because the network transit to the box is only 2Gbit, the write performance is only hitting between 60-80MB/sec.

**** FreeNAS + iSCSI Benchmark Results ****

For comparison purposes, here is the results from one of my two guests running on the Windows Storage Servers iSCSI. I'm only including one of the two, since results are similar.

***** Windows Storage Server + iSCSI Benchmark Results *****

As you can see, the results of writes is a boat load better to the Windows Storage Server. The reads are better on FreeNAS, but that is clearly expected due to the fact that the array for FreeNAS is idle. I fully expected to cap out at the full bandwidth of the network interface, which is the results I see. But on the write side of things, as you can see, it is much slower. And based on these results, I definitely don't see it being a network LACP issue, because if we were getting stuck with transit of only a single interface, I would still think I would see writes consistently capped out at about 110MB/sec, and that isn't what I'm seeing.

I have been doing forum digging, and honestly I can't seem to find anything that would indicate where my problem might reside, or if I even "have" a problem. I can only bases my suspicions on the results I'm seeing.

On a side note, I've also run CrystalDiskMark on both machines, and see some strange results.

**** FreeNAS 9.10 + iSCSI Results ****

**** Windows Storage Server + iSCSI Results ****

These results are a little strange, because they show some read results all over the map for the Windows Storage Server setup. Probably showing there are other issues on the windows storage server platform, but again, not here asking for help on that. Simply trying to isolate performance on FreeNAS. Again, looking at the first line, you can see we are getting write performance on Windows of 207MB/sec, but on FreeNAS, I can only get 108MB/sec, and remember, that's an idle array, doing nothing.

And then, in trying to determine of the underlying file system is the bottleneck, I've actually tried running a basic test directly on the FreeNAS box itself. The results here might be showing me that the underlying file system is where the write bottleneck is.

[root@freenas] /mnt/VOL-01# iozone -M -e -+u -T -t 32 -r 128k -s 40960 -i 0 -i 1 -i 2 -i 8 -+p 70 -C
Iozone: Performance Test of File I/O
Version $Revision: 3.420 $
Compiled for 64 bit mode.
Build: freebsd

Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
Al Slater, Scott Rhine, Mike Wisner, Ken Goss
Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer,
Vangel Bojaxhi, Ben England, Vikentsi Lapa.

Run began: Mon Mar 28 06:15:07 2016

Machine = FreeBSD freenas.onlinespamsolutions.com 10.3-RELEASE FreeBSD 10.3-RE Include fsync in write timing
Include fsync in write timing
CPU utilization Resolution = 0.000 seconds.
CPU utilization Excel chart enabled
Record Size 128 KB
File size set to 40960 KB
Percent read in mix test is 70
Command line used: iozone -M -e -+u -T -t 32 -r 128k -s 40960 -i 0 -i 1 -i 2 -i 8 -+p 70 -C
Output is in Kbytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
Throughput test with 32 threads
Each thread writes a 40960 Kbyte file in 128 Kbyte records

Children see throughput for 32 initial writers = 881455.50 KB/sec
Parent sees throughput for 32 initial writers = 175462.69 KB/sec
Min throughput per thread = 17009.57 KB/sec
Max throughput per thread = 58323.66 KB/sec
Avg throughput per thread = 27545.48 KB/sec
Min xfer = 11520.00 KB
CPU Utilization: Wall time 3.526 CPU time 37.525 CPU utilization 1064.15 %

Children see throughput for 32 rewriters = 481988.43 KB/sec
Parent sees throughput for 32 rewriters = 357642.57 KB/sec
Min throughput per thread = 11176.98 KB/sec
Max throughput per thread = 135210.06 KB/sec
Avg throughput per thread = 15062.14 KB/sec
Min xfer = 40960.00 KB
CPU utilization: Wall time 3.665 CPU time 53.569 CPU utilization 1461.76 %

Children see throughput for 32 readers = 20502240.03 KB/sec
Parent sees throughput for 32 readers = 19000155.28 KB/sec
Min throughput per thread = 1178280.88 KB/sec
Max throughput per thread = 1309782.25 KB/sec
Avg throughput per thread = 640695.00 KB/sec
Min xfer = 36608.00 KB
CPU utilization: Wall time 0.502 CPU time 13.718 CPU utilization 2732.44 %

Children see throughput for 32 re-readers = 20769397.02 KB/sec
Parent sees throughput for 32 re-readers = 19369242.03 KB/sec
Min throughput per thread = 1311409.00 KB/sec
Max throughput per thread = 1325271.75 KB/sec
Avg throughput per thread = 649043.66 KB/sec
Min xfer = 25216.00 KB
CPU utilization: Wall time 0.498 CPU time 7.064 CPU utilization 1419.41 %

Children see throughput for 32 random readers = 20194862.50 KB/sec
Parent sees throughput for 32 random readers = 16348396.82 KB/sec
Min throughput per thread = 0.00 KB/sec
Max throughput per thread = 2476420.25 KB/sec
Avg throughput per thread = 631089.45 KB/sec
Min xfer = 0.00 KB
CPU utilization: Wall time 0.252 CPU time 2.535 CPU utilization 1003.85 %

Children see throughput for 32 mixed workload = 2077540.47 KB/sec
Parent sees throughput for 32 mixed workload = 296201.30 KB/sec
Min throughput per thread = 94.94 KB/sec
Max throughput per thread = 1006437.50 KB/sec
Avg throughput per thread = 64923.14 KB/sec
Min xfer = 128.00 KB
CPU utilization: Wall time 2.258 CPU time 20.343 CPU utilization 900.91 %

Children see throughput for 32 random writers = 290536.01 KB/sec
Parent sees throughput for 32 random writers = 104919.13 KB/sec
Min throughput per thread = 2751.82 KB/sec
Max throughput per thread = 90470.99 KB/sec
Avg throughput per thread = 9079.25 KB/sec
Min xfer = 8960.00 KB
CPU utilization: Wall time 6.612 CPU time 51.601 CPU utilization 780.38 %

Based on this result, I'm almost thinking the write performance is actually a problem on the local file system. Can anyone shed some light on this?

jgreco · Mar 28, 2016

RAIDZ2 is very good at sequential writes, but horrible at the sort of random disk writes needed for VM data storage. I post about this kind of thing almost daily. Suggest searching the forums for posts where I discuss iSCSI, mirrors, and RAIDZ.

zimmy6996 · Mar 28, 2016

jgreco said:
RAIDZ2 is very good at sequential writes, but horrible at the sort of random disk writes needed for VM data storage. I post about this kind of thing almost daily. Suggest searching the forums for posts where I discuss iSCSI, mirrors, and RAIDZ.

Hello jgreco! I have actually seen many of your posts. I appreciate you taking the time to respond ...

I have seen the posts regarding the performance of RAID10 over RAIDZ2, and while overall, I understand the difference in performance, I was kind of surprised to see RAIDZ2 underperform the LSI RAID6 controllers by such a large margin. I honestly didn't think I would be hitting disk performance bottleneck, since I'm only running on GB ethernet, not 10GB.

I am absolutely open to destroying the RAID configuration I have right now, and rebuilding as RAID10 in order to eliminate RAIDZ2 as the bottleneck. I will likely do that this morning.

Can you chime in about my usage of the two SSD disks with respect to how I have setup a 10GB partition mirrors for SLOG, and then the other parts for SWAP and L2ARC?

jgreco · Mar 28, 2016

You'd probably discover RAIDZ2 outperforms the LSI RAID6 controller at sequential operations, while sucking at random. Mirrors are more of a mixed bag. RAIDZ2 can outperform ZFS mirrors at sequential, but won't at random. Mirrors is almost always a win at random I/O, and block storage should almost always be considered random I/O.

You don't want to put swap partitions on the SSD. In normal operation, they would never be used. They're there for crisis situations, so you are much better letting FreeNAS build its default swap partitions the way it normally does, on the hard drives.

Using a single SSD for both SLOG and L2ARC gets you the worst of both worlds.

1) SLOG is important to reduce latency in sync writes. That won't happen if the SSD is busy retrieving content from the L2ARC.

2) L2ARC is best as cheap SSD. With 96GB of RAM, you absolutely have enough RAM to go get yourself a pair of cheap 240GB SSD's and utilize them as L2ARC. Once warmed up, these will do amazing things for your read performance on the pool. You could also get a single 480GB SSD and play a waiting game to see if maybe you have enough RAM to do two 480's. The problem there is that you can start robbing the system of ARC if you go too heavy on the L2ARC. This is very complicated and involves many variables, including the workload, the working set size, the pool configuration, etc., so for the average user the only way to know is to actually put it into production and watch what happens.

jgreco · Mar 28, 2016

Also... pool fragmentation and occupancy. Keep occupany as low as possible, write speeds are highly tied to how full your pool is. It is better to have stupidly large amounts of free space.

depasseg · Mar 28, 2016

As jgreco mentioned, split the L2ARC and SLOG on separate SSD's. But in the mean time, disable sync writes and test to see if that has an effect.

Also, the 'zilstat' and 'zpool iostat -v <poolname>' commands are helpful to see realtime performance of the storage system.

And as a quick primer, a ZFS RAIDZ vdev has the max IOPS of a single drive, and the sequential performance of the sum of the data drives. And vdevs are additive. Hence the use of striped mirrors being better suited for handling random IO.

jgreco · Mar 28, 2016

Also just listen to @depasseg since I'm coffee deprived and shoulda mentioned the obvious

zimmy6996 · Mar 28, 2016

Thank you both! I've tore down the RAIDZ2, and am rebuilding as RAID10. I will let you know how it looks.

Having said that, I have to say going RAID10 just scares me, because you don't have dual drive resilience. I understand that you can have multi drive failure, as long as it isn't the the other drive in the same mirror that fails. Just makes me feel uneasy.

If RAID6 is such a drag, how does Netapp, EMC, etc, live in those types of setups with great success?

depasseg · Mar 28, 2016

zimmy6996 said:
rebuilding as RAID10

Not to be pedantic, but if you are using ZFS, there isn't RAID10. The right term is something like striped mirror vdev's.

zimmy6996 said:
Just makes me feel uneasy.

If you want, you can have 3-way mirrors.

zimmy6996 said:
If RAID6 is such a drag, how does Netapp, EMC, etc, live in those types of setups with great success?

2 answers - Lots of memory/caching. And tiering (the highest tier is never RAID6, it's usually RAID10 for hot data - the lower tiers use RAID6 for warm/cold data).

jgreco · Mar 28, 2016

depasseg said:
If you want, you can have 3-way mirrors.

And the beautiful thing about three-way mirrors is that you get a boatload of read IOPS, as ZFS can be reading each component separately to serve different requests.

zimmy6996 · Mar 28, 2016

jgreco said:
You'd probably discover RAIDZ2 outperforms the LSI RAID6 controller at sequential operations, while sucking at random. Mirrors are more of a mixed bag. RAIDZ2 can outperform ZFS mirrors at sequential, but won't at random. Mirrors is almost always a win at random I/O, and block storage should almost always be considered random I/O.

Jgreco-

Okay, so I blew up the existing RAIDZ2, and made it 4 mirrors, stripped together.

[root@freenas] ~# zpool status
pool: VOL-01
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
VOL-01 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/e411d03e-f4eb-11e5-a4f0-008cfa039484 ONLINE 0 0 0
gptid/e4b07ecb-f4eb-11e5-a4f0-008cfa039484 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/e54f3b40-f4eb-11e5-a4f0-008cfa039484 ONLINE 0 0 0
gptid/e5fcdae5-f4eb-11e5-a4f0-008cfa039484 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
gptid/e6a0f0ce-f4eb-11e5-a4f0-008cfa039484 ONLINE 0 0 0
gptid/e7535a99-f4eb-11e5-a4f0-008cfa039484 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
gptid/e7f2716e-f4eb-11e5-a4f0-008cfa039484 ONLINE 0 0 0
gptid/e890fdad-f4eb-11e5-a4f0-008cfa039484 ONLINE 0 0 0
logs
mirror-4 ONLINE 0 0 0
gptid/e00706c0-f501-11e5-a4f0-008cfa039484 ONLINE 0 0 0
gptid/e7f4238c-f501-11e5-a4f0-008cfa039484 ONLINE 0 0 0
cache
gptid/f0129db0-f501-11e5-a4f0-008cfa039484 ONLINE 0 0 0
gptid/f16e8e40-f501-11e5-a4f0-008cfa039484 ONLINE 0 0 0
spares
gptid/1b985b34-f4ec-11e5-a4f0-008cfa039484 AVAIL
gptid/3d921ac2-f4ec-11e5-a4f0-008cfa039484 AVAIL

That said, I just completed another ATTO test, and I'm still seeing the same write results ...

Thoughts?

zimmy6996 · Mar 28, 2016

THOUGHTS???? Well, lets just start out and say "I'm An Idiot!!!!!"

So, I continued to pound and pound at this, and I couldn't explain why it wasn't working. I started to wonder if the network "could" be at the root of the problem. So I started to really do some digging ...

As I said early on, this system is setup with 2 x 1GB ports in LACP config. This is the same config that I use for my two boxes that currently run Windows Storage Server 2012. Upon exam however, I found that when writes were being sent to the FreeNAS, only one of the two ports are receiving traffic. The other was idle. That stumped me because this is exactly the way my Windows Storage Servers are configured. It was further compounded by the fact that reads were actually doing proper MPIO from ESXi, and I could see reads coming over both ports, thus giving 2GB of read speed.

So I started to do a deep dive on Google regarding MPIO and FreeNAS. It turns out, that FreeNAS with the ports in LACP, doesn't like MPIO with 2 NICS on the remote host in the same subnet. While this works just fine in the world of Windows Storage Server, it absolutely doesn't work in FreeBSD / FreeNAS.

There were several articles I found, but here was the "how you need to do it" article.

http://jungle-it.blogspot.com/2015/01/esxi-50-freenas-93-mpio-iscsi-setup.html

Anyway, after tearing it down, doing the re-ip, and standing it up ... Guess what???

LOOKS MUCH BETTER!!! Now we are seeing the full 2GB speed! :)

For good measure, I even rolled a CrystalDiskMark test, and the results looked great!

I am still setup with the 4 mirrors striped together.

Having said that, do these numbers look good? Again, we're talking about an idle volume. Clearly the numbers that are 200+ MB/sec are great, because we're talking about 2GB Ethernet backbone. So it can't get get any faster. The 4K Q32T1 write speeds are slower, and the 4K write speeds are really slower, but in comparison to my windows storage server boxes, these are actually flying. Is this what is normally to be expected?

jgreco · Mar 29, 2016

zimmy6996 said:
It turns out, that FreeNAS with the ports in LACP, doesn't like MPIO with 2 NICS on the remote host in the same subnet. While this works just fine in the world of Windows Storage Server, it absolutely doesn't work in FreeBSD / FreeNAS.

The only reason it works in Windows Storage Server is because your average Windows admin often can't even understand the networking involved in a single subnet.

We cover some of this in the networking stickies.

https://forums.freenas.org/index.php?threads/multiple-network-interfaces-on-a-single-subnet.20204/

https://forums.freenas.org/index.php?threads/lacp-friend-or-foe.30541/

zimmy6996 · Mar 29, 2016

jgreco-

Thank you again for all you insight! Love this community!

One last question, going back on the RAIDZ2 vs Mirror vdevs pros/cons. I did some benchmark tests, and found comparable read/write results regardless of being on RAIDZ2 vs MIRROR vDevs from the VMware Guest side. So my question is, knowing that this FreeNAS server only has a maximum of 2GB of network transit between it, and the hosts, would you say there is real difference with using RAIDZ2 vs MIRROR vDevs? Obviously, if we had 10GB interfaces, you would likely see the bottlenecks of RAIDZ2 over MIRRORs. But if network transit is really the bottleneck, I'm almost thinking I would be more comfortable with 2 drive failure ability.

Any last thoughts?

depasseg · Mar 29, 2016

zimmy6996 said:
this FreeNAS server only has a maximum of 2GB of network transit between it, and the hosts, would you say there is real difference with using RAIDZ2 vs MIRROR vDevs?

Yes, the RAIDZ2 will max out around ~150iops, and the 5 striped mirrors will handle 5x that. If random IO isn't needed, then RAIDZ2 should be fine.

hugovsky · Mar 29, 2016

You should think of I/O instead of bandwidth. Mirrored vdevs have more I/O ops. Better if you use iscsi. Even if giving you the same bandwidth. Synthetic tests aren't really a good test. Try to use a test vm with your load and check performance.

hugovsky · Mar 29, 2016

depasseg said:
Yes, the RAIDZ2 will max out around ~150iops, and the 5 striped mirrors will handle 5x that. If random IO isn't needed, then RAIDZ2 should be fine.

Exactly.

jgreco · Mar 29, 2016

zimmy6996 said:
jgreco-

Thank you again for all you insight! Love this community!

One last question, going back on the RAIDZ2 vs Mirror vdevs pros/cons. I did some benchmark tests, and found comparable read/write results regardless of being on RAIDZ2 vs MIRROR vDevs from the VMware Guest side. So my question is, knowing that this FreeNAS server only has a maximum of 2GB of network transit between it, and the hosts, would you say there is real difference with using RAIDZ2 vs MIRROR vDevs? Obviously, if we had 10GB interfaces, you would likely see the bottlenecks of RAIDZ2 over MIRRORs. But if network transit is really the bottleneck, I'm almost thinking I would be more comfortable with 2 drive failure ability.

Any last thoughts?

It's not just transfer speed performance. It's also IOPS. If all you're ever doing with your VMware guests are trite things, RAIDZ2 might be fine... but you still have to remember it'll act like a single disk. When you get a dozen VM's all running off a RAIDZ2 with six disks, it'll give you the IOPS of a single disk, or 1/12th of a disk per VM. When that same dozen VM's are all running off three mirror vdevs with two disks in each, you'll get the read IOPS of between three and six disks, and the write IOPS of three disks. You won't "see" that until you get some parallelism going though - which is usually the thing that causes VM admins pain. I'd rather have a datastore that's "meh" for a single VM but can do a few dozen VM's of that without significant impact which is why I really never bother with things like CrystalDiskMark. It can never provide a useful measure for what I need to know.

Also, RAIDZ2 will become very inefficient spacewise with small block updates.

jgreco · Mar 29, 2016

zimmy6996 said:
Love this community!

Me too, it's one of those great places where several other people actually post useful responses while I'm composing mine.

depasseg · Mar 29, 2016

jgreco said:
Me too, it's one of those great places where several other people actually post useful responses while I'm composing mine. :)

Dude, you are just slow.

hugovsky said:
Try to use a test vm with your load and check performance.

VMware IOanalyzer is great for this.
https://labs.vmware.com/flings/io-analyzer

Important Announcement for the TrueNAS Community.

FreeNAS 9.10 iSCSI + ESXi 5.5 - Write Performance Issues

Explorer

Resident Grinch

Explorer

Resident Grinch

Resident Grinch

FreeNAS Replicant

Resident Grinch

Explorer

FreeNAS Replicant

Resident Grinch

Explorer

Explorer

Resident Grinch

Explorer

FreeNAS Replicant

Guru

Guru

Resident Grinch

Resident Grinch

FreeNAS Replicant

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "FreeNAS 9.10 iSCSI + ESXi 5.5 - Write Performance Issues"

Similar threads