iSCSI Performance / Memory question

Status
Not open for further replies.

MhaynesVCI

Cadet
Joined
Jun 25, 2013
Messages
4
Hi!

So we're in the process of deploying a FreeNAS for VMware iSCSI commodity storage. As a part of my PoC I was benchmarking the system to see what kind of results we could expect. I used iometer and the test template found in the Unofficial VMWare Storage Performance Thread (http://communities.vmware.com/thread/197844?start=0&tstart=0)

During my testing my results for the random 8K tests dropped drastically (from 10K+ iops to 100-200). I was messing around with some advanced iSCSI settings, which triggered an iSCSI service restart and magically all the tests started to return good results again.

I took a look at the system resource reporting and noticed the "wired bytes" memory value dropped drastically after the iSCSI restart (screenshot attached). And, as I continue to test, it's raising again and will likely cause the same problem if unresolved.

Now, I assume this has been caused by me inadvertently enabling either compression or de-dup on the ZFSvols I'm using as my iSCSI targets, but I can't for the life of me find out where to disable these attributes - or even verify they are indeed enabled. The root volume (mnt/vol1) I used to create them has it's compression set to "inherit" (from where I don't know) and it's de-dup disabled. If I edit the source volume, will that propagate down to the ZFSvols? I also assume I want to disable atime for performance increases. I'm hoping I don't need to re-create the ZFSvols to edit these values...

Any help would be appreciated. Thanks!
 

Attachments

  • freenas.PNG
    freenas.PNG
    21 KB · Views: 403
  • freenas2.PNG
    freenas2.PNG
    33.2 KB · Views: 390

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
From the command line, "zfs get dedup poolname" for example. This is the absolute way to know what ZFS is actually running with. Make any changes necessary through the GUI.

atime should have no measurable effect on an iSCSI file extent, and none at all for a zvol.

wired memory includes the ZFS ARC, and this should normally be pretty large. On a newly booted system, causing the ARC to fill:

Code:
last pid: 15982;  load averages:  2.76,  2.04,  1.11    up 0+14:19:17  13:50:37
33 processes:  1 running, 32 sleeping
CPU:  0.0% user,  0.0% nice, 63.9% system, 19.6% interrupt, 16.5% idle
Mem: 51M Active, 103M Inact, 45G Wired, 1168K Cache, 138M Buf, 61G Free
ARC: 40G Total, 601M MFU, 36G MRU, 16K Anon, 1766M Header, 1012M Other
Swap: 24G Total, 24G Free


Now as the box hasn't been tuned at all it hits the max it is willing to allocate to ARC:

Code:
last pid: 16063;  load averages:  3.00,  2.58,  1.65    up 0+14:24:51  13:56:11
33 processes:  2 running, 31 sleeping
CPU:  0.0% user,  0.0% nice, 65.5% system, 12.7% interrupt, 21.8% idle
Mem: 51M Active, 103M Inact, 91G Wired, 1168K Cache, 138M Buf, 15G Free
ARC: 84G Total, 341M MFU, 78G MRU, 16K Anon, 2934M Header, 2202M Other
Swap: 24G Total, 24G Free


Closing the vdev causes the ARC to release, and wired plummets too. Pathetically slow E5-2609 takes about half a minute to actually do that, haha. Here's where it ends up. There's some other minor stuff going on so ARC is expected to maintain some modest size.

Code:
last pid: 16086;  load averages:  1.46,  2.18,  1.62    up 0+14:26:53  13:58:13
31 processes:  1 running, 30 sleeping
CPU:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
Mem: 49M Active, 103M Inact, 12G Wired, 1168K Cache, 138M Buf, 93G Free
ARC: 2157M Total, 6K MFU, 2972K MRU, 16K Anon, 2154M Header, 163K Other
Swap: 24G Total, 24G Free


So restarting the iSCSI service on a zvol backed extent will cause the ARC to flush and wired to crash, but that's not really a good thing... the ZFS ARC is very good at accelerating your read workload. If your ARC is sufficiently large to hold your working set, it is likely that your iSCSI SAN will feel much faster than the underlying pool actually is.

Benchmarks may not properly reflect what you can expect out of a ZFS pool in production unless you've characterized the workload and are using representative benchmarks.

Without more information about your system and your tests, solid information vs guesswork ratio becomes uncomfortable.
 
D

dlavigne

Guest
Now, I assume this has been caused by me inadvertently enabling either compression or de-dup on the ZFSvols I'm using as my iSCSI targets, but I can't for the life of me find out where to disable these attributes - or even verify they are indeed enabled.

If dedup is enabled, you have to set the dedup property to No via the CLI command 'zfs set dedup=off (datasetname)'. However, data already stored as deduplicated will not be "un-duplicated" by changing the property. Only newly stored data after the property change will not be deduplicated.

To remove the deduplicated data, copy all the data out of the dataset(s), set the property to off (or destroy/recreate the dataset(s) ensuring dedup=off), then copy the data back in again.

If you have no important data on the testing system, it would be easier to just destroy the pool
and recreate from scratch. This ensures everything is clean.
 

MhaynesVCI

Cadet
Joined
Jun 25, 2013
Messages
4
Thanks for the replies.

OK, so I've verified dedup and compression are disabled for the parent volume. Those commands don't work for the paths to my zvols, returning "not a ZFS filesystem", but considering they were created with inheritance I can't imagine they're any different.

It's been approximately 16hrs since the iSCSI server was restarted and the memory growth has subsided. The system has left itself about 4GB of free memory, so it doesn't look like a leak. I will run more tests today to validate my previous results.

For reference the system is running atop a Xeon E5-2603 with 48GBs memory. The underlying storage is 27 1TB 7200RPM SATA2 disks with a 3Ware controller of some sort (sorry I don't have the exact specs). Honestly, when I got the first results of 10K I was surprised, as they seemed quite high... But 100-200 seems veeeeery low. I've been testing from a VM, and also from a physical machine connected directly to the array and mounting an RDM.

The two tests I'm focusing on are IOmeter tests designed to emulate VM workloads and just random small-block I/O. Screenshot with details attached.
 

Attachments

  • iometer.png
    iometer.png
    37.4 KB · Views: 439

MhaynesVCI

Cadet
Joined
Jun 25, 2013
Messages
4
Well, this just got more interesting.
IOmeter creates a 4GB file for its tests. For isolation, I've got a laptop plugged directly into the Freenas and using that laptop I've got a small iSCSI target mounted as an RDM. I've been using this same target with multiple machines, some virtual some physical.

Anyway, so I ran my tests and got some pretty abysmal results - attached as test1
test1.PNG

While the tests ran I monitored freenas via top & the network utilization on the machine running the test. The only test that seemed to cause any activity in either was the "Max Throughput 100% Read" test, which incidentally was the only test which resulted in decent performance.

I thought about that for a second and decided to delete the test file and let IOmeter recreate it. Sure enough the result difference was dramatic - attached as test2
test2.PNG

So these results seem a little more inline with what I'd expect to see from this array.

So nothing changed except me re-creating my test file. Is there some sort of contention issue going on here? The other machines I had mounted to this RDM are no longer actively accessing it. If it was an OS-level block you'd think all writes would be f'd, not just sporatically like this - and with added latency. Very strange.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Ok, I apologize but I just don't have the time for this today, but I want to give you key bits to ponder.

ZFS is a Copy-on-Write filesystem. If you write a file, the first time, it (probably) writes as contiguous blocks. Sequential reads - even of large sections - will involve a single seek and be fast. If you update a block, it writes a *new* block in free space on the filesystem. You wind up with fragmentation. If you randomly update lots of blocks, you end up with a massively fragmented file. Reads will be really slow if they have to be fulfilled by the pool, because it will be seeking all over the place to retrieve a block here and a block there. Does that make sense? If not, re-read until it does, and if it still doesn't, please squeak and I'll try to do better explaining.

But the point here is that you can cause massive fragmentation of an iSCSI extent. You must consider this in any design. In practice, certain things can make this better (or worse!):

1) Let's say you have a NTFS filesystem on top of the iSCSI extent. You go to write a file on it. NTFS pushes disk blocks to the "disk", which is ZFS, and ZFS aggregates these and will tend to write the new blocks contiguously if possible. So the *file* that is being stored has possibly-contiguous and at-least-strong-locality to the rest of itself.

2) Having a large ARC (or L2ARC) dramatically decreases the IOPS to the pool for commonly accessed blocks.

3) Randomly written blocks to a file (think: database) where the file must be read sequentially will result in totally awful read performance, unless counteracted through ARC/L2ARC, which means that for database usage, your database PLUS the remainder of the pool's working set must be able to reside in the ARC/L2ARC - or suffer poor performance.

Think about this for a bit and I'll bet you will gain insight into why your abysmal results occurred after extensive testing and why they appeared to clear up somewhat when you created a new test file.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Just as jgreco said. ZFS + iSCSI and ZFS + databases(assuming lots of writes) don't provide good long term performance without pre-planning and mitigating the issues. It's been discussed frequently and thoroughly in the forum, which is why jgreco didn't write the normal 2 screen post he has before on this topic.

Unless you plan to use LOTS of RAM(think >32GB), add a ZIL and an L2ARC then you can expect poor performance long term.
 

MhaynesVCI

Cadet
Joined
Jun 25, 2013
Messages
4
Thanks guys. Shoulda RTFM'd.

The box does have 48GB of memory, if I'm understanding I can carve out some of that for L2ARC and mitigate the performance degradation? Can you point me in the right direction for how to acomplish that?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
How to add an L2ARC? The manual... section 6 of the manual covers creating and adding disks to the pool(and ZIL and cache disks). Keep in mind that the bigger the L2ARC the more RAM you will need to manage the L2ARC. So you should try to "right-size" your L2ARC. Somewhere there's a thread discussing the ratio of MB of RAM used per GB of L2ARC you have.

If you plan to add a ZIL keep in mind size doesn't matter as much. The ZIL typically won't hold more than 1-2GB of data as it only stores data that isn't commited to the pool itself. For example, if your zpool "write cache" is only 5GB, than a ZIL that is more than 5GB will have unused space. The ZIL is only a second copy of the data that needs to be committed to the zpool and is also in RAM... at all times. ZIL mirrors are always recommended and SSDs with high grade SLC is also recommended.

Edit: Found it. Each 4kb block used in the L2ARC consumes 180bytes of ARC memory. So if you had a 120GB SSD about 5.4GB of RAM would be dedicated to maintaining the L2ARC.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
which is why jgreco didn't write the normal 2 screen post he has before on this topic.

You callin' me wordy?

ZIL mirrors are always recommended and SSDs with high grade SLC is also recommended.

Um, needs-better-explaining (sorry).

The ZIL is a ZFS feature built into the pool. An external SSD can be used, called a SLOG device, tons faster than the in-pool version. This is often referred to as a "ZIL" in conversation but that's wrong, it's actually a SLOG ("separate intent log").

For normal iSCSI, this is not significant, because no sync writes are happening. This is - in theory - bad. Companies like VMware take extreme pains to ensure that you are paying big bucks to their storage partners for battery backed RAM cache SAN arrays. Oh, wait, look, I meant "take extreme pains to ensure that your VM storage is coherent and consistent." They do that by requiring sync disk writes. FreeNAS will honor the sync NFS access requests by default, but iSCSI has no way to signal this, so by default FreeNAS treats iSCSI writes as async. This means that a crash or power fail with data acknowledged but not committed to the pool could ruin the consistency of one of your VM disks.

But the alternative is to turn on sync, which slows things down massively. A SLOG helps ease the pain, but does not usually eliminate it. Cyberjock kind of glossed over that you should probably use sync and therefore probably want a SLOG, and went right into the details.

Also, SLOG mirroring is no longer of particular value with the v28 pool format. The loss of a SLOG device should not result in the pool being rendered unusable.

As for L2ARC: it is much less desirable than ARC. You should attempt to figure out the size of the working set you have. You can define this a few ways, the most aggressive is "the number of unique blocks read from the pool", the next most aggressive is "the number of unique blocks read from the pool more than once", those same things over a given time period, etc. If you find that you have a working set size of 40GB and your ARC max size is 38GB, you might want to see if you can reduce your working set size somehow. But you may actually lose out trying to use L2ARC there, as you will evict a certain amount of super-fast ARC to maintain the L2ARC. On the other hand, if you have a wss of 100GB, then a 120GB L2ARC SSD is quite possibly in your future... but then you also need to learn about tuning of the L2ARC, and remember that benchmarking lots of write activity may not accurately represent the real-world performance, unless that's actually the sort of workload you need to sustain.
 
Status
Not open for further replies.
Top