Some insights into SLOG/ZIL with ZFS on FreeNAS

David E

Contributor
Joined
Nov 1, 2013
Messages
119
Yes, and there's good reason for each choice made. Look, VMware's stuff has to actually WORK. ESXi is not a "bad actor" for having made pragmatic choices about being paranoid with VM data. They've made the safest reasonable choices that can be generally implemented across a variety of hardware - specifically including NON-SCSI hardware. VMware sits in between a VM that might-or-might-not have virtual hardware that vaguely resembles SCSI or might implement something like IDE, and then data storage through a variety of technologies including FC, iSCSI, NFS, SAS, and others. It has to all WORK. This is the sucky real world. Nothing prohibits an admin who dislikes VMware's pragmatic and conservative choices from overriding them. But I think we can at least respect VMware for trying to make sure that the storage system does the right thing.

I'm pointing out that it is clearly not working as it ought to. ESXi's iSCSI implementation is not issuing the appropriate SCSI commands to the target storage layer and is thus not respecting the semantics and guarantees that filesystems rely on - namely the ability to issue synchronous writes and be assured when they have reached NV storage. And because of this failure downstream NAS/SAN systems are having to massively overprovision their systems by assuming every write is a worst case sync write (via ESXi's NFS interface), when in reality (workload depending) they are some tiny fraction of actual writes.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
You realize this is nonsense, if the driver issued a SYNCHRONIZE_CACHE API call to the emulated hardware then it should dutifully pass this along and follow the SCSI semantics for flushing the cache. Otherwise it is incorrectly emulating the guarantees.



This also makes no sense, lets for a minute assume that ESXi is emulating a SCSI card and also mounting the disk image under the covers using iSCSI. In this case there is quite literally a 1:1 relationship of commands that the guest OS's driver is issuing to the emulated SCSI card that should then be passed to the underlying iSCSI connection. Now granted this assumes this VM is the only one on this iSCSI mount, otherwise synchronize_cache commands will have detremental performance effects for other VMs - but in practice I would suspect that the FUA bit is used way more often than synchronize_cache, which then should not cause an issue.

It's not nonsense. You're missing the point. A VMware VM might be using an emulated SCSI controller, yes. But it might also NOT be. So just how do you expect to pass a SYNCHRONIZE_CACHE call via a VM with an IDE disk, anyways?

The point here is that ESXi has to work for ALL the cases.

And have you ever actually WORKED with large scale storage hardware? Issuing a SYNCHRONIZE_CACHE to some types of large SAN devices can cause serious problems, because it ACTUALLY DOES IT ... flushing gigs of already battery-backed write cache out to disk. Eugh.
 

pbucher

Contributor
Joined
Oct 15, 2012
Messages
180
ESXi's iSCSI implementation is not issuing the appropriate SCSI commands to the target storage layer and is thus not respecting the semantics and guarantees that filesystems rely on - namely the ability to issue synchronous writes and be assured when they have reached NV storage. And because of this failure downstream NAS/SAN systems are having to massively overprovision their systems by assuming every write is a worst case sync write (via ESXi's NFS interface), when in reality (workload depending) they are some tiny fraction of actual writes.

I'll agree that ESXi's iSCSI setup is less then ideal. Following jgreco's logic everying iSCSI should have the FUA bit set, which would then mirror what they are doing on NFS. Then again if like in istgt the FUA bit isn't honored it's a mute point.

I agree with jgreco that forcing sync on NFS is the correct thing to do. The real world is just too messed out with hardware/firmware/etc combos that don't always do what should be done. Not to mention IT shops & individuals who do the wrong thing, be it ignorance or lack of funds.

Today's Tip: For VMs with heavy i/o loads(aka my Oracle servers) I simply by pass vmware completely and directly mount nfs shares on my Oracle servers to store it's data & log files. That way I loose the overhead of the filesystem->virtual hardware->esxi plus the app can control what's async & sync. I use a separate dataset then my esxi nfs share so I can config it differently if desired and do fun things like snapshot & replicate it every hour. You could do something similar with windoz servers and iSCSI for i/o heavy things like exchange/sqlserver.
 

David E

Contributor
Joined
Nov 1, 2013
Messages
119
It's not nonsense. You're missing the point. A VMware VM might be using an emulated SCSI controller, yes. But it might also NOT be. So just how do you expect to pass a SYNCHRONIZE_CACHE call via a VM with an IDE disk, anyways?

The point here is that ESXi has to work for ALL the cases.

I assure you that SATA/IDE controllers have similar commands requiring confirmation that bits hit the disk, I can go dig them out if needed. My point is it should be easily emu-able, and translatable between layers.

And have you ever actually WORKED with large scale storage hardware? Issuing a SYNCHRONIZE_CACHE to some types of large SAN devices can cause serious problems, because it ACTUALLY DOES IT ... flushing gigs of already battery-backed write cache out to disk. Eugh.

Not with netapp size boxes no, but consider this, if you want safety now you have to make everything synchronous, in which case you are flushing your cache every time a bit hits it anyway. There is nothing to lose here (unless you are happy with a fully async system ignoring guest OS semantics), and only safety (and performance if you are making everything synchronous) to gain.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
And I assure you, if you can actually make it work out in all cases, you're doing better than me and all the idiot hacks over at VMware. It isn't easy. It isn't trivial. It likely isn't even possible given all the use cases. This is a pointless argument. If VMware could do X, Y, and Z, then yes, it'd be possible. But the list of X Y and Z are daunting, not least of which because they include depending on operating systems to do certain things that they don't actually do, storage systems behaving differently than we'd all prefer, etc.

As for "flushing your cache every time a bit hits it", that makes no sense. A sync write is allowed to be committed to stable, nonvolatile storage. It doesn't have to be stored on the ultimate destination (and in many cases these days there isn't even a single something that's identifiable as the "ultimate destination", given storage tiering systems). The only important property is that the same block be returned even if the storage system loses power, crashes, etc., in the meantime. It is perfectly fine for it to sit in a write cache and be aggregated. Such write performance even on a low end LSI RAID controller with BBU can easily run into the many hundreds of MBytes/sec.

From my point of view, this is a pointless discussion. It'd be great if it worked differently. It'd be great if it COULD work differently. But it works the way it does for good reason. We've spent time trying to help you understand, and you don't seem to quite see the big picture, just smaller bits of it. I am not likely to reply further. You are encouraged to read all the standard materials on this thoroughly explored topic.
 

David E

Contributor
Joined
Nov 1, 2013
Messages
119
And I assure you, if you can actually make it work out in all cases, you're doing better than me and all the idiot hacks over at VMware. It isn't easy. It isn't trivial. It likely isn't even possible given all the use cases. This is a pointless argument. If VMware could do X, Y, and Z, then yes, it'd be possible. But the list of X Y and Z are daunting, not least of which because they include depending on operating systems to do certain things that they don't actually do, storage systems behaving differently than we'd all prefer, etc.

I think you are making it sound overly complex. ESXi implements exactly 4 SCSI host controllers for guest storage emulation (BusLogic Parallel, LSI Logic Parallel, LSI Logic SAS, and VMware Paravirtual) note these all speak SCSI, and for actual storage pools there are only a few (out of the box) options as well (Fibre Channel, iSCSI, local SCSI, NFS). The bulk are SCSI to SCSI, with only two being SCSI -> FC or SCSI -> NFS. Even if it were not possible to map synchronous write metadata between all combinations, ANY combinations that worked would be an improvement over stating everything is synchronous, or nothing is.

As for "flushing your cache every time a bit hits it", that makes no sense. A sync write is allowed to be committed to stable, nonvolatile storage. It doesn't have to be stored on the ultimate destination (and in many cases these days there isn't even a single something that's identifiable as the "ultimate destination", given storage tiering systems). The only important property is that the same block be returned even if the storage system loses power, crashes, etc., in the meantime. It is perfectly fine for it to sit in a write cache and be aggregated. Such write performance even on a low end LSI RAID controller with BBU can easily run into the many hundreds of MBytes/sec.

Flushing volatile cache to NV cache/storage. As you said it is of course unreasonable to expect final destination flush.

From my point of view, this is a pointless discussion. It'd be great if it worked differently. It'd be great if it COULD work differently. But it works the way it does for good reason. We've spent time trying to help you understand, and you don't seem to quite see the big picture, just smaller bits of it. I am not likely to reply further. You are encouraged to read all the standard materials on this thoroughly explored topic.


I'm fairly sure if there is anything we have garnered from this discussion is that it is not working appropriately. I'm not sure how you can believe that it works correctly when NFS is asking for completely opposite write semantics from iSCSI, and neither are correctly reflecting what the guest OSs are asking for.
 

David E

Contributor
Joined
Nov 1, 2013
Messages
119
And yes I agree with you that since this is happening a layer removed from FreeNAS there are only two options, but if everyone believes that VMware's implementation is 'working correctly' then the status quo will never change.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
So take your complaint to VMware. I asked you nicely in private to stop this off-topic critique of VMware's strategy. I am now telling you in public: MODERATOR SAYS, SUBJECT CLOSED. Do feel free to go over to the VMware forums and engage the folks there.
 

Ivo

Cadet
Joined
Aug 19, 2013
Messages
6
Hi, I have a question on how the SLOG actually works:

From what I understand, ZFS receives write requests every so often and unless they're marked as SYNC, it just queues them into the transaction block (in RAM) and ultimately writes them if one of the following happens:

- the transaction block is full
- the max timeout for writes expires
- (or of course, there's a SYNC write).

Now my question is, how does this write happen? My understanding is that it has to block the SYNC write request (if that's what caused the write) until it has effectively pushed the transaction block into the SLOG.

What happens after that?

does ZFS write the contents of the transaction block (in RAM) to the pool, time permitting, if so, where do new write requests go until the transaction block has been effectively written to the pool and freed?

or does ZFS write the contents of the SLOG copy if the transaction block into the pool and thus immediately free-up the transaction block for new write requests?

Can ZFS queue several writes in the SLOG?

how is the SLOG used? Sequentially, from beginning of the device and always adding write requests one after the other or does each write request overwrite previous write requests? (and thus only using the first few cylinders of the SLOG device?)

I'm trying to design the perfect hardware setup for optimal performance, but to do this, I need to understand how the software interacts with the hardware.

Thanks for your help,
--Ivo
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
No, you're screwed up a bit. :smile:

A sync write request does not necessarily force a transaction group flush.

The transaction group never gets written to the SLOG... it isn't a cache or a buffer. And let's not even use the term; let's just talk about the ZIL. A SLOG is a specific kind of ZIL. A transaction group is a write queue of things that are going to be written to the pool. Periodic flush and filled transaction group are the things that typically force it to flush to disk. And that's the only place it goes.

A sync write makes a logical split in the ZFS code. One part has the "sync" bit stripped and proceeds into the transaction group. This is the pool update. It will sit in the tx group like any other nonsync data. The other part is pushed into the ZIL. Once that is completed, the data written "sync" has been committed to stable storage ... just not its ultimate resting place within the pool, which is actually what the FIRST copy is going to do. The ZIL is never read from - UNLESS something bad happens, like the server crashes, and the sync transactions must be replayed to rebuild the guaranteed-committed blocks. This is the ONLY time that anything might be pulled out of the ZIL (whether SLOG or in-pool), and I will reiterate that the ZIL is not a cache of any sort ... just a log that is used when something really bad happens. This simplistic technique allows ZFS to treat all pool-bound data as non-sync and simplifies the design of the transaction group processor.

As for what happens when ZFS is writing to the pool ... a transaction group fills. When txg #1 needs to flush to disk (fills up, time elapsed, whatever) then a new transaction group #2 is started, and the #1 txg begins flushing to disk. If the new #2 transaction group ALSO fills before the #1 is finished writing, then ZFS forcibly suspends I/O pending completion of that #1 txg. In theory that should never happen but if you have a very fast process writing to a slow pool, it can happen.
 

Ivo

Cadet
Joined
Aug 19, 2013
Messages
6
So it's theoretically possible to write a file to disk (contents are in the transaction group but not yet flushed to disk), and then to write the filesystem catalog info (SYNC write, which gets flushed to the ZIL), and then we crash. Upon reboot, ZFS discovers the ZIL, writes the catalog info on the disk, which then points to empty blocks because the file contents were not yet flushed to disk before the crash.

Doesn't it make much more sense to flush the whole transaction group when a SYNC write is received?

I guess what I'm struggling to understand, is the meaning of a SYNC write - does it mean "make sure this data is written to disk" or "flush all buffers to disk". If it's the former, is there any write command that means the latter?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Right. You crashed. That's not supposed to happen. You cannot write everything sync direct to the pool because that would absolutely suck. ZFS uses variable block sizes and a clever variation of RAID parity. Think of the complexity. Think of the horrible performance. You do not want every block flagged sync and then written to force a transaction group flush. There's a reason that ZFS reserves a large chunk of memory for handling transaction groups...

So if you actually want your file contents protected, you can tell ZFS to run that through the ZIL as well. Performance will suck if you do this on a pool ZIL. It will suck less if you have a SLOG device.

A sync write means "write this block to disk and do not return until that is done." As with most complex storage systems, ZFS kind-of lies by not necessarily doing that exact thing, but it honors the intent of the request, which is that future requests for the block in question will reliably return the block just written. I believe in FreeBSD that this is exposed to userland via O_DIRECT but quite frankly I haven't looked recently.

You may be confusing this with the related sync() syscall.

In any case, you can do one of three things:

You can disable sync writes, which is dangerous but man does everything fly.

You can use standard sync writes, which basically handles filesystem metadata and data specifically requested to be sync as sync.

You can sync write everything, which is stressy on your ZIL.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
So it's theoretically possible to write a file to disk (contents are in the transaction group but not yet flushed to disk), and then to write the filesystem catalog info (SYNC write, which gets flushed to the ZIL), and then we crash. Upon reboot, ZFS discovers the ZIL, writes the catalog info on the disk, which then points to empty blocks because the file contents were not yet flushed to disk before the crash.

No. The write is either complete on ZFS(either because the write is complete on the pool itself or because its partly on the pool and partly on the SLOG). In any case, before the sync write is returned as "complete" the data for the sync write must be in a storage device that is non-volatile. At least, that's what ZFS expects. If you do things like use a write cache on your RAID controller to buffer writes you can potentially break this if the RAID controller loses power and the write doesn't complete. This is one reason why we highly recommend HBAs.

There's a list of things that ZFS expects, requires, and demands to function properly. You break those rules and you potentially break ZFS. Most people don't know those rules so they often blindly do exactly what they've always done with their servers. Buy a RAID controller, enable read and write caches and everything else that normally gives a performance improvement. What they don't know is that they may be unwittingly hurting performance and breaking many of the things that ZFS assume you'd never be idiotic enough to do. This is why we have people losing pools all the time. It's a weekly thing here. I think we had 4 this week. Not because FreeNAS is some amateur project coded by mouth breathers. But because people don't want a degree in ZFS and try to setup a server in 3 hours with spare hardware. It works all fine and dandy on day one. So they wrongly assume it must be working fine and that they made no mistakes. What they don't realize is that the write cache on their RAID controller just might screw them over that one day when they lose power or the system has a kernel panic. ZFS was engineered to work a certain way. If you don't know how it was engineered then you could be making horrible mistakes. That's why we have stickies like this thread in the forums. We see it all the time.

Doesn't it make much more sense to flush the whole transaction group when a SYNC write is received?

Nope. The sync write data is all you want to write. You don't want to write a 1GB transaction group just to sync a 4k write. Remember, whatever device has requested that sync write is probably sitting idle doing absolutely nothing else until it gets the acknowledgement back. Do you really want to lock up your VMs on ESXi every time a single write to the VM's disk is made and it takes 2 seconds? It would take days just to get to the login screen when you booted a VM.

I guess what I'm struggling to understand, is the meaning of a SYNC write - does it mean "make sure this data is written to disk" or "flush all buffers to disk". If it's the former, is there any write command that means the latter?

A sync writ means "make sure this data is written to disk". That's it. You don't want it to be "flush all buffers to disk" because of the latency involved with writing additional data. The system compensates for the difference by allowing you to add an SLOG to deal with those potential problems efficiently. You write the stuff you need to write when you need to write it. Not sooner or later.
 

Ivo

Cadet
Joined
Aug 19, 2013
Messages
6
Nope. The sync write data is all you want to write. You don't want to write a 1GB transaction group just to sync a 4k write (...)
You know what, you're right. I think I've been trying to optimize something that's not necessary. I've been trying to obtain the highest IOPS possible for random SYNC writes, which is not representative of real world usage. It's better to let ZFS optimize its writes with the use of the transaction group.
 

jyavenard

Patron
Joined
Oct 16, 2013
Messages
361
2) A small pool on a system with a lot of memory, such as one where a designer has included lots of ARC for maximum responsiveness, can counter-intuitively perform very poorly due to the increased default size for transaction groups. In particular, if you have a system with four disks, each of which is capable of writing at 150MB/sec, and the pool can actually sustain 600MB/sec, that still doesn't fit well with a system that has 32GB of RAM, because it allows up to 4GB per txg, which is greater than the 3GB per 5 seconds that the pool can manage.

As a result, tuning the size of a transaction group to be appropriate to a pool is advised, and since that maximum size is directly related to SLOG sizing, it is all tied together.


How would you go into identifying if this is definitely an issue or not?

Current system is 32GB of RAM; RAIDZ2 6x4TB WD Red disk
top shows:
CPU: 0.0% user, 0.0% nice, 0.0% system, 0.2% interrupt, 99.8% idle
Mem: 145M Active, 310M Inact, 23G Wired, 259M Buf, 8122M Free
ARC: 20G Total, 1438M MFU, 18G MRU, 2320K Anon, 121M Header, 455M Other
Swap: 12G Total, 12G Free

When I copy a 20TB file within a pool using rsync, I see speeds that a varying greatly over time. 70MB/s to 135MB/s, constantly changing in between.

zpool iostat 1 gives me as output
pool 11.0T 10.8T 906 402 106M 43.8M
pool 11.0T 10.8T 1.00K 338 127M 21.9M
pool 11.0T 10.8T 1.18K 344 149M 40.5M
pool 11.0T 10.8T 990 2.02K 120M 229M
pool 11.0T 10.8T 1.35K 0 168M 0
pool 11.0T 10.8T 416 2.50K 51.5M 293M
pool 11.0T 10.8T 1.19K 0 151M 0
pool 11.0T 10.8T 712 1.43K 87.9M 180M
pool 11.0T 10.8T 1.25K 1.16K 158M 128M
pool 11.0T 10.8T 1.26K 0 154M 0
pool 11.0T 10.8T 261 2.50K 31.5M 311M
pool 11.0T 10.8T 1.45K 273 183M 14.0M
pool 11.0T 10.8T 1.06K 74 134M 3.87M
pool 11.0T 10.8T 515 2.85K 63.5M 331M
pool 11.0T 10.8T 1.25K 0 144M 0
pool 11.0T 10.8T 1.00K 902 124M 111M
pool 11.0T 10.8T 771 1.99K 95.9M 217M
pool 11.0T 10.8T 1.19K 0 150M 0
pool 11.0T 10.8T 845 900 104M 110M
pool 11.0T 10.8T 772 2.10K 95.4M 244M
pool 11.0T 10.8T 1.26K 0 159M 0
pool 11.0T 10.8T 894 296 109M 36.7M
pool 11.0T 10.8T 631 2.74K 77.7M 324M
pool 11.0T 10.8T 1.38K 0 175M 0
pool 11.0T 10.8T 902 848 111M 104M
pool 11.0T 10.8T 699 2.22K 86.7M 262M
pool 11.0T 10.8T 1.33K 0 168M 0
pool 11.0T 10.8T 1001 604 124M 74.1M
pool 11.0T 10.8T 552 2.49K 67.3M 295M
pool 11.0T 10.8T 1.61K 0 206M 0
pool 11.0T 10.8T 774 697 95.9M 85.0M
pool 11.0T 10.8T 465 2.52K 57.1M 300M

as you can see, this varies *greatly* ever second... I'm a tad ensure in what's going on....

using the plain cp, it varies slightly less, but still significant enough:
pool 11.0T 10.8T 971 2.03K 119M 234M
pool 11.0T 10.8T 1.66K 1.23K 208M 153M
pool 11.0T 10.8T 1.18K 1.92K 152M 214M
pool 11.0T 10.8T 1.75K 1.06K 222M 135M
pool 11.0T 10.8T 1.33K 2.07K 164M 239M
pool 11.0T 10.8T 1.66K 1.31K 210M 160M
pool 11.0T 10.8T 1.02K 1.96K 130M 215M
pool 11.0T 10.8T 1.93K 1.39K 232M 176M
pool 11.0T 10.8T 1.22K 1.84K 153M 186M
pool 11.0T 10.8T 1000 1.03K 124M 127M
pool 11.0T 10.8T 148 2.13K 18.6M 246M
pool 11.0T 10.8T 248 215 30.8M 26.7M
pool 11.0T 10.8T 39 2.95K 4.88M 349M
pool 11.0T 10.8T 8 2.89K 1.12M 364M
pool 11.0T 10.8T 16 2.62K 2.03M 308M
pool 11.0T 10.8T 126 2.18K 15.8M 243M
pool 11.0T 10.8T 18 2.34K 2.37M 264M
pool 11.0T 10.8T 33 3.13K 4.25M 364M
pool 11.0T 10.8T 16 3.16K 2.12M 392M

a few second later it was:
pool 11.0T 10.8T 29 1.85K 3.73M 213M
pool 11.0T 10.8T 0 3.17K 0 401M
pool 11.0T 10.8T 0 2.72K 0 346M
pool 11.0T 10.8T 0 3.82K 0 476M
pool 11.0T 10.8T 675 721 84.0M 32.8M
pool 11.0T 10.8T 875 0 109M 0
pool 11.0T 10.8T 113 131 14.0M 12.2M
pool 11.0T 10.8T 0 2.14K 0 271M
pool 11.0T 10.8T 0 2.55K 0 322M

is this something a SLOG could make a difference with? should the size of the cache or even the amount of RAM available be reduced?
 

Ivo

Cadet
Joined
Aug 19, 2013
Messages
6
Current system is 32GB of RAM; RAIDZ2 6x4TB WD Red disk


Hi, the problem is the WD Red disks - these are designed to reduce power consumption, heat and acoustic output. To do so, their spin speed varies which in turn affects performance. In other words, these were not designed with performance in mind.
 

jyavenard

Patron
Joined
Oct 16, 2013
Messages
361
Hi, the problem is the WD Red disks - these are designed to reduce power consumption, heat and acoustic output. To do so, their spin speed varies which in turn affects performance. In other words, these were not designed with performance in mind.

The green/red do not vary their spin speed. That myth was debunked a long time ago when WD refused to state what speed they were running at and was only using the name "intellipower". They are 5400rpm drive.

They also won't go into power saving mode in the middle of a write or read operation. Even the green drive take close to a minute of idle time before they go into power saving mode.

So that doesn't explain why there would be such great variation during continuous operation.

With read/copy operation within the same pool the average speed is 180MB/s. Which in itself isn't too bad.

It's the variation that concerns me. Jgreco description on how a too big cache can affect speed seems to fit the bill. Which beg the question: which size should you set.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
How would you go into identifying if this is definitely an issue or not?

Current system is 32GB of RAM; RAIDZ2 6x4TB WD Red disk
top shows:


When I copy a 20TB file within a pool using rsync, I see speeds that a varying greatly over time. 70MB/s to 135MB/s, constantly changing in between.

zpool iostat 1 gives me as output


as you can see, this varies *greatly* ever second... I'm a tad ensure in what's going on....

using the plain cp, it varies slightly less, but still significant enough:


a few second later it was:


is this something a SLOG could make a difference with? should the size of the cache or even the amount of RAM available be reduced?

Large RAM, small pool, performance problem example:

With 32GB, you have potentially 4GB by default (the write shift value of 3 resulting in 1/8th of system memory) allocated to transaction group buffer. Now ZFS does have some code to try to feel out and adjust the size of a transaction group, but consider the scenario where you have four oldish drives in RAIDZ2 capable of writing a max of 100MB/sec to the pool and your transaction group flush period is 5 seconds. So realistically if you're trying to write more than 500MB to the pool you're in trouble, right? But if you have 4GB worth of stuff queued up in a way-too-big txg to write in that time, that could be bad. If you stopped writing, ZFS would actually be fine but it'd take "too long" to write that txg and it'd notice, which is kind of how it tries to adjust txg size if I remember correctly.

But if you keep writing, like let's say dd if=/dev/zero which can go real fast, and you fill another transaction group while the first one is only a fraction of the way complete, now ZFS has a problem. It cannot shift the current txg into flush mode because the old one isn't done. So instead ZFS blocks.

Now, to be very clear here, ZFS is still busily writing out your data about as fast as the hardware is capable of... but with all the stuff queued up, you're locked out. Until that first flush is done, ZFS is effectively paused.

If you look at the system from a great enough distance, it looks fine: over the period of a day, you'll see that it's writing at an average 100MB/sec. Which is great! Because the system can only go 100MB/sec. But working interactively with the system, it is "run run <bam> wait wait wait wait wait wait wait wait GO! run run <bam>" etc. which could be a serious problem for a NAS platform.

Now on one hand, gigE places a limit on the amount of stuff you might reasonably queue up into a txg ... assuming you don't do anything at the console, and you only have a single gigE, and the hardware is capable of sustaining at least 125MB/sec ... if all that's true, then you might not ever see "serious" problems. But I do all my major data work at the console. And I really want the system to remain responsive, even if that means less throughput more reliably. And that's what bug 1531 is all about. Including technique to manually size the write buffer more reasonably.

But now something more generally applicable to you, I think:

Varying pool I/O speeds are not necessarily indicative of a problem. In the old days, we could fairly reliably predict the speeds we'd see because file systems were simple and in many cases the blocks were just coming off the disk, running up through some trite OS layers, and being fed into an application, so the speed very closely resembled the sequential block throughput of the disk adjusted for some overhead. That ought to seem like a very sensible observation, and I suspect you're even getting ready to say "but ... fragmentation, seeks, ...!" which is also correct, those things were difficult to predict and were the cause of poor performance.

ZFS is more complex because it is involving lots more potential cache, multiple drives, and a complexity that makes performance analysis somewhat dismaying because it can be difficult to understand symptoms or even difficult just to repeat a simple test and get similar results. Because ZFS has been designed to be your RAID controller AND your filesystem, and because that's more integrated than a legacy filesystem-on-top-of-hardware-RAID, it is harder to understand. It is not just shoveling data out in easily predictable ways. It is caching data, aggregating data to be written, and the behaviours make per-second measurement and monitoring less-useful. Imagine that you had per-millisecond reporting on the status of the cylinders of an internal combustion engine. It would be reporting wildly varying results in a way that you might or might not interpret usefully, but when taken as a whole the engine's performance is what it should be. ZFS is in some ways very much like that.
 

KTrain

Dabbler
Joined
Dec 29, 2013
Messages
36
Wow. Well, thanks for all the information. I'm pretty sure my head exploded near the end of page 2, but I feel like I may have learned some things here.
 

daimi

Dabbler
Joined
Nov 30, 2013
Messages
26
Large RAM, small pool, performance problem example:

With 32GB, you have potentially 4GB by default (the write shift value of 3 resulting in 1/8th of system memory) allocated to transaction group buffer. Now ZFS does have some code to try to feel out and adjust the size of a transaction group, but consider the scenario where you have four oldish drives in RAIDZ2 capable of writing a max of 100MB/sec to the pool and your transaction group flush period is 5 seconds. So realistically if you're trying to write more than 500MB to the pool you're in trouble, right? But if you have 4GB worth of stuff queued up in a way-too-big txg to write in that time, that could be bad. If you stopped writing, ZFS would actually be fine but it'd take "too long" to write that txg and it'd notice, which is kind of how it tries to adjust txg size if I remember correctly.

But if you keep writing, like let's say dd if=/dev/zero which can go real fast, and you fill another transaction group while the first one is only a fraction of the way complete, now ZFS has a problem. It cannot shift the current txg into flush mode because the old one isn't done. So instead ZFS blocks.

Now, to be very clear here, ZFS is still busily writing out your data about as fast as the hardware is capable of... but with all the stuff queued up, you're locked out. Until that first flush is done, ZFS is effectively paused.

If you look at the system from a great enough distance, it looks fine: over the period of a day, you'll see that it's writing at an average 100MB/sec. Which is great! Because the system can only go 100MB/sec. But working interactively with the system, it is "run run <bam> wait wait wait wait wait wait wait wait GO! run run <bam>" etc. which could be a serious problem for a NAS platform.

Now on one hand, gigE places a limit on the amount of stuff you might reasonably queue up into a txg ... assuming you don't do anything at the console, and you only have a single gigE, and the hardware is capable of sustaining at least 125MB/sec ... if all that's true, then you might not ever see "serious" problems. But I do all my major data work at the console. And I really want the system to remain responsive, even if that means less throughput more reliably. And that's what bug 1531 is all about. Including technique to manually size the write buffer more reasonably.

But now something more generally applicable to you, I think:

Varying pool I/O speeds are not necessarily indicative of a problem. In the old days, we could fairly reliably predict the speeds we'd see because file systems were simple and in many cases the blocks were just coming off the disk, running up through some trite OS layers, and being fed into an application, so the speed very closely resembled the sequential block throughput of the disk adjusted for some overhead. That ought to seem like a very sensible observation, and I suspect you're even getting ready to say "but ... fragmentation, seeks, ...!" which is also correct, those things were difficult to predict and were the cause of poor performance.

ZFS is more complex because it is involving lots more potential cache, multiple drives, and a complexity that makes performance analysis somewhat dismaying because it can be difficult to understand symptoms or even difficult just to repeat a simple test and get similar results. Because ZFS has been designed to be your RAID controller AND your filesystem, and because that's more integrated than a legacy filesystem-on-top-of-hardware-RAID, it is harder to understand. It is not just shoveling data out in easily predictable ways. It is caching data, aggregating data to be written, and the behaviours make per-second measurement and monitoring less-useful. Imagine that you had per-millisecond reporting on the status of the cylinders of an internal combustion engine. It would be reporting wildly varying results in a way that you might or might not interpret usefully, but when taken as a whole the engine's performance is what it should be. ZFS is in some ways very much like that.


Hi Jgreco, thanks for the above info.
Do you think the following calculation of system memory and ZIL size for a particular zpool is correct?

-- Assumption --
HDD each can write 150MB/s
So a zpool of 4 HDD will write 600MB/s

-- Calculation --
1) The transaction group (txg) size will be 3GB
(600MB/s x 5 sec where txg by default will flush data every 5 sec)
2) The system memory size should kept no more than 24GB
(3GB x 1/8 where txg by default use up to 1/8 system's memory)
3) The SSD should be partitioned to 12GB (leave the rest of the SSD unallocated) for ZIL/slog device
(The maximum size of a log device should be approximately 1/2 the size of physical memory because that is the maximum amount of potential in-play data that can be stored)
 
Top