How would you go into identifying if this is definitely an issue or not?
Current system is 32GB of RAM; RAIDZ2 6x4TB WD Red disk
When I copy a 20TB file within a pool using rsync, I see speeds that a varying greatly over time. 70MB/s to 135MB/s, constantly changing in between.
zpool iostat 1 gives me as output
as you can see, this varies *greatly* ever second... I'm a tad ensure in what's going on....
using the plain cp, it varies slightly less, but still significant enough:
a few second later it was:
is this something a SLOG could make a difference with? should the size of the cache or even the amount of RAM available be reduced?
Large RAM, small pool, performance problem example:
With 32GB, you have potentially 4GB by default (the write shift value of 3 resulting in 1/8th of system memory) allocated to transaction group buffer. Now ZFS does have some code to try to feel out and adjust the size of a transaction group, but consider the scenario where you have four oldish drives in RAIDZ2 capable of writing a max of 100MB/sec to the pool and your transaction group flush period is 5 seconds. So realistically if you're trying to write more than 500MB to the pool you're in trouble, right? But if you have 4GB worth of stuff queued up in a way-too-big txg to write in that time, that could be bad. If you stopped writing, ZFS would actually be fine but it'd take "too long" to write that txg and it'd notice, which is kind of how it tries to adjust txg size if I remember correctly.
But if you keep writing, like let's say dd if=/dev/zero which can go real fast, and you fill another transaction group while the first one is only a fraction of the way complete, now ZFS has a problem. It cannot shift the current txg into flush mode because the old one isn't done. So instead ZFS blocks.
Now, to be very clear here, ZFS is still busily writing out your data about as fast as the hardware is capable of... but with all the stuff queued up, you're locked out. Until that first flush is done, ZFS is effectively paused.
If you look at the system from a great enough distance, it looks fine: over the period of a day, you'll see that it's writing at an average 100MB/sec. Which is great! Because the system can only go 100MB/sec. But working interactively with the system, it is "run run <bam> wait wait wait wait wait wait wait wait GO! run run <bam>" etc. which could be a serious problem for a NAS platform.
Now on one hand, gigE places a limit on the amount of stuff you might reasonably queue up into a txg ... assuming you don't do anything at the console, and you only have a single gigE, and the hardware is capable of sustaining at least 125MB/sec ... if all that's true, then you might not ever see "serious" problems. But I do all my major data work at the console. And I really want the system to remain responsive, even if that means less throughput more reliably. And that's what bug 1531 is all about. Including technique to manually size the write buffer more reasonably.
But now something more generally applicable to you, I think:
Varying pool I/O speeds are not necessarily indicative of a problem. In the old days, we could fairly reliably predict the speeds we'd see because file systems were simple and in many cases the blocks were just coming off the disk, running up through some trite OS layers, and being fed into an application, so the speed very closely resembled the sequential block throughput of the disk adjusted for some overhead. That ought to seem like a very sensible observation, and I suspect you're even getting ready to say "but ... fragmentation, seeks, ...!" which is also correct, those things were difficult to predict and were the cause of poor performance.
ZFS is more complex because it is involving lots more potential cache, multiple drives, and a complexity that makes performance analysis somewhat dismaying because it can be difficult to understand symptoms or even difficult just to repeat a simple test and get similar results. Because ZFS has been designed to be your RAID controller AND your filesystem, and because that's more integrated than a legacy filesystem-on-top-of-hardware-RAID, it is harder to understand. It is not just shoveling data out in easily predictable ways. It is caching data, aggregating data to be written, and the behaviours make per-second measurement and monitoring less-useful. Imagine that you had per-millisecond reporting on the status of the cylinders of an internal combustion engine. It would be reporting wildly varying results in a way that you might or might not interpret usefully, but when taken as a whole the engine's performance is what it should be. ZFS is in some ways very much like that.