SLOG on 2x Optane - mirrored vs striped SLOG peculiarity

Kailee71

Contributor
Joined
Jul 8, 2018
Messages
110
Hi all,

first let me thank all here and at ixsystems for a) fantastic software and b) fantastic community. What I've learned here over the years is extremely useful - as is the software provided.

Now to my question; I recently purchased 2x 32Gb 3D Xpoint M.2's for use as log devices - I wanted to speed up my (exclusively sync) writes to the pool, and they have achieved exactly this. When adding the log vdev in the gui I actually get the option to either stripe or mirror the two Optanes, so I did some testing. When either a single or two Optanes are added as mirrored as log, I get ~300Mib/s write speed, indefinately, which is expected as that's roughly the write speed of one of the optanes. Just to see what would happen, I removed the log, and re-added the two Optanes in a strip, and for the first ~3Gib of the 8.5GiB testfile write speed (remember - sync!) jumps up to ~600Mib/s which is wow - I only have 2 striped mirrors (WD40EFRX) in the pool so that tripled the write performance. However, after the first 3.5GiB write speed drops to the bare pool performance just over 200MiB/s:

1626519188005.png


Could this a Windows caching thing that just doesn't kick in with the slower mirrored slog setup because it can keep up (the file comes from ssd that should be able to keep up with the server)? Also, the testfile should have been cached in mem (128Gb RAM on the client)....

Will be interested to hear opinions - also any discussion on mirrored vs striped slog - but keep in mind this is a home lab server server so even though I don't want to lose any data, a temporary loss of write performance due to failed slog stripe is no problem.

TIA

Kai.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
Hi Kai,

What you're seeing here is the ZFS write throttle in action - you're able to blast along at the speed of your SLOG for the first bit of traffic, but over time ZFS will be applying the brakes more and more in order to reduce the speed to that which your pool can handle so that you aren't left with a bunch of data in a "pending commit" stage.

It's interesting that the mirrored one flattens at the ~300MB/s number though, as that would seem to imply your pool can actually handle that kind of speed rather than the ~200MB/s it stops at with striped SLOG.

I'll have to come back to this from a keyboard, I've made a post previously with a script and instructions on finding the amount of "dirty data" and how long it takes to fully flush a transaction group to your data vdevs. But in general, mirrored SLOG is preferred because of the data safety aspect. For a homelab you can get away with the "stripe and lose performance" but bear in mind that when you do reach the endurance limits of one drive, the other is likely to be following right behind it.
 

Kailee71

Contributor
Joined
Jul 8, 2018
Messages
110
Hi @HoneyBadger,

ooh interesting - is there a way to tune the throttle?

I did some more testing with larger files and there seems to be a similar throttling scheme going on with just one log disk (or the mirror), I just hadn't noticed it before as it was a less drastic drop-off. It seems a shame that throttling already occurs so early in filling up the log - these are comparatively "safe" disks which should hold data on power-out. With the mirror (or single disk) only 10% of the log are used before bandwidth decreases, with two seperate log disks ("striped", but not really) this is even less (~5%)... Would be great to be able to jack this up to fill more of the log devs before throttling...

Cheers,

Kai.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,737
The SLOG will never hold more than 5 seconds worth of sustained writes, no matter how large the device - if I am not mistaken.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
Hi @HoneyBadger,

ooh interesting - is there a way to tune the throttle?

I did some more testing with larger files and there seems to be a similar throttling scheme going on with just one log disk (or the mirror), I just hadn't noticed it before as it was a less drastic drop-off. It seems a shame that throttling already occurs so early in filling up the log - these are comparatively "safe" disks which should hold data on power-out. With the mirror (or single disk) only 10% of the log are used before bandwidth decreases, with two seperate log disks ("striped", but not really) this is even less (~5%)... Would be great to be able to jack this up to fill more of the log devs before throttling...

Cheers,

Kai.

It's absolutely tunable - your trip down the rabbit hole starts here:


The short version is that it will be able to ingest data at "full speed" or close to it until you hit 60% of the total allowed dirty data, and then it will throttle you. You can adjust the total amount of allowed dirty data, the percentage, the slope of the throttle - but make sure to do that reading first to know which knob you're tuning. A key takeaway is that dirty data lives in RAM - so it will take away from ARC if you increase that value, but that might be valuable based on your read/write workload. But don't think you can set it to 32GB (matching the mirrored Optanes) on a system with 32GB of RAM. It won't turn out well. :)

Here's the dtrace script (originally from Adam Leventhal) - paste this into a file called dirty.d using vi or another local text editor, then launch it from an SSH session with dtrace -s dirty.d yourpoolnamegoeshere

Code:
txg-syncing
{
        this->dp = (dsl_pool_t *)arg0;
}

txg-syncing
/this->dp->dp_spa->spa_name == $$1/
{
        printf("%4dMB of %4dMB used", this->dp->dp_dirty_total / 1024 / 1024,
            `zfs_dirty_data_max / 1024 / 1024);
}


What you'll see is a number of lines that count up the percentage of dirty data outstanding. Play with the tunables carefully, watch for the correlation between transfer speed dropping and dirty data climbing, and measure your metrics from the client side as well to see if you observe tangible results.

The SLOG will never hold more than 5 seconds worth of sustained writes, no matter how large the device - if I am not mistaken.

That's one of the many knobs that can be tuned, but by default a transaction group is allowed to be no older than 5 seconds. But far more often you'll hit the "minimum threshold to trigger TXG flush" which defaults to 64MB.

Ultimately, tuning the SLOG behavior can only get you so far if your vdevs can't keep up. If your workload is very bursty and would fit inside the SLOG boundaries you set (eg: adjusting to 8GB or 16GB) then you could potentially see a huge improvement with slow back-end vdevs - but you'll also lose the 8-16GB of ARC when you put the array under that burst-write pressure, which might cause those same slow vdevs to be hit with reads.

As with so many things "Speed costs money. How fast do you want to go?"
 

Kailee71

Contributor
Joined
Jul 8, 2018
Messages
110
OMG wow - thank you! So for sure I'll be reading, tuning and using your script.

One question though: why is data that's gone to a non-volatile storage (the optanes in this case) considered dirty? It's safely stowed away is it not, even if not on the pool vdevs...?

In terms of reads, I have put 32Gb on TrueNAS but I have 128 available total, so can increase that in case ARC runs out...


+++++ EDIT +++++
... unfortunately, as I'm currently burning in a new disk on this host, I can't start playing with this until the burn-in finishes ... will report back with results some time monday ...
+++++/EDIT+++++
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
One question though: why is data that's gone to a non-volatile storage (the optanes in this case) considered dirty? It's safely stowed away is it not, even if not on the pool vdevs...?

Yes, but the full transaction group hasn't landed on the data vdevs and recieved full protection yet. The "flush to disk" comes from the copy of data in RAM - ZIL is never read from in a healthy pool.

In terms of reads, I have put 32Gb on TrueNAS but I have 128 available total, so can increase that in case ARC runs out...
More RAM is almost never a bad thing, and especially if you plan to increase the dirty data threshold above the default. Stuff it to the gills!
 

Kailee71

Contributor
Joined
Jul 8, 2018
Messages
110
Ok very interesting! So I gave the TrueNAS VM a little more RAM to play with first; 64GiB instead of 32GiB, so dirty_data_max is bigger (though not doubled as I hoped; before the increase in RAM it was 3271MB). With a single log device I then get this:
Code:
2  74566                 none:txg-syncing    0MB of 4096MB used
  2  74566                 none:txg-syncing    0MB of 4096MB used
  2  74566                 none:txg-syncing    9MB of 4096MB used
  1  74566                 none:txg-syncing  822MB of 4096MB used
  3  74566                 none:txg-syncing 1089MB of 4096MB used
  4  74566                 none:txg-syncing 1417MB of 4096MB used
  1  74566                 none:txg-syncing 1855MB of 4096MB used
  0  74566                 none:txg-syncing 2483MB of 4096MB used
  1  74566                 none:txg-syncing  599MB of 4096MB used
  0  74566                 none:txg-syncing    0MB of 4096MB used
  0  74566                 none:txg-syncing    0MB of 4096MB used

Which from Windows looked like this:
1626724285147.png

8GiB in took ca. 25(s) (sorry no accurate stopwatch at hand). So sustained xfers stay at or above 300MiB/s (not great, but substantially better than my raw pool, which gives ca. 220MiB/s)

Giving the pool another Optane log gives this:
Code:
2  74566                 none:txg-syncing    0MB of 4096MB used
  5  74566                 none:txg-syncing  823MB of 4096MB used
  0  74566                 none:txg-syncing 2732MB of 4096MB used
  5  74566                 none:txg-syncing 3380MB of 4096MB used
  3  74566                 none:txg-syncing 1326MB of 4096MB used
  3  74566                 none:txg-syncing    0MB of 4096MB used

And:
1626724376104.png

So even though the initial speed until throttling starts is literally doubled to just over 600MiB/s, the drop-off afterwards is considerable. Time taken... 22(s)...

So this is with following default values;

Code:
truenas# sysctl -a | grep zfs.dirty
vfs.zfs.dirty_data_sync_percent: 20
vfs.zfs.dirty_data_max_max: 4294967296
vfs.zfs.dirty_data_max: 4294967296
vfs.zfs.dirty_data_max_max_percent: 25
vfs.zfs.dirty_data_max_percent: 10


Those are clearly smaller than advertised ;-). Will now start playing with increasing max_dirty and report back.
 
Last edited:

Kailee71

Contributor
Joined
Jul 8, 2018
Messages
110
So now with following values (dirty max to 10GiB);

Code:
truenas#  sysctl -a | grep -i zfs.dirty
vfs.zfs.dirty_data_sync_percent: 20
vfs.zfs.dirty_data_max_max: 4294967296
vfs.zfs.dirty_data_max: 10737418240
vfs.zfs.dirty_data_max_max_percent: 25
vfs.zfs.dirty_data_max_percent: 10


Note how vfs.zfs.dirty_data_max_max seems to get ignored; both the TrueNAS tunable (which is set in TrueNAS to 16GiB or 25% of RAM) and also the resulting smaller systctl (4GiB < vfs.zfs.dirty_data_max) seem to not be respected. Anyway...

Code:
truenas# dtrace -s dirty.d zpool
dtrace: script 'dirty.d' matched 2 probes
CPU     ID                    FUNCTION:NAME
  3  74566                 none:txg-syncing    0MB of 10240MB used
  3  74566                 none:txg-syncing    0MB of 10240MB used
4  74566                 none:txg-syncing    0MB of 10240MB used
  1  74566                 none:txg-syncing 1232MB of 10240MB used
  3  74566                 none:txg-syncing 3768MB of 10240MB used
  3  74566                 none:txg-syncing 3344MB of 10240MB used
  3  74566                 none:txg-syncing    0MB of 10240MB used
  3  74566                 none:txg-syncing    0MB of 10240MB used

1626728652601.png


Am I opening a flank to any danger by doing this? Again - this data is going straight to two log vdev Optanes that are keeping up with the ~600MiB/s, before it's commited to the (much slower) data vdevs. I'm a happy camper so far though; nearly tripled my transfer rates for ~€90.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
You're not opening yourself up to any data safety issues of statistical significance (if both of your log vdevs die simultaneously, you're risking slightly more data, but that was still a risk before) and if you're specifically going to be dumping files no larger than 8GB at a time, you'll basically get full-throttle speed all the time now if it's purely writes happening.

In a mixed-workload but still write-heavy situation you could use up to 10GB of your RAM for incoming dirty data (vs the previous 4GB) and this might put your ARC (read cache) under pressure. Less risk with 64GB of total RAM assigned though.
 

Kailee71

Contributor
Joined
Jul 8, 2018
Messages
110
*Many* thanks for your input & the dtrace script. It's exactly this kind kind of forthcoming expertise I meant in #1 that makes this community and software so great. Top notch.
 

bonox

Dabbler
Joined
May 2, 2021
Messages
17
If you have an almost exclusively write based pool (or indeed whole truenas server) is there an ability here to devote, say, 80% of your RAM to write cache (or dirty data as i understand this stuff above). So for this box that has 384GiB RAM onboard, is this kind of tune below usable or will it cause a giant malfunction or otherwise not operate as I might hope (like ignoring the values like Kailee71 shows?

vfs.zfs.dirty_data_sync_percent: 80
vfs.zfs.dirty_data_max_max: 320,000,000,000
vfs.zfs.dirty_data_max: 320000000000
vfs.zfs.dirty_data_max_max_percent: 85
vfs.zfs.dirty_data_max_percent: 70
 

Kailee71

Contributor
Joined
Jul 8, 2018
Messages
110
I think in that situation you're much better off setting your dataset to sync=always and using an appropriate slog device like an Optane or RMS-200. I'm now saturating 10GbE with only 4 disks (2 striped mirrors) that would never ever be able to sustain that kind of write performance on their own, by using striped optane sticks or RMS-200 (in different boxes). Of course only up to the size of the slog devices...

Remember, read cache is easy - you can loose it at any time without causing much more than a temporary inconvenience whilst users wait for your server to come back online. Lose write data though and that stuff's gone forever, which is why zfs is so strict about flushing dirty data to disk. This is why slog devices are such a great idea; the writes quickly get committed to the slog device device and are then "safe" - zfs can accept more data whilst the "temp" on the slog device data can be dealt with safely later. Of course only up to the *.dirty_data_max_* limits. And so I hit these limits at some stage but in my usecase this is rather infrequently.

I recommend you try a fast slog and abandon trying doing this in RAM (if you even get it to work which I doubt, but I'm sure somebody can give you some tech data to back that intuition up). RMS-200s are now totally affordable used even for home labs, and for a little less you can get optane 32GiB sticks if you need a bigger slog dev. And you'll sleep much better (and sooner because your bandwidth has been increased ;-).
 

bonox

Dabbler
Joined
May 2, 2021
Messages
17
the use case means irregular but very bursty writes and no real reads. And any reads won't be in ARC anyway - large size with no short/medium term repetition. And each file is much bigger than a 32GB optane.

Never really understood the "data waiting to write in ram is dangerous" though - with good power supply protection the server is taken care of, and any data you're working on (like in that big video production) that isn't saved is in exactly the same position as data on a filer waiting to make it to disk.

In any case, i've got a filer that needs to accept approximately 300GB at a time off a 10Gb network and then it gets 20 minutes or more of 'rest time' before needing to do it again. That's mostly why i'm happy with a pair of Z2 vdevs and more space rather than striped mirror set with less space. It would be helpful to know if truenas could be tuned to accept those bursts being dumped to RAM to permit the sender getting back to work - saves wear on flash media as well.
 

bonox

Dabbler
Joined
May 2, 2021
Messages
17
i realise it's an unusual commercial use case which is why i'm asking. I imagine a lot of home media servers fall into a similar use category when media is updated, however they're unlikely to try to manage it with a large RAM buffer and just accept long writes at the mean performance of the pool. I'd rather avoid waiting for the file dump (or save) to make it to disk if possible and have to wait only for it to make it off my local machine after which the filer can take its time to make it rusty.
 

Kailee71

Contributor
Joined
Jul 8, 2018
Messages
110
Ok so a UPS will help decrease the probably of you losing your data in flight, or heaven forbid, your pool, but *it will never be zero*. And the data that's on an slog device *is* safe. It will survive a cord pull, even without redundant power. It's on nonvolatile, and if you do this right, it's written synchronously - no data is acknowledged written before it really is, on nonvol. Well, as long as you use proper slog devices - those with power loss protection - Optane/RMS-200 et al. Optane if you need large slog, RMS-200 if it's even more bursty with small transfers like my use case (that thing really rocks, but it's only 8GiB, so in reality <10secs worth of 10GbE data).

There are big Optanes that will do what you need. But for that particular use case, you're much better off investing in more disks. Use striped mirrors, both for performance, much faster resilvers, and ability to grow individual mirrors with larger disks down the road. That amount of data write (nearly 1TiB/hour) deserves a properly designed and tested environment. And I think in this case you're looking at getting more but smaller disks, so they don't need a 20 minute break to keep up with the recurring storm. Trying to "tune" zfs into doing what it's designed not to do is just asking for a world of pain.

Just my $0.02 - maybe someone else can chime in?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,702
with good power supply protection the server is taken care of
Have you ever heard of a kernel panic?

Granted it's less common for those to happen if you're using well tested hardware and drivers, but it's a risk you're not considering if you think power protection ensures your data in RAM can't be lost.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
Have you ever heard of a kernel panic?

Granted it's less common for those to happen if you're using well tested hardware and drivers, but it's a risk you're not considering if you think power protection ensures your data in RAM can't be lost.
I've had genuine LSI HBA hardware decide to let the magic blue smoke out. No amount of stable drivers or dual-feed UPS in the rack mount solved that.

But an SLOG did. Swapped the card, powered back up, back in action.
 

bonox

Dabbler
Joined
May 2, 2021
Messages
17
Nothing in my scenario can't be recreated. The full story here, seeing as how no-one likes to answer the question, is that i'm storing finite element and fluid dynamics calculation output files for short periods (couple of days max) until users have a chance to analyse them and determine that they're correct.

The original source files are tiny (few tens of megabytes) and stored elsewhere. The result files can always be recreated from source - it just takes many hours to do so. In other words, nothing here is sacred, I don't give a hoot about kernel panics in this application and never seen one after a decade of freenas/truenas use either, but I have had PSUs die (thankyou dual supplies in my servers) and have leant on UPSs a few times when the lights went out without any problems. My major goal is for users (ie me) to be able to go back to a recent run if needed for comparison purposes and be able to save half a day of time if that's needed. And it's rare. And for what it's worth, this server does have dual HBAs that I could connect to my dual controller STL3 shelves, but I choose to use single path because dual is not needed in my risk scenario. And if it did have a panic, you'd see it with a failed file copy, so i'd probably just restart the server and try the copy again and anything lost in the filer ram isn't gone anyway.

But the target is a machine with a pool that can keep up on time average, but not real time but it has loads of RAM and no real need to bury lots of cash in upgrading it to SSD slogs etc.

So question again, is it possible to devote 320GB of 384GB RAM to dirty data to smooth out high burst data dumps please?
 
Last edited:

Kailee71

Contributor
Joined
Jul 8, 2018
Messages
110
Ah here we go - I think we found the problem why you weren't getting the answer you like rather than the answer that describes best practice. Your tone of voice was too nice before ;-). Let's see if we can conjure up some bad (but *very* funny) ghosts from the past (whatever happened to Jock? I miss him! At least we have jgreco left...)

[/Irony] [ducking and running]

K.
 
Top