SLOG and power loss protection

rungekutta

Contributor
Joined
May 11, 2016
Messages
146
Hi guys,
Looking for clarity on something...

So I understand that in the case of synchronous writes, the SLOG (ZIL) keeps an on-disk copy of the data that will be written to the pool in the next transaction group. As opposed to asynchronous, where between the 5s writes to the pool, ZFS keeps this data in RAM, meaning data loss in the case of a failure.

I have also read many times that
a) the SLOG needs to deliver low-latency writes in order not to slow down synchronous writes of clients
b) the SLOG needs power loss protection, otherwise it defeats the purpose of the SLOG in the first place (may just as well have used RAM)

Thinking about this a bit more, I don't see how these two can work together..? Specifically, does ZFS write to the SLOG and wait for the SLOG device to ack before responding back to the client? That would explain (a) above. HOWEVER that being the case, why would power loss protection be required on the SLOG device? As the client won't get the positive ack until the data has been securely written.

Alternatively, there is some in-flight scenario where ZFS in effect responds back to the client before the SLOG has fully committed the write. That would explain (b) above. But in that case, why would write-latency of the SLOG be so important for performance...? As there's some degree of asynchronous write to the SLOG anyhow.

I'm sure I'm missing something here so would appreciate some guidance...!
 

rungekutta

Contributor
Joined
May 11, 2016
Messages
146
After a bit of googling I think I can answer my own question. So ZFS does indeed wait for the SLOG ssd to finish writing and even issues a cache flush command to ensure nothing is left uncommitted. So if all SSDs had behaved accordingly, there would indeed be no need for power loss protection in the SLOG. But the problem is that some SSDs actually lie to the controller and keeps uncommitted buffers even though they claim to have persisted the data. Enterprise’y drives in general and particularly those with power loss protection are less likely to lie to ZFS - in that they they guarantee to have persisted (or that they *will* persist, through its reserve power capacitors in the worst case), when they claim to have done so. Unlike many consumerish SSDs that lie about it to perform better in benchmarks.

So there you go.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
Accurate assessment. I meant to reply to this earlier, but "weekend happened."

All modern SSDs (circa 2014+ or so) properly honor a cache flush request, or at least attempt to do so. There's still the chance of "partial write failure" or similar firmware buginess, and there's much better odds of that not happening when using a drive that specifically calls out PLP as supported.

However it's important to differentiate between "PLP for data at rest" and "PLP for data in flight" - often times consumer drives only have the former. This makes the data safe for general ZFS use, but not speedy for SLOG purposes. Enterprise drives trend toward the latter, which is both safe and fast.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
After a bit of googling I think I can answer my own question. So ZFS does indeed wait for the SLOG ssd to finish writing and even issues a cache flush command to ensure nothing is left uncommitted. So if all SSDs had behaved accordingly, there would indeed be no need for power loss protection in the SLOG. But the problem is that some SSDs actually lie to the controller and keeps uncommitted buffers even though they claim to have persisted the data. Enterprise’y drives in general and particularly those with power loss protection are less likely to lie to ZFS - in that they they guarantee to have persisted (or that they *will* persist, through its reserve power capacitors in the worst case), when they claim to have done so. Unlike many consumerish SSDs that lie about it to perform better in benchmarks.

So there you go.
Random comment for clarification is that the slog is only used if there is an actual power loss. The write goes into ram and to the slog device and then from ram to disk, it does not go from slog to disk. The write in the slog never gets read unless there is a power loss and things need to be replayed.
 

rungekutta

Contributor
Joined
May 11, 2016
Messages
146
All modern SSDs (circa 2014+ or so) properly honor a cache flush request, or at least attempt to do so. There's still the chance of "partial write failure" or similar firmware buginess, and there's much better odds of that not happening when using a drive that specifically calls out PLP as supported.

Thanks for that added colour and I think this brings a lot of nuance to the “you must use a SLOG with PLP otherwise you’re a fool” argument that is sometimes made pretty strongly in various forums. ;-)
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,828
Yep, all comes down to how important it is to you for data in transit to be preserved. The SLOG achieves that even if the PSU for the server goes up in smoke. Some SuperMicro servers feature 2+ redundant PSUs for that reason. I drew the line with a single PLP Optane SLOG, a single ATX PSU, etc.
 

rungekutta

Contributor
Joined
May 11, 2016
Messages
146
Agreed but I think the main point here being that as long as the SLOG SSD correctly implements synchronous write semantics including honouring cache flush then PLP is actually not needed. Even if the PSU suddenly dies. But it seems that outside of enterprise SSDs correct behaviour cannot be assumed, so the way to reasonably sure is to pick an SSD from a reputable brand and that makes specific promises.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Agreed but I think the main point here being that as long as the SLOG SSD correctly implements synchronous write semantics including honouring cache flush then PLP is actually not needed. Even if the PSU suddenly dies. But it seems that outside of enterprise SSDs correct behaviour cannot be assumed, so the way to reasonably sure is to pick an SSD from a reputable brand and that makes specific promises.
It might end up being too slow at that point if it doesn't use it's cache. I have no clue though, never bothered to look into slog
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
Agreed but I think the main point here being that as long as the SLOG SSD correctly implements synchronous write semantics including honouring cache flush then PLP is actually not needed. Even if the PSU suddenly dies. But it seems that outside of enterprise SSDs correct behaviour cannot be assumed, so the way to reasonably sure is to pick an SSD from a reputable brand and that makes specific promises.

It might end up being too slow at that point if it doesn't use it's cache. I have no clue though, never bothered to look into slog

Correct on both counts. PLP for data at rest is the minimum requirement in order for a device to serve as an SLOG. That means that data flushed from RAM to NAND won't be harmed by additional, unrelated programming of adjacent NAND chips, and will survive any internal garbage-collection or TRIM routines that might shuffle that data around. Effectively, it's ensuring that the drive doesn't self-corrode. However, for a drive that has PLP for data at rest only, the performance is not sufficient for most sync-write needs. A drive like this still needs to actually commit the data from its volatile RAM to the physical NAND when it receives the cache flush command from ZFS, and it won't reply with the "All clear" until those writes are finished.

A drive with PLP for data in flight on the other hand is capable of receiving that cache flush command and simply asking itself "Do I have sufficient power in my internal PLP circuitry to flush the data in my volatile RAM to NAND?" If it does, then it can immediately reply back with "Data is safe" and then the drive itself performs a more asynchronous write of its RAM to NAND internally. But even if the power to the drive is cut, the drive itself already knows "I need to flush my RAM to NAND" and will complete that on its own internal backup power.

The difference between these two is night and day when it comes to sync write workloads, especially at the small record sizes that are common with things like virtualization and block storage.

Drive A:
Code:
Synchronous random writes:
           4 kbytes:    994.3 usec/IO =      3.9 Mbytes/s
           8 kbytes:   1013.9 usec/IO =      7.7 Mbytes/s
          16 kbytes:   1014.0 usec/IO =     15.4 Mbytes/s


Drive B:
Code:
Synchronous random writes:
           4 kbytes:     15.0 usec/IO =    260.1 Mbytes/s
           8 kbytes:     16.7 usec/IO =    468.5 Mbytes/s
          16 kbytes:     23.0 usec/IO =    680.5 Mbytes/s


Guess which drive has PLP-in-flight data protection? ;)
 

rungekutta

Contributor
Joined
May 11, 2016
Messages
146
Very good info, thanks.
What tool produces benchmarks like the above?

Incidentally I just bought a used DC S3700 quite cheap off eBay and according to the description with plenty of mileage left. Will be interesting to compare that against my current when it arrives. I understand Optane 900p or P4800x is the gold standard but that is just too much $$$ for me.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
Very good info, thanks.
What tool produces benchmarks like the above?

The FreeBSD diskinfo command has a simple write-test built into it.

WARNING: THE COMMAND IS DESTRUCTIVE. It should prevent you from running it on an active device ("disk busy" or similar errors) but just in case it doesn't stop you, make darned sure you're targeting the right /dev/daX device.

The command is diskinfo -wS /dev/daX and will give you the tabulated results.

Incidentally I just bought a used DC S3700 quite cheap off eBay and according to the description with plenty of mileage left. Will be interesting to compare that against my current when it arrives. I understand Optane 900p or P4800x is the gold standard but that is just too much $$$ for me.

If you're putting this into a homelab-style system with very light usage, the "Optane memory" M10 cards are solid performers. Even the smallest 16GB one punches way above its pricetag, although the limited (~185TB) write endurance means it's not suited for any kind of heavy use. If you can find another PCIe slot to get more M.2 cards in, this would do a single GbE link just fine.

Code:
Synchronous random writes:
           4 kbytes:     32.6 usec/IO =    120.0 Mbytes/s
           8 kbytes:     59.7 usec/IO =    130.8 Mbytes/s
          16 kbytes:    114.0 usec/IO =    137.0 Mbytes/s


Failing that, the DC S3700/S3710 is about the best you can get for SATA. There are newer/faster SAS drives available, but in terms of "budget SLOG" you've already about hit the limit. Performance increases slightly with capacity, 200GB is often the best price/performance ratio.
 
Top