Hardware Guide - NVMe and Fusion Pool Power Loss Protection

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,176
The only thing I still do not get is, why a USP is sufficient enough and I don't need a "in flight" PLP protected fusion pool.
It needs it as much or as little as a traditional pool.
 

spuky

Explorer
Joined
Oct 11, 2022
Messages
60
Thank you @Ericloewe and sorry to you @spuky, I was on mobile :smile:

The only thing I still do not get is, why a USP is sufficient enough and I don't need a "in flight" PLP protected fusion pool.
In my imagination, metadata gets to the SSD, the SSD says "yeah sure I got this" while the data is still only in cache, the system has an unexpected shutdown and puuf, the metadata in cache is lost an not written to the fusion pool.

A sync write means it is on disk... the transaction with the storage system is done. Thats a given when metadata and data are on disk. If only metadata is on disk and the system is still waiting for the spinning rust to say "we got this"... the transaction is not done...
 

Jamberry

Contributor
Joined
May 3, 2017
Messages
106
Sorry to be so thick, I rather ask one time to many, than to assume wrong things. Also my statements are not really statments, more like guesses that are probably wrong so you can correct me.
It needs it as much or as little as a traditional pool.
A fusion pool and a traditional pool loose a write in the worst case, but if it is sync, at least the client knows it lost a write.
Like if I have a Proxmox Hypervisor with NFS on TrueNAS and suddenly disconnect the LAN cable, there is just a ugly disconnect but the VM would probably survive it because it is just a normal crash. Not a disaster for a normal Windows VM. A database on the other hand would know that it could not finish the write and because of Atomicity and all, rollback. So basically it behaves like an unexpected shutdown, not a huge deal.

If I use SLOG and have a power outage or unexpected shutdown, without "in flight" PLP I risky loosing the whole pool, because some things have not been written to the disk, because the controller basically lied to TrueNAS having it all written to the SSD, while in reality it was only in SSD cache. Another reason could be that even if the controller does not lie to TrueNAS, it will probably be slower, because unlike a PLP SSD, it can not cache data if it honest to TrueNAS.

So assuming that my HDD or SSD in my pool do not lie to TrueNAS about writes, there is no real risk of loosing my pool. Which would answer our original question why we don't need PLP on Fusion Pools. We don't because there is no risk of loosing a pool by not using PLP.

Which brings me to the next thing, if I have no SLOG, I basically have ZIL on my pool. If my pool is only HDD or SSD, can't they lie to TrueNAS?

Assuming they do lie to TrueNAS, it would be a security gain to add a PLP SLOG drive, because that way I will always know that writes that were promised to be on disk are actually on disk.

I can take the risk of not mirroring SLOG, because it only gets read from after a power loss. The only scenario were I would matter is if the drive dies, and I have a power outage. Pretty unlikely.

I can not take the risk of not mirroring Fusion Pools, because I would loose my pool.

Does that sound more or less right to you guys? Thank you for reading all of this!
 

spuky

Explorer
Joined
Oct 11, 2022
Messages
60
A fusion pool and a traditional pool loose a write in the worst case, but if it is sync, at least the client knows it lost a write.
No they loose ( assuming default parameters ) up to 5 seconds of writes unless that are sync writes.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
Sorry to be so thick, I rather ask one time to many, than to assume wrong things. Also my statements are not really statments, more like guesses that are probably wrong so you can correct me.

"... Every question is a cry to understand the world. There is no such thing as a dumb question." - Carl Sagan

A fusion pool and a traditional pool loose a write in the worst case, but if it is sync, at least the client knows it lost a write.
Like if I have a Proxmox Hypervisor with NFS on TrueNAS and suddenly disconnect the LAN cable, there is just a ugly disconnect but the VM would probably survive it because it is just a normal crash. Not a disaster for a normal Windows VM. A database on the other hand would know that it could not finish the write and because of Atomicity and all, rollback. So basically it behaves like an unexpected shutdown, not a huge deal.

A network disconnect is slightly different from a sudden power failure of the storage, so let's use the latter as an example (your TrueNAS system freezes, or loses power due to an internal PSU/motherboard/HBA failure, etc)

In the "worst case", if you are writing async, the client incorrectly believes that the writes were safe. This is very bad, especially for things like VMs or databases, because from the client's perspective, the writes were safely committed - when they go to read them back, either "on demand" from a DB, or "on boot" from the VM, it won't find the data it expects. This could be a "whoops, your file is missing" or "whoops, your whole VM is corrupt."

If you have sync writes enabled, none of the above applies. Your client won't get the confirmation of "write safe" until it actually is safely on the ZIL - whether that resides on the pool, or on an SLOG. When your TrueNAS system powers back on, it sees there are pending uncommitted writes in the SLOG, and "completes the writes" to the main pool. (This is the only time an SLOG is read from.) Your VM will react as if it had crashed at the exact time your storage did. A client write or DB commit that was in progress may have been partially completed from the client's OS - but it will be "definitively incomplete" rather than "mistakenly complete" if that makes sense.

If I use SLOG and have a power outage or unexpected shutdown, without "in flight" PLP I risky loosing the whole pool, because some things have not been written to the disk, because the controller basically lied to TrueNAS having it all written to the SSD, while in reality it was only in SSD cache.

I'll talk about "in-flight PLP" in a moment - but the "controller lying" is one of the reasons why RAID controllers are discouraged - they will take a pending write into their cache, and then if that cache is lost for any reason (RAID card failed, power outage exceeds BBWC hold-up time, etc) then that write is also lost forever.

Another reason could be that even if the controller does not lie to TrueNAS, it will probably be slower, because unlike a PLP SSD, it can not cache data if it honest to TrueNAS.

Most post-2014 or so SSDs don't lie about their caching; you're exactly on the mark about a controller that's "honest" about its lack of in-flight PLP being significantly slower, devices that don't have it tend to not be suitably fast SLOGs.

So assuming that my HDD or SSD in my pool do not lie to TrueNAS about writes, there is no real risk of loosing my pool. Which would answer our original question why we don't need PLP on Fusion Pools. We don't because there is no risk of loosing a pool by not using PLP.

Correct. The pending writes to your special ("Fusion Pool") devices are still being protected through the same path as your regular writes. They're async, async, async - until ZFS finally decides to close the transaction, at which point it issues that small burst of sync-writes to the vdevs and says "commit this metadata/uberblock update, and I'm not proceeding until you do" - but similarly to how an SLOG without PLP is slow to do this, so are pool devices. That means that the best special/meta devices share similar characteristics to SLOGs; they don't have quite the emphasis on endurance, but having a low write latency is helpful - or at least, "low relative to your regular pool devices."

Which brings me to the next thing, if I have no SLOG, I basically have ZIL on my pool. If my pool is only HDD or SSD, can't they lie to TrueNAS?

They could, but see above regarding how that's not much of a thing anymore because of the negative repercussions.

Assuming they do lie to TrueNAS, it would be a security gain to add a PLP SLOG drive, because that way I will always know that writes that were promised to be on disk are actually on disk.

Unfortunately not. If your pool vdevs lie about the data being safe when it isn't, you'd open up the same vulnerability when the transaction commits to disk - ZFS would think it's safe, clear out the SLOG, and then the crash happens and it turns out the disks lied. Whoops.

The only real guarantee that ZFS requires for data safety is "power loss protection for data at rest" - devices shouldn't lie about a cache flush, and shouldn't corrupt existing good data when writing new data. See footnote [1] for an old paper from 2013 about the behavior of some SSDs under power fault.

I can take the risk of not mirroring SLOG, because it only gets read from after a power loss. The only scenario were I would matter is if the drive dies, and I have a power outage. Pretty unlikely.

Notably, the drive has to die at the same time as a power outage. A dead SLOG will cause your sync-writes to revert to the on-pool ZIL. They'll still be safe, just very slow. If you then have a power failure after the reversion completes, you're still safe, because upon power-up ZFS would replay the transaction from the on-pool ZIL.

I can not take the risk of not mirroring Fusion Pools, because I would loose my pool.

If by "Fusion Pools" you mean the special/meta vdevs - yes, they absolutely require redundancy. They are considered a root vdev, and their loss renders the pool unmountable. If you have mirrored vdevs, use mirrored special devices. If you have RAIDZ vdevs, use at least the same level of parity (RAIDZ1 = 2, RAIDZ3 = 3-way mirror) and also strongly consider not using a special vdev as it won't be removable, even in a controlled manner, without destroying the pool!

Does that sound more or less right to you guys? Thank you for reading all of this!

You're welcome, and sorry about the post length. :)

[1] https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf
 

Jamberry

Contributor
Joined
May 3, 2017
Messages
106
Hey @HoneyBadger I can't thank you enough for taking the time to answer all of this in such easy way to understand! You would be a great teacher!

A dead SLOG will cause your sync-writes to revert to the on-pool ZIL
Huh, I did not even think of that. But what if the SLOG does not completely crashes but gives out wrong data? Does TrueNAS recognize the wrong data and revert to ZIL?

Most post-2014 or so SSDs don't lie about their caching
That is uplifting to hear, but at the same time scary because you also say

The only real guarantee that ZFS requires for data safety is "power loss protection for data at rest" - devices shouldn't lie about a cache flush, and shouldn't corrupt existing good data when writing new data.
So we have to trust the SSD vendors. You say nobody messed it up since 2014 but reading your PDF it seems like these FTL are basically magical black boxes. Is it paranoid to think that maybe some vendor will mess this up later on? Either on purpose because they want to save costs or not on purpose just because FTL is extremely complex and someone made a mistake? Something like the WD SMR debacle?

One question is still open for debate in my opinion: Why does the manual recommend a UPS for special vdevs? Because like @Ericloewe said:
It needs it as much or as little as a traditional pool.

Anyway also like having a UPS.

Thank you guys again for all this great information and insights!
This opens up a lot of ideas for my new build, but that is off topic. :smile: I will open up a separate post.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
Hey @HoneyBadger I can't thank you enough for taking the time to answer all of this in such easy way to understand! You would be a great teacher!

Happy to help!

Huh, I did not even think of that. But what if the SLOG does not completely crashes but gives out wrong data? Does TrueNAS recognize the wrong data and revert to ZIL?

That's where a mirrored SLOG is necessary to save you - records are checksummed before write, so if a single SLOG device is returning bad data, it will pull from the mirror pair. That's why it's recommended in the UI to have a mirrored log device.

That is uplifting to hear, but at the same time scary because you also say ("The only real guarantee that ZFS requires for data safety is "power loss protection for data at rest" - devices shouldn't lie about a cache flush, and shouldn't corrupt existing good data when writing new data.")

So we have to trust the SSD vendors. You say nobody messed it up since 2014 but reading your PDF it seems like these FTL are basically magical black boxes. Is it paranoid to think that maybe some vendor will mess this up later on? Either on purpose because they want to save costs or not on purpose just because FTL is extremely complex and someone made a mistake? Something like the WD SMR debacle?

Device firmware is largely a black-box that we do have to trust - and unfortunately that trust has been misplaced in the past. A healthy amount of paranoia, when it comes to the safety of your data, is a good thing. Yes, it's possible we could see something like this return - but it's unlikely to slip past a decent vendor's quality control or "moral compass" - as something with this level of potential data impact would likely go well past the level of backlash that the WD SMR issue saw. Drive vendors sell their spindles and NAND to all manner of companies, and the 800lb gorillas of the corporate/social-media world would likely have a field day if their data was impacted. If you think the consumer NAS world was upset about the SMR misleading, imagine how upset the likes of Facebook or Twitter would be if they found out that their data wasn't as safe as a company had promised (on a contract) that it was ... :oops:

One question is still open for debate in my opinion: Why does the manual recommend a UPS for special vdevs? Because like @Ericloewe said: "(Special vdevs) needs (UPS) as much or as little as a traditional pool."

Anyway also like having a UPS.

I assume you're talking about this line from the docs page [1]:

When using SSDs with an internal cache, add uninterruptible power supply (UPS) to the system to help minimize the risk from power loss.

It's good general advice, but ultimately not technically necessary.

If your SSDs are compliant with the ZFS requirements, that being the "doesn't lie about a cache flush, and doesn't corrupt existing good data when writing new data" that I mentioned previously, then the transaction group is committed to the pool and special vdev disks asynchronously for the bulk data and metadata - the final "switch" that changes the pool state to the new txg ID by incrementing the uberblock counter is done sync to the metadata devices. If it succeeds - the pool goes ahead and is now on "txg+1" - if there's a power failure at that exact moment and the final update fails, then ZFS considers the entire transaction as failed, and will revert the pool back to the "old txg." This of course means the loss of any data that was in the "pending update" and not protected by the ZIL.

Although, if you use meta SSDs that have power-loss-protection, that final sync-write update happens extra-quick, and is protected by the capacitors in the meta devices. Much like SLOG, the best candidates for special vdevs often have in-flight PLP by design.

Thank you guys again for all this great information and insights!
This opens up a lot of ideas for my new build, but that is off topic. :smile: I will open up a separate post.

Feel free to @ tag me in it if you'd like.

[1] https://www.truenas.com/docs/core/coretutorials/storage/pools/fusionpool/
 
Top