Shower Thoughts: ZFS and SSD?

winnielinnie · Nov 3, 2022

A thought just occurred to me, and perhaps I'm approaching it the wrong way.

Does SSD "garbage collection" conflict with the principle of ZFS "copy-on-write"?

One of the reasons for data integrity with ZFS is the "copy-on-write" feature. Once a record is written, it remains as such: never modified in-place, never relocated in-place. This is true for user data and metadata. This very record has a checksum value assigned specifically to it.

However, SSD controllers do a lot of their own internal housekeeping, notably "garbage collection". From what I understand, they are always physically reading, rearranging, and re-copying pages and blocks behind-the-scenes, unknown to the operating system or filesystem.

Doesn't this negate the assurances of "copy-on-write"?

What if there is a mistake during this internal housekeeping? (I'd assume ZFS will detect a mismatch against the checksum in a future scrub or read.)

However, what if the SSD's controller runs its "garbage collection", yet it does it against areas of the disk where ZFS stores the actual checksums? In other words, the user data records are not touched, yet it messes up when trying to consolidate/rearrange/re-copy the pages that contain the checksums. Wouldn't ZFS interpret this as "data corruption", even though the user data is still intact and perfectly fine?

Patrick M. Hausen · Nov 3, 2022

It's turtles all the way down. There are checksums for the blocks storing the checksums. Etc. All starting at the superblock, of which there are multiple copies.

AlexGG · Nov 3, 2022

winnielinnie said:
However, what if the SSD's controller runs its "garbage collection"

The SSD address fiddling is opaque to ZFS. If you write X at address Y on SSD, you read back X from address Y, unless you TRIM address Y, no matter if SSD garbage collection or any other maintenance was run or not.

This is, if SSD is not faulty. If it is faulty, you get what you get, and then you detect it with checksums. If the fault affects any specific area, then the effect on ZFS is the same as of the equivalent error on the rotational drive.

Arwen · Nov 4, 2022

In someways the garbage collection of a SSD is an annoyance to ZFS. I do expect ZFS to catch any and all errors from a SSD, as it was designed to do.

But, SSDs do have to relocate erasure domain blocks. Let us say that the erasure domain is 64K bytes, (it was on some earlier SSDs). When needing to over-write a used block, the SSD needs to copy all used blocks to a previously erased 64K byte block in the spare list. Update the indirect table. Then schedule the old 64K byte block for erasure and add it to the spare list.

If ZFS knew about the 64K byte erasure domain concept, in theory ZFS could consider the minimum size to allocate as 64K bytes. (Instead of the 512 bytes or 4096 bytes we have today.) Then, after COW, (Copy On Write), a TRIM command could be sent to the old 64K byte block. That would mean that ZFS & the SSD would work together, perhaps reducing extraneous internal SSD copy / move operations.

Ericloewe · Nov 4, 2022

Think of SSDs as little boxes of NAND running a much crappier filesystem that does much of what ZFS already does, but at the level of NAND ICs instead of disks. Also, deeply proprietary and opaque. Also, often very buggy.

You might say "but that's terrible", and you'll hardly find anyone outside Marvell, Silicon Motion, Mediatek, Samsung, Micron, Hynix, WD, Seagate and Kioxia who would disagree with that assessment... But it's what we have right now.
If you have pressure points available to you in any of these companies, do use them to apply pressure for more open SSD firmware.

jgreco · Nov 4, 2022

winnielinnie said:
I'm approaching it the wrong way.

This.

There's a lot of inefficiency in computing systems. We often simplify systems to manage the complexity. While it would totally be possible to write a ZFS-like flash optimized storage system, it turns out to be much more convenient to abstract it out as a block storage problem that can be optimized towards either HDD or SSD (or possibly other things in the future); this allows, for example, ZFS to be used for both those underlying technologies. Likewise, there is a lot of interpreted code running out there, very inefficiently, that could be running a lot faster if it was written in an optimized compiled language. We do these things because they're convenient for us dumb humans to get our heads around, develop code more quickly and/or more abstractly, without needing to fully understand a more complicated set of interrelationships and interactions. It also provides a layer where we can swap things out and not be locked in to old ways of doing things; flash block sizes might vary over time, or preferred development languages might too.

HoneyBadger · Nov 4, 2022

winnielinnie said:
Doesn't this negate the assurances of "copy-on-write"?

If the SSD's embedded firmware isn't sufficiently well-coded to prevent corruption during these steps (eg: it needs to do a similar atomic copy-on-write behaviour itself) then yes, this can negate that assurance. This is where the only true requirement for an SSD's use with ZFS being that the drive won't corrupt existing data or its internal metadata (such as a flash translation layer "FTL" for SSDs) - even under a power-loss scenario as described in the OpenZFS hardware documentation:

Hardware — OpenZFS documentation

openzfs.github.io

winnielinnie said:
What if there is a mistake during this internal housekeeping?

Data goes missing. Hopefully, ZFS would be able to pick up on this, if it's only a small amount, and recover from it - and we'd very quickly see reports of data loss from not just the TrueNAS community but anyone using these devices in any context.

winnielinnie said:
However, what if the SSD's controller runs its "garbage collection", yet it does it against areas of the disk where ZFS stores the actual checksums? In other words, the user data records are not touched, yet it messes up when trying to consolidate/rearrange/re-copy the pages that contain the checksums. Wouldn't ZFS interpret this as "data corruption", even though the user data is still intact and perfectly fine?

ZFS will see this as "data that doesn't match a checksum" and will overwrite it with a replica copy from the other vdev members - assuming that those devices haven't suffered a similar fate.

Suffice it to say, any device that self-corrodes like this would (hopefully) be caught in regular QA/QC testing by the vendor, and if not, in an extreme hurry by the consumer market.

Important Announcement for the TrueNAS Community.

Shower Thoughts: ZFS and SSD?

winnielinnie

MVP

Patrick M. Hausen

Hall of Famer

AlexGG

Contributor

Arwen

MVP

Ericloewe

Server Wrangler

jgreco

Resident Grinch

HoneyBadger

actually does care

Hardware — OpenZFS documentation

Similar threads

Important Announcement for the TrueNAS Community.

Shower Thoughts: ZFS and SSD?

MVP

Hall of Famer

Contributor

MVP

Server Wrangler

Resident Grinch

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Shower Thoughts: ZFS and SSD?"

Similar threads