SSD partitions for L2ARC, SLOG, Metadata and Boot

Arwen · Jan 18, 2024

One note, L2ARC can not be Mirrored, (AFAIK). Striped yes. And of course, any L2ARC is not critical. Any L2ARC failure simply causes ZFS to re-direct the read(s) to the data pool.

HoneyBadger · Jan 18, 2024

Howdy @naskit

Lots of questions here, so let's get cracking:

naskit said:
On the topic of 'over-provisioning', does TrueNAS have knowledge of and track 'spare cells' on the SSD, or is this only visible to and done by the SSD drive controller? (What I have read in SSD technical data sheets and specifications so far leads me to believe only the SSD drive controller 'sees' or 'knows' about spare cells and only the drive controller can manage them). It is my (perhaps naïve?) belief that the main reason why Enterprise SSDs are advertised with a lower capacity is *precisely because* a chunk of cells are carved out and hidden away for the express purpose of replacing bad cells whenever they are encountered and SMART mechanics detects that those cells are no longer reliable.

What (if any) commands are available to the user for said 'over-previsioning' configurations?

The SSD controller is the only one that knows the true extent of the overprovisioning and wear cycles on each particular page/block of NAND, as well as other stats being stored in the FTL (Flash Translation Layer) - TrueNAS doesn't have any knowledge of the "spare cells" beyond what the SSD will report back through SMART data or attributes; and some of those are vendor-specific, or use raw/hex coding to make them difficult or misleading to read at a glance.

Intel is one of the better ones, giving you an extended attribute page to poll:

Code:

admin@alderlake[~]$ sudo smartctl -x /dev/sdb
...
Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x07  =====  =               =  ===  == Solid State Device Statistics (rev 1) ==
0x07  0x008  1              58  ---  Percentage Used Endurance Indicator

Many "Enterprise" SSDs do indeed use a higher level of overprovisioning from the factory compared to consumer SSDs. Some of this comes from the fact that computers (and NAND) think in binary 2^X for their GiB and TiB - so most devices start with ~7.4% of spare area just from being built in binary and sold in decimal. But manufacturers can add extra overhead; for example, an SSD with 512GiB of raw NAND (549,755,813,888 bytes) might be sold as:

A 512GB "consumer SSD" with 512,000,000,000 bytes of addressable space, and 7.4% overhead from the binary -> decimal conversion
A 500GB "read-optimized enterprise SSD" with 500,000,000,000 bytes of space, and ~10% spare area
A 480GB "mixed-use enterprise SSD" with 14.5% spare area
A 400GB "write-intensive enterprise SSD" with 37% spare area

Spare area is used to not only replace failed or failing cells, but also allow for extra free pages that can receive writes without needing to be erased, typically leading to better sustained write performance.

So, how can you mimic this? Both SCALE and CORE support SSD overprovisioning through the webUI or shell, following the instructions below:

SCALE: https://www.truenas.com/docs/scale/scaletutorials/storage/disks/slogoverprovisionscale/
CORE: https://www.truenas.com/docs/core/coretutorials/storage/pools/slogoverprovision/

Now of course, there are often other changes between an "enterprise" and a "consumer" SSD - the raw speed and endurance of the NAND used (eMLC or TLC vs QLC), bin quality, power-loss-protection for in-flight data (which you've seen referenced here for the SLOG, and with good reason) - but overprovisioning a consumer SSD can greatly increase both its endurance and speed. Don't expect it to turn a random SSD into a viable SLOG device though.

Volts · Jan 18, 2024

SSD controllers use empty space for writes.
There's no need to under/over provision SLOG devices.
Just enable TRIM so the controller knows what's empty and can do its job.
Logical writes will be distributed physically across empty cells.

Davvo · Jan 18, 2024

Volts said:
Just enable TRIM so the controller knows what's empty and can do its job.

Mmmh I prefer either manually trimming or setting a cronjob/sysctl for periodical trim. Till this day haven't been able to understand how often the auto trim option actually trims the drives.

Volts · Jan 18, 2024

autotrim tracks freed blocks and trims them on a thread, continuously. It's a little bit clever at trying to aggregate ranges.

Check out the perf chart here. Very meaningful if you happen to make a lot of linux kernel tree copies!

Add TRIM support by behlendorf · Pull Request #8419 · openzfs/zfs

Motivation and Context UNMAP/TRIM support is a frequently-requested feature to help prevent performance from degrading on SSDs and on various other SAN-like storage back-ends. By issuing UNMAP/TRI...

github.com

The code comments: https://github.com/openzfs/zfs/blob/master/module/zfs/vdev_trim.c

Code:

 * 1) Automatic TRIM happens continuously in the background and operates
 *    solely on recently freed blocks (ms_trim not ms_allocatable).

You still need to run the occasional manual trim - autotrim can miss tiny ranges, and autotrim is paused during rebuilds/resilvers.

Volts · Jan 18, 2024

I'm not suggesting this as the only way. The TrueNAS-documented "pretend it's a small disk" way will certainly work.

But if your hardware doesn't hork itself over TRIM, and especially if you're a "partition my drives" kind of person, autotrim can be good stuff indeed.

I didn't realize OP's drives were SATA, but they're good drives so it's worth testing. If they were NVMe I'd be even more confident.

Davvo · Jan 19, 2024

As far as I understand auto TRIM decreases the drives' lifespan.

Volts · Jan 19, 2024

Davvo said:
As far as I understand auto TRIM decreases the drives' lifespan.

Why would that be?

TrueNAS links to this Seagate article, which is pro-TRIM and refers to that space as "Dynamic Overprovisioning".

The benchmark chart in the ZFS change for trim & autotrim showed that enabling autotrim increased performance significantly. I take that as a longevity benefit too - distributed writes are good for longevity, and the amplification from leveling is avoided.

I can imagine a pathological controller performing unnecessary write load for each TRIM, but is that a thing?

naskit · Jan 19, 2024

Etorix said:
L2ARC and metadata duties are somewhat redundant. High wear from L2ARC could potentially affect the pool-critical metadata part. I would personally NEVER mix these.

I was referring back to what Volts had said earlier here:

Volts said:
The goofy partition and L2ARC hack that I do endorse is a metadata L2ARC, secondarycache=metadata, vfs.zfs.l2arc.rebuild_enabled=1

Are you @Etorix disagreeing with @Volts, or are you referring to a 'metadata' type that is different from that which @Volts describes above?

Etorix said:
If hosting a live database, you may well need a SLOG, which requires a drive with PLP. Is this the case here?

Yes, SAMSUNG PM893 SSDs are PLP as mentioned.

Davvo said:
for SOHO use if you have L2ARC you rarely need a metadata VDEV

Yeah...I see your point.

Volts said:
I didn't realize OP's drives were SATA, but they're good drives so it's worth testing.

They oughta be, I paid a pretty penny for them! :P

Volts said:
Logical writes will be distributed physically across empty cells.

What I believe some NAND storage manufacturers refer to as 'wear leveling'. Given the finite number of writes to each cell, they spread the writes across all cells to reduce the average number of writes per cell over the life of the drive. That is how I understand it.

HoneyBadger said:
built in binary and sold in decimal

We have all heard horror stories of multiple drive manufacturers advertising one spec and selling another...

Davvo said:
auto TRIM decreases the drives' lifespan

I would believe that to be true for the same reason they implement wear leveling - finite writes per cell.

...however

Volts said:
enabling autotrim increased performance significantly. I take that as a longevity benefit too - distributed writes are good for longevity, and the amplification from leveling is avoided.

…maybe this is a more correct view?

Thanks again all of you for your thoughts. I will post my final configuration here once I have done it.

Etorix · Jan 20, 2024

naskit said:
Are you @Etorix disagreeing with @Volts, or are you referring to a 'metadata' type that is different from that which @Volts describes above?

I read your drawing as pointing to mirrorred devices which are partitioned to host both a L2ARC and a metadata vdev—which I would personally never do.
If you meant "persistent metadata L2ARC" as per @Volts ' recipe, you're fine, provided that you have enough RAM to support the size of your L2ARC (5*RAM≤L2ARC≤10*RAM). But having two drives for L2ARC is overkill, whether as stripe or as mirror.

Davvo · Jan 20, 2024

Volts said:
Why would that be? [...] I can imagine a pathological controller performing unnecessary write load for each TRIM, but is that a thing?

Apparently yes.

TRIM takes ages to complete.

Hello, I have a pool of 8 SAS SSDs in RAIDZ2. When I issue zpool trim, it takes considerable time to finish the TRIM process. I've issued that a few months back and it took 30-40 minutes. Now it takes DAYS! Where can I look for insight on what may be the issue?

www.truenas.com

Volts · Jan 20, 2024

I would never bet against broken hardware existing

but I also don't take much away from that post. In that post regular TRIM is slow, ESXi is in the stack, and he says they TRIM fast in Linux.

Volts · Jan 20, 2024

Etorix said:
I read your drawing as pointing to mirrorred devices which are partitioned to host both a L2ARC and a metadata vdev—which I would personally never do.
If you meant "persistent metadata L2ARC" as per @Volts ' recipe, you're fine, provided that you have enough RAM to support the size of your L2ARC (5*RAM≤L2ARC≤10*RAM). But having two drives for L2ARC is overkill, whether as stripe or as mirror.

I'm not suggesting a data L2ARC. A metadata-only L2ARC (especially a persistent one, l2arc.rebuild_enabled) can be a big win at a much lower RAM cost.

It's an alternative to the special metadata vdev. It's easier to live with, especially in weird situations. Much easier to import a shelf of disks with both data and metadata, instead of also figuring out how to attach the metadata devices.

naskit · Jan 21, 2024

Etorix said:
you meant "persistent metadata L2ARC" as per @Volts ' recipe

Yep, that is what I meant - thanks for clarifying.

Etorix said:
have enough RAM to support the size of your L2ARC (5*RAM≤L2ARC≤10*RAM

Converting the formula above to my system: 320GB ≤ L2ARC ≤ 640GB.

From "FreeBSD Mastery: Advanced ZFS" (Lucas/Jude):
- Page 132: "The L2ARC catches items that fall off of the ARC."
- Page 133: "As a general rule of thumb, each gigabyte of L2ARC requires about 25 MB of ARC...assume that one terabyte of L2ARC, fully utilized, will devour about 25 GB of ARC."

Per the book: 960 GB x 25 MB = 24 GB.

This doesn't quite align with "5*RAM≤L2ARC≤10*RAM". I do appreciate, however, that this is highly variable due to varying file size. Smaller files = more files = more index entries in the ARC metadata, and Larger files = opposite.

Volts said:
secondarycache=metadata, vfs.zfs.l2arc.rebuild_enabled=1

I understand from page 180 of "FreeBSD Mastery: Advanced ZFS" that the "metadata" cache property tells ZFS not to cache actual files, but only metadata. So my interpretation of your 'hack' @Volts is that your L2ARC only contains metadata but no actual cached files for fast retrieval.
Would this be correct?

I found this a good read: Klara Systems - OpenZFS: All about the cache vdev or L2ARC

Davvo · Jan 21, 2024

naskit said:
This doesn't quite align with "5*RAM≤L2ARC≤10*RAM". I do appreciate, however, that this is highly variable due to varying file size. Smaller files = more files = more index entries in the ARC metadata, and Larger files = opposite.

The best performance is obtained with a RAM to L2ARC ratio of 1:4. It's possible however to increase the ratio up to 1:10 with increasingly (not linear) diminishing performance: going beyond that would be a futile exercise, and most of us would frown upon seeing a greater than 1:6 ratio.

This is, at the very least, the forum's consensus. Stick with 1:4.

naskit said:
I understand from page 180 of "FreeBSD Mastery: Advanced ZFS" that the "metadata" cache property tells ZFS not to cache actual files, but only metadata. So my interpretation of your 'hack' @Volts is that your L2ARC only contains metadata but no actual cached files for fast retrieval.
Would this be correct?

L2ARC usually holds both data and metadata, but it can be set to only accept metadata (referred as metadata-only) and it can also made to be persistent between reboots.

Volts · Jan 22, 2024

naskit said:
the "metadata" cache property tells ZFS not to cache actual files, but only metadata

Right. I don't think I'm adding anything additional now - just responding with keywords for your searching pleasure.

Default is primarycache=all; secondarycache=all. ARC is used for both data and metadata - it's a rare system where that isn't optimal. By default if L2ARC is configured, it's used for both.

What I'm describing is leaving primarycache=all, and changing secondarycache=metadata. To have it persist across reboots, see the sysctl tunable vfs.zfs.l2arc.rebuild_enabled=1.

Important Announcement for the TrueNAS Community.

SSD partitions for L2ARC, SLOG, Metadata and Boot

Arwen

MVP

HoneyBadger

actually does care

Volts

Patron

Davvo

MVP

Volts

Patron

Add TRIM support by behlendorf · Pull Request #8419 · openzfs/zfs

Volts

Patron

Davvo

MVP

Volts

Patron

naskit

Dabbler

Etorix

Wizard

Davvo

MVP

TRIM takes ages to complete.

Volts

Patron

Volts

Patron

naskit

Dabbler

Davvo

MVP

Volts

Patron

Similar threads

Important Announcement for the TrueNAS Community.

SSD partitions for L2ARC, SLOG, Metadata and Boot

MVP

actually does care

Patron

MVP

Patron

Patron

MVP

Patron

Dabbler

Wizard

MVP

Patron

Patron

Dabbler

MVP

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "SSD partitions for L2ARC, SLOG, Metadata and Boot"

Similar threads