SSD partitions for L2ARC, SLOG, Metadata and Boot

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
One note, L2ARC can not be Mirrored, (AFAIK). Striped yes. And of course, any L2ARC is not critical. Any L2ARC failure simply causes ZFS to re-direct the read(s) to the data pool.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Howdy @naskit

Lots of questions here, so let's get cracking:

On the topic of 'over-provisioning', does TrueNAS have knowledge of and track 'spare cells' on the SSD, or is this only visible to and done by the SSD drive controller? (What I have read in SSD technical data sheets and specifications so far leads me to believe only the SSD drive controller 'sees' or 'knows' about spare cells and only the drive controller can manage them). It is my (perhaps naïve?) belief that the main reason why Enterprise SSDs are advertised with a lower capacity is *precisely because* a chunk of cells are carved out and hidden away for the express purpose of replacing bad cells whenever they are encountered and SMART mechanics detects that those cells are no longer reliable.

What (if any) commands are available to the user for said 'over-previsioning' configurations?

The SSD controller is the only one that knows the true extent of the overprovisioning and wear cycles on each particular page/block of NAND, as well as other stats being stored in the FTL (Flash Translation Layer) - TrueNAS doesn't have any knowledge of the "spare cells" beyond what the SSD will report back through SMART data or attributes; and some of those are vendor-specific, or use raw/hex coding to make them difficult or misleading to read at a glance.

Intel is one of the better ones, giving you an extended attribute page to poll:
Code:
admin@alderlake[~]$ sudo smartctl -x /dev/sdb
...
Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x07  =====  =               =  ===  == Solid State Device Statistics (rev 1) ==
0x07  0x008  1              58  ---  Percentage Used Endurance Indicator


Many "Enterprise" SSDs do indeed use a higher level of overprovisioning from the factory compared to consumer SSDs. Some of this comes from the fact that computers (and NAND) think in binary 2^X for their GiB and TiB - so most devices start with ~7.4% of spare area just from being built in binary and sold in decimal. But manufacturers can add extra overhead; for example, an SSD with 512GiB of raw NAND (549,755,813,888 bytes) might be sold as:
  • A 512GB "consumer SSD" with 512,000,000,000 bytes of addressable space, and 7.4% overhead from the binary -> decimal conversion
  • A 500GB "read-optimized enterprise SSD" with 500,000,000,000 bytes of space, and ~10% spare area
  • A 480GB "mixed-use enterprise SSD" with 14.5% spare area
  • A 400GB "write-intensive enterprise SSD" with 37% spare area
Spare area is used to not only replace failed or failing cells, but also allow for extra free pages that can receive writes without needing to be erased, typically leading to better sustained write performance.

So, how can you mimic this? Both SCALE and CORE support SSD overprovisioning through the webUI or shell, following the instructions below:

SCALE: https://www.truenas.com/docs/scale/scaletutorials/storage/disks/slogoverprovisionscale/
CORE: https://www.truenas.com/docs/core/coretutorials/storage/pools/slogoverprovision/

Now of course, there are often other changes between an "enterprise" and a "consumer" SSD - the raw speed and endurance of the NAND used (eMLC or TLC vs QLC), bin quality, power-loss-protection for in-flight data (which you've seen referenced here for the SLOG, and with good reason) - but overprovisioning a consumer SSD can greatly increase both its endurance and speed. Don't expect it to turn a random SSD into a viable SLOG device though.
 

Volts

Patron
Joined
May 3, 2021
Messages
210
SSD controllers use empty space for writes.
There's no need to under/over provision SLOG devices.
Just enable TRIM so the controller knows what's empty and can do its job.
Logical writes will be distributed physically across empty cells.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Just enable TRIM so the controller knows what's empty and can do its job.
Mmmh I prefer either manually trimming or setting a cronjob/sysctl for periodical trim. Till this day haven't been able to understand how often the auto trim option actually trims the drives.
 

Volts

Patron
Joined
May 3, 2021
Messages
210
autotrim tracks freed blocks and trims them on a thread, continuously. It's a little bit clever at trying to aggregate ranges.

Check out the perf chart here. Very meaningful if you happen to make a lot of linux kernel tree copies!

The code comments: https://github.com/openzfs/zfs/blob/master/module/zfs/vdev_trim.c

Code:
 * 1) Automatic TRIM happens continuously in the background and operates
 *    solely on recently freed blocks (ms_trim not ms_allocatable).


You still need to run the occasional manual trim - autotrim can miss tiny ranges, and autotrim is paused during rebuilds/resilvers.
 

Volts

Patron
Joined
May 3, 2021
Messages
210
I'm not suggesting this as the only way. The TrueNAS-documented "pretend it's a small disk" way will certainly work.

But if your hardware doesn't hork itself over TRIM, and especially if you're a "partition my drives" kind of person, autotrim can be good stuff indeed.

I didn't realize OP's drives were SATA, but they're good drives so it's worth testing. If they were NVMe I'd be even more confident.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222

Volts

Patron
Joined
May 3, 2021
Messages
210
As far as I understand auto TRIM decreases the drives' lifespan.

Why would that be?

TrueNAS links to this Seagate article, which is pro-TRIM and refers to that space as "Dynamic Overprovisioning".

The benchmark chart in the ZFS change for trim & autotrim showed that enabling autotrim increased performance significantly. I take that as a longevity benefit too - distributed writes are good for longevity, and the amplification from leveling is avoided.

I can imagine a pathological controller performing unnecessary write load for each TRIM, but is that a thing?
 

naskit

Dabbler
Joined
Apr 19, 2021
Messages
20
L2ARC and metadata duties are somewhat redundant. High wear from L2ARC could potentially affect the pool-critical metadata part. I would personally NEVER mix these.
I was referring back to what Volts had said earlier here:
The goofy partition and L2ARC hack that I do endorse is a metadata L2ARC, secondarycache=metadata, vfs.zfs.l2arc.rebuild_enabled=1
Are you @Etorix disagreeing with @Volts, or are you referring to a 'metadata' type that is different from that which @Volts describes above?

If hosting a live database, you may well need a SLOG, which requires a drive with PLP. Is this the case here?
Yes, SAMSUNG PM893 SSDs are PLP as mentioned.

for SOHO use if you have L2ARC you rarely need a metadata VDEV
Yeah...I see your point.

I didn't realize OP's drives were SATA, but they're good drives so it's worth testing.
They oughta be, I paid a pretty penny for them! :P

Logical writes will be distributed physically across empty cells.
What I believe some NAND storage manufacturers refer to as 'wear leveling'. Given the finite number of writes to each cell, they spread the writes across all cells to reduce the average number of writes per cell over the life of the drive. That is how I understand it.

built in binary and sold in decimal
We have all heard horror stories of multiple drive manufacturers advertising one spec and selling another...

auto TRIM decreases the drives' lifespan
I would believe that to be true for the same reason they implement wear leveling - finite writes per cell.

...however
enabling autotrim increased performance significantly. I take that as a longevity benefit too - distributed writes are good for longevity, and the amplification from leveling is avoided.
…maybe this is a more correct view?

Thanks again all of you for your thoughts. I will post my final configuration here once I have done it.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
Are you @Etorix disagreeing with @Volts, or are you referring to a 'metadata' type that is different from that which @Volts describes above?
I read your drawing as pointing to mirrorred devices which are partitioned to host both a L2ARC and a metadata vdev—which I would personally never do.
If you meant "persistent metadata L2ARC" as per @Volts ' recipe, you're fine, provided that you have enough RAM to support the size of your L2ARC (5*RAM≤L2ARC≤10*RAM). But having two drives for L2ARC is overkill, whether as stripe or as mirror.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Why would that be? [...] I can imagine a pathological controller performing unnecessary write load for each TRIM, but is that a thing?
Apparently yes.
 

Volts

Patron
Joined
May 3, 2021
Messages
210
I would never bet against broken hardware existing :cool: but I also don't take much away from that post. In that post regular TRIM is slow, ESXi is in the stack, and he says they TRIM fast in Linux.
 

Volts

Patron
Joined
May 3, 2021
Messages
210
I read your drawing as pointing to mirrorred devices which are partitioned to host both a L2ARC and a metadata vdev—which I would personally never do.
If you meant "persistent metadata L2ARC" as per @Volts ' recipe, you're fine, provided that you have enough RAM to support the size of your L2ARC (5*RAM≤L2ARC≤10*RAM). But having two drives for L2ARC is overkill, whether as stripe or as mirror.

I'm not suggesting a data L2ARC. A metadata-only L2ARC (especially a persistent one, l2arc.rebuild_enabled) can be a big win at a much lower RAM cost.

It's an alternative to the special metadata vdev. It's easier to live with, especially in weird situations. Much easier to import a shelf of disks with both data and metadata, instead of also figuring out how to attach the metadata devices.
 

naskit

Dabbler
Joined
Apr 19, 2021
Messages
20
you meant "persistent metadata L2ARC" as per @Volts ' recipe
Yep, that is what I meant - thanks for clarifying.

have enough RAM to support the size of your L2ARC (5*RAM≤L2ARC≤10*RAM
Converting the formula above to my system: 320GB ≤ L2ARC ≤ 640GB.

From "FreeBSD Mastery: Advanced ZFS" (Lucas/Jude):
- Page 132: "The L2ARC catches items that fall off of the ARC."
- Page 133: "As a general rule of thumb, each gigabyte of L2ARC requires about 25 MB of ARC...assume that one terabyte of L2ARC, fully utilized, will devour about 25 GB of ARC."

Per the book: 960 GB x 25 MB = 24 GB.

This doesn't quite align with "5*RAM≤L2ARC≤10*RAM". I do appreciate, however, that this is highly variable due to varying file size. Smaller files = more files = more index entries in the ARC metadata, and Larger files = opposite.
secondarycache=metadata, vfs.zfs.l2arc.rebuild_enabled=1
I understand from page 180 of "FreeBSD Mastery: Advanced ZFS" that the "metadata" cache property tells ZFS not to cache actual files, but only metadata. So my interpretation of your 'hack' @Volts is that your L2ARC only contains metadata but no actual cached files for fast retrieval.
Would this be correct?

I found this a good read: Klara Systems - OpenZFS: All about the cache vdev or L2ARC
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
This doesn't quite align with "5*RAM≤L2ARC≤10*RAM". I do appreciate, however, that this is highly variable due to varying file size. Smaller files = more files = more index entries in the ARC metadata, and Larger files = opposite.
The best performance is obtained with a RAM to L2ARC ratio of 1:4. It's possible however to increase the ratio up to 1:10 with increasingly (not linear) diminishing performance: going beyond that would be a futile exercise, and most of us would frown upon seeing a greater than 1:6 ratio.

This is, at the very least, the forum's consensus. Stick with 1:4.

I understand from page 180 of "FreeBSD Mastery: Advanced ZFS" that the "metadata" cache property tells ZFS not to cache actual files, but only metadata. So my interpretation of your 'hack' @Volts is that your L2ARC only contains metadata but no actual cached files for fast retrieval.
Would this be correct?
L2ARC usually holds both data and metadata, but it can be set to only accept metadata (referred as metadata-only) and it can also made to be persistent between reboots.
 

Volts

Patron
Joined
May 3, 2021
Messages
210
the "metadata" cache property tells ZFS not to cache actual files, but only metadata

Right. I don't think I'm adding anything additional now - just responding with keywords for your searching pleasure.

Default is primarycache=all; secondarycache=all. ARC is used for both data and metadata - it's a rare system where that isn't optimal. By default if L2ARC is configured, it's used for both.

What I'm describing is leaving primarycache=all, and changing secondarycache=metadata. To have it persist across reboots, see the sysctl tunable vfs.zfs.l2arc.rebuild_enabled=1.
 
Top