SSD partitions for L2ARC, SLOG, Metadata and Boot

naskit · Dec 18, 2023

Jamberry said:
Hey @naskit! I have been down that path.
Same thinking, same reasons as you (13Sata), and so on.
The little bit frustrating result is that there seems to be no technical reason why partitions should not work. It is just a support thing.
This is more a missing feature request than a technological barrier
You could even open a feature request, unfortunately, due to the enterprise nature of TrueNAS I would not expect iXsystems or OpenZFS to change this.

Thanks for your tuppence @Jamberry. I just looked at my drive list again after creating the data pool, and also have checked what the books say (Jude/Lucas - FreeBSD Mastery: ZFS/Advanced ZFS), and low and behold, when you create even just a basic pool, TrueNAS actually creates 2 partitions on every drive anyway!

I did not create these partitions, TrueNAS did when I created the pools:

Code:

root@truenas[~]# ll /dev/ada*
crw-r-----  1 root  operator  -  0x8a Dec 17 16:06 /dev/ada0
crw-r-----  1 root  operator  -  0x9b Dec 17 16:07 /dev/ada0p1   <-- TrueNAS, not me
crw-r-----  1 root  operator  - 0x164 Dec 17 16:07 /dev/ada0p2   <-- TrueNAS, not me
crw-r-----  1 root  operator  -  0x92 Dec 17 16:06 /dev/ada1
crw-r-----  1 root  operator  -  0xe6 Dec 17 16:06 /dev/ada10
crw-r-----  1 root  operator  -  0xc2 Dec 17 16:07 /dev/ada10p1  <-- TrueNAS, not me
crw-r-----  1 root  operator  - 0x15a Dec 17 16:07 /dev/ada10p2  <-- TrueNAS, not me
crw-r-----  1 root  operator  - 0x10a Dec 17 16:06 /dev/ada11
crw-r-----  1 root  operator  -  0x9d Dec 17 16:07 /dev/ada11p1  <-- TrueNAS, not me
crw-r-----  1 root  operator  - 0x15e Dec 17 16:07 /dev/ada11p2  <-- TrueNAS, not me
crw-r-----  1 root  operator  -  0x8f Dec 17 16:07 /dev/ada1p1   <-- TrueNAS, not me
crw-r-----  1 root  operator  - 0x14a Dec 17 16:07 /dev/ada1p2   <-- TrueNAS, not me
crw-r-----  1 root  operator  -  0x8c Dec 17 16:06 /dev/ada2
crw-r-----  1 root  operator  -  0xa1 Dec 17 16:07 /dev/ada2p1   <-- TrueNAS, not me
crw-r-----  1 root  operator  - 0x15c Dec 17 16:07 /dev/ada2p2   <-- TrueNAS, not me
crw-r-----  1 root  operator  -  0x94 Dec 17 16:06 /dev/ada3
crw-r-----  1 root  operator  - 0x14e Dec 17 16:07 /dev/ada3p1   <-- TrueNAS, not me
crw-r-----  1 root  operator  - 0x17e Dec 17 16:07 /dev/ada3p2   <-- TrueNAS, not me
crw-r-----  1 root  operator  -  0x96 Dec 17 16:05 /dev/ada4
crw-r-----  1 root  operator  - 0x136 Dec 17 16:05 /dev/ada4p1   <-- TrueNAS, not me
crw-r-----  1 root  operator  - 0x13d Dec 17 16:05 /dev/ada4p2   <-- TrueNAS, not me
crw-r-----  1 root  operator  -  0x9e Dec 17 16:16 /dev/ada5
crw-r-----  1 root  operator  -  0xd0 Dec 17 16:16 /dev/ada5p1   <-- TrueNAS, not me
crw-r-----  1 root  operator  -  0xa4 Dec  7 22:34 /dev/ada6
crw-r-----  1 root  operator  -  0xee Dec  7 11:33 /dev/ada6p1   <-- TrueNAS, not me
crw-r-----  1 root  operator  -  0xf0 Dec  7 11:33 /dev/ada6p2   <-- TrueNAS, not me
crw-r-----  1 root  operator  -  0xf2 Dec  7 11:33 /dev/ada6p3   <-- TrueNAS, not me
crw-r-----  1 root  operator  -  0xb2 Dec  7 22:34 /dev/ada7
crw-r-----  1 root  operator  -  0xf4 Dec  7 11:33 /dev/ada7p1   <-- TrueNAS, not me
crw-r-----  1 root  operator  -  0xf6 Dec  7 11:33 /dev/ada7p2   <-- TrueNAS, not me
crw-r-----  1 root  operator  -  0xf8 Dec  7 11:33 /dev/ada7p3   <-- TrueNAS, not me
crw-r-----  1 root  operator  -  0xe2 Dec 17 16:06 /dev/ada8
crw-r-----  1 root  operator  - 0x148 Dec 17 16:07 /dev/ada8p1   <-- TrueNAS, not me
crw-r-----  1 root  operator  - 0x178 Dec 17 16:07 /dev/ada8p2   <-- TrueNAS, not me
crw-r-----  1 root  operator  -  0xe4 Dec 17 16:06 /dev/ada9
crw-r-----  1 root  operator  -  0xca Dec 17 16:07 /dev/ada9p1   <-- TrueNAS, not me
crw-r-----  1 root  operator  - 0x162 Dec 17 16:07 /dev/ada9p2   <-- TrueNAS, not me
root@truenas[~]#

This is only a temporary config. I still need to burnin the two 480GB SSD (which currently form the boot-pool).
I will probably just re-install TrueNAS again after that.

Code:

root@truenas[~]# zpool status
  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:05 with 0 errors on Fri Dec 15 22:45:05 2023
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada6p2  ONLINE       0     0     0  <-- (zfs::vdev)
            ada7p2  ONLINE       0     0     0  <-- (zfs::vdev)

errors: No known data errors

  pool: pool-boot  <-- (interim pool)
 state: ONLINE
config:

        NAME                                          STATE     READ WRITE CKSUM
        pool-boot                                     ONLINE       0     0     0
          gptid/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  ONLINE       0     0     0  <-- ada4p2 (zfs::vdev)

errors: No known data errors

  pool: pool-data
 state: ONLINE
config:

        NAME                                            STATE     READ WRITE CKSUM
        pool-data                                       ONLINE       0     0     0
          raidz3-0                                      ONLINE       0     0     0
            gptid/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  ONLINE       0     0     0  <-- ada0p2 (zfs::vdev)
            gptid/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  ONLINE       0     0     0  <-- ada1p2 (zfs::vdev)
            gptid/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  ONLINE       0     0     0  <-- ada2p2 (zfs::vdev)
            gptid/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  ONLINE       0     0     0  <-- ada3p2 (zfs::vdev)
            gptid/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  ONLINE       0     0     0  <-- ada8p2 (zfs::vdev)
            gptid/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  ONLINE       0     0     0  <-- ada9p2 (zfs::vdev)
            gptid/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  ONLINE       0     0     0  <-- ada10p2 (zfs::vdev)
            gptid/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  ONLINE       0     0     0  <-- ada11p2 (zfs::vdev)
        cache
          gptid/631b0e86-9c9b-11ee-bf37-d05099dcdabb    ONLINE       0     0     0  <-- ada5p1 (zfs::vdev)

errors: No known data errors
root@truenas[~]#

From "FreeBSD Mastery: ZFS" (Michael Lucas/Allan Jude) page 24-25:

Many of the original Solaris ZFS administration guides recommended against using partitions...for performance reasons. In Solaris, using a partition for a filesystem disables the write cache. In FreeBSD, disabling the write cache is completely separate from disk partitioning or filesystems. FreeBSD gives full performance when using ZFS on a partition...The disadvantage to using partitions is that you might lose some portability that ZFS provides. If you move disks from one system to another, the target system must be able to recognize the disk partitions

Thanks for chiming in, too, @morganL. It is nice to see iX has an interest in this topic.

morganL said:
Thanks for the ideas and write up. I think the logic is right and it is something we (iX) have as a potential future roadmap item. Would love to have a real system as validation...Boot, SLOG, L2ARC as partitions make sense. I'm a little more skeptical that metadata vdevs should be added as partitions (especially as a pioneer) since that is hardest to recover. A larger persistent L2ARC reduces some of the need for metadata vdev.

Thank for your vote of confidence in the logic. Re: using a partition for metadata, that was only because of the lack of SSDs and SATA ports...it's all down to making the most out of my limited HW (yes, it's a SOHO system). It's not because I think metadata 'should' be on a partition rather than a disk. I see what you mean about a larger L2ARC alleviating some of the need for a metadata vdev.

morganL said:
I'd generally recommend 2-way mirrors and potentially a spare. Its less work for the system and better tested code path. Its not clear that the SLOG has to be mirrored... since there is a copy of SLOG data in DRAM.

Hmmm...yeah, I suppose it is. It just means more work when re-silvering post-failure. I am guessing that what it written to the SLOG from RAM is also written to the pool from RAM, rather than copied from the SLOG to the pool, thus making the SLOG nothing more than a copy 'just-in-case'?

Thanks @Arwen.

Arwen said:
Going way too large simply means you are using RAM for L2ARC pointers and not ARC, (aka first level read cache).

Right...that computes in my mind...with my original partition proposal the L2ARC would have only been 480GB (2-way partition mirror), but the second proposal uses a whole disk.

Thanks @Davvo.

Davvo said:
Regarding metadata VDEVs, you want to have the same level of redundancy as the rest of your pool in order not to create a weak point in the system...

Hence the original 4-way partition mirror proposal. I am fully aware that a metadata vdev is pool-critical.

Thanks @Etorix.

Etorix said:
Which "obvious reason"? I indeed prefer to have new HDDs, but have no qualms about second-hand entrerprise-grade SSDs, second-hand Optane (SLOG/L2ARC) and second-hand anything for boot drive.

My research into NAND flash based drives (SSD/NVMe) indicates NAND cells have a definite write-limit. I have no idea how many times a second hand SSD has been written to, so I have no idea how long it would last. But you go for it :)

Etorix said:
there's no "loading into L2ARC" at boot: It starts empty, and gets filled as time goes—or, if persistent, it's already filled.
L2ARC management uses RAM, so L2ARC can be detrimental: Too much L2ARC can degrade performance by evicting ARC.

...special vdevs are potentially risky because they are pool-critical.
If mostly read benefits (fast browsing) are sought, then a persistent L2ARC for metadata can safely provide the same result with a single drive...

Different uses have different requirements.

...

L2ARC requires low read latency and endurance; PLP not needed, redundancy not needed.

Special vdev requires redundancy, redundancy and redundancy.

Boot has no special requirement at all.

...I don't think you have a strong use case for a SLOG.

Good points.

Thanks @Volts.

Volts said:
Change the data drive layout from Z3 to Stripe+Mirror. Much better actual performance w/ 8 drives.

I considered that, but the problem then is whilst I could lose any one drive first, there is a risk that the second drive I lose could render the pool lost if it happens to be in the same VDEV. That is why I decided RAIDZ3 would be more robust. Mathematically, I can lose ANY 3 drives in ANY order before I lose the pool. In a 4x 2-way mirror layout, after losing the first drive, I am mathematically now down to one SPOF - the remaining disk in the degraded mirror. True, I could still lose 3 other drives, but given that once a VDEV has been added to a pool it cannot be removed, I don't like the odds in that scenario. Yes, it will perform more slowly...but I don't think it will matter that much to me. I would rather keep my data...I have gone to the trouble and expensive of buying 8 disks for it...

Volts said:
Don't use slog unless you have critical sync writes. Backups don't need sync. And sync=disabled is faster still.
Don't use a meta vdev with a complex partitioning scheme because it's critical for the data. If you lose the meta vdev you lose the data.
Don't try to use a big data L2ARC, probably, unless you have huge amounts of RAM. It's not worth sacrificing primary ARC.

The goofy partition and L2ARC hack that I do endorse is a metadata L2ARC, secondarycache=metadata, vfs.zfs.l2arc.rebuild_enabled=1. It sure helps keep things snappy like a meta vdev, and especially after a reboot. But it can be disconnected without impact.

All good points. your last point is interesting and I might go with that - a compromise between performance and resilience.

Etorix · Dec 19, 2023

naskit said:
I am guessing that what it written to the SLOG from RAM is also written to the pool from RAM, rather than copied from the SLOG to the pool, thus making the SLOG nothing more than a copy 'just-in-case'?

No guessing here: That's exactly how SLOG works—it is a "LOG", not a "CACHE". SLOG is never ever read back in normal operation.

naskit said:
My research into NAND flash based drives (SSD/NVMe) indicates NAND cells have a definite write-limit. I have no idea how many times a second hand SSD has been written to, so I have no idea how long it would last. But you go for it :)

There is a write limit, but it is standard practice for sellers of this class of hardware to indicate the write load and/or drive health report from SMART—and if not, it should be standard practice for the prospective buyer to ask for it. Then you set your threshold… ;-)

naskit said:
All good points. your last point is interesting and I might go with that - a compromise between performance and resilience.

Also known as "persistent metatadata L2ARC". @Volts spelt the recipe for you.

Volts · Dec 19, 2023

naskit said:
That is why I decided RAIDZ3 would be more robust

Yeah but with mirror-and-stripe you can lose FOUR drives, which is one louder. Oh what? Where's the Spinal Tap 2023 audition then, please? Thanks, sorry.

Constantin · Dec 19, 2023

Volts said:
Yeah but with mirror-and-stripe you can lose FOUR drives, which is one louder.

Not really. You can lose two drives in a mirror and enjoy total data loss. Z3 allows up to three drives to go down before not enough parity data is left over to rebuild with.

Now as to the probability of which combination of drives will more likely fail in a given VDEV, that’s beyond my appetite ATM to compute.

But I’m happy with my Z3, 8-disk VDEV. It may not break any sound barriers re performance but for SOHO, it’s fast enough (and much faster than the cloud).

Volts · Dec 19, 2023

Sorry, my sense of humour is off or something. I hoped the Spinal Tap reference would make it clear it was a joke.

With 8 drives mirrored, assuming one has already failed, the chance of surviving a second failure is 6/7, 85%.
If you get lucky, the chance of surviving a third failure is 4/6. Total (6/7) * (4/6) = 57%.
If you get even luckier, the chance of surviving a fourth failure is 3/5. Total (6/7) * (4/6) * (3/5) = 22%.
The fifth failure is 0/4, so you're in trouble.

At least that's a curve an actuary could love!

1 failure: 100% survival
2 failures: 85%
3 failures: 57%
4 failures: 22%
5 failures: 0%

But with Z3 it's totally unpredictable (a joke):

1 failure: 100%
2 failures: 100%
3 failures: 100%
4 failures: 0%

But you still might compare performance. It can be surprising how much faster a pile of mirrors is, especially on mixed read/write/random/sequential workloads. All those vdevs make light the work.

Davvo · Dec 19, 2023

Volts said:
With 8 drives mirrored

Since it can get pretty confusing for those not familiar with the terminology, in this case he means 8 drives in 4 VDEVs of 2-way mirrors each. It's possible to make a single-vdev 8-way mirror, but hardly practical.

Suggested reading:

Assessing the Potential for Data Loss

This guide was written to be read from top to bottom without jumps, with the intent of spreading awareness to both new and experienced users; the author of this document assumes the understanding of the concepts explained in the following...

www.truenas.com

Volts · Dec 19, 2023

Davvo said:
It's possible to make a single-vdev 8-way mirror, but hardly practical.

I encountered a 15-way mirror in the wild once, long ago. Database publishing server. They wanted the random read IO. Today it would fit in RAM or be smoked by a single SSD.

naskit · Dec 19, 2023

Volts said:
Sorry, my sense of humour is off or something. I hoped the Spinal Tap reference would make it clear it was a joke.

With 8 drives mirrored, assuming one has already failed, the chance of surviving a second failure is 6/7, 85%.
If you get lucky, the chance of surviving a third failure is 4/6. Total (6/7) * (4/6) = 57%.
If you get even luckier, the chance of surviving a fourth failure is 3/5. Total (6/7) * (4/6) * (3/5) = 22%.
The fifth failure is 0/4, so you're in trouble.

At least that's a curve an actuary could love!

1 failure: 100% survival
2 failures: 85%
3 failures: 57%
4 failures: 22%
5 failures: 0%

But with Z3 it's totally unpredictable (a joke):

1 failure: 100%
2 failures: 100%
3 failures: 100%
4 failures: 0%

But you still might compare performance. It can be surprising how much faster a pile of mirrors is, especially on mixed read/write/random/sequential workloads. All those vdevs make light the work.

I appreciate the Spinal Tap reference :)
I watched it again literally a few weeks ago.
A pertinent quote for this forum:
Spinal Tap philosophy

Yes, the MTBF maths is interesting.
Stripe+Mirror IOPS of course higher, but looking at it from a POF count perspective, RAIDZ3 is more tolerant than 4x 2-way mirrors since after the first drive loss you are down to a single point of failure. There is no knowing if that drive will go next or a drive from another mirror. Surviving 4 failures without data loss is the best case, but I doubt the most probable case.

Volts · Dec 20, 2023

I've seen people be really irrational about even RAIDZ2: "We don't need to worry about replacing that disk today, there's still redundancy." With RAIDZ3 I could imagine doing that myself. "Ehhh, it's an hour away. I'll do it next time I'm there."

I'm not arguing for a less-durable strategy, I just think the human element is interesting. We often want "the best" margin up front, but later we're willing to borrow casually from that margin. It's easy to lose sight of the availability goals & policy that were first implemented.

Do we have any statistics on the success rate of Remote Hands pulling a failed disk, and not a healthy one? (For this exercise, assume that you somehow, implausibly, communicated the correct slot.)

I wonder how Remote Hands' "pulled the right device" success rate compares with owner/operators'. I wonder if it's better.

Davvo · Dec 20, 2023

That's why hotspares exist.

Constantin · Dec 20, 2023

I qualify cold spares and insert them as needed. Hot spares make sense for remote data centers / locations.

Volts · Dec 20, 2023

Davvo said:
That's why hotspares exist.

If you've got hotspares then I'm right back on my mirrored vdevs horse! Resilver goes zoooom.

blademan · Jan 2, 2024

Super relevant conversation. As my signature includes, I'm configuring a server with use cases of: media storage for plex, running plex & *arr apps in containers, and SMB shares for home use. After watching this Level1Tech vid, I bought a pair of 118GB Optane P1660X NVME that I thought to mirror and have multiple mirrored partitions for: L2ARC, SLOG, and metadata. After reading this thread and these threads (1, 2, 3, 4, 5, 6, 7) and this post by @Davvo, I'm planning not to partition the Optanes, and not mirror the Optanes. Setup one Optane for L2ARC, and the other for SLOG, as there would be some benefit with little risk. Thoughts? Is there a point to using the 118GB Optane for L2ARC if the system has 384GB memory?

Davvo · Jan 2, 2024

SLOG contains critical data which might be lost in case of drive failure, there is merit in mirroring it.
L2ARC has no reason to be mirrored, and an optane might be overkill for it.

awasb · Jan 2, 2024

Well ... just to clarify:

During regular operation a complete SLOG loss (e.g. by a dead SLOG drive) means that log data gets written to the pool (again). Nothing lost.
If your system goes down (unregular shutdown due to kernel panic, power loss, whatever) and on reboot the formerly exiting SLOG can't be found (e.g. due to a dead SLOG drive), then there is no transaction log for replay. Nothing more. You lose data. But the pool remains consistent. (And if not, then something has happened a working SLOG wouldn't have prevented.)

I won't say there is no risc losing something. But - while calculating odds - mirroring SLOG drives for availability reasons seems a bit overkill _to me_. (Bandwidth and i/o throughput would be better reasons to do it ... again _for me_.)

naskit · Jan 17, 2024

morganL said:
You need to make sure the endurance is there and that there are enough spare cells. You may want to "under provision" partition capacity... some people call it "overprovision".

On the topic of 'over-provisioning', does TrueNAS have knowledge of and track 'spare cells' on the SSD, or is this only visible to and done by the SSD drive controller? (What I have read in SSD technical data sheets and specifications so far leads me to believe only the SSD drive controller 'sees' or 'knows' about spare cells and only the drive controller can manage them). It is my (perhaps naïve?) belief that the main reason why Enterprise SSDs are advertised with a lower capacity is *precisely because* a chunk of cells are carved out and hidden away for the express purpose of replacing bad cells whenever they are encountered and SMART mechanics detects that those cells are no longer reliable.

What (if any) commands are available to the user for said 'over-previsioning' configurations?

naskit · Jan 17, 2024

Davvo said:
It depends on what he's after, striped mirrors don't give you the same resiliency.

Code:

response = hammer.hit->nail.head

naskit · Jan 18, 2024

naskit said:
I would love to get any thoughts on this from those who are more experienced in this.

Ok, picking this up again, I bought another SSD - 240GB - to plug into the 13th SATA port on the Mobo and use as the boot drive, so this is now what I have:

I do have a machine with potentially large PostgreSQL DBs, so I am thinking that it now may be worth hosting a redundant DB on the NAS and using it as a bona-fide DB server in addition to storage for file/system backup and disk imaging.

I might use this for media streaming, but I think I will use my other (QNAP) NAS predominantly for that, and I will just backup media files here (not real-time transcoding or streaming).

Given this new info, if you guys were handed this, would you configure pools as per my new suggestion above, or do something different?
Note that I have 64GB of ECC RAM.

(I did take your advice on having a single and cheap(-ish) SSD for boot).

Etorix · Jan 18, 2024

If hosting a live database, you may well need a SLOG, which requires a drive with PLP. Is this the case here? Mirrorring SLOG is optional.

L2ARC and metadata duties are somewhat redundant. High wear from L2ARC could potentially affect the pool-critical metadata part. I would personally NEVER mix these.

Davvo · Jan 18, 2024

Agree with Etorix, for SOHO use if you have L2ARC you rarely need a metadata VDEV.

Important Announcement for the TrueNAS Community.

SSD partitions for L2ARC, SLOG, Metadata and Boot

Dabbler

Wizard

Patron

Vampire Pig

Patron

MVP

Patron

Dabbler

Patron

MVP

Vampire Pig

Patron

Dabbler

MVP

Patron

Dabbler

Dabbler

Dabbler

Wizard

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "SSD partitions for L2ARC, SLOG, Metadata and Boot"

Similar threads