Thanks for your tuppence @Jamberry. I just looked at my drive list again after creating the data pool, and also have checked what the books say (Jude/Lucas - FreeBSD Mastery: ZFS/Advanced ZFS), and low and behold, when you create even just a basic pool, TrueNAS actually creates 2 partitions on every drive anyway!Hey @naskit! I have been down that path.
Same thinking, same reasons as you (13Sata), and so on.
The little bit frustrating result is that there seems to be no technical reason why partitions should not work. It is just a support thing.
This is more a missing feature request than a technological barrier
You could even open a feature request, unfortunately, due to the enterprise nature of TrueNAS I would not expect iXsystems or OpenZFS to change this.
I did not create these partitions, TrueNAS did when I created the pools:
Code:
root@truenas[~]# ll /dev/ada* crw-r----- 1 root operator - 0x8a Dec 17 16:06 /dev/ada0 crw-r----- 1 root operator - 0x9b Dec 17 16:07 /dev/ada0p1 <-- TrueNAS, not me crw-r----- 1 root operator - 0x164 Dec 17 16:07 /dev/ada0p2 <-- TrueNAS, not me crw-r----- 1 root operator - 0x92 Dec 17 16:06 /dev/ada1 crw-r----- 1 root operator - 0xe6 Dec 17 16:06 /dev/ada10 crw-r----- 1 root operator - 0xc2 Dec 17 16:07 /dev/ada10p1 <-- TrueNAS, not me crw-r----- 1 root operator - 0x15a Dec 17 16:07 /dev/ada10p2 <-- TrueNAS, not me crw-r----- 1 root operator - 0x10a Dec 17 16:06 /dev/ada11 crw-r----- 1 root operator - 0x9d Dec 17 16:07 /dev/ada11p1 <-- TrueNAS, not me crw-r----- 1 root operator - 0x15e Dec 17 16:07 /dev/ada11p2 <-- TrueNAS, not me crw-r----- 1 root operator - 0x8f Dec 17 16:07 /dev/ada1p1 <-- TrueNAS, not me crw-r----- 1 root operator - 0x14a Dec 17 16:07 /dev/ada1p2 <-- TrueNAS, not me crw-r----- 1 root operator - 0x8c Dec 17 16:06 /dev/ada2 crw-r----- 1 root operator - 0xa1 Dec 17 16:07 /dev/ada2p1 <-- TrueNAS, not me crw-r----- 1 root operator - 0x15c Dec 17 16:07 /dev/ada2p2 <-- TrueNAS, not me crw-r----- 1 root operator - 0x94 Dec 17 16:06 /dev/ada3 crw-r----- 1 root operator - 0x14e Dec 17 16:07 /dev/ada3p1 <-- TrueNAS, not me crw-r----- 1 root operator - 0x17e Dec 17 16:07 /dev/ada3p2 <-- TrueNAS, not me crw-r----- 1 root operator - 0x96 Dec 17 16:05 /dev/ada4 crw-r----- 1 root operator - 0x136 Dec 17 16:05 /dev/ada4p1 <-- TrueNAS, not me crw-r----- 1 root operator - 0x13d Dec 17 16:05 /dev/ada4p2 <-- TrueNAS, not me crw-r----- 1 root operator - 0x9e Dec 17 16:16 /dev/ada5 crw-r----- 1 root operator - 0xd0 Dec 17 16:16 /dev/ada5p1 <-- TrueNAS, not me crw-r----- 1 root operator - 0xa4 Dec 7 22:34 /dev/ada6 crw-r----- 1 root operator - 0xee Dec 7 11:33 /dev/ada6p1 <-- TrueNAS, not me crw-r----- 1 root operator - 0xf0 Dec 7 11:33 /dev/ada6p2 <-- TrueNAS, not me crw-r----- 1 root operator - 0xf2 Dec 7 11:33 /dev/ada6p3 <-- TrueNAS, not me crw-r----- 1 root operator - 0xb2 Dec 7 22:34 /dev/ada7 crw-r----- 1 root operator - 0xf4 Dec 7 11:33 /dev/ada7p1 <-- TrueNAS, not me crw-r----- 1 root operator - 0xf6 Dec 7 11:33 /dev/ada7p2 <-- TrueNAS, not me crw-r----- 1 root operator - 0xf8 Dec 7 11:33 /dev/ada7p3 <-- TrueNAS, not me crw-r----- 1 root operator - 0xe2 Dec 17 16:06 /dev/ada8 crw-r----- 1 root operator - 0x148 Dec 17 16:07 /dev/ada8p1 <-- TrueNAS, not me crw-r----- 1 root operator - 0x178 Dec 17 16:07 /dev/ada8p2 <-- TrueNAS, not me crw-r----- 1 root operator - 0xe4 Dec 17 16:06 /dev/ada9 crw-r----- 1 root operator - 0xca Dec 17 16:07 /dev/ada9p1 <-- TrueNAS, not me crw-r----- 1 root operator - 0x162 Dec 17 16:07 /dev/ada9p2 <-- TrueNAS, not me root@truenas[~]#
This is only a temporary config. I still need to burnin the two 480GB SSD (which currently form the boot-pool).
I will probably just re-install TrueNAS again after that.
Code:
root@truenas[~]# zpool status pool: boot-pool state: ONLINE scan: scrub repaired 0B in 00:00:05 with 0 errors on Fri Dec 15 22:45:05 2023 config: NAME STATE READ WRITE CKSUM boot-pool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada6p2 ONLINE 0 0 0 <-- (zfs::vdev) ada7p2 ONLINE 0 0 0 <-- (zfs::vdev) errors: No known data errors pool: pool-boot <-- (interim pool) state: ONLINE config: NAME STATE READ WRITE CKSUM pool-boot ONLINE 0 0 0 gptid/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ONLINE 0 0 0 <-- ada4p2 (zfs::vdev) errors: No known data errors pool: pool-data state: ONLINE config: NAME STATE READ WRITE CKSUM pool-data ONLINE 0 0 0 raidz3-0 ONLINE 0 0 0 gptid/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ONLINE 0 0 0 <-- ada0p2 (zfs::vdev) gptid/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ONLINE 0 0 0 <-- ada1p2 (zfs::vdev) gptid/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ONLINE 0 0 0 <-- ada2p2 (zfs::vdev) gptid/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ONLINE 0 0 0 <-- ada3p2 (zfs::vdev) gptid/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ONLINE 0 0 0 <-- ada8p2 (zfs::vdev) gptid/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ONLINE 0 0 0 <-- ada9p2 (zfs::vdev) gptid/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ONLINE 0 0 0 <-- ada10p2 (zfs::vdev) gptid/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ONLINE 0 0 0 <-- ada11p2 (zfs::vdev) cache gptid/631b0e86-9c9b-11ee-bf37-d05099dcdabb ONLINE 0 0 0 <-- ada5p1 (zfs::vdev) errors: No known data errors root@truenas[~]#
From "FreeBSD Mastery: ZFS" (Michael Lucas/Allan Jude) page 24-25:
Thanks for chiming in, too, @morganL. It is nice to see iX has an interest in this topic.Many of the original Solaris ZFS administration guides recommended against using partitions...for performance reasons. In Solaris, using a partition for a filesystem disables the write cache. In FreeBSD, disabling the write cache is completely separate from disk partitioning or filesystems. FreeBSD gives full performance when using ZFS on a partition...The disadvantage to using partitions is that you might lose some portability that ZFS provides. If you move disks from one system to another, the target system must be able to recognize the disk partitions
Thank for your vote of confidence in the logic. Re: using a partition for metadata, that was only because of the lack of SSDs and SATA ports...it's all down to making the most out of my limited HW (yes, it's a SOHO system). It's not because I think metadata 'should' be on a partition rather than a disk. I see what you mean about a larger L2ARC alleviating some of the need for a metadata vdev.Thanks for the ideas and write up. I think the logic is right and it is something we (iX) have as a potential future roadmap item. Would love to have a real system as validation...Boot, SLOG, L2ARC as partitions make sense. I'm a little more skeptical that metadata vdevs should be added as partitions (especially as a pioneer) since that is hardest to recover. A larger persistent L2ARC reduces some of the need for metadata vdev.
Hmmm...yeah, I suppose it is. It just means more work when re-silvering post-failure. I am guessing that what it written to the SLOG from RAM is also written to the pool from RAM, rather than copied from the SLOG to the pool, thus making the SLOG nothing more than a copy 'just-in-case'?I'd generally recommend 2-way mirrors and potentially a spare. Its less work for the system and better tested code path. Its not clear that the SLOG has to be mirrored... since there is a copy of SLOG data in DRAM.
Thanks @Arwen.
Right...that computes in my mind...with my original partition proposal the L2ARC would have only been 480GB (2-way partition mirror), but the second proposal uses a whole disk.Going way too large simply means you are using RAM for L2ARC pointers and not ARC, (aka first level read cache).
Thanks @Davvo.
Hence the original 4-way partition mirror proposal. I am fully aware that a metadata vdev is pool-critical.Regarding metadata VDEVs, you want to have the same level of redundancy as the rest of your pool in order not to create a weak point in the system...
Thanks @Etorix.
My research into NAND flash based drives (SSD/NVMe) indicates NAND cells have a definite write-limit. I have no idea how many times a second hand SSD has been written to, so I have no idea how long it would last. But you go for it :)Which "obvious reason"? I indeed prefer to have new HDDs, but have no qualms about second-hand entrerprise-grade SSDs, second-hand Optane (SLOG/L2ARC) and second-hand anything for boot drive.
Good points.there's no "loading into L2ARC" at boot: It starts empty, and gets filled as time goes—or, if persistent, it's already filled.
L2ARC management uses RAM, so L2ARC can be detrimental: Too much L2ARC can degrade performance by evicting ARC.
...special vdevs are potentially risky because they are pool-critical.
If mostly read benefits (fast browsing) are sought, then a persistent L2ARC for metadata can safely provide the same result with a single drive...
Different uses have different requirements.
...I don't think you have a strong use case for a SLOG.
- ...
- L2ARC requires low read latency and endurance; PLP not needed, redundancy not needed.
- Special vdev requires redundancy, redundancy and redundancy.
- Boot has no special requirement at all.
Thanks @Volts.
I considered that, but the problem then is whilst I could lose any one drive first, there is a risk that the second drive I lose could render the pool lost if it happens to be in the same VDEV. That is why I decided RAIDZ3 would be more robust. Mathematically, I can lose ANY 3 drives in ANY order before I lose the pool. In a 4x 2-way mirror layout, after losing the first drive, I am mathematically now down to one SPOF - the remaining disk in the degraded mirror. True, I could still lose 3 other drives, but given that once a VDEV has been added to a pool it cannot be removed, I don't like the odds in that scenario. Yes, it will perform more slowly...but I don't think it will matter that much to me. I would rather keep my data...I have gone to the trouble and expensive of buying 8 disks for it...Change the data drive layout from Z3 to Stripe+Mirror. Much better actual performance w/ 8 drives.
All good points. your last point is interesting and I might go with that - a compromise between performance and resilience.Don't use slog unless you have critical sync writes. Backups don't need sync. And sync=disabled is faster still.
Don't use a meta vdev with a complex partitioning scheme because it's critical for the data. If you lose the meta vdev you lose the data.
Don't try to use a big data L2ARC, probably, unless you have huge amounts of RAM. It's not worth sacrificing primary ARC.
The goofy partition and L2ARC hack that I do endorse is a metadata L2ARC, secondarycache=metadata, vfs.zfs.l2arc.rebuild_enabled=1. It sure helps keep things snappy like a meta vdev, and especially after a reboot. But it can be disconnected without impact.