Highly Available Boot Pool Strategy

jgreco · Feb 24, 2022

jgreco submitted a new resource:

Highly Available Boot Pool Strategy - When you really want the thing to boot.

I've pounded a few versions of this out over the years, but I hate explaining over and over.

Due to the design of the average PC BIOS, a boot device that has become corrupted but not failed entirely may prevent booting.

Some of us used to use "hardware" RAID 1 and IR mode controllers in the pre-ZFS days to make sure an SSD failure didn't impair boot. This is good but not perfect, since the "hardware" RAID controller cannot detect corruption. Additionally, this is no longer recommended...

Read more about this resource...

winnielinnie · Feb 24, 2022

This part I view differently.

jgreco said:
Due to the design of the average PC BIOS, a boot device that has become corrupted but not failed entirely may prevent booting.

I'm not sure why that's a bad thing?

Wouldn't you want the boot process to be halted / interrupted / prevented if the configured (primary) boot drive has corruption or failure? This gives you a clear warning that something went wrong, and you can then go into the BIOS to change the boot order to use the other SSD in the meantime (until you address / replace the failing SSD.)

jgreco · Feb 24, 2022

It's a bad thing if you've got a twenty petabyte server. iXsystems quotes 20PB on a single head unit on this web page. My theory is that that server is going to cost at least half a million dollars. It would be stupid for a filer that cost that much not to boot over a failed flash chip in a fifty dollar SSD.

There are many who use TrueNAS in roles more critical than a homelab or Plex media server setup. Can your Plex server be down for a day? Probably, maybe some negative WAF though. But if you're running VM storage for a hypervisor cluster, can you afford to have your entire business offline because storage is down?

HoneyBadger · Feb 25, 2022

In that scenario, you're going to want/have redundancy at a higher level (eg: have ZFS pool failover between two controllers using HAST/CARP/export-import chaining) and as such the time it takes a single controller to come up (or need to be monitored/adjusted during that process) is less relevant.

With that said the method of using ZFS mirroring of a hardware RAID device is a solid way to prevent something like a corrupted/non-responsive boot device from deciding to hang up a piece of the middleware. It's hugely overkill for most end-users, but if you need this it's not.

I'm content with regular mirrored boot in most scenarios. A rebooting storage system should, generally speaking, never be something left unattended.

Constantin · Dec 24, 2022

I'm going to make use of a empty USB port on my motherboard to add a third boot drive to add to the pool. Not to boot from but rather to have a fully-compatible backup to use in case one of the SATADOMs fails to clone with. I already have a qualified SATADOM in cold storage but I doubt it will ever be needed. In a SOHO application where the hardware is readily accessible, this works well enough for me.

For those trying to figure out how to create a simple RAID1 boot pool via the GUI, see here for the TrueNAS GUI boot pool explainer. Boot pool disks can be included in SMART monitoring via the GUI as well. Just remember to modify/verify your SMART testing when you add new drives (such as additional redundant boot drives). Periodic scrubs of the boot drive pool appear to be automatic and set on a 35-day schedule.

Important Announcement for the TrueNAS Community.

Highly Available Boot Pool Strategy

jgreco

Resident Grinch

winnielinnie

MVP

jgreco

Resident Grinch

HoneyBadger

actually does care

Constantin

Vampire Pig

Similar threads

Important Announcement for the TrueNAS Community.

Highly Available Boot Pool Strategy

jgreco

Resident Grinch

winnielinnie

MVP

jgreco

Resident Grinch

HoneyBadger

actually does care

Constantin

Vampire Pig

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Highly Available Boot Pool Strategy"

Similar threads