Mirrored Boot Pool Failure

Status
Not open for further replies.

qwertymodo

Contributor
Joined
Apr 7, 2014
Messages
144
I've been running my boot pool off of a pair of mirrored SATA DOMs, and today one of them started spewing CAM STATUS errors. Eventually the entire system locked up, so I tried a reboot, only to have it report the same errors during startup, and it never finished booting, just an endless stream of CAM STATUS errors. Finally, I pulled the box apart and removed the offending DOM, and the secondary DOM boots just fine. So, obviously I have a bad boot device, but my question is, what good is a mirrored boot pool if it can't recover from a single device failure? I would have literally been better off with a cold spare instead of wasting all of the wear-and-tear on a mirror that never kicked in when it should have. Has anybody else experienced this with a mirrored boot device failure?
 

snaptec

Guru
Joined
Nov 30, 2015
Messages
502
I had a faulty boot mirror a couple of month ago.
One just died.
The other Part of the mirror worked just Fine. No reboot needed

This was on a 9.3fn

Gesendet von iPhone mit Tapatalk
 

Mirfster

Doesn't know what he's talking about
Joined
Oct 2, 2015
Messages
3,215
Wonder if it is just an inability or required BIOS setting/configuration on your Motherboard?

Not sure though since I don't use SATA DOMs, but I would think that they should work as designed...
 

MrToddsFriends

Documentation Browser
Joined
Jan 12, 2015
Messages
1,338
[...] but my question is, what good is a mirrored boot pool if it can't recover from a single device failure?

I'm not surprised that there's a difference between checksum errors on one of the mirrored devices (which are handled by ZFS) on one hand and the CAM subsystem bailing out with some status errors on the other.
 

philhu

Patron
Joined
May 17, 2016
Messages
258
It sounds like cam errors are not hardware errors, so the boot bios did not see the bad drive as bad, read the bootstrap successfully, assigned it for boot and booted, when FreeNas got control, cam errors appeared again.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
This is the same reason I am against using mirrored boot drives hosted by FreeNAS and not a true RAID card.

Explanation: Your motherboard is setup to boot from a specific drive, for this purpose lets call it ada0 that is plugged into SATA0 port. You have a second DOM plugged into SATA1 port and we will call this ada1. If this were USB ports then I'd call it da0 and da1, and it really just depends on the hardware fro true letter assignments but you get the point.

So your motherboard is setup to look at SATA0 as the primary boot device and SATA1 as the second boot device.

Now you are booting from SATA0 and things start booting but then the code crashes. That is where things stop working.

To make the motherboard boot from SATA1 port then the device on SATA0 port needs to be so dead that it cannot be recognized by the motherboard and then it will try the SATA1 port. This is true for USB ports as well.

What does mirroring a boot device get you in FreeNAS? One thing only, the ability to remove the failed device, once you figure out which one it is, and boot your machine. There is nothing automatic about it.

Why do I not like it? Because it give people the perception that a failed device is automatically overcome but it isn't.

My favorite solution is to make a backup of your configuration file, like you should do anyway, and if the system craps out then you reinstall the FreeNAS software and then restore the configuration file.

For my ESXi server I actually have mirrored boot devices but they are attached to a true RAID card which will handle failures properly and will boot should a device failure occur. I have not idea if the RAID card will handle all types of failures but that is what I'm to understand. Now I'm not promoting that people use a RAID card for FreeNAS, I am promoting they use a single boot device and backup the configuration file.
 

philhu

Patron
Joined
May 17, 2016
Messages
258
I actually did the raid thing. My 2 ssd 32g drives are plugged into the motherboard 'raid' controller, set sata0, and 1 to mirror raid 1. My data disks are jbod with lsi it mode card seeing 24 disks

It presents the boot mirror as 1 disk to freenas. But in bios it sees 2. Actually quite an elegant solution

On my motherboard, a SM X8DTN+, turning on the raid for the 6 mb sata ports means turning on acpi which means cannot boot a usb, which is fine with me as a tradeoff since i boot ssd
 

Mirfster

Doesn't know what he's talking about
Joined
Oct 2, 2015
Messages
3,215
It presents the boot mirror as 1 disk to freenas. But in bios it sees 2. Actually quite an elegant solution
Only concern with that is this would prevent FreeNAS from getting SMART data from the OS Drive(s). Also wondering about scrubs....
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Only concern with that is this would prevent FreeNAS from getting SMART data from the OS Drive(s). Also wondering about scrubs....
I agree with this conclusion as well but odds of the SSDs going bad are so much lower than other drive types so it's a risk, but a very small risk. Scrubs, I would expect them to operate normally since FreeNAS only sees one data drive. If one of the drives fails then the controller would need to notify the user of a failure. Not sure how that is done. Again, minimal risk. I'm still in favor of a single boot device for FreeNAS, preferably a SSD, and a backup of the configuration file.
 

philhu

Patron
Joined
May 17, 2016
Messages
258
On my controller, scrubs worked fine on the boot/raid1

Interesting about the SSD and smartctl. I am using kingston 32gb slc ssd drives, before I went raid
for this, I did the FN mirror thing. Smartctl showed as enabled on these but there was very little available counters. Did not even display temperature. I think it had 2-3 counters and Status: PASSED so figured the drives would die before smart told me anyway.

So the risk is very minimal, and alot cleaner than the FN mirror solution for boot device.
The SSDs I got are slc devices, so I expect them to last for a long time.
 

Mirfster

Doesn't know what he's talking about
Joined
Oct 2, 2015
Messages
3,215
Understood, TBH I have something similar in my ESXi (AiO) setup; but on a much cheaper level. Just as long as one keeps backups of their configurations (I do for both ESXi and FreeNAS) the risk is pretty minimal.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I still need to setup backups for my ESXi server configuration. I'll get around to it after I have a failure ;)
 

philhu

Patron
Joined
May 17, 2016
Messages
258
Just retired my esxi server, a 2950 dell box from 2011 using 1777 watts of power
I can do fine with FN 9.10, and virtualbox for the 2 VM's I still run at 370 watts of power
 

qwertymodo

Contributor
Joined
Apr 7, 2014
Messages
144
This is the same reason I am against using mirrored boot drives hosted by FreeNAS and not a true RAID card.

Explanation: Your motherboard is setup to boot from a specific drive, for this purpose lets call it ada0 that is plugged into SATA0 port. You have a second DOM plugged into SATA1 port and we will call this ada1. If this were USB ports then I'd call it da0 and da1, and it really just depends on the hardware fro true letter assignments but you get the point.

So your motherboard is setup to look at SATA0 as the primary boot device and SATA1 as the second boot device.

Now you are booting from SATA0 and things start booting but then the code crashes. That is where things stop working.

To make the motherboard boot from SATA1 port then the device on SATA0 port needs to be so dead that it cannot be recognized by the motherboard and then it will try the SATA1 port. This is true for USB ports as well.

What I don't get is that grub worked, and it was well into the BSD boot process before the CAM errors started, so at that point it should be running off of ZFS, and I would have thought the ZFS volume should have been able to recognize and recover from the failures since it's active at that point. If grub had failed, then your explanation would make complete sense, but I would think that the ZFS pool should have dropped the device, unless I'm misunderstanding at what point in the boot process you're actually accessing the ZFS pool, even though grub was booted from the dead device.
 

gpsguy

Active Member
Joined
Jan 22, 2012
Messages
4,472
I understand your point of view, but I'd still recommend it for most of our users.

How many times a week do we have folks show up, where their flash drive died and they never bothered to back up their configuration? If they can fall back to another flash drive in the mirror, I'd call it a win!

This is the same reason I am against using mirrored boot drives hosted by FreeNAS and not a true RAID card.
 
Last edited:

philhu

Patron
Joined
May 17, 2016
Messages
258
The cam errors might have occurred during zpool import of boot volume. It wouldn't know to 'start over' that far along
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
what good is a mirrored boot pool if it can't recover from a single device failure?
I would argue that you did recover. All you had to do was remove the offending device and you were back up and running right away. No need to reinstall (download ISO, dd onto one device, install to a 2nd) and restore saved config, not that this is a big deal, but still.

Now, if what you're looking for is automatic failover, that's going to require support from your mobo.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
How many times a week do we have folks show up, were their flash drive died and they never bothered to back up their configuration?
It's okay that we disagree, it wouldn't be a normal day in my house if we all agreed on everything. So most people here will and do recommend a mirrored boot device while I won't, not without explaining what they should realistically expect from it.

While I agree that the recovery process can be very quick with the mirrored device, and hopefully painless, it's still no replacement for good operating practices like backing up the configuration file.
 

qwertymodo

Contributor
Joined
Apr 7, 2014
Messages
144
Yes, I recovered, after 12 hours of down time, pulling the case down, dissembling, then hiking it back up again. A reinstall on a cold spare would have been barely any more work or downtime. Considering that zfs mirroring was such a huge selling point of switching away from the old read-only boot devices, it's turned out to be a really poor tradeoff considering that before zfs boot disks I never had a single boot device failure but zfs has thrashed 6 boot devices in a single year. I ran a mirrored pool in order to avoid that, but honestly, I'd be better off with the older pre-9.2 boot disks. What benefit does zfs boot disks give me to make up for chewing through my devices like a pitbull puppy? I think the choice to change how the boot devices were configured in 9.2+ was a poor one.

Sent from my m8wl using Tapatalk
 
Status
Not open for further replies.
Top