Is this pool failure mode (following a temporary resolved issue without disk damage) possible?

Stilez · Jan 24, 2018

There's a thread a while back, which asks about what happens if something temporarily renders multiple HDDs in a pool, unavailable (but undamaged). Examples might be, a non-redundant HBA/backplane failure, some kind of federated or linked enclosures/pods/backends get their cable pulled, some drives are on a 2nd PSU (non shared rails) which dies, and so on.

According to the replies, it's not a problem, there's a difference between disk unavailability and disk failure, and if disks become unavailable in a way that the pool no longer has one accessible copy of all data, but the data is intact, then the pool will recover when the disks become readable again (after HBA replacement/repair/reconnection). So far so good.

My question looks at that a bit more closely: is there theoretically a different failure mode that can follow from that scenario (HBA loss for example), whereby a pool with mirrored vdevs continues in a degraded state, but the connected and disconnected drives' desyncing would cause loss of the entire pool?

Scenario:

Suppose the pool is made of drive 1a/b/c, 2a/b/c and 3a/b/c (123=vdevs and abc=mirrored HDDs).
An HBA, or a port on an HBA, becomes faulty and takes 1b/c and 3a offline (but unharmed).
(Alternatively, to show it doesn't have to be just HBAs, suppose the PSU is an ordinary, good quality, multiple rail design and one rail stays tripped due to a fault, taking some HDDs offline but not the baseboard and other HDDs)
If at this point the pool halts, no harm is done. That's the scenario in the other thread, but it doesn't always happen. With enough redundancy, each vdev survives, so the pool is only degraded not halted. Ther sysadmin isn't on site, or only gets the email later on. As a result, usual file activities continue without any sign of any issue given to client users of the NAS, until, under the stress, suppose 1a dies a while later and only then does pool activity halt.

Result:

When the HBA is fixed and the disks become accessible again, and the NAS reboots, the pool will be massively desynced. (It wouldn't take much activity between steps 2 and 3 above, to do that). vdev1 (1b/c) will contain what it held before the 2nd PSU failure; vdev2 (2a/b/c) will contain only current latest data from when 1a failed, and vdev3 will have 1 drive (3a) in one state and 2 drives (3b/c) in the other. The ZIL is not in a helpful state to identify what's changed, because TXGs going back to the original HBA failure aren't on it any more - they were ditched long ago as the ZIL rolled around.

So it doesn't seem that one could be sure whether a redundant copy exists of any specific pool data or metadata for rebuild, or if any rollback/snaps are viable?

Discussion:

This isn't quite the same as a double failure killing redundancy. All other drives in vdev1 are still in perfect condition and unharmed, and the issue is not that we lost more drives than redundancy would allow. What's happened is that although only one drive was lost (redundancy of HDDs was adequate), the HBA loss desynced the pool into two "halves" (connected and disconnected), at which point the loss of one drive from the "active" half meant it could no longer resync, resilver or roll back to any consistent form, when the missing (intact) HDDs came back online.

In practice, many installs won't have 100% redundant hardware (most home users and even many businesses don't have 100% dual-port SAS drives, dual HBAs, etc). They will have redundant pools, because they were told to, but not redundant HDD power and redundant HDD connections. There should be backups, but one tends to think in terms of ZFS just not having issues of this kind, in restoring a pool, on good hardware, unless we suffer (n-1) actually failed drives in a single vdev or some extreme electrical event damaging multiple drives.

This scenario suggests one could have very robust HDD redundancy but a single lost HBA/backplane plus a single lost HDD could still kill the pool, however many redundant HDDs are in each vdev, and even though all but one copy of the lost vdev's data is actually intact and all but one disk is in perfect condition.

Not at all "worried", but intrigued for sure, and I would like to ask for input on this scenario.

rs225 · Jan 24, 2018

Yes, the loss of 1a would be the end of the pool. It would have to be recovered. Any unrecovered blocks could be copied from 1b/c and then hope for the best.

rs225 · Jan 24, 2018

I will add that scenarios like this are why I don't give any concern to double disk failure. You are better off with raidz1 and a backup than you are with raidz3 and no backup. Systemic failure or risk can not be eliminated. www.taobackup.com

Nick2253 · Jan 24, 2018

I think your scenario is a good example of the need to be conscious about balancing high availability and long-term data reliability. In your build, you retain incredibly high availability: any one component can go down, and your system keeps on going. However, you've immediately gone to a situation with very low reliability. Unless you're doing synchronous backups somewhere, it might be better for a system to gracefully fail then keep running in this situation. At least, if it gracefully fails, you're not writing data to it (usually under the assumption that the system is extremely solid) that now has a high risk of loss.

Chris Moore · Jan 24, 2018

I already said I wasn't going to give you any information because you don't listen, but I will make this last point. The failure mode you have just illustrated is the reason that server power supplies are paralleled by a power-supply paralleling board instead of running two power supplies with one supply handling part of the drives and another supply handling the other part of the drives. The thing you want to do is a kluge and goes against the very nature of high availability. If you want a server, use server grade parts and stop trying the cheap route. You want high availability without being interested in spending the money that it costs to actually have it.

Stilez · Jan 24, 2018

Chris Moore said:
I already said I wasn't going to give you any information because you don't listen, but I will make this last point. The failure mode you have just illustrated is the reason that server power supplies are paralleled by a power-supply paralleling board instead of running two power supplies with one supply handling part of the drives and another supply handling the other part of the drives. The thing you want to do is a kluge and goes against the very nature of high availability. If you want a server, use server grade parts and stop trying the cheap route. You want high availability without being interested in spending the money that it costs to actually have it.

Chris, you've already done all your snarkiness elsewhere - adding nothing of merit by it - and got that thread closed and reopened by a mod. I wasn't asking about my own use-case, nor about PSUs, nor HA. But you've "gone drama" and assumed something explicitly not in the OP (and explicitly stated not of interest to me in the other thread either), that nobody else here saw a need to do. I left that thread there - and so should you have, instead of pursuing me across the forum, to allege things not done, not wanted, or explicitly stated not of interest.

The other thread prompted as a matter of curiosity, a possible mode of failure I'd never seen written about, for any *NAS system which has a single point where disks might become inaccessible without sustaining damage - failed HBA, pulled fanout cable, multirail PSU fuse, whatever. That's the vast majority of home and non-enterprise users, I reckon. How can one assess risk without understanding it :) This thread simply asks one question:

On ZFS, we often think of redundancy as disk redundancy only, and with good hardware any other loss is a mere inconvenience. But is there a route by which loss of even one HDD alone (even from a highly redundant vdev) can in fact lead to total pool loss if it follows a more "innocent" prior loss that normally is no threat to the pool, such as a mere HBA?

I think it's a worthwhile question, so let's keep personal emo out of it and carry on learning.

LIGISTX · Jan 24, 2018

Stilez said:
I think it's a worthwhile question, so let's keep personal emo out of it and carry on learning.

I fully agree. It’s a failure mode I have never seen discussed, and the more understanding we all have the better we can make assumptions to suite our individual needs. This is nothing 99.9% of us would likely worry about as OP stated, but it’s a quest for information and understanding about the underlying issues that could arise in the perfect storm situation.

If this perfect storm was to happen to many of us, it would be an unlikely day, but a shitty one none the less. Most of us don’t have such issues, so no one (most likely) has mission critical data, in their home, on their server, with no offsite backup or tape drives sent to an offsite location, and even less likely someone here has this situation of no offsite backups ect at their place of employ which they are a sysadmin of. If you do, and you read this thread, go back and rethink your strategy/reassess just how mission critical your data truly is.

That being said, this is interesting information none the less, and I for one would like to understand the possible outcomes of such an event; not because I will worry about it and build my server differently to bypass the risk (because I won’t.... my data isn’t worth enough of my money to care, and I have other proper means of backups which is likely a high % of those following this thread), I just simply would like to know for my own personal bank of knowledge.

Sent from my iPhone using Tapatalk

Stilez · Jan 24, 2018

rs225 said:
I will add that scenarios like this are why I don't give any concern to double disk failure. You are better off with raidz1 and a backup than you are with raidz3 and no backup. Systemic failure or risk can not be eliminated. www.taobackup.com

Nothing can eliminate the possibility of an entire server or pool being wiped out by some event or other, however good its build. I used ordinary 2 way mirrors on Windows for many years, where a rebuild is easy, and initially adopted that same level of redundancy when the data moved to FreeNAS. One day I had a single failure, and what with the lack of recovery tools compared to NTFS, I suddenly got much more aware than before, that everything was hanging on a single HDD. Even though resilvering a mirror isn't too demanding and I had a spare in place ready, I can tell you I was never so relieved as when it finished and I was back to redundancy again. By the next day, I had an extra set of mirrors ordered so I wouldn't have to feel that way on single disk loss again. I keep the most crucial stuff constantly backed up, and replication intended for ages for exactly the reason you gave, but while I've got the 2nd machine ready, I haven't set it up yet because of other stuff going on.

Nick2253 said:
you retain incredibly high availability: any one component can go down, and your system keeps on going. However, you've immediately gone to a situation with very low reliability. Unless you're doing synchronous backups somewhere, it might be better for a system to gracefully fail then keep running in this situation. At least, if it gracefully fails, you're not writing data to it (usually under the assumption that the system is extremely solid) that now has a high risk of loss.

I've never really thought an ordinary mirror was "incredibly high availability", it's just a way to get a safe pool as the guide says. From what you're saying, one might put entire vdev's on the same connection to force an immediate stop, rather than what feels intuitively safest of splitting them up. How peculiar and counter-intuitive!

I think I'll stick with replication in practice (redundancy within a server not being a safeguard for whole-server loss) but it's definitely not where I expected the question to lead!

rs225 · Jan 24, 2018

Stilez said:
...to force an immediate stop, rather than what feels intuitively safest of splitting them up. How peculiar and counter-intuitive!

It seems odd until you consider that availability and durability are two different things. It is a 'fail-safe' concept to halt availability in order to preserve the durability. However, in the OP scenario, the failure is probably guaranteed either way; the drive would probably not be able to finish a resilver. The durability of your data comes down to backups, not your pool design.

Important Announcement for the TrueNAS Community.

Is this pool failure mode (following a temporary resolved issue without disk damage) possible?

Stilez

Guru

rs225

Guru

rs225

Guru

Nick2253

Wizard

Chris Moore

Hall of Famer

Stilez

Guru

LIGISTX

Guru

Stilez

Guru

rs225

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Is this pool failure mode (following a temporary resolved issue without disk damage) possible?

Guru

Guru

Guru

Wizard

Hall of Famer

Guru

Guru

Guru

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Is this pool failure mode (following a temporary resolved issue without disk damage) possible?"

Similar threads