What are the chances of a RaidZ1 array failing during rebuild?(and more)

Status
Not open for further replies.

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
And after Something Bad Happens it becomes:
"Why did I waste 50% of my capacity when I could have just made a full backup?"
The question only arises if you don't have a full backup, which would be foolish, regardless of vdev layout.

When a system survives multiple disk failures, saving the trouble of restoring from backup, that's a huge win in many scenarios. If restoring from backup is not a significant burden, then a less reliable vdev layout becomes a much more attractive option.
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
The question only arises if you don't have a full backup, which would be foolish, regardless of vdev layout.
When I see questions about raidz1 vs raidz2 vs raidz3, I don't see anybody mentioning their super-duper-must-be-always-online requirements, which would imply they are looking at it correctly: as availability. Instead, they are hemming or hawing about 1 extra drive or not, and how important their data is. Not how critical 100% uptime is. In other words, they are looking to have a justification to take the foolish route.

When a system survives multiple disk failures, saving the trouble of restoring from backup, that's a huge win in many scenarios. If restoring from backup is not a significant burden, then a less reliable vdev layout becomes a much more attractive option.

So take that full backup, which only a fool doesn't have, and store it in a standby continuously replicated server. If your raidz1 wins the lottery and fails during resilver, just switch right over. Plus, you'll be able to deal with all manner of disasters.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
they are looking to have a justification to take the foolish route.
Perhaps, but none of the regulars is advising people to go with a higher RAIDZ level instead of having a backup. Encouraging people to have a backup is a good thing. Discouraging them from having highly reliable storage is not. I feel like you're presenting a false dichotomy.
continuously replicated server
Someone "hemming or hawing about 1 extra drive" is unlikely to adopt this solution.
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
Perhaps, but none of the regulars is advising people to go with a higher RAIDZ level instead of having a backup. Encouraging people to have a backup is a good thing. Discouraging them from having highly reliable storage is not. I feel like you're presenting a false dichotomy.
I have rewritten this post a couple times, some responding to your points, but I think I can boil it down to this single statement:

When a person asks about raidz2, raidz3, etc, respond with: You already have full backups, you already have scrubs, so why would you possibly care about the small risk of a second drive failure?

Would there be anything wrong with that? Sometimes resiliency from failure is better than trying to minimize failure entirely.

Someone "hemming or hawing about 1 extra drive" is unlikely to adopt this solution.
Yes, that was partially my point, and is exactly who I have in mind in these scenarios. They throw in a second parity drive and do no backups. Better to have used that drive for partial backup, and if that makes somebody nervous, good, then do a full backup. Nobody is that worried about a failed resilver on raidz1 if they have a full backup. And if they are, they can explain it, such as a lights-out operation.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
I have rewritten this post a couple times
I get it ;)
why would you possibly care about the small risk of a second drive failure?
I acknowledge the risk is relatively small for each individual, yet we see it happen repeatedly in the population of FreeNAS users showing up in these forums. Here's a typical example. Oh look, the user even has a replication server.

This is key. You seem to assume that restoring from backup as no big deal. I already suggested earlier, in many cases, it is a big deal. It's a last resort, disaster-recovery option, even assuming the backup is up to date. Of course, you should test your backups, but how often?
Sometimes resiliency from failure is better than trying to minimize failure entirely.
This is the false dichotomy you keep presenting.
They throw in a second parity drive and do no backups.
Which is foolish, and not what anyone here is recommending, so why present it as a counterpoint?
Nobody is that worried about a failed resilver on raidz1 if they have a full backup.
Nonsense.
 

Scharbag

Guru
Joined
Feb 1, 2012
Messages
620
Parity = increased system availability! Meaning you can access your data more of the time without restoring from backup. Parity is never a replacement for backups. Full stop. End of story. Dead horse beaten to a pulp.

A failed resilver on a Z1 system with a proper backup is still a massive PITA. Restoring data from your backups takes time and effort. I am lazy, so therefore I run a Z2 production pool and a z1 backup pool.

As for making your system immune to failure - you will never, ever, ever be able to remove failures completely. Again, backups are required IMHO but I, touch wood, have never had to use them even though I have had a number of HDDs fail in the last 3 years (damn 3TB Seagate shite). Much easier just to swap the caddy and let ZFS do the work. Gah, don't even want to think of the time a full restore would take.

Message of the day: Redundancy will never replace backups. Redundancy can greatly increase system availability.

Bottom line - backup your shite so you don't lose it and pick the parity option that suits your availability needs. If you are happy with restoring from backups from time to time, then use Z1. Personally, I would use Z2 + backup of your choice.

Cheers,
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
Also from the linked google paper:
Out of all failed drives, over 56% of them have no count in any of the four strong SMART signals, namely scan errors, reallocation count, offline reallocation, and probational count.
:eek:
 
Status
Not open for further replies.
Top