1-disk vdev's for backup of a Z3 server?

Phase

Explorer
Joined
Sep 30, 2020
Messages
63
Question: On a 1 disk per vdev does a "nonrecoverable read error" mean we only lose a file, as opposed to the entire vdev/pool?

Context:

I'm running out of space, so I'm redoing my servers. The new configuration is a 11 disk (14TB) in a RaidZ3 plus 2 off-line spares and a second server as a backup via replication. I'm looking to minimize the number of disks (16TB) on the second server. Since the primary is already a Z3, for the replication destination I'm thinking of using single-disk vdevs stitched together in a pool.

If I understood this correctly, the chance of the primary pool failing (227,000 hours MTBF), then the rebuild of the 11-disk Z3 failing (0.016%) and then the replication being lost by a drive failure (500,000 hours MTBF) is pretty much nil.

Details:

- MTBF of 1 disk is 2,500,000 hours
- MTBF of a 5-disk pool is 500,000 hours <== mtbf(1||2) = (mtbf1 X mtbf2) / (mtbf1 + mtbf2)
- MTBF of a 11-disk pool is 227,000 hours
- Nonrecoverable read errors rate is 1 per 10^16 bits read (1.12% for a 14GB disk and 1.28% for a 16GB disk)
- Z3 Rebuild Fail is 0.016% <== 1 - (1-df)^(nd-1) - (nd-1)*df*(1-df)^(nd-2) - (nd-1)*(nd-2)*(df)^(2)*(1-df)^(nd-3)/2 where df = probability of a drive to fail during rebuild (1.12%), nd = number of drives (11)

Thoughts?

Thanks!
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
Hey @Phase

Question: On a 1 disk per vdev does a "nonrecoverable read error" mean we only lose a file, as opposed to the entire vdev/pool?

It depends on what ends up as unreadable. If it is part of the critical ZFS hierarchy structure, it may go to that bad. It is low probability because these critical blocks exist in multiple copy for that very reason, even on a single drive. Still, it would not be impossible to loose it all from a few Read Errors on a single drive vDev.

I'm thinking of using single-disk vdevs stitched together in a pool

Bad idea. The moment any single of these drives fail, the entire pool will fail.

If I understood this correctly, the chance of the primary pool failing (227,000 hours MTBF)

Indeed, you mis-understood here. How many human errors can happen over these 227,000 hours ? How many software updates can go wrong in that period ? How probable is a fire or physical incident destroying that entire server all at once over that period ? ... HDD failure is but one thing in the overall safety of your data. Your backup must protect you against all of these.

See my signature for a complete backup solution that will provide a complete protection and do not hesitate to ask if you have more questions.
 

Phase

Explorer
Joined
Sep 30, 2020
Messages
63
Yes, I agree. I have 3 copies of the data but they are in different setups depending on the type of data and they are all in the same street address. Some of the data I also have on the cloud, but that is a small amount, probably 1 or 2 TB. I’ll have to give it a bit more thought.

Based on your input I’m making the secondary server a Z2.

The task at hand is to:

1) migrate the primary server from old box 2 to box new box 1 including the physical SSDs and HDDs in the current pools, but with a new mobo, cpu, ps, etc.

2) migrate the hardware from secondary server from old box 3 to box 2 including mobo and SSDs, but with all new HHDs in a Z2 configuration (box 3 does not fit all disks we need)

3) replicate primary server to new secondary server (takes 4-5 days)

4) destroy old pools on primary server and recreate them as a 11 disk Z3 (9 disks from the current primary and 2 repurposed from the retired secondary server) — all old disks are the same 14 GB model, and all new disks are the same 16GB model.

5) replicate data from secondary server to primary server (takes another 4-5 days)

6) have a party!
 
Top