ZFS Basics: "Symmetric" Data/Backup setup

Masch · Aug 14, 2021

Hello all!

I am new to TrueNAS and ZFS and spend quite some time the last days reading up on the technicalities.
This forum and the TrueNAS documentation have been tremendously helpful in that!
I think I have a decent basic understanding of ZFS now, however not to the extent that I could give a definite answer to my question (which is more a ZFS than a TrueNAS question):

A friend and I are planning on each setting up a TrueNAS server, let's call them A and B.
We are talking about homeservers here, so rather small but properly build (server MOBO, ECC RAM,...).
Let's say, we have 6 HDDs in both A and B.
We are not expecting a lot of action on these servers, maybe 2 people using fileshare, something syncy (Nextcloud, Syncthing, ...), maybe git.
(Dataintegrety > performance.)

Now the idea is to use each other's NAS as a backup, so B to backup A and A to backup B.
Let's say backup is performed via zfs snapshots and replication (though I don't think it matters).
I'm pretty sure that this would be no problem at all if both A and B have two zpools, one for data and one for the other's backup (at least I can't think of a reason why that would be bad).

However, as said, there both homeservers and only equipped with a comparably small number of HDDs.
I dislike the idea of running 2 zpools with 2+1 raidz1 on both machines.
I'd much rather run one zpool using 4+2 raidz2 or even 3+3 raid3 as it's much more fault tolerant.
(Ok a compromise would be 3+2 Raidz2 for the data pool and 1 plain vdev for the backup pool. Doesn't feel good without parity on the backup pool).

So my question is, if both A and B each have one zpool with e.g. 4+2 Raidz2 or 3+3 Raidz3 and a structure something like this (ignoring possible subdatasets):

on machine A:

Code:

- tank_a (zpool)
   |- data_a (dataset)
   |- backup_b (dataset)

on machine B:

Code:

 - tank_b (zpool)
   |- data_b (dataset)
   |- backup_a (dataset)

what scenarios are there that a corruption of tank_a would lead to a corruption of tank_b and vice versa?

For example, if either A or B does not use ECC, I understand how that could lead to a corruption of one of the pools and its backup which would corrupt the full pool that the backup resides on.

With that out of the way using ECC RAM, is the above layout still a good idea?
It feels to me that it's not, however I can't really say why since I am not fully aware of all the possible corruptions that could destroy a pool and its backup pool.

Looking forward to your insight!

NugentS · Aug 14, 2021

I don't see an issue. Its just a pool of discs there is no reason (baring hardware issues) why writing to a dataset should corrupt another dataset. Do it. You can I believe even put a quota on the backup dataset so it doesn't use too much space

ChrisRJ · Aug 14, 2021

What type of corruption are we talking about?

In general ZFS takes a lot of precautions against things like bit-rot. But of course there is always risk and backups reaching back far enough are the way to mitigate that. The replication between NASes, as you described it, is what I have been doing here between my two FreeNAS boxes for a while. In addition I also have an encrypted cloud backup to protect against the house burning down etc.

Masch · Aug 15, 2021

ChrisRJ said:
What type of corruption are we talking about?

Well that's kind of what I am trying to find out :)
As mentioned, I know that non-ECC RAM that has gone bad can lead to a corrupted pool and its replicated backup (so basically, all data is lost as most likely metadata is corrupted).
What other scenarios are there, assuming ECC-RAM? E.g. can certain other hardware failure produce this?
I am just trying to understand the possibilities and their rough likelihood to be able to judge the risk.
In the end, I don't want to end up loosing my data because I've overlooked something and end up being told "well your setup was a bad idea to being with" :)

But it's good to know that other people have been running the setup (or similar) that I've described.
Being new to ZFS, it's hard for me to grasp its pitfalls and make an educated decision.

NugentS said:
You can I believe even put a quota on the backup dataset so it doesn't use too much space

Yes, using a quota on the data dataset or a reserve on the backup dataset is certainly a good idea.

danb35 · Aug 15, 2021

Masch said:
what scenarios are there that a corruption of tank_a would lead to a corruption of tank_b and vice versa?

Well, anything that results in the corruption of data on tank_a/data_a would also result in corresponding corruption of the backup data from that point forward. This is pretty obvious if you think about it. But it wouldn't affect prior snapshots of that same dataset, whether on tank_a or replicated to tank_b.

While ECC is always a good idea, the disk of data loss from not using it is greatly exaggerated.

Masch · Aug 15, 2021

danb35 said:
Well, anything that results in the corruption of data on tank_a/data_a would also result in corresponding corruption of the backup data from that point forward. This is pretty obvious if you think about it. But it wouldn't affect prior snapshots of that same dataset, whether on tank_a or replicated to tank_b.
While ECC is always a good idea, the disk of data loss from not using it is greatly exaggerated.

Yes, makes sense. I guess the tricky bit is that e.g. scrubbing a pool on a system with bad RAM can actually corrupt previous snapshots which then also messes up replications on a system that's otherwise healthy.

Either way, I think I also missed that the datasets are independent filesystems.
So even if tank_a/data_a is corrupt and replicated to tank_b/backup_a which is corrupt (starting with that snapshot) that would still leave tank_b otherwise unaffected, correct?

danb35 · Aug 15, 2021

I have a reply in my email, but the stupid forum CDN isn't showing it yet. But no, the "scrub of death" is a myth. It depends on a sequence of astronomically unlikely events in order to result in corrupt data. For more information, see:

Will ZFS and non-ECC RAM kill your data? – JRS Systems: the blog

Masch · Aug 15, 2021

danb35 said:
For more information, see:
Will ZFS and non-ECC RAM kill your data? – JRS Systems: the blog

Thanks, that was a refreshingly unbiased read!

ChrisRJ · Aug 15, 2021

After all, it should be noted that the biggest risk are still users. I seem to remember a study from the early 2000s where more than 70% of data loss in the corporate world was due to user error, and not HW or SW issues.

Heracles · Aug 15, 2021

Hey @Masch,

Congrats on planning your backup as an integral part of your design. In my signature, you have the details about the 3 copies rules. What you described will cover copies No1 and No2 for both you and your friend (the moment your 2 places are not right next to each other so say a fire in one may damage the second).

The risk you are talking about is the kind of risk that is to be managed by copy No3. That one being offline, it will not be affected by any logical incident that can potentially affect both 1 and 2. So indeed, what you are looking at here is an actual risk and you are looking for a proper solution.

But do not worry too much here. Such a logical incident that would propagate to both instances at once is EXTREMELY low. The highest risk would be a human error while handling the servers and again, it would not be a single wrong command for that. As such, with only copy No1 and No2, you already have a very good backup strategy that will protect you against almost every cases. Should you wish to go for the complete rule, start looking for an offline solution but again, no need to panic here.

Again, congrats for the effort you put on your backups,

Masch · Aug 15, 2021

@Heracles thanks for the feedback!
Yes, I guess a third copy is what it comes down to.
I agree, especially since I'm new to ZFS the human error aspect is the most likely one for me!

One final question: I probably have a hard drive lying around that's large enough to hold most of the data that I absolutely do not want to loose and that I could put in an external housing.
For this final 3rd offline backup, would it still make sense to use ZFS and a plain vdev?
As far as I understand the self-validation (i.e. checksum) should protect against bit rod (at least to the extend that one of the two metadata copies is still intact)?
So I can also use ZFS snapshots and replication here instead of doing a backup from scratch each time?

Heracles · Aug 15, 2021

A benefit of using ZFS for that 3rd copy would be to allow you to have versioning also on that last resort copy. With snapshots, you can have multiple versions of the same data. Would that be of interest for the kind of data you will save that way ?

For TrueNAS, external storages do not work well over USB. If you are talking about an external storage on e-SATA, no problem. If it is USB, be extra careful.

Another option for you would be to do that backup from a workstation instead of a server. That way, USB is not a problem. You will not do it with ZFS, so no snapshots, but it may still be good enough for you. With a tool like rsync, you can be sure to update your backup with only what is required instead of re-copying it all.

Masch · Aug 15, 2021

Heracles said:
With a tool like rsync, you can be sure to update your backup with only what is required instead of re-copying it all.

Yes that's true, with all this ZFS reading lately it's easy to forget about the obvious solutions...

Thanks for the hint with USB! Probably just going to put the drive in on a workstation and backup the shares via rsync.

Patrick M. Hausen · Aug 15, 2021

Heracles said:
With a tool like rsync, you can be sure to update your backup with only what is required instead of re-copying it all.

Incremental ZFS snapshot copies do the same. Just sayin'

Masch · Aug 15, 2021

I guess, kind of?
The incremental sync via checksum checking that both rsync and ZFS have in common would in the case of ZFS be local on the server side where the snapshot is created. Sending that snapshot, I assume, does not reevaluate the checksums on the target as rsync does?
(I am a bit guessing here, again, just started to get into ZFS).

How about snapshot sends to a plain vdev with a single HDD as a target?
Assuming the target backup pool has degraded such that some snapshot in the past is affected.
While ZFS can detect corruptions on that target via checksum, there is no healthy data on the pool to repair it.
If I ZFS send a snapshot to that target, would ZFS detect the mismatch of the checksum? I mean, it's just sending snapshot so I would assume, no?

Patrick M. Hausen · Aug 15, 2021

Masch said:
Sending that snapshot, I assume, does not reevaluate the checksums on the target as rsync does?

It assumes that the data on the target is correct. Local checksums on the target will take care of that. ZFS on both sides always knows if your data is corrupted or not.

Masch said:
Assuming the target backup pool has degraded such that some snapshot in the past is affected.

That is simply not possible. ZFS will know that the snapshot is invalid, period. So you will not be able to send a differential unless you fix the problem on the target first. All of these operations work on the block level, not on the file level. If you send an incremental from snapshot A to snaphot B to a destination, ZFS on the destination will ensure snapshot A is OK or abort with an error message.

Some snapshot being affected is just not a scenario in ZFS. Either you have sufficient redundancy, wich I recommend you should have, or ZFS will tell you in very precise terms which files are corrupt and which ones aren't. Use zpool status -v in that case.

Edit: I might not be 100% correct about the details, here. But what I do know: you must keep in mind that ZFS works on the block level, not on the file level for send/receive. The send/receive pipe makes 100% sure that the snapshot on the destination is bit for bit identical with the snapshot on the source. If it cannot do that, because the previous snapshot on the destination got corrupted *somehow*, it will abort and tell you about the problem. Definitely.

Heracles · Aug 15, 2021

Patrick M. Hausen said:
Incremental ZFS snapshot copies do the same. Just sayin'

Indeed. Here the major difference between the 2 is that ZFS is block level while rsync is file level.

I talked about ZFS replication because it was the first considered solution. It also had the benefit of versioning. The mention about rsync was more to say how that would mimic the original solution (ZFS) more than doing any better.

Masch · Aug 16, 2021

Alright, thanks everyone! I think that did clear up most of the things for me. Looking forward to actually put this into practise.

Important Announcement for the TrueNAS Community.

ZFS Basics: "Symmetric" Data/Backup setup

Masch

Cadet

NugentS

MVP

ChrisRJ

Wizard

Masch

Cadet

danb35

Hall of Famer

Masch

Cadet

danb35

Hall of Famer

Masch

Cadet

ChrisRJ

Wizard

Heracles

Wizard

Masch

Cadet

Heracles

Wizard

Masch

Cadet

Patrick M. Hausen

Hall of Famer

Masch

Cadet

Patrick M. Hausen

Hall of Famer

Heracles

Wizard

Masch

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

ZFS Basics: "Symmetric" Data/Backup setup

Cadet

MVP

Wizard

Cadet

Hall of Famer

Cadet

Hall of Famer

Cadet

Wizard

Wizard

Cadet

Wizard

Cadet

Hall of Famer

Cadet

Hall of Famer

Wizard

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "ZFS Basics: "Symmetric" Data/Backup setup"

Similar threads