@f00b4r00 the bug you linked was for ZFS on Linux, which is fairly different from the ZFS implementation in FreeNAS.
The bug I linked is in the common openzfs code, as evidenced by being in the "openzfs/zfs" repository. It affects all implementations - also as evidenced in the bug thread, which mentions e.g. illumos, openzsfsonosx - and the bug includes a cross-platform reproducer. Have you read the bug thread?
It's not clear if it's even the same thing you encountered, based on the limited information given in this thread. Creating a ticket would help us get more information to debug the issue.
It
is the same bug. Unless you mean to say there is another bug that has the exact same symptoms as this one (
completely silent zeroing of replicated files) at which point I'd say it'd be safe to call this replication system a practical joke.
I certainly have zero interest in trying to
reproduce the bug (i.e. help debug it) so I don't see what my input would be beyond opening a ticket, which you can do yourself. I don't even remember when I flicked the LARGEBLOCK switch (possibly when I started pulling from the 11.3 target instead of pushing from the 11.2 source), which leads to other dramatic consequences, see below.
Anyway, you have all the available information in the github issue, along with a testcase and a "fix". What else could you possibly need?
And to be clear, this isn't really a disaster bug if it is the one described in the issue on github. That bug described an incremental send with incorrect options would populate files with zeros on the receiver. Since it's an incremental send, that means you still have the previous snapshot to roll back to, and you have the data fully in tact on the sender still. Nothing actually gets overwritten on disk in ZFS, your data is still safe. It is inconvenient that it gave the appearance of working when it wasn't instead of refusing to run with the incorrect flags. Roll back to the previous snapshot and do the incremental send again with the correct flags.
Wow. You completely failed to grasp the severity of this bug.
Silent data corruption in the replication system, can it get any worse?
Because the corruption is totally silent
and the scrub doesn't report anything wrong with the pool
and the subsequent incrementals keep happening happily ever after, if you have snapshot expiration enabled, there comes a point where data
CANNOT BE RECOVERED. Once the last good snapshot is gone, you're fscked. Guess what: I am, because I noticed too late.
Furthermore, and that's possibly the most important point: it is impossible to quickly identify the last good snapshot. Corruption can be spread between multiple snapshot as random parts of the underlying filesystem are updated. ahrens himself confirms there is
NO RECOVERY PATH.
The fact that the data is intact on the sender is completely besides the point. One of the main reasons to use replication is to increase redundancy: if my sender dies, is destroyed in a fire or is stolen, if I thought my data was safe then I'm in for a world of pain! And herein lies the catch: how often do you check data stored on a
backup system? Usually
when you need it. That's exactly
why this bug
is a total disaster! Because the time you might
eventually notice something went wrong is the time you least want to!
Besides it may not be practical to replicate the complete dataset from scratch (otherwise incremental backups themselves may not be even used in the first place). In my case I have to shlep back 6TB of data between two sites several hundreds km apart which are only capable of doing 20Mbps between themselves...
But maybe it'll take one of your enterprise-grade customers to lose hundreds of TB of data for sh*t to really hit the fan?
