Yes, I generally upgrade the pool features O(months) after deploying.This is something that stands out. Very old zpool from the FreeNAS 9 era, replicating into TrueNAS SCALE with ZFS 2.1, and then back again to a different dataset on the old pool.
I don't believe ZSTD was a supported compression property back then? Did you upgrade the old pool's features prior to these replications?
I just followed my steps in comment #37 with the single-file dataset and was not able to reproduce the problem.
So it may be something to do with the large amount of data in the intermediate dataset and how it's being read and pipelined to the destination dataset. It's unfortunately I wasn't able to reproduce this with a small test dataset.
The md5sum check is still running on the overnight replication that went old_pool -> old_pool, but so far there is no corruption.
I am starting to run out of free space to keep some of these experiments around. I think next I will do as follows:
- On the intermediate dataset, delete everything except the known-to-corrupt file types `.cshrc` and `*.properties`
- Snapshot
- Replicate that snapshot to the original pool
- Check for corruption
- If corruption is repro, replicate the snapshot into a new dataset on the same pool
- Replicate that snapshot to the original pool
- Check for corruption
- If corruption is repro, I will pipe the source snapshot into a file and share
The corrupted file is in fact LZ4-compressed data as is.
During replication, it does change the compression mode in the block pointer as requested but does not actually recompress the data. This way, when you copy to the LZ4-compressed dataset, there is no problem as no recompression is needed. I wonder if the problem persists if the target dataset uses LZJB compression.
I will start this now.