transient data corruption with ECC

George Kyriazis · Jun 15, 2020

Hello,

Our FreeNAS server is a Xeon Gold 6140 with 256GB of memory and 2xraidz2x6 disks running 11.2-U8. ARC cache is having about 200GB; no L2ARC cache. Uptime is 48 days.

one of our users ran into a peculiar problem:

Some files shared through NFS shows data corruption. Parts of the files had zero blocks on them. He is also storing checksum separately, and his checksums did not match. He tried unmouning the NFS share in question, and remounting it, but the error did not go away. Surprisingly, overnight the error fixed itself, and the data file(s) are now read correctly.

Logs do not show any ECC errors, no alarms are raised, "zpool status" doesn't show any problems.

The fact that the problem "fixed itself" (besides of being unnerving), shows that the data on disk is intact. Since parts of the files were "zeros" this point that it's not an ECC problem, but rather indicates a problem in ARC, maybe? However I really have doubts that a bug of that magnitude could still exist.

Has anybody seen anything like that?

Thanks,

George

Heracles · Jun 15, 2020

Hey George,

From what you described, I suspect the problem is client side. Some data did not made it and were zeroed by the client. By re-accessing the file, it was downloaded again.

ZFS is very advanced and strong for integrity. From what you described, the copy on disk (so the one managed by ZFS) is clean and has always been so.

George Kyriazis · Jun 15, 2020

Thank you Heracles.

Let me re-describe the problem, hopefully it will help. I don't think it's a client-side problem.

Here's what my user told me:

Let machine A be the "write" machine.
Let machine B be the "read" machine.

"A" wrote several files into the NFS share. Checksums from "A" machine is correct.
(let some time pass)
"B" reads bad data from NFS share
Unmount NFS share on "A". Remount.
Reread file from "A". File has wrong checksum
(following day)
Both "A" and "B" have correct checksum.

Hope that helps.

Heracles · Jun 15, 2020

Hey George,

Again, pretty sure the problem is client side. Here, that problem is probably SYNC.

Remember the old time with floppy drives ? When using them, you could copy a file to it and it was always lightning fast. Instantly after you entered the command "cp", the prompt was back and you could list the file as being in the directory. Read it back and no problem, it is here.

But should you eject the floppy right away and move it to a second computer, there was nothing on it. You had to either unmount or sync the drive before removing it.

The reason is that the client is doing the operation only in RAM. it is not committed to disk. As long as it is inside and from the same client, it is good because the client knows that the file has not been saved to the drive yet.

So in your case, client A did not forced the sync to the share and by itself, it did not happened either. That is why the content is not there yet and is zeroed.

So Yes, pretty sure the root cause is on the client side that did not completed the original sync but managed to do it at a later time.

George Kyriazis · Jun 15, 2020

Hello again,

I am well aware of the floppy disk scenario.

but I don't think it's the same issue here.

If it was a "sync" issue, then unmounting the share from machine "A" (the "write" machine), would've flushed all the buffers upon unmount. The following mount forces the files to be read afresh from the server. So, reading bad data after the mount means that the server provided that bad data.

Fredda · Jun 15, 2020

I've also run into the same situation in very rare cases. But most of the time it was not the "write" machine, which had the problem, but the read machine. The scenario was that data was written on machine "A". Reading on machine "B" the data was not correct, while reading the data on machine "C" showed the correct results.

If your machine in question is a Linux machine, you can clear the NFS cache with the following command:
sync; echo 3 > /proc/sys/vm/drop_caches

Important Announcement for the TrueNAS Community.

transient data corruption with ECC

George Kyriazis

Dabbler

Heracles

Wizard

George Kyriazis

Dabbler

Heracles

Wizard

George Kyriazis

Dabbler

Fredda

Guru

Similar threads

Important Announcement for the TrueNAS Community.

transient data corruption with ECC

George Kyriazis

Dabbler

Heracles

Wizard

George Kyriazis

Dabbler

Heracles

Wizard

George Kyriazis

Dabbler

Fredda

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "transient data corruption with ECC"

Similar threads