SOLVED NFS Share I/O Errors after recieving snapshot replication

pflaugh · Jul 12, 2023

I have a bit of a puzzle. I have done tons of searching both here and the subreddit, and have come up empty handed so far.

Background and setup: Setup is 2 TrueNAS boxes. One is a Production NAS/"SAN" for hypervisor use, etc. The second box is a local replication box as a part of my backup strategy. Midnight every night, the prod box performs snapshots of various datasets and replicates those to second box.

I am running Duplicati to then backup critical portions of this replicated data on the second box to cloud storage. Duplicati is running in a VM and I have one of the replicated datasets mounted via NFS so Duplicati can access as source.

The Problem: Every night at midnight the NFS mount on my Duplicati VM starts failing and throwing I/O Errors. I have tried remounting the share and the TrueNAS box denies the mount request. Restarting NFS services seems to fix it, until the next day. No configs change, just restarting the services. (I am running 13.0-U5.1) I suspect this is somehow connected to the snapshot replication, but I have never run into this issue before. To rule out my Duplicati VM, I have a second Alpine VM with the same share mounted w/ the same problems.

Things I have tried: I have made sure permissions are all correct, tried different combinations of maproot/mapall settings. Using IP instead of hostname for the mount config. Confirmed networking settings on the TrueNAS box are correct.

Please let me know if anyone has run into this before, or if you have any other ideas for troubleshooting. I have thought of mounting the data from the prod box, instead of the local replication, but I want to try to avoid the additional load on my prod box for remote backup handling.

samarium · Jul 12, 2023

I have an pool that I export over NFS and I update part of that pool from snapshots from another pool. I don't access that part often over NFS, and even then it is automated, so usually unmounted so I'm not surprised I haven't seen this issue.

I would try remount the NFS share after the replication is complete. Might need to restart duplicati. Maybe stale NFS file handle. Check logs, app and kernel and system.

Maybe something to do with some internal nfs ids changing after replication affecting nfs export or mount properties.

Try creating some small test datasets, and snap shots, and NFS connections, and see if you can replicate on at your testing leisure, check system logs and kernel logs. Might even have to get duplicati to try and do a backup. Check nfs server logs.

Might even use tcpdump and Wireshark to inspect the nfs traffic looking at the various IDs and what changes after replication.

Might also try it between two non TN VMs where you have control of the ZFS and NFS activity and operation.

pflaugh · Jul 13, 2023

Thank you for all these suggestions. I'll try them out. I also stumbled across this just now https://github.com/openzfs/zfs/commit/ed9cb8390bbcac0e43736c1bf0d6ad91dae45642

This piqued my interest. Im interested to see if I can replicate this issue on TN Scale. Could be a fundamental difference in how the kernels are addressing storage

Patrick M. Hausen · Jul 13, 2023

NFS works on the inode/vnode level. These structures are emulated by ZFS for POSIX compatibility. If they change by replicating a younger snapshot on top of an existing dataset, that might well be the cause of your problem.

pflaugh · Jul 13, 2023

Patrick M. Hausen said:
NFS works on the inode/vnode level. These structures are emulated by ZFS for POSIX compatibility. If they change by replicating a younger snapshot on top of an existing dataset, that might well be the cause of your problem.

This makes sense. Is there a different storage share protocol that would be better for this use case? I might try out a few; SFTP, etc.

Patrick M. Hausen · Jul 14, 2023

Sorry, no idea. I have come to never live share or use a replication target. I use them as cold standby only and start/enable services, VMs, jails, ... only in case the productive node fails.

I also had the impression that a replication target dataset must necessarily be kept in read-only state, because changes on the target would break replication. I might be wrong here.

samarium · Jul 14, 2023

Patrick M. Hausen said:
I also had the impression that a replication target dataset must necessarily be kept in read-only state, because changes on the target would break replication. I might be wrong here.

TNS advises to use R/O when setting up replication, and IIRC there is an option to select. Can also be setup so it rolls back any changes to last common snapshot when replication starts.

@pflaugh Did you consider unmounting and remounting the file system, as a rough test, before duplicati starts? Or even unmount before replication starts. Maybe you could use another NFS mount as a signal vector, or just ssh to the duplicati server and tell it to remount and start after the replication finishes. Even if you use SFTP, which would add the CPU of a encryption layer even if you turned off compression, then it seems you need to sync the backup completion and the duplicati start.

pflaugh · Jul 14, 2023

In this case I am actually already using the R/O option on the NFS share on top of mounting it as R/O from the duplicati vm.

I have started looking more into stuff along the lines you mention, automating the (un)mounting and duplicati job around the replication.

I will have to test unmounting before the replication and IO errors. Ill post back here with the results

pflaugh · Jul 14, 2023

An interesting development. I set up a test TN Scale box and attempted to replicate the same errors and, I did get "stale file handle" errors during the replication process, but once the replication was complete everything worked as expected. So, at least initially, it appears that TNS exhibits different behavior than TNC in this situation. I will test this further to confirm.

pflaugh · Oct 16, 2023

Things got busy for me and I just go around to wrapping this up. I have migrated that backup box to TNS Bluefin and everything is working fine. So I suspect the way the Linux kernel serves NFS exports is more conducive to my niche use case.

Thank you all again for you input.

Important Announcement for the TrueNAS Community.

SOLVED NFS Share I/O Errors after recieving snapshot replication

pflaugh

Cadet

samarium

Contributor

pflaugh

Cadet

Patrick M. Hausen

Hall of Famer

pflaugh

Cadet

Patrick M. Hausen

Hall of Famer

samarium

Contributor

pflaugh

Cadet

pflaugh

Cadet

pflaugh

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

SOLVED NFS Share I/O Errors after recieving snapshot replication

Cadet

Contributor

Cadet

Hall of Famer

Cadet

Hall of Famer

Contributor

Cadet

Cadet

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "NFS Share I/O Errors after recieving snapshot replication"

Similar threads