Copied from: https://forum.proxmox.com/threads/nfs-problems.100760/post-436869
Not sure if this is a Proxmox or TrueNAS/Ganesha issue but I just switched over my TrueNAS Core server to TrueNAS Scale (22.02-RC.1-2, Kernel 5.10.70+truenas) for testing and run into a similar issue.
I mount a NFS share (nfsv4 with kerberos) in Proxmox and my VM backup job targets this storage. The share is mounted successfully and I can see content with "ls" and the Proxmox Web GUI. Copying a file from the nfs share to local storage works fine. But my backups stop after "INFO: transferred XX GiB in xx seconds" and before getting to the "INFO: archive file size: XX GiB" line.
The zst archive is never created but a .dat file is present on the share (slightly larger than the local zst backup). In this state trying to "ls" the share does not work either (it hangs as well). Neither does listing the contents in the Proxmox Web GUI work.
Some log lines:
It also hangs if I create a backup on local storage and then try to rsync/move it to the NFS share manually (seems stuck after transfer when doing validation as the file has the same size as the source but is shown with a temp filename i.e. ".<filename>.XXXXX ).
When I was still on TrueNAS Core this ran fine so it does not seem to be a network or hardware issue (compatibility maybe?).Also I can push and pull files to/from the same NFS share using the same mount options without an issue from a current Arch install (5.15.7) but it also hangs on a current CentOS 8 Stream install (4.18.0-348.2.1.el8_5.x86_64 - actually a VM on the Proxmox host).
I had two backups finish while trying various things but could not replicate and usually the share just hangs. I tried mounting with nfsv4, 4.1 and 4.2. Tried adding soft mount option, but no change.
It is as if the information that the transfer is finished never reaches the client (my uneducated guess).
There have been many kernel commits regarding NFS since 5.13.x so maybe that is why it works with a kernel 5.15.7 client.
edit:
Just did another test with the kernel 5.15.7 client and it got stuck as well when uploading.
First I downloaded a ~60GB file from the share which worked fine. I then reuploaded the same file to the share (different path/name) and when the transfer was done (size local = size on share) it got stuck. Client log shows:
Not sure if this is a Proxmox or TrueNAS/Ganesha issue but I just switched over my TrueNAS Core server to TrueNAS Scale (22.02-RC.1-2, Kernel 5.10.70+truenas) for testing and run into a similar issue.
I mount a NFS share (nfsv4 with kerberos) in Proxmox and my VM backup job targets this storage. The share is mounted successfully and I can see content with "ls" and the Proxmox Web GUI. Copying a file from the nfs share to local storage works fine. But my backups stop after "INFO: transferred XX GiB in xx seconds" and before getting to the "INFO: archive file size: XX GiB" line.
The zst archive is never created but a .dat file is present on the share (slightly larger than the local zst backup). In this state trying to "ls" the share does not work either (it hangs as well). Neither does listing the contents in the Proxmox Web GUI work.
Some log lines:
Code:
kernel: rpc_check_timeout: 35 callbacks suppressed nfs: server truenas01.ipa.mydomain.com not responding, still trying nfs: RPC call returned error 512 # this one could have been when I force unmounted the share: pvedaemon[2189736]: Warning: unable to close filehandle GEN21432 properly: Bad file descriptor at /usr/share/perl5/PVE/VZDump/QemuServer.pm line 764.
It also hangs if I create a backup on local storage and then try to rsync/move it to the NFS share manually (seems stuck after transfer when doing validation as the file has the same size as the source but is shown with a temp filename i.e. ".<filename>.XXXXX ).
When I was still on TrueNAS Core this ran fine so it does not seem to be a network or hardware issue (compatibility maybe?).
I had two backups finish while trying various things but could not replicate and usually the share just hangs. I tried mounting with nfsv4, 4.1 and 4.2. Tried adding soft mount option, but no change.
It is as if the information that the transfer is finished never reaches the client (my uneducated guess).
Code:
proxmox-ve: 7.1-1 (running kernel: 5.13.19-2-pve) pve-manager: 7.1-8 (running version: 7.1-8/5b267f33) pve-kernel-helper: 7.1-6 pve-kernel-5.13: 7.1-5 pve-kernel-5.11: 7.0-10 pve-kernel-5.13.19-2-pve: 5.13.19-4 pve-kernel-5.13.19-1-pve: 5.13.19-3 pve-kernel-5.11.22-7-pve: 5.11.22-12 pve-kernel-4.2.6-1-pve: 4.2.6-36 ceph-fuse: 16.2.6-pve2 corosync: 3.1.5-pve2 criu: 3.15-1+pve-1 glusterfs-client: 9.2-1 ifupdown: 0.8.36+pve1 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-1 libknet1: 1.22-pve2 libproxmox-acme-perl: 1.4.0 libproxmox-backup-qemu0: 1.2.0-1 libpve-access-control: 7.1-5 libpve-apiclient-perl: 3.2-1 libpve-common-perl: 7.0-14 libpve-guest-common-perl: 4.0-3 libpve-http-server-perl: 4.0-4 libpve-storage-perl: 7.0-15 libqb0: 1.0.5-1 libspice-server1: 0.14.3-2.1 lvm2: 2.03.11-2.1 lxc-pve: 4.0.11-1 lxcfs: 4.0.11-pve1 novnc-pve: 1.2.0-3 proxmox-backup-client: 2.1.2-1 proxmox-backup-file-restore: 2.1.2-1 proxmox-mini-journalreader: 1.3-1 proxmox-widget-toolkit: 3.4-4 pve-cluster: 7.1-2 pve-container: 4.1-3 pve-docs: 7.1-2 pve-edk2-firmware: 3.20210831-2 pve-firewall: 4.2-5 pve-firmware: 3.3-3 pve-ha-manager: 3.3-1 pve-i18n: 2.6-2 pve-qemu-kvm: 6.1.0-3 pve-xtermjs: 4.12.0-1 qemu-server: 7.1-4 smartmontools: 7.2-pve2 spiceterm: 3.2-2 swtpm: 0.7.0~rc1+2 vncterm: 1.7-1 zfsutils-linux: 2.1.1-pve3
edit:
Just did another test with the kernel 5.15.7 client and it got stuck as well when uploading.
First I downloaded a ~60GB file from the share which worked fine. I then reuploaded the same file to the share (different path/name) and when the transfer was done (size local = size on share) it got stuck. Client log shows:
Code:
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: INFO: task rsync:7398 blocked for more than 122 seconds. Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: Tainted: G OE 5.15.7 #1 Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: task:rsync state:D stack: 0 pid: 7398 ppid: 7397 flags:0x00000000 Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: Call Trace: Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: <TASK> Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: __schedule+0x30f/0x930 Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: schedule+0x59/0xc0 Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: io_schedule+0x42/0x70 Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: wait_on_page_bit_common+0x10e/0x390 Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: ? filemap_invalidate_unlock_two+0x40/0x40 Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: wait_on_page_writeback+0x22/0x80 Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: __filemap_fdatawait_range+0x8b/0xf0 Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: filemap_write_and_wait_range+0x85/0xf0 Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: nfs_wb_all+0x22/0x120 [nfs] Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: nfs4_file_flush+0x6b/0xa0 [nfsv4] Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: filp_close+0x2f/0x70 Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: __x64_sys_close+0xd/0x40 Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: do_syscall_64+0x38/0x90 Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: RIP: 0033:0x7f2d91212fe7 Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: RSP: 002b:00007ffe4b97c888 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: RAX: ffffffffffffffda RBX: 00007f2d90d76fe8 RCX: 00007f2d91212fe7 Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001 Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: R10: 0000000000000080 R11: 0000000000000246 R12: 00007ffe4b97c9a0 Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: R13: 00000000ffffffff R14: 00007ffe4b97d9a0 R15: 0000000000000000 Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: </TASK>
Last edited: