RC 22.02-RC.1-2 NFS Problems

xenu

Dabbler
Joined
Nov 12, 2015
Messages
43
Copied from: https://forum.proxmox.com/threads/nfs-problems.100760/post-436869

Not sure if this is a Proxmox or TrueNAS/Ganesha issue but I just switched over my TrueNAS Core server to TrueNAS Scale (22.02-RC.1-2, Kernel 5.10.70+truenas) for testing and run into a similar issue.
I mount a NFS share (nfsv4 with kerberos) in Proxmox and my VM backup job targets this storage. The share is mounted successfully and I can see content with "ls" and the Proxmox Web GUI. Copying a file from the nfs share to local storage works fine. But my backups stop after "INFO: transferred XX GiB in xx seconds" and before getting to the "INFO: archive file size: XX GiB" line.
The zst archive is never created but a .dat file is present on the share (slightly larger than the local zst backup). In this state trying to "ls" the share does not work either (it hangs as well). Neither does listing the contents in the Proxmox Web GUI work.
Some log lines:
Code:
kernel: rpc_check_timeout: 35 callbacks suppressed
nfs: server truenas01.ipa.mydomain.com not responding, still trying
nfs: RPC call returned error 512
#  this one could have been when I force unmounted the share:
pvedaemon[2189736]: Warning: unable to close filehandle GEN21432 properly: Bad file descriptor at /usr/share/perl5/PVE/VZDump/QemuServer.pm line 764.

It also hangs if I create a backup on local storage and then try to rsync/move it to the NFS share manually (seems stuck after transfer when doing validation as the file has the same size as the source but is shown with a temp filename i.e. ".<filename>.XXXXX ).
When I was still on TrueNAS Core this ran fine so it does not seem to be a network or hardware issue (compatibility maybe?). Also I can push and pull files to/from the same NFS share using the same mount options without an issue from a current Arch install (5.15.7) but it also hangs on a current CentOS 8 Stream install (4.18.0-348.2.1.el8_5.x86_64 - actually a VM on the Proxmox host).
I had two backups finish while trying various things but could not replicate and usually the share just hangs. I tried mounting with nfsv4, 4.1 and 4.2. Tried adding soft mount option, but no change.
It is as if the information that the transfer is finished never reaches the client (my uneducated guess).
Code:
proxmox-ve: 7.1-1 (running kernel: 5.13.19-2-pve)
pve-manager: 7.1-8 (running version: 7.1-8/5b267f33)
pve-kernel-helper: 7.1-6
pve-kernel-5.13: 7.1-5
pve-kernel-5.11: 7.0-10
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.13.19-1-pve: 5.13.19-3
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-4.2.6-1-pve: 4.2.6-36
ceph-fuse: 16.2.6-pve2
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve1
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-2
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3


There have been many kernel commits regarding NFS since 5.13.x so maybe that is why it works with a kernel 5.15.7 client.

edit:
Just did another test with the kernel 5.15.7 client and it got stuck as well when uploading.
First I downloaded a ~60GB file from the share which worked fine. I then reuploaded the same file to the share (different path/name) and when the transfer was done (size local = size on share) it got stuck. Client log shows:

Code:
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: INFO: task rsync:7398 blocked for more than 122 seconds.
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:       Tainted: G           OE     5.15.7 #1
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: task:rsync           state:D stack:    0 pid: 7398 ppid:  7397 flags:0x00000000
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: Call Trace:
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  <TASK>
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  __schedule+0x30f/0x930
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  schedule+0x59/0xc0
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  io_schedule+0x42/0x70
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  wait_on_page_bit_common+0x10e/0x390
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  ? filemap_invalidate_unlock_two+0x40/0x40
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  wait_on_page_writeback+0x22/0x80
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  __filemap_fdatawait_range+0x8b/0xf0
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  filemap_write_and_wait_range+0x85/0xf0
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  nfs_wb_all+0x22/0x120 [nfs]
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  nfs4_file_flush+0x6b/0xa0 [nfsv4]
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  filp_close+0x2f/0x70
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  __x64_sys_close+0xd/0x40
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  do_syscall_64+0x38/0x90
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: RIP: 0033:0x7f2d91212fe7
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: RSP: 002b:00007ffe4b97c888 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: RAX: ffffffffffffffda RBX: 00007f2d90d76fe8 RCX: 00007f2d91212fe7
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: R10: 0000000000000080 R11: 0000000000000246 R12: 00007ffe4b97c9a0
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: R13: 00000000ffffffff R14: 00007ffe4b97d9a0 R15: 0000000000000000
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  </TASK>
 
Last edited:

xenu

Dabbler
Joined
Nov 12, 2015
Messages
43

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
quick update: it appears to be caused by using sec=krb5p for the nfs share. krb5 and krb5i work fine. Possibly a ganesha issue.
Can you report a bug... thanks.
 

xenu

Dabbler
Joined
Nov 12, 2015
Messages
43
Will do. Furhter testing showed that both krb5p and krb5i run into this issue. So far only sec=krb5 never got stuck.
 
Top