NFS4 broken after upgrade to TrueNAS 12

nniehoff

Cadet
Joined
Jan 3, 2021
Messages
9
I have a TrueNAS Mini XL that is about 2 years old now, last night I upgraded from the latest FreeNAS 11 (sorry I don't remember the exact version) to TrueNAS 12.0-U1. Previous to my upgrade I had several NFS shares all being shared out over NFS version 4 which were all working and I was even using one of the NFS shares for Kubernetes persistent volumes. After the upgrade, none of my shares will mount either through Kubernetes or autofs on my Ubuntu 20.04 clients (these are my only clients). So I started testing manually, but the mount just hangs with the following output:

# mount -v 10.12.22.60:/mnt/data/k8s/ /mnt
mount.nfs: timeout set for Sun Jan 3 13:22:21 2021
mount.nfs: trying text-based options 'vers=4.2,addr=10.12.22.60,clientaddr=10.12.22.39'
mount.nfs: mount(2): Protocol not supported
mount.nfs: trying text-based options 'vers=4.1,addr=10.12.22.60,clientaddr=10.12.22.39'

Interestingly if I specify the option vers=4 the mount succeeds and mounts immediately:

# mount -v -o vers=4 10.12.22.60:/mnt/data/k8s/ /mnt
mount.nfs: timeout set for Sun Jan 3 13:24:15 2021
mount.nfs: trying text-based options 'vers=4,addr=10.12.22.60,clientaddr=10.12.22.39'
root@node2:~# df -h /mnt
Filesystem Size Used Avail Use% Mounted on
10.12.22.60:/mnt/data/k8s 13T 8.0G 13T 1% /mnt

I have tried using FQDN and IP to eliminate any potential issue with DNS but this has worked for years so I'm hesitant to go chasing problems that don't exist. I did notice the output of 'rpcinfo -p' does not show nfs version 4:

# rpcinfo -p 10.12.22.60
program vers proto port service
100000 4 tcp 111 portmapper
100000 3 tcp 111 portmapper
100000 2 tcp 111 portmapper
100000 4 udp 111 portmapper
100000 3 udp 111 portmapper
100000 2 udp 111 portmapper
100000 4 7 111 portmapper
100000 3 7 111 portmapper
100000 2 7 111 portmapper
100005 1 udp 717 mountd
100005 3 udp 717 mountd
100005 1 tcp 717 mountd
100005 3 tcp 717 mountd
100003 2 udp 2049 nfs
100003 3 udp 2049 nfs
100003 2 tcp 2049 nfs
100003 3 tcp 2049 nfs
100024 1 udp 773 status
100024 1 tcp 773 status
100021 0 udp 744 nlockmgr
100021 0 tcp 874 nlockmgr
100021 1 udp 744 nlockmgr
100021 1 tcp 874 nlockmgr
100021 3 udp 744 nlockmgr
100021 3 tcp 874 nlockmgr
100021 4 udp 744 nlockmgr
100021 4 tcp 874 nlockmgr

Checking my exports on the NAS:

# cat /etc/exports
V4: / -sec=sys
/mnt/data/k8s -maproot="root":"wheel"

The service configuration on the NAS:

Screen Shot 2021-01-03 at 1.27.16 PM.png


I've been trying different settings here, the original number of servers was 4 and I don't allow UDP or do any logging normally, but was hoping in my testing it would shed some light. Unfortunately, the logs aren't incredibly helpful, here's a snippet:

Jan 3 12:47:26 freenas nfsd: can't register svc name
Jan 3 13:02:51 freenas 1 2021-01-03T13:02:51.037773-07:00 freenas.example.com mountd 22808 - - can't open /etc/zfs/exports
Jan 3 13:02:53 freenas 1 2021-01-03T13:02:53.184244-07:00 freenas.example.com mountd 23313 - - can't open /etc/zfs/exports
Jan 3 13:02:53 freenas nfsd: can't register svc name
Jan 3 13:02:53 freenas kernel: NLM: local NSM state is 21

I spent some time looking into these and mainly found old posts basically saying these are cosmetic issues. I am concerned about the nfs cant register svc name that seems like a thread to pull. The can't open /etc/zfs/exports error from various posts I saw touch the file, or copy /etc/exports, I can confirm the file did not exist and I have tried both of these solutions. I do use LACP and jumbo frames in my configuration but again this configuration has been working for some time I don't believe the issue is related to these. I'm more of a Linux admin, I've reached the end of my BSD knowledge and I hate troubleshooting NFS (regardless of OS) it always seems like there are no helpful logs. If someone could point me in the right direction, or any direction for that matter, any help would be greatly appreciated.
 

nniehoff

Cadet
Joined
Jan 3, 2021
Messages
9
To clarify specifying the vers=4 option allows me to mount the filesystem on some hosts, but not others. It seems if the filesystem is mounted and accessed for a brief period of time it eventually disconnects and the mount goes stale, so that is not a valid workaround.
 

KrisBee

Wizard
Joined
Mar 20, 2017
Messages
1,288
Just wondering if you upgraded without an umount of NFS shares on all your clients. Did you google that "NLM: local NSM state is 21" error?
Did you try unmounting all clients and restarting the FreeNAS NFS server?

The "nfsd: can't register svc name" is definitely cosmetic, a result of using NFSv4 without kerberos.
 

nniehoff

Cadet
Joined
Jan 3, 2021
Messages
9
@KrisBee it was definitely not a clean reboot, I did not unmount all of the shares across all of my clients. Partially because there's simply too many, but also because I expected the protocol to handle things more cleanly, after all, it is a well-tested protocol I would have thought it would handle an unclean disconnect better. I did not spend much time with that error, I didn't realize NLM/NSM were related to NFS at all, too many acronyms. Is there any way to reset the locks (at least that's what I gather NLM/NSM does with a cursory look) on the server side without having to turn off all of my clients? I have already rebooted most of the clients and the server to no avail.
 

nniehoff

Cadet
Joined
Jan 3, 2021
Messages
9
As a follow up question, are there NFS mount options to help on the client side make this happen more cleanly? I figured out how to inject the vers=4 option into my kubernetes volumes (which didn't work, but was a good learning experience for k8s) so it would be simple to add a separate option. Another option I have would be to use autofs for my k8s volumes and configure each host to mount /net/... and point k8s at the host path to match. Then k8s wouldn't be handling the mounting of the FSs but the host OS using autofs, would that be cleaner?
 

nniehoff

Cadet
Joined
Jan 3, 2021
Messages
9
I'm getting further, I was able to disconnect all clients and reboot the server. The local NSM state is 0 and I'm no longer seeing the NSM related messages in the logs. I am now able to manually mount with version 4.1 however rpcinfo still does not display version 4 for nfs. Should it? It does for my non-freenas NASs. Unfortunately, Kubernetes is now mounting the volumes but they are unresponsive to the pods and to the host OS, similar to a stale nfs connection.
 

nniehoff

Cadet
Joined
Jan 3, 2021
Messages
9
So interesting tidbit here. Some of my Kubernetes nodes are Raspberry Pis. It seems these nodes running Ubuntu 20.10 are able to mount the shares just fine and they are responsive. Unfortunately, only a few of my workloads have containers compiled for the ARM architecture at this point but it's an interesting data point. I am currently running Ubuntu 20.04 on my problematic AMD64 nodes (Kernel 5.4.0-58-generic #64-Ubuntu) while the pi's are 20.10 ARM64 running kernel 5.8.0-1010-raspi #13-Ubuntu. This could be any of several things at this point, architecture, kernel, or newer packages. I am going to methodically update the kernel, test, then package upgrades and test, leaving the architecture as the difference. hopefully I get lucky.
 

nniehoff

Cadet
Joined
Jan 3, 2021
Messages
9
Upgrading the kernel didn't help. With Ubuntu 20.04 with the 5.8 kernel on amd64 I have the same symptoms. At least with 5.8 the kernel has an error message "NFS4: Couldn't follow remote path"... Not really helpful but it did lead to this bug report which might be related. I am going to upgrade one node to 20.10, however, I'm not incredibly hopeful.
 

nniehoff

Cadet
Joined
Jan 3, 2021
Messages
9
Another piece of information. Even though I am about to now mount filesystems from an amd64 system it seems I am unable to perform any real (notice count for the following dd) writes to the filesystem:

# dd if=/dev/random of=test.img bs=1024 count=1
0+1 records in
0+1 records out
71 bytes copied, 0.00468038 s, 15.2 kB/s
# dd if=/dev/random of=test.img bs=1024 count=2
0+2 records in
0+2 records out
12 bytes copied, 46.5145 s, 0.0 kB/s

I've seen the other posts on TrueNAS 12 being slow but this isn't slow, this doesn't work. From an ARM64 system, this doesn't seem to be an issue:

# dd if=/dev/random of=test.img bs=1024 count=1024
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0760373 s, 13.8 MB/s

I'm not going for speed here, but the time spent on the pi was not noticeable for 512x the number of blocks.
 

cscutcher

Cadet
Joined
Aug 29, 2021
Messages
6
So I just sunk 2 days trying to debug an NFS issue, and I found this thread early on but dismissed it as unrelated. That being said I kept running into this thread over and over, and now I've finally "fixed" my issue I figured I share what happened to me. Partly because I have a suspicion there's something going on similar and since there seems to be precious little information around this stuff, and every tidbit I've stumbled upon has been a precious artifact!

I've been running FreeNAS-12.0-U1.1 since 2021-01-24. I've been running the same installation for at least a year before that too, but that was the last time I did any updates. I have a bunch of NFS shares, including a lot of NFSv4 shares with kerberos.

My pain started when upgrading to (TrueNAS) 12.0-U5.1 . My NFSv4 shares became inaccessible from all the systems where it had been working before. I investigated an eliminated a whole bunch of theories;

  • The keytab was correct. I had both host/host.domain and nfs/host.domain . I regenerated the keytab to be sure.
  • Time was in sync
  • For a while I also chased the lack of nfs version 4 in rpcinfo . As it turns out, even when I "fixed" the problem nfs version 4 still doesn't appear. I did find evidence that in googling that sometimes nfs ver 4 does appear here, but I guess not on BSD / TrueNAS
  • Removed all NFS shares/exports
I too had this error "nfsd: can't register svc name" . Another symptom was when I started gssd on TrueNAS with gssd -d -v there was zero output. I know from past trauma that gssd should be chatting away during krb5 secured mounts so this was suspicious, although ultimately fruitless.

In the end the only way I managed to "fix" anything was to roll back/activate the previous FreeNAS-12.0-U1.1 boot environment. While I've seen plenty of posts around "nfsd: can't register svc name" being purely cosmetic, and perhaps that's true for scenarios without kerberos, but at least in my case this error was definitely present on 12.0-U5.1 when NFS was broken, and gone when I restored U1.1 and NFS worked.

I really should raise a proper support ticket, but frankly I'm burnt out by the last few days of futile debugging and straw clutching. I just wanted to put this incase someone else in my situation finds these notes useful. I'll also share some commands that I at least got me more information, not that any of it actually helped.

---

Getting debug from gssd on TrueNAS:

Code:
# Stop gssd service
service gssd stop

# Start gssd in foreground with debug output
gssd -v -d

# Restart gssd when done
service gssd start 


Increase verbosity of nfsd logging on TrueNAS:

Code:
sysctl vfs.nfsd.debuglevel=4


Check timesync

Code:
ntpq -p


I found these logs has the most interesting noise in;

Code:
tail -F /var/log/messages /var/log/middlewared.log /var/log/daemon.log


Increase verbosity of NFS logging on Linux Clients

Code:
rpcdebug -m rpc -s all
rpcdebug -m nfs -s all


Confirm we can generate tokens

Code:
ktutil -k /etc/krb5.keytab list

kinit -k host/<hostname.domain>
kinit -k nfs/<hostname.domain>
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
In the end the only way I managed to "fix" anything was to roll back/activate the previous FreeNAS-12.0-U1.1 boot environment. While I've seen plenty of posts around "nfsd: can't register svc name" being purely cosmetic, and perhaps that's true for scenarios without kerberos, but at least in my case this error was definitely present on 12.0-U5.1 when NFS was broken, and gone when I restored U1.1 and NFS worked.

I really should raise a proper support ticket, but frankly I'm burnt out by the last few days of futile debugging and straw clutching. I just wanted to put this incase someone else in my situation finds these notes useful. I'll also share some commands that I at least got me more information, not that any of it actually helped.

It would be good to make a case out of it.... if the engineering team need some debugs or want to provide a fix, we need someone who can reproduce the issue.

If anyone else can confirm and report the same issue, it would be useful.
 

cscutcher

Cadet
Joined
Aug 29, 2021
Messages
6
if the engineering team need some debugs or want to provide a fix, we need someone who can reproduce the issue.

That's totally understandable. Just a bit hesitant to spend more time rolling forward to break things again and collect diagnosis info, even though I appreciate that's not much help to anyone. The ole' homelab hobby has sunken a bit too much time over the last few days!

Pulling stuff from the current system is easy enough so I don't mind doing that, and I will try and go back and reproduce in a more controlled fashion when I've recharged a bit. But in both cases I'm just not sure what would be useful to you guys?

I'm aware there's the generated debug tar thing, and I even have one of those from when I was U5.1, but a casual scan of it's contents makes it a bit cautious to share the whole thing, certainly publicly, since it does contain data that I'd consider somewhat private.

I'm happy to raise a bug too, but as I say I haven't got much time to do extensive debug for a couple of weeks, so I didn't want to waste anyone's time raising a formal bug without having the time to do the follow through.
 

xenu

Dabbler
Joined
Nov 12, 2015
Messages
43
I ran into the same issue @cscutcher. I opened a similar thread after my first upgrade from U4.1 to U5. U5.1 did not fix it either.
There is also this bug report which is about a different NFS issue originally but has someone else reporting NFS not working after upgrading to U5.
Unfortunately I could not figure out what the issue was and have no idea how to troubleshoot further. So I rolled back to U4.1 for the time being.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
I ran into the same issue @cscutcher. I opened a similar thread after my first upgrade from U4.1 to U5. U5.1 did not fix it either.
There is also this bug report which is about a different NFS issue originally but has someone else reporting NFS not working after upgrading to U5.
Unfortunately I could not figure out what the issue was and have no idea how to troubleshoot further. So I rolled back to U4.1 for the time being.
Please report and bug and help with getting the diagnostics.
 

Forza

Explorer
Joined
Apr 28, 2021
Messages
81
My pain started when upgrading to (TrueNAS) 12.0-U5.1 . My NFSv4 shares became inaccessible from all the systems where it had been working before. I investigated an eliminated a whole bunch of theories;
I've had a similar issue happen to me too when upgrading from U5.0 to U5.1. I could mount but not access files.

The solution was to move entries in "maproot" to "mapall".

example share:
1630494030204.png
 

Alphahydro

Cadet
Joined
Sep 18, 2021
Messages
1
It may not be a bug. It may be related to the credentials you entered to access your NFS share. I always used the built-in root user on FreeNAS until I updated to TrueNAS which broke my shares. Share access via root was eliminated in TrueNAS for security purposes. Following another post regarding broken SMB shares, I created a new user in TrueNAS and went into the app that was accessing my NFS share (in my case, NextCloud) and entered the new user credentials and voila.
 
Top