NFS/SMB Lock up with no error and 32GB of RAM

Joined
May 8, 2020
Messages
7
Hello.

I have (2) Freenas boxes. One box has 40 TB and the other 80 TB. Performance when its working is amazing.

Lately, the main NAS's (the 80 TB 2630v2 with 32 GB of RAM) NFS/SMB access will lock up and the box cannot be reset or powered off. I can still log on to the freenas and I can browse all of my data sets.... I do not understand what's going on or even how to trouble shoot it. Again, NFS/SMB/SYSLOG/DMESG/Terminal screen put out no log files or even a hint of an issue. I cannot even predict when the problem starts because its random, always happens during the night, and usually when im pushing ~500MB/s traffic through it. The interface is a Mellanox ConnectX-3 with 40Gb fiber. I tried upgrading the host to 64GB ram and the problem still occurs.

- I tried switching the NIC out for another known working ConnectX3 40Gb. No dice.
- I tried switching the nic to a quad port Intel NIC and that made problems worse (the system was not stable unless all types of offloading were disabled or else the driver would crash). This brought my throughput from ~8Gbps to about ~983Mbps. I've also order a quad port chelsio to replace this just in case. The Mellanox adapter remains till then.
- The troubled nas is in a completely closed VLAN with jumbo mtu's enabled and verified working. This vlan has no route point and is only meant as a storage backbone for servers to access the NAS's.
- I completely wiped freenas and started from a fresh install on fresh enterprise SSD's. problem persists.
- I removed all of my 40Gb related sysctl tunings. No dice.

This happened all started happening when I selected "Upgrade ZFS pool", which I will never do again. The other box (who did not have there ZFS upgraded) is behaving as expected. I'm at my wits end with this.

Please help. Or if you want to tell me to RTFM for this problem, please send me the URL and I'll look right away.
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
Joined
May 8, 2020
Messages
7
Which version of FreeNAS?

Good morning and thank you.

My freenas version: FreeNAS-11.3-U2.1

Also note that I upgraded to this host. When I upgraded, my pool was no longer recognized on any version of freenas. Unfortunately, I had to destroy it and start from scratch. This is a little worrying, but I have ~ 90TB worth of backup tapes and very crucial data is backed up to the other smaller NAS. I tried changing the ram out for (2) new sticks of ECC DDR3. The problem still persists.

FYSA I also want to reiterate that this host is on a closed network with no Layer 3 hop out i.e. no routing. I use an arista switch (48 port 10 Gb with 4 port 40 Gb).

This NAS is now is a posture to be erased, pool rebuilt, etc. whatever I need to do to mitigate this error and i'm open to any direction.
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
Joined
May 8, 2020
Messages
7
Yes. Both servers use the Supermicro X9DRH-7TF

I'm using the onboard raid (LSI 2208). The controller has been flashed easily/successfully to IT mode. Both use the Mellanox ConnectX3. I did replace a hard drive in the problem host with a brand new HGST 10 TB. No drives spit out any issues. Both servers use the exact same optane 900 slog.

No hardware errors are thrown even after NFS and SMB stop functioning. If I use Sockstat, I can see NFS AND SMB listening on their expected ports when symptoms present themselves. Syptoms in this case is all NFS/SMB access to this box is cut off immediately. (I'm lol to stop from crying). The network interfaces have ZERO errors.
 
Joined
May 8, 2020
Messages
7
Yes. Both servers use the Supermicro X9DRH-7TF

I'm using the onboard raid (LSI 2208). The controller has been flashed easily/successfully to IT mode. Both use the Mellanox ConnectX3. I did replace a hard drive in the problem host with a brand new HGST 10 TB. No drives spit out any issues. Both servers use the exact same optane 900 slog.

No hardware errors are thrown even after NFS and SMB stop functioning. If I use Sockstat, I can see NFS AND SMB listening on their expected ports when symptoms present themselves. Syptoms in this case is all NFS/SMB access to this box is cut off immediately. (I'm lol to stop from crying). The network interfaces have ZERO errors.

Also: there are (2) networks where 172.16.2.0 is for non-jumbo routed traffic and 172.16.6.0 is for isolated jumbo mtu traffic. NFS/SMB locks up across the board.
 
Joined
May 8, 2020
Messages
7
Is the same dataset being shared via SMB and NFS simultaneously?

Great question!

My nas is setup the same: I have it split into (4) 17TB datasets (named "set01-set04"). Set01 is shared only through CIFS and uses windows style permissions. Set02-set04 all have general permissions and are shared out through cifs and nfs simultaneously. This setup worked for years on my previous setup.

Do you need the output of any commands?
 

anodos

Sambassador
iXsystems
Joined
Mar 6, 2014
Messages
9,553
Great question!

My nas is setup the same: I have it split into (4) 17TB datasets (named "set01-set04"). Set01 is shared only through CIFS and uses windows style permissions. Set02-set04 all have general permissions and are shared out through cifs and nfs simultaneously. This setup worked for years on my previous setup.

Do you need the output of any commands?
Not really. One parameter that became auto-default for mixed-protocol SMB/NFS shares in 11.3 is strict locking = yes. This is applied on a per-share basis. You can override by setting strict locking = no as a share-level auxiliary parameter, which will restore the pre-11.3 behavior. Specific parameters changed in this case are:
Code:
                        "strict locking": "yes",
                        "level2 oplocks": "no",
                        "oplocks": "no"

The one most likely to cause your problem of the batch for mixed protocol shares is "strict locking".
 
Joined
May 8, 2020
Messages
7
Not really. One parameter that became auto-default for mixed-protocol SMB/NFS shares in 11.3 is strict locking = yes. This is applied on a per-share basis. You can override by setting strict locking = no as a share-level auxiliary parameter, which will restore the pre-11.3 behavior. Specific parameters changed in this case are:
Code:
                        "strict locking": "yes",
                        "level2 oplocks": "no",
                        "oplocks": "no"

The one most likely to cause your problem of the batch for mixed protocol shares is "strict locking".

You're a Rockstar. Configuring now. I'll let you know if this is the case. Now that you mention it.... Set01 (the only one that is strictly set to smb) is still accessible right now when the other 3 shares are down...
 
Joined
May 8, 2020
Messages
7
Just to let everyone know, this setting did seem to fix my issue. I don't know if strict locking is responsible, but when I set it to "no" as per your guidance, but my SMB throughput has increased dramatically. I'll continue to monitor.
 
Top