Kernel Panic - possibly nfs and 10Gbps NIC

havefun!

Cadet
Joined
Sep 23, 2020
Messages
4
I've run into an issue, with TrueNAS RC1, that I believe is related to NFS and 10Gbps network ports.

I had setup a Temporary TrueNas server of spare parts, was setup to serve as an NFS server for a 3 node Proxmox cluster, and as a target for 2 node Microsoft 2019 Failover File server cluster virtual machines running on this proxmox cluster. This was done while waiting for a refurbished Dell r720XD to be delivered.

Initially ran fine, then began crashing under sustained network traffic, with a Kernel panic, and messages related to NFS. Would continually kernel panic shortly after displaying the IP addresses in the console. Rebooting the Proxmox nodes, stops the panics for a while, but they start again, if network traffic increases over NFS. Disabling the network port on the switch allows the system to boot without a kernel panic, I can then disable NFS and iSCSI and it stays up after re-enabling the switch port. The Failover cluster virtual machines, configured to use iSCSI, have been disabled since this kernel panic issue started. As soon as I turn on NFS service, kernel panic.

I assumed that I just had some sort of random hardware issue, so waited on Dell server delivery.

Dell r720XD server showed up late this afternoon. Dual Xeon E5-2667@2.90GHz, Perc 710P mini, 128GB ECC ram Stuck it in the rack, updated firmware, flashed 710P mini to IT mode.

Fresh install of TrueNas to spare SSD, pool drives physically moved from "spare parts server" to Dell r720XD. Uploaded config saved from spare parts server and rebooted. System imported config and updated database without any errors that I could see and rebooted again.
After second boot, had to reconfigure network interfaces. As soon as the IP address assigned to 10Gbps port became available, kernel panic within a few seconds.

Disabled switch port. After boot, disabled NFS and iSCSI, re-enabled switch port, and system stayed up.
Enabled NFS, kernel panic within 30 seconds.


The 10Gbps Nic in the r720XD is a new Intel based X520-DA2 from Gtek, connected with Gtek 3 meter DAC cables for Intel cards. The spare parts server also had a new X520-DA2, but not the same card used in the Dell.
I have experimented with jumbo frames on and off. LAGG with both ports connected, and only a single port connected. Have also tried with just a single non LAGG 10Gbps port configured and connected. NFS3 and NFS4 both seem to cause the kernel panic.
Have also tried switch port set to auto-negotiate, and fixed at 10Gbps FDX. Switch is a Ubiquiti US-16-XG

Tomorrow, I may reset TrueNas config, and reconfigure. This will leave my pool data, so I don't have to copy the couple of terabytes that I managed to get into the pool before these kernel panics started.

Any thoughts on what might be causing this? I have attached debug from the r720XD server, and a screenshot of the kernel panic.
 

Attachments

  • debug-truenas-20201015020736.tgz
    2 MB · Views: 357
  • panic.png
    panic.png
    506.7 KB · Views: 362

Kris Moore

SVP of Engineering
Administrator
Moderator
iXsystems
Joined
Nov 12, 2015
Messages
1,471
I'd suspect its not really NFS related, but perhaps the network driver for that card. Have you tried simulating some load to the box with iperf or similar?

Also, you can open a ticket on jira.ixsystems.com and include that debug for the iX engineers to look a bit deeper. It would be helpful it on the backtrace screen, you ran the "bt" command and screen captured that output as well, since it should give more details as to where it crashed.
 

havefun!

Cadet
Joined
Sep 23, 2020
Messages
4
Thanks for the response.

How do I run the bt command?
The server reboots about 10 seconds after the kernel panic. I don't see anyplace to enter BT command.

Ran a few 5 minute iperf3 tests. No kernel panic. Averaged 7 Gbits/sec.

After running iperf3 tests, enabled NFS in TrueNas, and got a kernel panic.

I should also add, that I have disabled the NFS connection in the Proxmox cluster. Still get a kernel panic after enabling NFS in TrueNAS.
I can only assume that the Proxmox cluster is trying to flush some data that was queued to be written to a VM disk(s) hosted on the NFS share, before actually disabling the NFS connection.

Lastly, some additional info. I have another TrueNas Core system built with the same model NIC, same switch, same DAC cables, similar setup, running NFS and iSCSI serving the same proxmox servers that is not exhibiting the kernel panic issues. This system started having many chksum errors on the disks, which was the reason for building the replacement server..
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The 10Gbps Nic in the r720XD is a new Intel based X520-DA2 from Gtek, connected with Gtek 3 meter DAC cables for Intel cards.

Be aware that there are a large number of knockoff X520 cards out there, and that these are expected to crash, not work, cause panics, etc.

If your card doesn't have an Intel Yottamark on it that passes validation, I'd suggest trying a different card. The words "new X520 from Gtek" are a big red flag as the card is a 2009-era PCIe 2.0 card; while you can still get the X520 new in the channel, it's considered a legacy card.
 

havefun!

Cadet
Joined
Sep 23, 2020
Messages
4
"new X520 from Gtek" are a big red flag

I visited a local server hardware refurbisher, and picked up a used DELL OEM x520 card. Ran Dell firmware update CD which updated firmware on this card. Booted into TrueNas, still having same kernel panics.

I have a HP 530sfp+ adapter, but don't have any DAC cables, or transceivers that will work with that card. Ordered something from fs.com, will be here next Tuesday.
 
Top