TrueNAS locks up with "uhub_reattach_port: giving up port X reset"

benk87

Cadet
Joined
Jan 11, 2023
Messages
6
First, thank you for taking the time to read.

Background:
  • Home system for personal documents/photos
  • Currently not dependent on this build for any storage due to current issue
  • I'm new to FreeBSD, but generally comfortable on the command line (windows developer)
  • Cribbed this build from this video, which since visiting these forums may have been a mistake. sigh
  • I have not been able to reproduce consistently which has made this hard to fix myself
  • I did the following to log all console output to a log file as the below messages were not showing up anywhere in the /var/log files that I could find. They still didn't log as expected.
Hardware (all purchase new):

I am happy to supply serial numbers upon request, just didn't want to dismantle the box to get to all of them:
Configuration:

I am happy to supply other configuration/log information as needed.
  • TrueNAS Version: TrueNAS-13.0-U3.1
  • Previously had a SyncThing jail installed but removed, no current jails installed

Description:

The problem will usually start within a 24 hour period of a reboot.

When the problem occurs, TrueNAS becomes unresponsive. I am unable to see it on the network, the web UI does not connect, you can't reach the login page. Attaching a monitor to the physical box, I see the following cascade of error messages in the console and the console is unresponsive to the keyboard:

uhub_reattach_port: giving up port 13 reset - device vanished: change 0xfb status 0x7fb
uhub_reattach_port: giving up port 14 reset - device vanished: change 0xfb status 0x7fb
uhub_reattach_port: giving up port 1 reset - device vanished: change 0xfb status 0x7fb
uhub_reattach_port: giving up port 2 reset - device vanished: change 0xfb status 0x7fb
uhub_reattach_port: giving up port 3 reset - device vanished: change 0xfb status 0x7fb
uhub_reattach_port: giving up port 4 reset - device vanished: change 0xfb status 0x7fb
uhub_reattach_port: giving up port 5 reset - device vanished: change 0xfb status 0x7fb
uhub_reattach_port: giving up port 6 reset - device vanished: change 0xfb status 0x7fb
uhub_reattach_port: giving up port 7 reset - device vanished: change 0xfb status 0x7fb
uhub_reattach_port: giving up port 8 reset - device vanished: change 0xfb status 0x7fb
uhub_reattach_port: giving up port 9 reset - device vanished: change 0xfb status 0x7fb
uhub_reattach_port: giving up port 10 reset - device vanished: change 0xfb status 0x7fb
Jan 11 13:04:05 vault 1 2023-01-11T13:04:05.447511-08:00 vault.local collectd 1457 - - plugin_dispatch_values: Low water mark reached. Dropping 100% of metrics.

Please see 'console-image.jpg' attached for more context. I would supply the full text but it's not captured in the /var/log files.

  • I've found similar but not identical error messaging here on the forums
  • I have an intuition this is likely a hardware issue (or maybe bios setting?) based on the problems people tend to have around incompatibility with their hardware and TrueNAS, but I'm interested in a professional's opinion.
I also don't entirely understand the error message as typically this box is headless and has nothing in any of it's USB ports. I did find this bug report on the FreeBSD forum.

Things I've Tried

  • Replacing the M.2 SATA Expansion Card, previously used this one, but I believe I was still having the issue at this time (I've been debugging this on and off for awhile)
  • Moving the OS Kingston drive off the ECS07 expansion card and directly used the onboard SATA connectors
  • Setup logging all console messages to try and capture the full log as PGUP/PGDOWN stops working once this flood of errors starts, logs didn't seem to have the attached picture messages
  • Checking the HDD drives SMART results (They looked fine to me but, again, amateur hour here, can provide)

Resolutions

Currently the only thing that 'resolves' this issue is a hard reboot.
 

Attachments

  • console-image.jpg
    console-image.jpg
    459.4 KB · Views: 114

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Cribbed this build from this video, which since visiting these forums may have been a mistake. sigh
Yeah, LTT is not great for advice on what to do. On the flip side, if they tell you not to do something, you should probably listen because they deemed it to crazy even for themselves.

I seem to recall that AMD processors were having trouble with USB, with various mitigation options. In your case, perhaps you can just disable the USB ports in the system setup.
 

benk87

Cadet
Joined
Jan 11, 2023
Messages
6
Hey! Thanks for the quick response.

A few follow on questions, respond at your leisure:

  • To learn more about the AMD processor/USB issues and the various mitigation options, I'm assuming it best to crawl the FreeBSD forums?
  • To be crystal clear about your suggestion, you're saying disable the USB ports via the BIOS? Or is there a separate 'system setup' place I should be looking at.
Thanks again.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
To learn more about the AMD processor/USB issues and the various mitigation options, I'm assuming it best to crawl the FreeBSD forums?
Possibly, but it's not an OS thing, rather a general problem.

To be crystal clear about your suggestion, you're saying disable the USB ports via the BIOS?
In the system firmware setup menu, often still called the BIOS setup menu, yes.
 

benk87

Cadet
Joined
Jan 11, 2023
Messages
6
Thank you for the clarifications!

A quick update, I disabled all USBs last night in the system firmware setup menu last night.

We're crossing about 15 hours since that change and we're not seeing issues.

I had the box continue to sync files to a cloud service to give it something to do in the mean time.

I will update again once we clear 48 hours for the next person who finds this thread for the same issue.
 

benk87

Cadet
Joined
Jan 11, 2023
Messages
6
Good news bad news.

Bad news, it bailed again. Good news though is I managed to catch it as it happened (or very shortly afterwards) which gave me access to some new error messages I hadn't caught before (please see comprehensive console log in screenshots):

Some entries of note:
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file

Solaris: WARNING: Pool 'boot-pool' has encountered an uncorrectable I/O failure and has been suspended

nvme0: Controller in fatal status, resetting
nvme0: Resetting controller due to a timeout and possible hot unplug.
nvme0: RECOVERY_WAITING
nvme0: resetting controller
nvme0: failing outstanding i/o

nvd0: detached

nvme0 = Intel Optane Memory H10 32GB with SSD Solid State Storage 512GB HBRPEKNX0202AC

nvd0 = apparently nvme0 by a different name?

boot-pool is on ada04

Attached screenshot of the notifications after hard restart.
Attached smartctl results for each storage disk, they are all marked passed.

For context, I wasn't doing any heavy load on the truenas when this occurred, it had been awhile since I had done anything with it, and I had just created a few folders on my samba share before it went belly up.

--

From what I can gather, it looks like both the boot-pool (ada4) and and the cache (nvme) are disconnecting and being put into a fatal/suspended status? A few thoughts:

  • I could potentially simplify the pool by yanking the nvme0 and see if the problem resides with that nvme or the m.2 slot it's plugged into.
  • I saw some similar error messages in the forums searching for the xptioctl messages mentioning potentially a bad cable (though their issue looked more like a drive problem). When I moved the boot drive off the SilverStone Technology ECS07, I changed the cable at that time, so that seems unlikely.
 

Attachments

  • ada1.txt
    6.1 KB · Views: 95
  • ada2.txt
    6.1 KB · Views: 90
  • ada3.txt
    10.7 KB · Views: 98
  • ada4.txt
    6.3 KB · Views: 77
  • ada5.txt
    6.1 KB · Views: 91
  • nvme0.txt
    2.7 KB · Views: 103
  • Screenshot 2023-01-12 172917.png
    Screenshot 2023-01-12 172917.png
    98.4 KB · Views: 110
  • 20230112_171242.jpg
    20230112_171242.jpg
    508.8 KB · Views: 110
  • 20230112_171224.jpg
    20230112_171224.jpg
    552.3 KB · Views: 119
  • ada0.txt
    6.1 KB · Views: 92

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
The nvme driver deals with talking to the SSD. The nvd driver provides access to the SSD's NVMe namespaces in a similar way to other drivers wired up to GEOM, so that you can do the same things (software RAID without ZFS, etc.). The nda driver is also available as an alternative to nvd, and I think it bypasses GEOM for extra performance with fewer features.

An NVMe SSD will have one or more namespaces, think of them as partitions, SCSI LUNs, datasets, etc. Most consumer SSDs only support a single one, but enterprise SSDs often support multiple namespaces.
 

benk87

Cadet
Joined
Jan 11, 2023
Messages
6
Thanks for the context!

For historical purposes:

I went further down the line of reasoning on the usb support issue and found the following article.

As you can see it names both the motherboard and cpu combination that should be hit by this problem. There are also the official work arounds listed in that article.

In a following article it names AGESA 1.2.0.2 as the update to be on the lookout for to solve this root cause USB issue.

Unfortunately, per the BIOS update page for the 550, I'm already at AGESA 1.2.0.3.

My next steps are to update the BIOS to latest as version 2803 seems to again iterate the AGESA version to 1.2.0.7.

Following that, if I still experience the USB issues, I'll be attempting the workarounds outlined in the first article.

I will update when I'm able to do the BIOS update/work around work.

Thanks again for your time!
 

benk87

Cadet
Joined
Jan 11, 2023
Messages
6
Update:

I have reached 72 hours~ since my last boot without running into the above issue since updating the BIOS.

Some other things of note:

  1. There's bad documentation on this motherboard with regard to how the BIOS update usb slot and button works. While the docs say it will flash 3 times to let you know it's started the update, it in fact will continually flash until it completes, and then finally will boot the computer for you. It might be obvious to some readers, but to those who need to hear this do not pull the flash drive while the bios update button continues to flash, you will interrupt and temporarily brick your motherboard which will subsequently begin to return a status LED that says there's something wrong with DRAM on boot, this is not true, you have a borked BIOS. (this is reversable, but is an easy way to give yourself another headache, ask me how I know.)
  2. It's likely just my PC case but I found that the BIOS flash button on the back of the motherboard was very easy to keep permanently depressed by the IO shroud, you may need to do some surgery on the shroud to get it to play nicely.
 
Top