TrueNAS locks up with "uhub_reattach_port: giving up port X reset"

benk87 · Jan 11, 2023

First, thank you for taking the time to read.

Background:

Home system for personal documents/photos
Currently not dependent on this build for any storage due to current issue
I'm new to FreeBSD, but generally comfortable on the command line (windows developer)
Cribbed this build from this video, which since visiting these forums may have been a mistake. sigh
I have not been able to reproduce consistently which has made this hard to fix myself
I did the following to log all console output to a log file as the below messages were not showing up anywhere in the /var/log files that I could find. They still didn't log as expected.

Hardware (all purchase new):

I am happy to supply serial numbers upon request, just didn't want to dismantle the box to get to all of them:

ASUS ROG Strix B550-I Gaming AMD AM4
AMD Ryzen 3 3100 4-Core
SilverStone Technology ECS07 (Provide 2 more sata ports, 2 of the WD Gold disks are on this)
G.Skill RipJaws V Series 16GB (2 x 8GB) 288-Pin SDRAM PC4-28800 DDR4 3600
Kingston 120GB A400 SATA 3 2.5" Internal SSD SA400S37/120G (OS SSD)
Western Digital 4TB WD Gold (x5, ZFS pool)
Intel Optane Memory H10 32GB with SSD Solid State Storage 512GB HBRPEKNX0202AC
Cooler Master V850 SFX Gold Full Modular, 850W
JONSBO N1 Mini-ITX NAS Chassis

Configuration:

I am happy to supply other configuration/log information as needed.

TrueNAS Version: TrueNAS-13.0-U3.1
Previously had a SyncThing jail installed but removed, no current jails installed

Description:

The problem will usually start within a 24 hour period of a reboot.

When the problem occurs, TrueNAS becomes unresponsive. I am unable to see it on the network, the web UI does not connect, you can't reach the login page. Attaching a monitor to the physical box, I see the following cascade of error messages in the console and the console is unresponsive to the keyboard:

uhub_reattach_port: giving up port 13 reset - device vanished: change 0xfb status 0x7fb
uhub_reattach_port: giving up port 14 reset - device vanished: change 0xfb status 0x7fb
uhub_reattach_port: giving up port 1 reset - device vanished: change 0xfb status 0x7fb
uhub_reattach_port: giving up port 2 reset - device vanished: change 0xfb status 0x7fb
uhub_reattach_port: giving up port 3 reset - device vanished: change 0xfb status 0x7fb
uhub_reattach_port: giving up port 4 reset - device vanished: change 0xfb status 0x7fb
uhub_reattach_port: giving up port 5 reset - device vanished: change 0xfb status 0x7fb
uhub_reattach_port: giving up port 6 reset - device vanished: change 0xfb status 0x7fb
uhub_reattach_port: giving up port 7 reset - device vanished: change 0xfb status 0x7fb
uhub_reattach_port: giving up port 8 reset - device vanished: change 0xfb status 0x7fb
uhub_reattach_port: giving up port 9 reset - device vanished: change 0xfb status 0x7fb
uhub_reattach_port: giving up port 10 reset - device vanished: change 0xfb status 0x7fb
Jan 11 13:04:05 vault 1 2023-01-11T13:04:05.447511-08:00 vault.local collectd 1457 - - plugin_dispatch_values: Low water mark reached. Dropping 100% of metrics.

Please see 'console-image.jpg' attached for more context. I would supply the full text but it's not captured in the /var/log files.

I've found similar but not identical error messaging here on the forums
I have an intuition this is likely a hardware issue (or maybe bios setting?) based on the problems people tend to have around incompatibility with their hardware and TrueNAS, but I'm interested in a professional's opinion.

I also don't entirely understand the error message as typically this box is headless and has nothing in any of it's USB ports. I did find this bug report on the FreeBSD forum.

Things I've Tried

Replacing the M.2 SATA Expansion Card, previously used this one, but I believe I was still having the issue at this time (I've been debugging this on and off for awhile)
Moving the OS Kingston drive off the ECS07 expansion card and directly used the onboard SATA connectors
Setup logging all console messages to try and capture the full log as PGUP/PGDOWN stops working once this flood of errors starts, logs didn't seem to have the attached picture messages
Checking the HDD drives SMART results (They looked fine to me but, again, amateur hour here, can provide)

Resolutions

Currently the only thing that 'resolves' this issue is a hard reboot.

Ericloewe · Jan 11, 2023

benk87 said:
Cribbed this build from this video, which since visiting these forums may have been a mistake. sigh

Yeah, LTT is not great for advice on what to do. On the flip side, if they tell you not to do something, you should probably listen because they deemed it to crazy even for themselves.

I seem to recall that AMD processors were having trouble with USB, with various mitigation options. In your case, perhaps you can just disable the USB ports in the system setup.

benk87 · Jan 11, 2023

Hey! Thanks for the quick response.

A few follow on questions, respond at your leisure:

To learn more about the AMD processor/USB issues and the various mitigation options, I'm assuming it best to crawl the FreeBSD forums?
To be crystal clear about your suggestion, you're saying disable the USB ports via the BIOS? Or is there a separate 'system setup' place I should be looking at.

Thanks again.

Ericloewe · Jan 12, 2023

benk87 said:
To learn more about the AMD processor/USB issues and the various mitigation options, I'm assuming it best to crawl the FreeBSD forums?

Possibly, but it's not an OS thing, rather a general problem.

benk87 said:
To be crystal clear about your suggestion, you're saying disable the USB ports via the BIOS?

In the system firmware setup menu, often still called the BIOS setup menu, yes.

benk87 · Jan 12, 2023

Thank you for the clarifications!

A quick update, I disabled all USBs last night in the system firmware setup menu last night.

We're crossing about 15 hours since that change and we're not seeing issues.

I had the box continue to sync files to a cloud service to give it something to do in the mean time.

I will update again once we clear 48 hours for the next person who finds this thread for the same issue.

benk87 · Jan 12, 2023

Good news bad news.

Bad news, it bailed again. Good news though is I managed to catch it as it happened (or very shortly afterwards) which gave me access to some new error messages I hadn't caught before (please see comprehensive console log in screenshots):

Some entries of note:

xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file

Solaris: WARNING: Pool 'boot-pool' has encountered an uncorrectable I/O failure and has been suspended

nvme0: Controller in fatal status, resetting
nvme0: Resetting controller due to a timeout and possible hot unplug.
nvme0: RECOVERY_WAITING
nvme0: resetting controller
nvme0: failing outstanding i/o

nvd0: detached

nvme0 = Intel Optane Memory H10 32GB with SSD Solid State Storage 512GB HBRPEKNX0202AC

nvd0 = apparently nvme0 by a different name?

boot-pool is on ada04

Attached screenshot of the notifications after hard restart.
Attached smartctl results for each storage disk, they are all marked passed.

For context, I wasn't doing any heavy load on the truenas when this occurred, it had been awhile since I had done anything with it, and I had just created a few folders on my samba share before it went belly up.

--

From what I can gather, it looks like both the boot-pool (ada4) and and the cache (nvme) are disconnecting and being put into a fatal/suspended status? A few thoughts:

I could potentially simplify the pool by yanking the nvme0 and see if the problem resides with that nvme or the m.2 slot it's plugged into.
I saw some similar error messages in the forums searching for the xptioctl messages mentioning potentially a bad cable (though their issue looked more like a drive problem). When I moved the boot drive off the SilverStone Technology ECS07, I changed the cable at that time, so that seems unlikely.

Ericloewe · Jan 13, 2023

benk87 said:
nvme0 = Intel Optane Memory H10 32GB with SSD Solid State Storage 512GB HBRPEKNX0202AC

nvd0 = apparently nvme0 by a different name?

The nvme driver deals with talking to the SSD. The nvd driver provides access to the SSD's NVMe namespaces in a similar way to other drivers wired up to GEOM, so that you can do the same things (software RAID without ZFS, etc.). The nda driver is also available as an alternative to nvd, and I think it bypasses GEOM for extra performance with fewer features.

An NVMe SSD will have one or more namespaces, think of them as partitions, SCSI LUNs, datasets, etc. Most consumer SSDs only support a single one, but enterprise SSDs often support multiple namespaces.

benk87 · Jan 13, 2023

Thanks for the context!

For historical purposes:

I went further down the line of reasoning on the usb support issue and found the following article.

As you can see it names both the motherboard and cpu combination that should be hit by this problem. There are also the official work arounds listed in that article.

In a following article it names AGESA 1.2.0.2 as the update to be on the lookout for to solve this root cause USB issue.

Unfortunately, per the BIOS update page for the 550, I'm already at AGESA 1.2.0.3.

My next steps are to update the BIOS to latest as version 2803 seems to again iterate the AGESA version to 1.2.0.7.

Following that, if I still experience the USB issues, I'll be attempting the workarounds outlined in the first article.

I will update when I'm able to do the BIOS update/work around work.

Thanks again for your time!

benk87 · Jan 31, 2023

Update:

I have reached 72 hours~ since my last boot without running into the above issue since updating the BIOS.

Some other things of note:

There's bad documentation on this motherboard with regard to how the BIOS update usb slot and button works. While the docs say it will flash 3 times to let you know it's started the update, it in fact will continually flash until it completes, and then finally will boot the computer for you. It might be obvious to some readers, but to those who need to hear this do not pull the flash drive while the bios update button continues to flash, you will interrupt and temporarily brick your motherboard which will subsequently begin to return a status LED that says there's something wrong with DRAM on boot, this is not true, you have a borked BIOS. (this is reversable, but is an easy way to give yourself another headache, ask me how I know.)
It's likely just my PC case but I found that the BIOS flash button on the back of the motherboard was very easy to keep permanently depressed by the IO shroud, you may need to do some surgery on the shroud to get it to play nicely.

Important Announcement for the TrueNAS Community.

TrueNAS locks up with "uhub_reattach_port: giving up port X reset"

benk87

Cadet

Attachments

Ericloewe

Server Wrangler

benk87

Cadet

Ericloewe

Server Wrangler

benk87

Cadet

benk87

Cadet

Attachments

Ericloewe

Server Wrangler

benk87

Cadet

benk87

Cadet

Similar threads