Waiting for Active TrueNAS controller to come up...

ptyork

Dabbler
Joined
Jun 23, 2021
Messages
32
There's another thread with this title, but it doesn't seem to cover the exact symptoms I'm encountering. Though it is with Bluefin (upgrade form Angelfish), and I don't recall seeing this issue before the upgrade.

My issue appears to be 100% related to the GUI as all services and such seem stable. Basically, at unpredictable times and intervals, the GUI just loses connection to the server. Usually I get the titular "Waiting for Active TrueNAS controller to come up..." message. Sometimes it comes back on its own. More often I have to manually refresh the page. Refreshing always fixes it. Sometimes the login screen flashes up, but it returns to the screen I was originally on very quickly after that without requiring a login. And things are stable for a few seconds or minutes or hours.

MOST of the time it seems to happen while I'm not on the tab. So I thought it might be related to tab sleeping or something. But it does occasionally happen while I'm interacting with the UI. So I don't think that's it. Plus, it can go hours without happening, and then happen a few times in a 15 minute span.

This obviously sounds like a network issue. But literally nothing else on the box is impacted. SSH connections (in my experience very susceptible to network issues) all remain solid. No other web UI shows similar behavior. Other VMs and containers all up and solid. I'm seeing this with MS Edge, but it's happened across multiple versions and persisted through a cookie and history purge. And it's just Chromium, so that should really be a factor.

I assume there's a background worker thread that polls the connection and/or the current users permissions. And perhaps this connection is failing and causing this issue. No clues are given in the GUI. Is there a log file that might capture failed logins to whatever service the UI is trying to ping? Or are there other potential explanations that I might be able to explore?

Thanks.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Can you specify the hardware?

There are almost 300 fixes/improvements scheduled for 22.12.1 in 3 weeks. Might be in there...
 

ptyork

Dabbler
Joined
Jun 23, 2021
Messages
32
Homelab rig that Craft Computing would be proud of. :) Running virtualized under Proxmox. VM has:
- 64 GB RAM (ECC)
- 2x8 cores (Ivy Bridge Xeons in "host" mode)
- Bridged VirtIO network
- Passthrough SATA (onboard host and HBA---total 12 drives)

If it matters, the Proxmox host network is a bond...onboard (Intel) 1G nic + 10G ConnectX-3. Supermicro MB. Not sure what else might be relevant.

I think the only other potentially "interesting" things about my config are that I have two IP aliases bound to my primary bridge (br0) over my virtualized nic (ens18). I also have a bridge (br20) over a VLAN (vlan20). The GUI is listening on one of the two br0 aliases as well as the IP address on the VLAN.

Thanks.
 

ptyork

Dabbler
Joined
Jun 23, 2021
Messages
32
More info if you're interested. I've gotten pretty good at narrowing this down. I've not been exhaustive here, but I have found a few things.

First, it seems to happen on Chromium-based browsers. Chrome and Edge on Windows and Chrome on Mac. Does NOT seem to happen on Firefox.

Second, right now the repeatable issue doesn't actually show the "Waiting for Active TrueNAS Controller" message. Rather, it jumps to a "/ui/sessions/signin" route but gets stuck showing JUST the logo with an "indeterminate" progress bar at the top:

1674959270187.png


Third, this seems not to happen if I leave the screen on the Dashboard, but DOES happen if I leave the Datasets tab open. Though I COULD have gotten impatient waiting on the Dashboard, it was pretty quick to happen on the Datasets tab (< 1 hour). So this may or may not be a relevant clue.

Fourth, loading that Datasets tab does incur a couple of JS errors in the developer console, though I think they might just be coincidental. The first message below happens on load for all browsers, but the second and third happen on all browsers except Firefox when the Datasets tab is activated:

1674959710123.png


Fifth, and I think demonstrating the result (if not the cause), is that the chromium browsers keep closing and respawning WebSockets. Firefox has no issue with the single WebSocket staying alive and thus does not attempt to log out/redirect, whatever.

1674960110312.png


The pattern is similar across browsers and platforms, but the exact time is not. Basically a long'ish period (46 minutes here in Chrome on Windows and basically 1 hour on Edge and Mac/Chrome. That socket contains a LOT of ping/pong messages...just keep-alive stuff. Dies without a message. This is followed by a sub-second re-connection attempt that is unsuccessful, ending with what seems to be a request for "pool.dataset.details. and closing with this error message:
  1. error: {error: 13, type: null, reason: "Not authenticated", trace: null, extra: null}
  2. id: "d02327a4-1ff7-1dc4-b7a1-f04d7407afca"
  3. msg: "result"
All of the remaining socket connections are just "connect", "connected", "ping", "pong", "ping" and so forth until the socket drops and it is recreated.

And again, Firefox has ONE socket that just plays ping pong every 20 seconds forever. No issues.

If others can't reproduce this behavior, then perhaps this is a "hiccup" in the network connection that is not being handled as gracefully by Chromium as it is by Firefox. But the regularity of the closing of the WebSockets following the first failure...same pattern but different timing in all browsers...but almost always something.9 minutes...doesn't feel like a network-layer issue anymore.

Hope this helps.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
More info if you're interested. I've gotten pretty good at narrowing this down. I've not been exhaustive here, but I have found a few things.

First, it seems to happen on Chromium-based browsers. Chrome and Edge on Windows and Chrome on Mac. Does NOT seem to happen on Firefox.

Second, right now the repeatable issue doesn't actually show the "Waiting for Active TrueNAS Controller" message. Rather, it jumps to a "/ui/sessions/signin" route but gets stuck showing JUST the logo with an "indeterminate" progress bar at the top:

View attachment 63028

Third, this seems not to happen if I leave the screen on the Dashboard, but DOES happen if I leave the Datasets tab open. Though I COULD have gotten impatient waiting on the Dashboard, it was pretty quick to happen on the Datasets tab (< 1 hour). So this may or may not be a relevant clue.

Fourth, loading that Datasets tab does incur a couple of JS errors in the developer console, though I think they might just be coincidental. The first message below happens on load for all browsers, but the second and third happen on all browsers except Firefox when the Datasets tab is activated:

View attachment 63029

Fifth, and I think demonstrating the result (if not the cause), is that the chromium browsers keep closing and respawning WebSockets. Firefox has no issue with the single WebSocket staying alive and thus does not attempt to log out/redirect, whatever.

View attachment 63030

The pattern is similar across browsers and platforms, but the exact time is not. Basically a long'ish period (46 minutes here in Chrome on Windows and basically 1 hour on Edge and Mac/Chrome. That socket contains a LOT of ping/pong messages...just keep-alive stuff. Dies without a message. This is followed by a sub-second re-connection attempt that is unsuccessful, ending with what seems to be a request for "pool.dataset.details. and closing with this error message:
  1. error: {error: 13, type: null, reason: "Not authenticated", trace: null, extra: null}
  2. id: "d02327a4-1ff7-1dc4-b7a1-f04d7407afca"
  3. msg: "result"
All of the remaining socket connections are just "connect", "connected", "ping", "pong", "ping" and so forth until the socket drops and it is recreated.

And again, Firefox has ONE socket that just plays ping pong every 20 seconds forever. No issues.

If others can't reproduce this behavior, then perhaps this is a "hiccup" in the network connection that is not being handled as gracefully by Chromium as it is by Firefox. But the regularity of the closing of the WebSockets following the first failure...same pattern but different timing in all browsers...but almost always something.9 minutes...doesn't feel like a network-layer issue anymore.

Hope this helps.
Awesome detail... would you report a bug if you can. The engineering team would appreciate the confirmation of any fixes they have.
 

ptyork

Dabbler
Joined
Jun 23, 2021
Messages
32
would you report a bug if you can.
Can you please open a Jira ticket?

Sadly, no. When I click the "Report a Bug" link, I can see bugs. When I click the "Create" button I'm told to log in. When I log in, I'm redirected to a customer portal with no way to report bugs. Something is misconfigured in Jira, or perhaps accounts have to be added to a group in order to file bug reports.

--------------

EDIT: My personal Jira account doesn't work. Tried logging in using my Google id as well as directly using password auth. Either I've been blocked or perhaps since I used Google auth first, somehow it didn't kick off the automated account approval process. Is there someone at iXsystems to contact about this?

Logged in using my work Jira account and it worked.
 

ptyork

Dabbler
Joined
Jun 23, 2021
Messages
32
@Daisuke yeah, both were pre-existing Atlassian accounts. I think I'd used my personal one with the iX Jira instance before. Something odd presumably happened. Some kind of migration limbo or something. Never got asked to verify my email for the personal account (at least not recently) but did for my work account. Oh well.

Happy to share the ticket. There's no Daisuke user there. If you can't "watch" the issue directly (https://ixsystems.atlassian.net/browse/NAS-120095) then let me know whatever name you are registered with in Jira and I'll share it there.
 
Top