SOLVED Re-joining Active Directory results in DNS timeout

INS4NIt

Cadet
Joined
May 1, 2023
Messages
8
After upgrading from TrueNAS Scale 22.12.1 to Scale 22.12.2 I encountered the bug described in this thread: https://www.truenas.com/community/threads/task-renew-kerberos-ticket-hangs-since-the-update.108037/

In trying to resolve it, I left the AD domain and attempted to rejoin it. Now, though, every single time I attempt to enable Active Directory, even with the same credentials that worked before, I get the following error:

Code:
Error: Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 196, in call_method
    result = await self.middleware._call(message['method'], serviceobj, methodobj, params, app=self)
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1335, in _call
    return await methodobj(*prepared_call.args)
  File "/usr/lib/python3/dist-packages/middlewared/service.py", line 576, in update
    rv = await self.middleware._call(
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1335, in _call
    return await methodobj(*prepared_call.args)
  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1186, in nf
    res = await f(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1318, in nf
    return await func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/plugins/activedirectory.py", line 434, in do_update
    await self.middleware.call(
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1386, in call
    return await self._call(
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1335, in _call
    return await methodobj(*prepared_call.args)
  File "/usr/lib/python3/dist-packages/middlewared/plugins/activedirectory_/dns.py", line 210, in check_nameservers
    resp = await self.middleware.call('dnsclient.forward_lookup', {
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1386, in call
    return await self._call(
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1335, in _call
    return await methodobj(*prepared_call.args)
  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1318, in nf
    return await func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1186, in nf
    res = await f(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/plugins/dns_client.py", line 108, in forward_lookup
    results = await asyncio.gather(*[
  File "/usr/lib/python3/dist-packages/middlewared/plugins/dns_client.py", line 40, in resolve_name
    ans = await r.resolve(
  File "/usr/lib/python3/dist-packages/dns/asyncresolver.py", line 114, in resolve
    timeout = self._compute_timeout(start, lifetime)
  File "/usr/lib/python3/dist-packages/dns/resolver.py", line 950, in _compute_timeout
    raise Timeout(timeout=duration)
dns.exception.Timeout: The DNS operation timed out after 12.405295133590698 seconds


Note the number of timeout seconds in the last line. It is always a very similar (if not identical) value, and is not affected whatsoever by the DNS Timeout setting when configuring Active Directory.
 

anodos

Sambassador
iXsystems
Joined
Mar 6, 2014
Messages
9,554
Hmm... that's a good point we should probably pass that value to our dnsclient. We're more string now about validating that we don't have any non-AD nameservers (because users were setting up non-AD ones as nameserver 2 or 3 and having broken AD if TrueNAS switched to using one of them). That said, I have a feeling that if you have a nameserver that isn't generating a response within 12 seconds, then you should fix that problem.
 

INS4NIt

Cadet
Joined
May 1, 2023
Messages
8
Hmm... that's a good point we should probably pass that value to our dnsclient. We're more string now about validating that we don't have any non-AD nameservers (because users were setting up non-AD ones as nameserver 2 or 3 and having broken AD if TrueNAS switched to using one of them). That said, I have a feeling that if you have a nameserver that isn't generating a response within 12 seconds, then you should fix that problem.
I have three Nameservers configured in Scale's network settings section, the first of which is the domain controller. I'm able to ping all three of them from the shell, and the "dig" command returns the IP address of our domain controller as expected. For that matter, I'm also able to ping the hostname of the machine I'm typing on right now from the Scale shell and it immediately resolves and returns with no issues -- the Active Directory setup is the only place that I can tell is having any DNS issues... is there something that I might be missing?
 

INS4NIt

Cadet
Joined
May 1, 2023
Messages
8
Following up, this comment --
We're more string now about validating that we don't have any non-AD nameservers (because users were setting up non-AD ones as nameserver 2 or 3 and having broken AD if TrueNAS switched to using one of them).
-- led me to test removing the other two nameservers and trying again. That now changes the error I get to the following:

Code:
 Error: Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 426, in run
    await self.future
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 461, in __run_body
    rv = await self.method(*([self] + args))
  File "/usr/lib/python3/dist-packages/middlewared/plugins/activedirectory.py", line 590, in start
    dc_info = await self.lookup_dc(ad['domainname'])
  File "/usr/lib/python3/dist-packages/middlewared/plugins/activedirectory.py", line 975, in lookup_dc
    raise CallError("Failed to look up Domain Controller information: "
middlewared.service_exception.CallError: [EFAULT] Failed to look up Domain Controller information: ads_connect: No logon servers are currently available to service the logon request.
Didn't find the cldap server!


Just to be clear, the secondary and tertiary DNS entries are redundantly fed from our domain controller, and they're what DHCP assigns to new devices on our network. They're also what we used to initially set up our AD integration when we were initially testing Scale, so they have worked in the past.
 

INS4NIt

Cadet
Joined
May 1, 2023
Messages
8
Well this is interesting... Enabling and disabling the Active Directory configuration actually is appreciably doing something in the background. The output of midclt call activedirectory.domain_info | jq changes depending on whether or not I've enabled AD in Directory Services (see below)

1682979918569.png

1682979937813.png

But SCALE still isn't configuring the Kerberos realm, or showing the AD status

1682980106195.png
 

INS4NIt

Cadet
Joined
May 1, 2023
Messages
8
Aha! Going down several Google rabbitholes, I stumbled across this response you made to a much earlier post: https://www.truenas.com/community/t...mpossible-to-join-ad-domain.90983/post-631746

After running midclt call activedirectory.start via the shell, everything started up as expected.

For a future release of SCALE, would it be possible to implement a check that tries that above method of starting the AD instance if the default one returns an error, or would that mask underlying problems that should be resolved on the user's end?
 

anodos

Sambassador
iXsystems
Joined
Mar 6, 2014
Messages
9,554
I'm able to ping all three of them from the shell, and the "dig" command returns the IP address of our domain controller as expected. Right. That's only one aspect of what we need available. Are all relevant SRV records available through all of your nameservers? "activedirectory.start" just bypasses validation. It doesn't fix your underlying issues.
 

INS4NIt

Cadet
Joined
May 1, 2023
Messages
8
Right. That's only one aspect of what we need available. Are all relevant SRV records available through all of your nameservers? "activedirectory.start" just bypasses validation. It doesn't fix your underlying issues.
Understood, I included that mostly just to demonstrate that DNS was available, since that's the error I was getting. Just so you know where I'm at right now, I have the domain controller as my only DNS entry now so that there's no way for the other entries to potentially conflict.

Checking for SRV records via the SCALE shell returns the following outputs:
1683042916808.png

Which, to my understanding, is expected. But with this configuration I was still getting the "Didn't find the cldap server!" error I was getting in comment #4.

This screenshot was taken after force-starting Active Directory, if that's worth anything, but I do remember the SRV records returning properly when checked before that step as well.

To be clear, users are able to authenticate to the SMB share they need access to just fine now. I'm just trying to figure out what the fault was in the first place at this point.
 

INS4NIt

Cadet
Joined
May 1, 2023
Messages
8
libads error is probably stale info being cached. C.f. https://github.com/truenas/middleware/pull/11080 fixed for next release. The point of validation is to catch cases where things are somewhat on a knife edge. If nameserver2 isn't able to resolve SRV records, then kerberos (and AD) will break if for any reason we have to switch from nameserver1 to nameserver2.
Gotcha, glad to see ya'll are aware of the issue and it's already being tracked. For future reference (myself and anyone else that might stumble on this thread), what's the process of manually flushing the DNS(?) cache in SCALE?
 

anodos

Sambassador
iXsystems
Joined
Mar 6, 2014
Messages
9,554
Gotcha, glad to see ya'll are aware of the issue and it's already being tracked. For future reference (myself and anyone else that might stumble on this thread), what's the process of manually flushing the DNS(?) cache in SCALE?
libads you can do via net cache flush. If there's something in nscd, you have to restart that service as well. nscd is unfortunately a requirement these days on our side because of the sheer quantity of poorly designed apps out there that spam name servers.
 
Top