Race condition need workaround to delay LDAP startup

kur1j

Cadet
Joined
Aug 29, 2023
Messages
9
I found this issue: https://ixsystems.atlassian.net/jira/software/c/projects/NAS/issues/NAS-127452

In short, there is a race condition that LDAP Services start BEFORE the networking services have completely started, which results in kinit failing as it can't talk to the ticket authority server which results in `kinit: Resource temporarily unavailable while getting initial credentials\n')` errors.

I have tried to delay the start up of middlewared.service by adding a delay. But it seems that middlewared is controlling the startup of the network, which doesn't help.

Any tricks to make the startup process of Kerberos to delay 5-10 seconds so all the network services can be started?
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
I found this issue: https://ixsystems.atlassian.net/jira/software/c/projects/NAS/issues/NAS-127452

In short, there is a race condition that LDAP Services start BEFORE the networking services have completely started, which results in kinit failing as it can't talk to the ticket authority server which results in `kinit: Resource temporarily unavailable while getting initial credentials\n')` errors.

I have tried to delay the start up of middlewared.service by adding a delay. But it seems that middlewared is controlling the startup of the network, which doesn't help.

Any tricks to make the startup process of Kerberos to delay 5-10 seconds so all the network services can be started?

If the IP address is statically assigned.... is the issue resolved?
 

kur1j

Cadet
Joined
Aug 29, 2023
Messages
9
If the IP address is statically assigned.... is the issue resolved?

No it doesn't. Just happens less frequently. Before the change, it would fail 100% of the time. After the change to a static IP it fails ~50% of the time after reboot. It is such a tight window that even after the set to static it comes up fast enough, but other times it doesn't.

Basically it comes down to the middlewared service kicking off the kerberos.start job before the network is completely up as the logs depict in the ticket I put in. Unfortunately, I can't control the dependency as middlewared seems to control both the network and the kerberos start.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
I'm a bit surprised that no-one else is reporting the same issue.... anything unusual in your network?
 

kur1j

Cadet
Joined
Aug 29, 2023
Messages
9
I'm a bit surprised that no-one else is reporting the same issue.... anything unusual in your network?
Honestly, I am too. I would suspect that there isn't many people using Kerberos keytabs? The problem exhibits itself in the kinit call with the keytab file. If no one is using keytabs then that might be why.

No, nothing special. The two systems exhibiting this issue, one is on a copper 1Gbps connection, the other is a 10Gbps SFP+ connection. Both connected to Cisco switches. The system that doesn't exhibit this issue is connected to the SAME exact switch via 10Gbps SFP+ as well.

When I had it set for our DHCP server to assign a static IP its not like it was a long negotiation either < 5 seconds, not like its 1-2 minutes.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Honestly, I am too. I would suspect that there isn't many people using Kerberos keytabs? The problem exhibits itself in the kinit call with the keytab file. If no one is using keytabs then that might be why.

No, nothing special. The two systems exhibiting this issue, one is on a copper 1Gbps connection, the other is a 10Gbps SFP+ connection. Both connected to Cisco switches. The system that doesn't exhibit this issue is connected to the SAME exact switch via 10Gbps SFP+ as well.

When I had it set for our DHCP server to assign a static IP its not like it was a long negotiation either < 5 seconds, not like its 1-2 minutes.

Thanks for the bug report.... if you could narrow it dow to keytabs, it would be useful.
 

kur1j

Cadet
Joined
Aug 29, 2023
Messages
9
Thanks for the bug report.... if you could narrow it dow to keytabs, it would be useful.
Oh it is definitely caused by the use of keytabs. It’s all there in the big report.

It’s the “kinit” call that fails when the network isn’t up thst throws the exception which causes the directory services to become faulted and requires a manual disable and re-enable to get it back working.

I was posting here to see if there was an easy way I could get a workaround to delay the startup of the directory services “kerberos.start” job.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Is there an equivalent to CORE's "netwait" feature in SCALE?
 

kur1j

Cadet
Joined
Aug 29, 2023
Messages
9
Is there an equivalent to CORE's "netwait" feature in SCALE?
I'm not familiar with CORE and to know what "netwait" is. I assume it is related to having the network services up. If so, there is already ix-net.service but that is already a dependency on the network.

For this problem middlewared is handling the network AND the kerberos.start, which means adding external service level dependencies doesn't really do me any good unfortunately.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
For this problem middlewared is handling the network AND the kerberos.start, which means adding external service level dependencies doesn't really do me any good unfortunately.
I see - sorry. In FreeBSD you can set these tunables (rc.conf):

netwait_enable="YES"
netwait_ip="1.2.3.4"
netwait_timeout="60"

and the system will wait up to 60 seconds for 1.2.3.4 to be "pingable" before starting any other services.

Now that you mention it I wonder if this will work with TrueNAS at all - depends on if the network is started by the FreeBSD RC system or the middleware.
 

kur1j

Cadet
Joined
Aug 29, 2023
Messages
9
I see - sorry. In FreeBSD you can set these tunables (rc.conf):

netwait_enable="YES"
netwait_ip="1.2.3.4"
netwait_timeout="60"

and the system will wait up to 60 seconds for 1.2.3.4 to be "pingable" before starting any other services.

Now that you mention it I wonder if this will work with TrueNAS at all - depends on if the network is started by the FreeBSD RC system or the middleware.
I believe that the TrueNAS SCALE does allow the "system" level network service to start, but the management of the ip's/network information is managed by middlewared.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
The engineers are adding a retry process for Kerberos in Dragonfish (24.04.RC1).

I'd suggest using static IP until you want to update.
 
Top