SOLVED Kubernetes service is not running - RC2 or Hardware Upgrade Issue

Antoshka

Dabbler
Joined
May 1, 2020
Messages
31
Hello guys,

I did an upgrade of my NAS to new hardware and faced an error during migration -> Apps can't be installed + VMs run unstably, I'm getting "Kubernetes service is not running"

What happened before issue (prerequisites):
- Hardware upgrade from old Intel v3 to new Ryzen 5800x:
MB: Supermicro X10SLL-F ---> Asrock Rack X570D4U
CPU: Xeon E3 1271 v3 ---> Ryzen 5800X
RAM: 32 Gb ECC ---> 64Gb nonECC
Added 1Tb NVMVe

- Upgrade from RC1.2 to RC2 at the same time

Unfortunately, the OS upgrade happened exactly during the hardware upgrade, so I still don't know what is the root cause of my problem...

Problem description:
I can't install any apps because I get a "Kubernetes service is not running" error.
Code:
Error: Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 409, in run
    await self.future
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 445, in __run_body
    rv = await self.method(*([self] + args))
  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1137, in nf
    res = await f(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1269, in nf
    return await func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/plugins/chart_releases_linux/chart_release.py", line 478, in do_create
    await self.middleware.call('kubernetes.validate_k8s_setup')
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1324, in call
    return await self._call(
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1281, in _call
    return await methodobj(*prepared_call.args)
  File "/usr/lib/python3/dist-packages/middlewared/plugins/kubernetes_linux/update.py", line 322, in validate_k8s_setup
    raise CallError('Kubernetes service is not running.')
middlewared.service_exception.CallError: [EFAULT] Kubernetes service is not running.

[EFAULT] Kubernetes service is not running.


Also, VMs runs very unstably: VM runs ok after initial configuration, but it stuck somewhere after reboot - no VNC at all, doesn't work properly. I think it can be connected to App issue.

What I tried to solve a problem:
- Import old config to the new installed TrueNAS;
- Move application pool assignment to another existing pool;
- Completely destroy the app pool and create a new one from scratch;
- Do everything from scratch (freshly new installed TrueNAS).

I noticed a few similar treads, but "reboot + destroy and recreate app pool" don't work in my case.

Maybe someone has any idea or solution? If anyone had similar issues with AMD builds, it will be helpful as well.

P.S. I have properly configured "Route v4 Interface" and "Route v4 Gateway" in apps settings.

Thank you in advance!
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Hello guys,

I did an upgrade of my NAS to new hardware and faced an error during migration -> Apps can't be installed + VMs run unstably, I'm getting "Kubernetes service is not running"

What happened before issue (prerequisites):
- Hardware upgrade from old Intel v3 to new Ryzen 5800x:
MB: Supermicro X10SLL-F ---> Asrock Rack X570D4U
CPU: Xeon E3 1271 v3 ---> Ryzen 5800X
RAM: 32 Gb ECC ---> 64Gb nonECC
Added 1Tb NVMVe

- Upgrade from RC1.2 to RC2 at the same time

Unfortunately, the OS upgrade happened exactly during the hardware upgrade, so I still don't know what is the root cause of my problem...

Problem description:
I can't install any apps because I get a "Kubernetes service is not running" error.
Code:
Error: Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 409, in run
    await self.future
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 445, in __run_body
    rv = await self.method(*([self] + args))
  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1137, in nf
    res = await f(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1269, in nf
    return await func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/plugins/chart_releases_linux/chart_release.py", line 478, in do_create
    await self.middleware.call('kubernetes.validate_k8s_setup')
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1324, in call
    return await self._call(
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1281, in _call
    return await methodobj(*prepared_call.args)
  File "/usr/lib/python3/dist-packages/middlewared/plugins/kubernetes_linux/update.py", line 322, in validate_k8s_setup
    raise CallError('Kubernetes service is not running.')
middlewared.service_exception.CallError: [EFAULT] Kubernetes service is not running.

[EFAULT] Kubernetes service is not running.


Also, VMs runs very unstably: VM runs ok after initial configuration, but it stuck somewhere after reboot - no VNC at all, doesn't work properly. I think it can be connected to App issue.

What I tried to solve a problem:
- Import old config to the new installed TrueNAS;
- Move application pool assignment to another existing pool;
- Completely destroy the app pool and create a new one from scratch;
- Do everything from scratch (freshly new installed TrueNAS).

I noticed a few similar treads, but "reboot + destroy and recreate app pool" don't work in my case.

Maybe someone has any idea or solution? If anyone had similar issues with AMD builds, it will be helpful as well.

P.S. I have properly configured "Route v4 Interface" and "Route v4 Gateway" in apps settings.

Thank you in advance!

I think the radical hardware upgrade and the simultaneous software update make this very hard to diagnose. its always best to do these separately... one step at a time.

Perhaps just go through the fresh install process carefully and document where the issues happen. Can you start with a new pool and no Kubernetes.
 

Kris Moore

SVP of Engineering
Administrator
Moderator
iXsystems
Joined
Nov 12, 2015
Messages
1,471
Make sure your system can fully reach out to the internet, I.E. can you ping from a shell prompt to google.com or similar. When you enable Apps, Kubernetes has to download a lot of files during the initial setup, and if the internet cannot be reached, it'll fail like this.
 

Antoshka

Dabbler
Joined
May 1, 2020
Messages
31
Thanks for the ideas, guys.
morganL, I tried that already: completely new, fresh installation and new pool -> Kubernetes doesn't start/run.
BTW, I really wanted to do not mix Hardware and Software upgrades together, but the NAS building took 2 days. And when I finished, I couldn't find the RC1.1 build anywhere. I checked different links/posts on the forum, but I was able to download only new RC2, so no way for the end-user to download previous versions:(

Kris Moore, I can ping google.com from a shell.

Also, I updated the ticket with outputs of
k3s kubectl get pods -A
service k3s status
 

Kris Moore

SVP of Engineering
Administrator
Moderator
iXsystems
Joined
Nov 12, 2015
Messages
1,471
Great, looks like one of our devs is already interacting with you on the ticket. We'll keep investigating there and see if we can figure out what's going on here.
 

Antoshka

Dabbler
Joined
May 1, 2020
Messages
31
Hey everyone,
I have a small update from my end.

I spent 5 days trying to resolve issues: re-installing TrueNAS and playing with Pools, reconfiguring BIOS settings, switching disks, and other stuff. I provided all details in the Jira ticket to get some feedback.
Unfortunately, nothing/nobody resolved my issue OR helped me to identify a root cause.

Fortunately, I gave up and decided to install the ESXi hypervisor to have my VMs.
I got a clue to my issue in the first 3 minutes of installation - "Page File Exception 14" (an error occurred in 10 seconds of the installation process itself). Googling this code shows a potential issue with memory.
MemTest86 displayed thousands of errors with memory:
Memory Both Modules - Errors.PNG


I played a bit with memory modules and figured out that one module is defective. The system doesn't boot at all with one "bad" module, but it works unstable with two modules.

So, in the end, the root cause is the defective RAM module. On the one hand, I'm impressed that TrueNAS Scale works in such conditions. On the other hand, I'm so disappointed how the system is non-informative. In comparison to ESXi checks, I don't know why TrueNAS doesn't check the hardware to at least warn the users.

I had a pretty rare issue, but maybe this post will help someone OR it will convince for the development of internal checks inside a TrueNAS Scale.

Happy holidays, guys!
 
Top