I have that sinking feeling.....

jaywest · Oct 30, 2022

I've been running the latest 12.0 truenas core (not sure exactly which level cause I cant get into it, hence the sinking feeling but I think it was 12.0.8-U2) for some time now with no issues. I very often have the dashboard window open, so I keep close tabs daily on the system and have never seen a single hiccup. I'm using windows/smb shares for some local desktops, a handful of jails for plesk & friends, and 4 or 5 byhve vm's (1 windows, 1 freebsd, and a few debian). I keep it updated so I'm pretty sure I am on the last/latest 12x release. I was planning on going to rel 13 sometime soon since I noticed it showed up in the updater as general deployment ready, but did no start that yet.

I have four 4tb drives set as mirrored vdevs (8tb usable), and the system boots from a mirrored two usb DOM (disk on module) that plug directly into the mainboard. Not sure of the DOM size, maybe 32gb, and 64gb ram. System has been rock solid for a couple years.

Abruptly, a few days ago I could no longer get to the UI/login screen and the browser just sits and spins. I figured I could just pop into the cli via ssh, and it checks my credentials correctly but instead of giving me a shell, it just prints out the following:

---------------
Last login: Sat Oct 29 22:36:53 2022 from 172.30.30.177
FreeBSD 12.2-RELEASE-p14 325282c09a5(HEAD) TRUENAS

TrueNAS (c) 2009-2022, iXsystems, Inc.
All rights reserved.
TrueNAS code is released under the modified BSD license with some
files copyrighted by (c) iXsystems, Inc.

For more information, documentation, help or support, go here:
http://truenas.com
Welcome to TrueNAS
Traceback (most recent call last):
File "/usr/local/sbin/hactl", line 171, in <module>
main(args.command, args.q)
File "/usr/local/sbin/hactl", line 17, in main
client = Client()
File "/usr/local/lib/python3.9/site-packages/middlewared/client/client.py", line 283, in __init__
^C^C^C^C
---------------

The ctrl-C's are my attempt to stop whatever is running and give me a shell prompt, but I never get anything else in the ssh session. I run the system headless, but will try to get a monitor/keyboard/mouse on it today to see if I can access the console. I'm worried that if I just reboot it, it may not come back up and at this point i have no way to get the system config off it or anything. I have popped into each jail and vm, as well as copied a few files on and off the shares, and everything virtualized has no visible errors or issues in their logs. It just seems to be the UI and ssh that are hosed.

Could someone perhaps point me into a direction to troubleshoot? Any pointers or advice is most appreciated!

J

awasb · Oct 30, 2022

The above mentioned errors usually stem from a middleware(d) service problem.

If I were you, I'd

[1] log in locally / via serial console.
[2] do some service middlewared stop and service middlewared start.
[3] Log in via (hopefully then working) GUI and update everything to 13.0-U2. It's been long since (if at all) I've seen such a stable release. On my hardware I'd do it any time again.

Etorix · Oct 30, 2022

With access to the console (IPMI?), you may well bring the middleware back online, and take this early Halloween scare as a reminder that it is good practice to always have a recent copy of the configuration file at hand.
If you can't reach the console at all, it should be safe to reboot (still make sure the shares aren't used for writes and cleanly shut down as many services/VMs as possible).

jaywest · Oct 30, 2022

I knew there was a facility to restart the gui/middleware service, but I couldn't get ssh to work either so will have to hook up monitor/keyboard/mouse and go that route.

This does raise a question in my mind, and I'm not looking so much for technical details as I'm just curious - why is some middleware component necessary to be running correctly *just to get to a shell prompt via ssh*? So they set the root shell to something other than /bin/sh or similar? I don't get why this middleware piece impacts getting a shell prompt via ssh.....

jaywest · Oct 30, 2022

Etorix said:
With access to the console (IPMI?), you may well bring the middleware back online, and take this early Halloween scare as a reminder that it is good practice to always have a recent copy of the configuration file at hand.
If you can't reach the console at all, it should be safe to reboot (still make sure the shares aren't used for writes and cleanly shut down as many services/VMs as possible).

Due to this scare, I will most definitely be setting up a period job to scp the config file offsite, backblaze etc. Will hook up monitor in a few minutes and see if I can talk to it. Much appreciated!

jaywest · Oct 30, 2022

awasb said:
The above mentioned errors usually stem from a middleware(d) service problem.

If I were you, I'd

[1] log in locally / via serial console.
[2] do some service middlewared stop and service middlewared start.
[3] Log in via (hopefully then working) GUI and update everything to 13.0-U2. It's been long since (if at all) I've seen such a stable release. On my hardware I'd do it any time again.

So once I hooked up a monitor and keyboard, I hit enter to get the menu. Something at the very top scrolled off, but I believe it said something to the effect of 'boot-pool suspended'. That's probably not good at all; but I did get the menu. I selected 9 for shell, and it does the same thing as if I ssh'd above.... prints the message about client.py error and no shell prompt.

Is it time to power cycle and pray? (I would likely take a day or two to copy file shares off, not sure how to back up the vm's from inside them....)
smh anything else I should try first?

awasb · Oct 30, 2022

There seems to be no alternative. What could get worse?

danb35 · Oct 30, 2022

jaywest said:
So they set the root shell to something other than /bin/sh or similar?

You can set the root shell to whatever you like; I think the current default is zsh. But if there's something in one of the login files that calls the middleware for something, that would explain its involvement. And that is what's happening; .zlogin contains this:

Code:

root@freenas2[~]# cat .zlogin
if [ -f /usr/local/sbin/hactl ]; then
    /usr/local/sbin/hactl status -q
fi

cat ~/.warning

...and hactl calls the middleware.

Your system seems to be showing classic symptoms of boot device failure, for which the prescription is to reinstall on a fresh device and upload a saved copy of your config file. If you don't have a saved copy of that, once you log in to the new installation and import your pool, you should be able to pull the last of the system's automatic backups off of your pool somewhere under /var/db/system/configs-(longhexnumber)/. Upload that through the GUI and you should be back up and running.

jaywest · Oct 30, 2022

awasb said:
There seems to be no alternative. What could get worse?

well, the vm's and fileservices are at least running at the moment (one of which is public facing). If I reboot and auto-fsck and such doesn't fix it, nothing will be running so it can get worse heh. At the moment, I am copying all my drive shares off to a standalone usb drive. Those just finished. Next, I need to somehow back up the few vm's on it. One of those vm's I can't back up, as I only turned it on when actively using it so no way to turn it on but that vm wasn't super critical anyways. Once I get backups of those vm's, I'll try a power cycle and see. If it comes up I can go from there to triage, but if not I'll load up a new truenas box and import the drives.

danb35 · Oct 30, 2022

The TrueNAS boot device is separate from your data, which is by design. Assuming your pool isn't encrypted, neither your data nor your VMs are at risk.

jaywest · Oct 30, 2022

danb35 said:
You can set the root shell to whatever you like; I think the current default is zsh. But if there's something in one of the login files that calls the middleware for something, that would explain its involvement. And that is what's happening; .zlogin contains this:

Code:
root@freenas2[~]# cat .zlogin if [ -f /usr/local/sbin/hactl ]; then /usr/local/sbin/hactl status -q fi cat ~/.warning

...and hactl calls the middleware.

Your system seems to be showing classic symptoms of boot device failure, for which the prescription is to reinstall on a fresh device and upload a saved copy of your config file. If you don't have a saved copy of that, once you log in to the new installation and import your pool, you should be able to pull the last of the system's automatic backups off of your pool somewhere under /var/db/system/configs-(longhexnumber)/. Upload that through the GUI and you should be back up and running.

I would think a back-door root (toor?) login should exist that does NOT call hactl, and thus can always be ssh'd in to even if the middleware is dead. In my case the middleware failing prevents it from going on to the shell I guess. Perhaps this is not viable depending on how this 'middleware' fits into the architecture. Good to know, thanks!

I have lived through many production failures of truenas due to boot drives being usb flash drives. For this installation, I went with these Disk-On-Module "flash drives", but I was under the impression these were specially made for high/long/boot activity unlike normal flash drives. It does not seem to be the case with these, perhaps on the new/replacement machine I just need to go buy a couple regular ssd drives and drop them in the chassis along with the data drives and see if the system will install to those. This is the "boot drive" (2 of them, set up in a mirror):

Fortunately, I have a spare (and identical) server sitting on top of this machine. All I have to do is move the hard drives and boot (after installing truenas to the usb dom's). I don't have a backup of the config, but I'm hoping the spot you mention above will appear and I can go the route you mention. But maybe I should look at ssd's instead of those dom things....

Thanks to all who replied, very sincerely appreciate the direction and advice!

J

jaywest · Oct 30, 2022

danb35 said:
The TrueNAS boot device is separate from your data, which is by design. Assuming your pool isn't encrypted, neither your data nor your VMs are at risk.

Totally understand - but I do want a backup in case I run into other trouble. Protecting myself from myself is good

But there is one vm I need up immediately, so the backup of that vm will get it running somewhere else while I monkey with the truenas box in a less white-knuckled fashion!

Really appreciate your concise and useful advice danb35!

jaywest · Nov 4, 2022

I'll be doing the os reload tomorrow I think, but would like to check my process first.

Currently system is booting from two usb-dom devices on the mainboard. I plan on removing those, and replacing them with two SSD drives.
I should be able to boot from the truenas installer on a usb stick and install to those two new ssd drives.
All my hard drives are present, but I assume that the above will not touch them in any way.
After the install completes, I should be able to get networking set up.
Next I should be able to import the existing pool from the hard drives that are still present via the gui.
Last, see if I can find a backup in /var/db/system/configs-.... and if so, import that via the gui and reboot.

Can anyone advise if there are some pitfalls or holes in my steps above? Thanks in advance!

Glorious1 · Nov 4, 2022

I'm not qualified to identify pitfalls, but it just seems hard to believe that two DOM devices would fail at the same time. Was your boot pool getting scrubs on some schedule?

EDIT:
Actually, reading this post by @jgreco, perhaps only one boot device has failed, and you just need to tell your BIOS to boot from the other one.
https://www.truenas.com/community/t...t-wont-boot-video-included.104920/post-722007

jaywest · Nov 5, 2022

I tried booting from the mirrored atp32 device, no luck. I don't think my boot-pool was in any scrub schedule, not that I created anyways.

I have my two new ssd's and install stick. On an identical machine, I installed the two ssd's and booted from the install stick, the install process finished completely and correctly. I then booted off the ssd and it did its initial boot/setup fine as well. One quick question before i proceed....

Is it ok to move the two ssd's to the machine that failed at this point? Or should I reinstall truenas core from stick and then move the ssd's before their first boot?

What I'm trying to ask is - does that first boot from ssd (some things are initialized, DH, keys, etc) lock it to that machine?

danb35 · Nov 5, 2022

jaywest said:
What I'm trying to ask is - does that first boot from ssd (some things are initialized, DH, keys, etc) lock it to that machine?

Nope, you should be good to go.

joeschmuck · Nov 5, 2022

So the issue with mirrored boot drives is if the main one fails, it does not mean the second one will automatically boot up. It should have a viable copy of the boot environment and you should be able to tell the BIOS to boot from the other device. The main device could be so damaged that it would need to be removed from the system. I hope a simple reboot fixes it all. You should know that all your configuration files are on the pool as well, it's well hidden but it's there so you can recover it. Here is a link that should help.

SOLVED - RESOLVED Recover Configuration.db From Failing Boot Device. Almost There. Help!

Hi all, I did a regular update to my FreeNAS server on Friday, and afterward I could not access it. A trip to the basement server rack and a reboot showed a Grub error on the boot USB stick, a Sandisk Cruzr. I found this thread in my search for a solution...

www.truenas.com

Patrick M. Hausen · Nov 5, 2022

Did you install on both DOMs at the same time? If you added the second drive to the boot pool after installation, you will have a mirrored pool allright, but the drive will not have UEFI or FreeBSD boot loader installed. You would need to copy those manually (just once) when adding the second drive.

jaywest · Nov 5, 2022

so it came up on the two new ssd drives (well, just one of them, haven't mirrored the boot pool yet).
It also imported my pool, and I can see all the data there. Thats a very great thing. Thanks for the handholding and reassurances :D

From the ui, I click on 'shell' and cd to /var/db/system. There are two config-{longhexnumber} folders. One is dated today and has a db file in it but I assume that is the new {effectively empty} current config. The other config- directory is dated a year ago, but there are no files inside that directory at all. I assume this means for some reason, my system wasn't autocopying config files into the pool?

I guess I need to start all the configuration manually, recreating vm's and such and pointing them at their disks....

Thanks so much folks, I'll move forward!

Glorious1 · Nov 5, 2022

jaywest said:
From the ui, I click on 'shell' and cd to /var/db/system. There are two config-{longhexnumber} folders. One is dated today and has a db file in it but I assume that is the new {effectively empty} current config. The other config- directory is dated a year ago, but there are no files inside that directory at all. I assume this means for some reason, my system wasn't autocopying config files into the pool?

Drill down inside the one dated today. If you're lucky, it will have a directory for every TrueNAS version you've used, and inside each a daily config backup. No idea why there would be an empty one.

Important Announcement for the TrueNAS Community.

I have that sinking feeling.....

Dabbler

Patron

Wizard

Dabbler

Dabbler

Dabbler

Patron

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Dabbler

Dabbler

Guru

Dabbler

Hall of Famer

Old Man

Hall of Famer

Dabbler

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "I have that sinking feeling....."

Similar threads