"This is a FreeNAS data disk and can not boot system. System halted" after every update

danno

Dabbler
Joined
Sep 3, 2012
Messages
22
I searched but didn't see my particular flavor of this issue mentioned yet in the forum. Sorry if it duplicates another post.
For many years now, I have been having this problem and decided to finally post about it.
It just happened today when I updated from TrueNAS-12.0-U5 to TrueNAS-12.0-U6, but also happened when I was running FreeNAS (11).

Here are basic system details:
PC is Lenovo Thinkserver TS140
Boot disk is a single 120 GB SSD
Server has 3 data drives (2 GB each) configured in RaidZ1
All 3 data drives are excluded from the boot sequence in the BIOS

That last statement above is key, because the problem is that after each update is applied using the
System - Update option in the GUI, the subsequent reboot gets hung up indefinitely with the subject error:
"This is a FreeNAS data disk and can not boot system. System halted"

The only way I can get the update to complete is to reboot the server (without changing any BIOS settings),
and the subsequent update will complete and boot successfully to the new update version of TrueNAS.
A few times, I have gone into the BIOS right after the error appears to confirm that the 3 data drives are
still excluded from the boot sequence (they are), and the only drive in the boot sequence is the 120 GB SSD (as expected).

I would like to be able to run an update all the way to completion without having to manually intervene
each time. Hoping there is an easy fix or workaround that I just haven't discovered yet.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
For some reason, your BIOS has one of the data disks listed ahead of your boot disk. On my server, I exclude all the data disks from the boot order, and only list the boot disks.
 

danno

Dabbler
Joined
Sep 3, 2012
Messages
22
Thanks for your reply. As I stated in my original post (the underlined statement), all data disks are excluded from the boot sequence, and always have been.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Thanks for your reply. As I stated in my original post (the underlined statement), all data disks are excluded from the boot sequence, and always have been.

So it sounds like it's a buggy BIOS, and for some reason it's trying to boot off one of those devices anyways. This isn't unheard-of in the PC world. I've seen platforms where this happens every other reboot. By the way, what happens when you reboot? Does it work correctly?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Also, it's worth replacing your motherboard battery, as it doesn't seem like it's able to hold the boot order properly.
 

danno

Dabbler
Joined
Sep 3, 2012
Messages
22
Sorry, but I have to disagree with the assessment the BIOS is to blame. Here are the reasons I think the evidence points more toward a TrueNAS root cause:
  1. As I previously mentioned, I have checked the BIOS settings when the error happens, and they are set correctly: The TrueNAS SSD is the only device in the boot sequence, and the data disks are excluded.
  2. When the error occurs upon TrueNAS rebooting after an update, I only have to invoke a reboot using CTRL + ALT + DEL (with no other changes made) and that subsequent reboot allows TrueNAS to continue updating or boot into the new TrueNAS version successfully.
  3. Whenever I reboot the TrueNAS server using the "Restart" option in the GUI, it does so successfully and does not encounter this error.
  4. I have other Thinkserver TS140 servers with a similar configuration (single bootable SSD, several data drives) running Ubuntu and Linux Mint, and this has never happened with any of them. However, it happens every time with the TrueNAS server when an update occurs.
  5. If the BIOS were erroneously trying to boot from one of the data drives as is being claimed, then the error message would be the standard "No Boot OS Found" error I see with Lenovo devices. The error message in this case seems to generate from the TrueNAS system, and there really shouldn't be any TrueNAS (FreeBSD) code on the data drives.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Yet it's the BIOS that controls what drive gets booted from.

And by your own admission, hitting CTRL-ALT-DEL makes the system boot correctly.

So here's the problem you need to think about.

The big thing that FreeNAS *could* do is to corrupt the boot loader on its boot SSD. But if that happened, then it would seem like it would fail every time and not magically fix itself with a CTRL-ALT-DEL.

Now, the bit you're wrong about in #5 is that FreeNAS does install a trite boot sector on all its data disks, which prints out "This is a FreeNAS data disk and can not boot system" or something like that. This isn't hard to do, and provides a clue to someone with a misconfiguration as to what is going on.

However, that code does not do anything other than print an error message. This code comes from /boot/pmbr-datadisk; you can inspect it yourself if you wish.

When you run this code, literally nothing happens except that the machine prints a message and halts. No settings are changed by the MBR code, therefore, really, the same exact boot sequence should happen AGAIN next time. But you're saying it doesn't.

That means that the BIOS is involved, because it is the only significant bit of code running, and is the only thing that is making choices about where to boot from.

And this is clearly a thing that is limited to your platform, because we're not hearing about it for every other system out there. This also points strongly in the direction of something your BIOS is doing.

So I can tell you that I have a general suspicion about the nature of the thing that's going on. Some enterprise systems have various types of hooks, such as Dell's DRAC and lifecycle manager, that allow things to happen within the OS, which in turn cause the BIOS to run some special hook during a reboot. This allows a server to do an "update" that reboots out to a virtual floppy, maybe does a firmware update on a critical system that can't be updated while an OS is running, and then reboot into the OS. Enterprise PC's are filled with this kind of crap, and even Windows itself plays along. Look at stuff like Absolute Persistence and Computrace. Anyways, a lot of these things work through nonstandard mechanisms, and it seems like something specific about your upgrade path is triggering the BIOS to temporarily boot from a drive that would not normally be eligible from boot.
 

danno

Dabbler
Joined
Sep 3, 2012
Messages
22
I appreciate your lengthy reply, and as an IT professional with over 30 years experience, I am well aware at how BIOS (and newer UEFI) technology normally works. However, the fact remains that a reboot initiated using the TrueNAS GUI does not encounter the error, nor does the 'soft' reboot using CTRL + ALT +DEL when the error-after-update occurs (nor, does it occur if the update continues with additional processing, as it did with the most recent TrueNAS-12.0-U6 update, and then initiates yet another reboot...that one occurs successfully as well). So, there seems to be something different and unique (and problematic) about how the reboot is initiated during the FreeNAS update process which is causing the issue. If the root cause could be attributed to the system board BIOS it should then occur with every reboot.

I don't know why this isn't more widely encountered, but the hardware in my situation may be unique enough to bring out a bug in the code that isn't encountered in other installations.
 

danno

Dabbler
Joined
Sep 3, 2012
Messages
22
One other thing I failed to mention: With the rainy summer here this year, there were a bunch of times the power went out and the TrueNAS box booted right back up successfully each time when power was restored (and sent me an E-mail alert as expected) since I don't have it connected to a UPS. That, again, seems to further eliminate the BIOS as suspect, including the possibility of a bad CMOS battery which would almost certainly have caused the bootup failure in those instances.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Check the partition layout of each of your data disks. If there's a EFI system partition on any of them, the updater can get confused and add boot entries on the wrong drive. You can do this via gpart show. Only your boot disk should have a partition with a partition labeled efi. Your pool disks should only have a swap partition and a ZFS partition.
 

danno

Dabbler
Joined
Sep 3, 2012
Messages
22
Thanks for your reply. I'm not as familiar with FreeBSD partitioning as with Debian derivative systems, so am posting the command output for you:
Code:
root@freenas:~ # gpart show
=>        40  3907029088  ada0  GPT  (1.8T)
          40          88        - free -  (44K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  3902834696     2  freebsd-zfs  (1.8T)

=>        40  3907029088  ada1  GPT  (1.8T)
          40          88        - free -  (44K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  3902834696     2  freebsd-zfs  (1.8T)

=>        40  3907029088  ada2  GPT  (1.8T)
          40          88        - free -  (44K)
         128     4194304     1  freebsd-swap  (2.0G)
     4194432  3902834696     2  freebsd-zfs  (1.8T)

=>        40  234441568  ada3  GPT  (112G)
          40       1024     1  freebsd-boot  (512K)
        1064  234440544     2  freebsd-zfs  (112G)


I am using "legacy" BIOS instead of EFI, which I believe is reflected in the output (no EFI partitions present). Let me know if this is not correct, or you would like other output included.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Try moving your drives around so your boot drive is at ada0. Legacy BIOSs often default to the first disk for boot. This won't affect your pool, as ZFS uses drive GUIDs to track VDEV members, not physical port.
 

danno

Dabbler
Joined
Sep 3, 2012
Messages
22
But, I don't understand how that would happen since the 3 data drives are excluded from the boot sequence in the BIOS. I've confirmed this multiple times, and the only time the error seems to happen is when the reboot is invoked by an update process...never from a GUI-initiated reboot, power-outage forced reboot, or soft-reboot (using CTRL + ALT + DEL) used to get beyond this "system halted" error.

Every time I've replaced a failed drive within TrueNAS, the "ada" suffix (#) is auto-generated. How can that be dictated so the boot drive is made to be ada0?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Look on your motherboard. The lowest numbered port is usually ada0.
 

danno

Dabbler
Joined
Sep 3, 2012
Messages
22
I see - I thought FreeBSD might have a numbering scheme that was separate from the port number on the motherboard connector / port.
I can try to investigate this and try changing the connector used by the boot SSD, but I'm still interested in whether this might be a defect in the TrueNAS code worth pursuing (I don't believe the BIOS boot sequence configuration should ever be allowed to be over-ridden).

I think the evidence is strongest that it is a TrueNAS root cause as opposed to some sort of defect with the BIOS code, since my other ThinkServer TS140 machines are not suffering at all from this phenomenon (as I mentioned earlier in this post), and there is nothing I have found in the Lenovo Forums indicating other users with the same motherboard are having anything similar happen. Whether or not I try to do it by testing (revering to an earlier version in TrueNAS and going through the update process again) or whenever the next actual update may be made available, I think the next time I will plan to be at the console to observer if/how the update process is actually performing a 'normal' reboot correctly, instead of just invoking it from the GUI and then going to the server to do the 'soft' boot workaround.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Your hypothesis is faulty. TrueNAS isn't running when this happens. This is strictly a BIOS quirks and/or timing issue.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Your hypothesis is faulty. TrueNAS isn't running when this happens. This is strictly a BIOS quirks and/or timing issue.

Obviously, but I said that a bunch of posts back, including theorizing about what might be going on, and nevertheless it was dismissed. Sigh.
 

danno

Dabbler
Joined
Sep 3, 2012
Messages
22
I am just going by what the data / evidence appear to be showing, and there can definitely be something to how TrueNAS is invoking the reboot (presumably by invoking ACPI protocols) that could be at play. I am now determined more than before to find the root cause instead of trying to work around it. There is a bug here, whether it is with TrueNAS code or the Lenovo BIOS / hardware. If someone can offer information about where TrueNAS might offer diagnostic logging data specific to update processing, it should be very easy to investigate since the most recent entries will most likely have the most helpful data.

One possibility is that the update process is using a different protocol for invoking the reboot compared to the other instances I previously mentioned that do not have the issue (CTRL + ALT + DEL, power outage, "Restart" option in the TrueNAS GUI). The Thinkserver BIOS config has a "Primary Boot Sequence", which is supposed to be used when the machine is "powered on or rebooted" (according to the Lenovo manual), and it also has an "Automatic Boot Sequence", which is supposed to be used when "a communications device wakes up the system (for example, Wake on LAN)". I have to admit I don't remember if I checked the "Automatic Boot Sequence" settings, since that should NOT be involved here. Again, just a possible root cause, but either way would most likely be a bug in TrueNAS (I don't imagine a "wake" type protocol should be used with an update reboot) or with the Lenovo BIOS (if TrueNAS is using a protocol that should trigger the "Primary Boot Sequence", which again excludes ALL of the data drives).
 
Last edited:

danno

Dabbler
Joined
Sep 3, 2012
Messages
22
Update: I just went into the BIOS and updated the other 2 boot sequences, "Automatic" and "Error", to remove the data drives for those (and, confirmed they are still excluded in the "Primary" entry). This is speculation until I have a chance to either test or go through another actual update, but if this helps avoid the "system halted" error, then something definitely changed in TrueNAS as it had been working fine without my ever having to investigate or adjust those other BIOS settings (I sort of wish I had opened an issue at the time instead of working around it, as I can't remember for certain but believe it was sometime during the FreeNAS 9.x version when this behavior started).

My plan now is to wait until I either have time to manually test with the new BIOS settings, or when the next normal update is available, and will post back with my findings.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
When you said you had multiple boot sequences in your BIOS, I think I know what's going on. Your BIOS likely has a timer where a reboot that occurs shortly after a previous boot, and before the timer expires, is considered a failed boot, and triggers the error boot sequence. TrueNAS updates usually complete within 60 seconds, which is likely shorter than your BIOS timer.
 
Top