TrueNAS Scale incorrectly reporting Mixed Capacity VDEVS

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Import on the command line:

zpool import -o altroot=/mnt <poolname>

If that works post the output of

zfs list
 
Last edited:

Saoshen

Dabbler
Joined
Oct 13, 2023
Messages
47
this sounds eerily similar to my situation, I wonder if they are related?


 

Brandito

Explorer
Joined
May 6, 2023
Messages
72
Import on the command line:

zpool import -o altroot=/mnt <poolname>

If that works post the output of

zfs list
Tried this and it resulted in TrueNAS rebooting the entire system

And if that doesn't work, show the output of first zpool import to see that it shows your Home pool correctly with all present devices, and then try zpool import -Fn Home
This also had the same result.

All drives appeared with zpool import and showed online. Unfortunately after using the -Fn options the system became unbootable for that installation.

I was able to get back into the system with an old install and below is what zpool import shows

This is under bluefin so I assume that's what it means damaged devices or data. It didn't say that under cobia
 

Attachments

  • 16999039233944235437003044136247.jpg
    16999039233944235437003044136247.jpg
    252.7 KB · Views: 49

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
FreeBSD 14 has ZFS 2.2 - you could try a FreeBSD 14 RC4 ISO and the "live CD" feature to import and check the pool.
 

Brandito

Explorer
Joined
May 6, 2023
Messages
72
FreeBSD 14 has ZFS 2.2 - you could try a FreeBSD 14 RC4 ISO and the "live CD" feature to import and check the pool.
Trying this now, got the iso booting to the installer, how do you get to the live cd mode?

Edit: should have waiting 15 more seconds to hit reply. Found it and trying now
 

Brandito

Explorer
Joined
May 6, 2023
Messages
72
FreeBSD 14 has ZFS 2.2 - you could try a FreeBSD 14 RC4 ISO and the "live CD" feature to import and check the pool.
Got FreeBSD up, performed a zpool import -f Home and it took forcibly rebooted the server after a few seconds.

Am I out of options here?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Got FreeBSD up, performed a zpool import -f Home and it took forcibly rebooted the server after a few seconds.

Am I out of options here?
The -n parameter means "do not actually import" - but a crash even with that means there's probably spacemap leakage or something ugly in the metadata. The "unsupported flags" means you can't import it on Bluefin's ZFS 2.1.x, only a ZFS 2.2 system as mentioned by Patrick.

Try importing with -FXn (X being "extreme rollback measures") and still with the "n" of "just attempt and report the result, don't actually mount it"
 

Brandito

Explorer
Joined
May 6, 2023
Messages
72
The -n parameter means "do not actually import" - but a crash even with that means there's probably spacemap leakage or something ugly in the metadata. The "unsupported flags" means you can't import it on Bluefin's ZFS 2.1.x, only a ZFS 2.2 system as mentioned by Patrick.

Try importing with -FXn (X being "extreme rollback measures") and still with the "n" of "just attempt and report the result, don't actually mount it"
Tried import with -FXn and it forced another reboot.

What are the chances this is just an HBA issue? I did get a couple random reboots prior to this during some pretty extensive use of the pool. I was rebalancing using the previously mentioned script and running badblocks on 3 new drives intended to be hotspares if they passed.

When this most recent more fatal reboot occurred I had been doing the rebalance and was running badblocks on one drive at a time, adding one after 12-24 hours until all three were running. I know you may be thinking I ran badblocks on the wrong drives, but I assure you I checked, double checked and triple checked serial numbers and drive letter assignments before each run. I also label the drives on the front of my disk shelf.

My next step is to move the single drive pool I know imports to the disk shelf to see if it will fail when in the disk shelf

Edit: I also forgot to mention, the system failed to boot after that command as well. I keep needing to boot from my bluefin install, to upgrade to cobia to then try the next option
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
What are the chances this is just an HBA issue? I did get a couple random reboots prior to this during some pretty extensive use of the pool. I was rebalancing using the previously mentioned script and running badblocks on 3 new drives intended to be hotspares if they passed.
Is your HBA getting sufficient cooling? I assume from the disk count and Supermicro motherboard this is in a rackmount, but please correct me if it's not.
 

Brandito

Explorer
Joined
May 6, 2023
Messages
72
Is your HBA getting sufficient cooling? I assume from the disk count and Supermicro motherboard this is in a rackmount, but please correct me if it's not.
The server itself is in a supermicro 1u chassis and I actually recently I stalled 2 optional fans where the server normally ships with dummies. It's also in my basement and stays pretty cool, never had any heat issues.

Importing the 1 disk pool from the disk shelf did not fail. This is a 45 bay supermicro disk shelf though so I've swapped the newest HBA drive set to a new location to see if that helps. Where they were installed has been mostly used for hotspares until recently.
 

Brandito

Explorer
Joined
May 6, 2023
Messages
72
I installed my HBA in a spare desktop computer, installed cobia and found I needed to attach at least one of my zil drives with an m.2 USB enclosure and this is what I get for the output of

Zpool import -Fn Home

No crash this time. Should I try to import them export the pool properly?

This is uncharted territory
 

Attachments

  • 16999190699511664394509568064914.jpg
    16999190699511664394509568064914.jpg
    344.8 KB · Views: 44

Brandito

Explorer
Joined
May 6, 2023
Messages
72
I tried -Fn a second time and it crashed again. I have a new 9207 8e on order as well was some new sff 8088 cables.

Is it likely that the HBA is failing but is fine importing a single drive pool but is being pushed to hard on 24 drives?

I've swapped all the hardware I can. Would a corrupted pool cause a machine to just reboot on import like this? I'm hoping it's just the HBA
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Would a corrupted pool cause a machine to just reboot on import like this? I'm hoping it's just the HBA
It's possible that it's failing on import due to corrupted metadata; I would have hoped it would be polite enough to at least put the reason for failure into a kernel panic dump at least.

Thought - what if you disconnect the M.2 log vdev and import with only -m (for "missing log device")?
 

Brandito

Explorer
Joined
May 6, 2023
Messages
72
It's possible that it's failing on import due to corrupted metadata; I would have hoped it would be polite enough to at least put the reason for failure into a kernel panic dump at least.

Thought - what if you disconnect the M.2 log vdev and import with only -m (for "missing log device")?
Tried to import with log vdev disconnected using -m and had the console up in ikvm and got the quickest peak at an error and was able to get a quick snip of it before the machine rebooted again.

One other thing I tried was to remove a drive from the latest vdev. The reason being, when attempting to fix the issue in my initial post I had offlined this particular drive in order to see if the swap partition would be added upon resilver. That didn't happen but the resilver seemed odd.

I monitored it with zpool status and it was reporting a fairly quick resilver that would take about an hour. The Webui was reporting a day to resilver. In the terminal I believe I saw near the very end it was at 110% complete. I let it do it's thing and received an email alert saying the resilver finished after only an hour or 2. No errors reported.

I should have run a scrub immediately after this but I didn't. I would think if the resilver had failed it would affect only that drive and I'd have enough replicas to fix it now?

I also only recently realized there are bootleg LSI devices out there? Looking at mine that could be the case. It came with a solid pic-e bracket instead of the perforated bracket and it's pretty cheap metal. The heat sink also looked like a cheaper version of what I';ve seen in pictures. I'm hopeful that a replacement 9207 8e is the fix. When I had it slotted into my spare rig for testing it got pretty hot, though that case isn't designed for cooling anything but the CPU
 

Attachments

  • zfs-snip.png
    zfs-snip.png
    127.7 KB · Views: 36

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Tried to import with log vdev disconnected using -m and had the console up in ikvm and got the quickest peak at an error and was able to get a quick snip of it before the machine rebooted again.
Failure to read the log is expected when the log device is missing, so that error isn't unexpected.

Keep us posted on the inbound replacement HBA - hopefully that resolves things, but if there's issues with pool health still we may have to dig further back in the pool history and/or enable some debug-level flags to disable metadata checks on import.
 

Brandito

Explorer
Joined
May 6, 2023
Messages
72
First, I'd like to thank you guys for helping me try to figure this out, so thanks!

Second, I've run all the drives through a long smart test just while I'm killing time waiting for replacement HBA and cables. All passed, and I also monitor all of my drives with scrutiny, which compares against backblaze data and scrutiny doesn't have any issues with the drives themselves.

I have two HBAs on the way, both 9207 8e, one should arrive today, hoping to have better news to report back
 

Brandito

Explorer
Joined
May 6, 2023
Messages
72
Less than great news. Got the new HBA and cables and I still have the same issue with TrueNAS force rebooting when trying to import the pool.

I tried another thing, I had 3 4tb drives that were a former zfs pool on my proxmox backup server, so I slotted them in and truenas imported that pool no problem.

What are my next steps here guys?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
All right, let's see if we can take a stab at some pool necromancy then.

Let's start with
Code:
for n in {0..X}; do
zdb -l "/dev/da"$n"p2" | grep 'name\|txg'
done
, where X here is the highest number of your disk (so if you have 24 of them, it would be 24, and it will catch your boot-device da0 as well)
 
Top