TrueNAS Scale incorrectly reporting Mixed Capacity VDEVS

Patrick M. Hausen · Nov 13, 2023

Import on the command line:

zpool import -o altroot=/mnt <poolname>

If that works post the output of

zfs list

HoneyBadger · Nov 13, 2023

And if that doesn't work, show the output of first zpool import to see that it shows your Home pool correctly with all present devices, and then try zpool import -Fn Home

Saoshen · Nov 13, 2023

this sounds eerily similar to my situation, I wonder if they are related?

troubleshooting mirror vdev won't expand

I appear to have a similar situation as @ https://www.truenas.com/community/threads/replaced-mirror-with-larger-drives-but-cant-expand-pool.112702/ but that thread ended with inconclusively. I have a pool, tank1. It is 6 disks, 3x 2-wide mirrors. originally, it was 6x 8tb drives. I replaced...

www.truenas.com

[NAS-125078] - iXsystems TrueNAS Jira

ixsystems.atlassian.net

Brandito · Nov 13, 2023

Patrick M. Hausen said:
Import on the command line:

zpool import -o altroot=/mnt <poolname>

If that works post the output of

zfs list

Tried this and it resulted in TrueNAS rebooting the entire system

HoneyBadger said:
And if that doesn't work, show the output of first zpool import to see that it shows your Home pool correctly with all present devices, and then try zpool import -Fn Home

This also had the same result.

All drives appeared with zpool import and showed online. Unfortunately after using the -Fn options the system became unbootable for that installation.

I was able to get back into the system with an old install and below is what zpool import shows

This is under bluefin so I assume that's what it means damaged devices or data. It didn't say that under cobia

Patrick M. Hausen · Nov 13, 2023

FreeBSD 14 has ZFS 2.2 - you could try a FreeBSD 14 RC4 ISO and the "live CD" feature to import and check the pool.

Brandito · Nov 13, 2023

Patrick M. Hausen said:
FreeBSD 14 has ZFS 2.2 - you could try a FreeBSD 14 RC4 ISO and the "live CD" feature to import and check the pool.

Trying this now, got the iso booting to the installer, how do you get to the live cd mode?

Edit: should have waiting 15 more seconds to hit reply. Found it and trying now

Brandito · Nov 13, 2023

Patrick M. Hausen said:
FreeBSD 14 has ZFS 2.2 - you could try a FreeBSD 14 RC4 ISO and the "live CD" feature to import and check the pool.

Got FreeBSD up, performed a zpool import -f Home and it took forcibly rebooted the server after a few seconds.

Am I out of options here?

HoneyBadger · Nov 13, 2023

Brandito said:
Got FreeBSD up, performed a zpool import -f Home and it took forcibly rebooted the server after a few seconds.

Am I out of options here?

The -n parameter means "do not actually import" - but a crash even with that means there's probably spacemap leakage or something ugly in the metadata. The "unsupported flags" means you can't import it on Bluefin's ZFS 2.1.x, only a ZFS 2.2 system as mentioned by Patrick.

Try importing with -FXn (X being "extreme rollback measures") and still with the "n" of "just attempt and report the result, don't actually mount it"

Brandito · Nov 13, 2023

HoneyBadger said:
The -n parameter means "do not actually import" - but a crash even with that means there's probably spacemap leakage or something ugly in the metadata. The "unsupported flags" means you can't import it on Bluefin's ZFS 2.1.x, only a ZFS 2.2 system as mentioned by Patrick.

Try importing with -FXn (X being "extreme rollback measures") and still with the "n" of "just attempt and report the result, don't actually mount it"

Tried import with -FXn and it forced another reboot.

What are the chances this is just an HBA issue? I did get a couple random reboots prior to this during some pretty extensive use of the pool. I was rebalancing using the previously mentioned script and running badblocks on 3 new drives intended to be hotspares if they passed.

When this most recent more fatal reboot occurred I had been doing the rebalance and was running badblocks on one drive at a time, adding one after 12-24 hours until all three were running. I know you may be thinking I ran badblocks on the wrong drives, but I assure you I checked, double checked and triple checked serial numbers and drive letter assignments before each run. I also label the drives on the front of my disk shelf.

My next step is to move the single drive pool I know imports to the disk shelf to see if it will fail when in the disk shelf

Edit: I also forgot to mention, the system failed to boot after that command as well. I keep needing to boot from my bluefin install, to upgrade to cobia to then try the next option

HoneyBadger · Nov 13, 2023

Brandito said:
What are the chances this is just an HBA issue? I did get a couple random reboots prior to this during some pretty extensive use of the pool. I was rebalancing using the previously mentioned script and running badblocks on 3 new drives intended to be hotspares if they passed.

Is your HBA getting sufficient cooling? I assume from the disk count and Supermicro motherboard this is in a rackmount, but please correct me if it's not.

Brandito · Nov 13, 2023

HoneyBadger said:
Is your HBA getting sufficient cooling? I assume from the disk count and Supermicro motherboard this is in a rackmount, but please correct me if it's not.

The server itself is in a supermicro 1u chassis and I actually recently I stalled 2 optional fans where the server normally ships with dummies. It's also in my basement and stays pretty cool, never had any heat issues.

Importing the 1 disk pool from the disk shelf did not fail. This is a 45 bay supermicro disk shelf though so I've swapped the newest HBA drive set to a new location to see if that helps. Where they were installed has been mostly used for hotspares until recently.

Brandito · Nov 13, 2023

I installed my HBA in a spare desktop computer, installed cobia and found I needed to attach at least one of my zil drives with an m.2 USB enclosure and this is what I get for the output of

Zpool import -Fn Home

No crash this time. Should I try to import them export the pool properly?

This is uncharted territory

HoneyBadger · Nov 13, 2023

A missing log vdev shouldn't have caused a crash on import.

Do -Ffn (yes, both upper and lowercase F's) to attempt the import of the "in-use" pool and see if it crashes. If it doesn't, then do -Ff and scrub.

Brandito · Nov 13, 2023

I tried -Fn a second time and it crashed again. I have a new 9207 8e on order as well was some new sff 8088 cables.

Is it likely that the HBA is failing but is fine importing a single drive pool but is being pushed to hard on 24 drives?

I've swapped all the hardware I can. Would a corrupted pool cause a machine to just reboot on import like this? I'm hoping it's just the HBA

HoneyBadger · Nov 13, 2023

Brandito said:
Would a corrupted pool cause a machine to just reboot on import like this? I'm hoping it's just the HBA

It's possible that it's failing on import due to corrupted metadata; I would have hoped it would be polite enough to at least put the reason for failure into a kernel panic dump at least.

Thought - what if you disconnect the M.2 log vdev and import with only -m (for "missing log device")?

Brandito · Nov 14, 2023

HoneyBadger said:
It's possible that it's failing on import due to corrupted metadata; I would have hoped it would be polite enough to at least put the reason for failure into a kernel panic dump at least.

Thought - what if you disconnect the M.2 log vdev and import with only -m (for "missing log device")?

Tried to import with log vdev disconnected using -m and had the console up in ikvm and got the quickest peak at an error and was able to get a quick snip of it before the machine rebooted again.

One other thing I tried was to remove a drive from the latest vdev. The reason being, when attempting to fix the issue in my initial post I had offlined this particular drive in order to see if the swap partition would be added upon resilver. That didn't happen but the resilver seemed odd.

I monitored it with zpool status and it was reporting a fairly quick resilver that would take about an hour. The Webui was reporting a day to resilver. In the terminal I believe I saw near the very end it was at 110% complete. I let it do it's thing and received an email alert saying the resilver finished after only an hour or 2. No errors reported.

I should have run a scrub immediately after this but I didn't. I would think if the resilver had failed it would affect only that drive and I'd have enough replicas to fix it now?

I also only recently realized there are bootleg LSI devices out there? Looking at mine that could be the case. It came with a solid pic-e bracket instead of the perforated bracket and it's pretty cheap metal. The heat sink also looked like a cheaper version of what I';ve seen in pictures. I'm hopeful that a replacement 9207 8e is the fix. When I had it slotted into my spare rig for testing it got pretty hot, though that case isn't designed for cooling anything but the CPU

HoneyBadger · Nov 14, 2023

Brandito said:
Tried to import with log vdev disconnected using -m and had the console up in ikvm and got the quickest peak at an error and was able to get a quick snip of it before the machine rebooted again.

Failure to read the log is expected when the log device is missing, so that error isn't unexpected.

Keep us posted on the inbound replacement HBA - hopefully that resolves things, but if there's issues with pool health still we may have to dig further back in the pool history and/or enable some debug-level flags to disable metadata checks on import.

Brandito · Nov 15, 2023

First, I'd like to thank you guys for helping me try to figure this out, so thanks!

Second, I've run all the drives through a long smart test just while I'm killing time waiting for replacement HBA and cables. All passed, and I also monitor all of my drives with scrutiny, which compares against backblaze data and scrutiny doesn't have any issues with the drives themselves.

I have two HBAs on the way, both 9207 8e, one should arrive today, hoping to have better news to report back

Brandito · Nov 16, 2023

Less than great news. Got the new HBA and cables and I still have the same issue with TrueNAS force rebooting when trying to import the pool.

I tried another thing, I had 3 4tb drives that were a former zfs pool on my proxmox backup server, so I slotted them in and truenas imported that pool no problem.

What are my next steps here guys?

HoneyBadger · Nov 17, 2023

All right, let's see if we can take a stab at some pool necromancy then.

Let's start with

Code:

for n in {0..X}; do
zdb -l "/dev/da"$n"p2" | grep 'name\|txg'
done

, where X here is the highest number of your disk (so if you have 24 of them, it would be 24, and it will catch your boot-device da0 as well)

Important Announcement for the TrueNAS Community.

TrueNAS Scale incorrectly reporting Mixed Capacity VDEVS

Hall of Famer

actually does care

Dabbler

Explorer

Attachments

Hall of Famer

Explorer

Explorer

actually does care

Explorer

actually does care

Explorer

Explorer

Attachments

actually does care

Explorer

actually does care

Explorer

Attachments

actually does care

Explorer

Explorer

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "TrueNAS Scale incorrectly reporting Mixed Capacity VDEVS"

Similar threads