Reboot produces GPT Table Corrupt or Invalid

Todd Marimon · Dec 31, 2013

I am currently in the "playing around with" mode for FreeNAS (and ESXi), so no data is at risk here, but I still ultimately want to put this setup into production. So far, I haven't gained confidence that it will work, though.

I have just installed FreeNAS 9.2.0 on ESXi with the following stats:

4 vCPU
LSI M1015 (IT Mode) Passthrough to VM
12GB of of RAM (Host has 24GB)
8GB HDD
2x750 Samsung HD753LJ SATA HDDs

I did not have the following problem with FreeNAS 9.1.0 when I was playing with that prior to realizing 9.2.0 was out.

Every time I create a new zfs volume on the 750GB drives, everything works perfectly. I can create zvols and map them via iSCSI and I'm happy. But then.... I reboot the VM. Upon start-up, I see errors scroll by for the 2 drives:

Code:

Dec 31 15:35:22 freenas kernel: da1 at mps0 bus 0 scbus3 target 2 lun 0
Dec 31 15:35:22 freenas kernel: da1: <ATA SAMSUNG HD753LJ 1112> Fixed Direct Access SCSI-6 device 
Dec 31 15:35:22 freenas kernel: da1: 300.000MB/s transfers
Dec 31 15:35:22 freenas kernel: da1: Command Queueing enabled
Dec 31 15:35:22 freenas kernel: da1: 715404MB (1465149168 512 byte sectors: 255H 63S/T 91201C)
Dec 31 15:35:22 freenas kernel: da2 at mps0 bus 0 scbus3 target 3 lun 0
Dec 31 15:35:22 freenas kernel: da2: <ATA SAMSUNG HD753LJ 1112> Fixed Direct Access SCSI-6 device 
Dec 31 15:35:22 freenas kernel: da2: 300.000MB/s transfers
Dec 31 15:35:22 freenas kernel: da2: Command Queueing enabled
Dec 31 15:35:22 freenas kernel: da2: 715404MB (1465149168 512 byte sectors: 255H 63S/T 91201C)
Dec 31 15:35:22 freenas kernel: GEOM: da1: the secondary GPT table is corrupt or invalid.
Dec 31 15:35:22 freenas kernel: GEOM: da1: using the primary only -- recovery suggested.
Dec 31 15:35:22 freenas kernel: GEOM: da2: the secondary GPT table is corrupt or invalid.
Dec 31 15:35:22 freenas kernel: GEOM: da2: using the primary only -- recovery suggested.
Dec 31 15:35:22 freenas kernel: GEOM_MIRROR: Device mirror/system launched (2/2).
Dec 31 15:35:22 freenas kernel: GEOM: mirror/system: corrupt or invalid GPT detected.
Dec 31 15:35:22 freenas kernel: GEOM: mirror/system: GPT rejected -- may not be recoverable.

I have tried the following, without meaningful success:

Followed in http://forums.freenas.org/threads/gpt-table-is-corrupt-or-invalid-error-on-bootup.12171/ to set 'sysctl vfs.zfs.vdev.larger_ashift_minimal=0'
Disabling the default Swap of 2G on each drive
Zeroing out both drives entirely (had to set 'sysctl kern.geom.debugflags=16')
Not rebooting (obviously, this one is not a feasible solution, but it works!)

Every time I do these various things, as soon as I reboot, I get the same result-- no more ZFS volume.
I'm kind of at a loss with this one-- I have no idea what is wrong. I have a feeling it is something to do with the 512B vs 4KB sectors, but why is this suddenly an issue? And what can I do to troubleshoot it? I've tried running 'gpart recover' commands, but they do not work:

Code:

[root@freenas] ~# gpart recover /dev/da1
gpart: arg0 'da1': invalid argument

Any help on this would be greatly appreciated.

cyberjock · Dec 31, 2013

Here's a $1,000,000 question.

Are you virtualizing your disks are are you using PCI passthrough?

cyberjock · Dec 31, 2013

What motherboard are you using? Can you post your hardware makes/models?

One of the problems with VT-d is that it is very finicky and very temperamental. If you are using a board that isn't from one of the big companies that knows how to properly use VT-d technology, it can blow up in your face. I had a highend Gigabyte motherboard that had VT-d. It doesn't work right. Switched to Supermicro motherboard and the problems went away.

cyberjock · Dec 31, 2013

Ok.. So I just deleted your post because 2 appeared. I deleted the extra, but now both are gone.

In short, the OP said that he did use PCIe passthrough.

Here's a paste:

Oh, I should have made that more clear...

The disks are attached to the LSI HBA card. So they are via the passthrough PCIe card.

Todd Marimon · Dec 31, 2013

cyberjock said:
What motherboard are you using? Can you post your hardware makes/models?

One of the problems with VT-d is that it is very finicky and very temperamental. If you are using a board that isn't from one of the big companies that knows how to properly use VT-d technology, it can blow up in your face. I had a highend Gigabyte motherboard that had VT-d. It doesn't work right. Switched to Supermicro motherboard and the problems went away.

The VMware host is a Tyan S7012 with a single Xeon L5520 with 24GB of DDR3 ECC (Might pick up a second down the road)

cyberjock · Dec 31, 2013

Well, Tyan is a "well known" company. Some experienced people here have warned against using them. I used one for a desktop around 2005 and it had odd quirks. I used it for less than 3 months because I got tired of it acting up.

Honestly, I don't have a smoking gun for you to check.

I'd do the normal stuff.. update your BIOS, check your BIOS settings, make sure you are using p14 firmware on your M1015(since the FreeNAS driver is p14). Other than that I have no other recommendations.

I can tell you that on my Supermicro system it has worked flawlessly and still is. :(

Todd Marimon · Dec 31, 2013

cyberjock said:
Well, Tyan is a "well known" company. Some experienced people here have warned against using them. I used one for a desktop around 2005 and it had odd quirks. I used it for less than 3 months because I got tired of it acting up.

Honestly, I don't have a smoking gun for you to check.

I'd do the normal stuff.. update your BIOS, check your BIOS settings, make sure you are using p14 firmware on your M1015(since the FreeNAS driver is p14). Other than that I have no other recommendations.

I can tell you that on my Supermicro system it has worked flawlessly and still is. :(

BIOS is up to date.

Should I try installing FreeNAS on the raw hardware to eliminate possible VT-d/ESXi oddness? I'm going to also re-try 9.1.0 to verify it worked correctly.

cyberjock · Dec 31, 2013

That's what I would do. I'd be willing to bet good money the problem will magically go away when ESXi goes away. :(

jgreco · Dec 31, 2013

Is it just me or do we seem to see a lot of LGA1366 boards with what appear to be VT-d related problems? (your Gigabyte, someone else recently, now this)

Todd Marimon · Dec 31, 2013

jgreco said:
Is it just me or do we seem to see a lot of LGA1366 boards with what appear to be VT-d related problems? (your Gigabyte, someone else recently, now this)

Well, I wouldn't write it off just yet-- Honestly, I was using 9.1.0 for days in a VM and didn't have this problem (but I will admit, I had lots of other problems, so it's possible somehow I didn't realize it-- I don't know how I would miss it, though).

I reverted my VM to 9.1.0 to see if that would work... and I have to eat my hat because the same thing happened with that version, too. So, I'm currently doing as I said before and installing FreeNAS 9.2.0 on the raw hardware. I'll report back as soon as that is done.

It is just vary odd to me that FreeNAS works on the very first boot after install, but then as soon as you reboot it, things hit the fan.

Todd Marimon · Dec 31, 2013

Well... Installed FreeNAS on the raw hardware, followed the same procedure for my initial configuration, ending in creating a volume, then a nested zvol on that. Reboot, then I still get:

Code:

The volume store1 (ZFS) status is UNKNOWN

So, I don't think this is a VMware issue.

How do I know what "P" my M1015 is running? I got the firmware from this post: http://forums.laptopvideo2go.com/topic/29059-sas2008-lsi92409211-firmware-files/

I'm also running the latest BIOS. Prior to installing ESXi I had to upgrade the BIOS in order to get the installer to even work. I might need to revert the BIOS (I'm not sure what was on it previously, though), or locate the one problematic setting. This is definitely not good in my mind, though.

One other thought, this was definitely working just fine for me before messing with ESXi-- so on the older BIOS on FreeNAS 9.1.0, I did not have any problems like this.

Suggestions? I almost want to fill the drives with random data, checksum, then reboot, re-checksum and compare. I also want to try other OSes to see if I can even blame FreeNAS.

jgreco · Dec 31, 2013

Run DBAN in drive zeroing mode to be safe...?

cyberjock · Dec 31, 2013

If you bootup FreeNAS and do dmesg | grep mps you'll know what version you flashed. It should say something like "mps0: Firmware: 15.00.00.00, Driver: 14.00.00.01-fbsd". In this example, this is a friend's server with firmware 15 and driver 14 on FreeNAS 9.1.1. You're supposed to keep the 2 on the same version at all times.

It is also possible you messed up the reflashing somehow??? Maybe find the p14 firmware and use that since we've been stuck there for over a year and despite my ticket that is 7 months old asking for the update, there's no rush to update drivers.

I'd keep the latest BIOS and try the correct firmware.

Todd Marimon · Jan 1, 2014

OK, I flashed the correct p14 firmware now. It took a little bit of figuring out on how to actually downgrade the firmware on the M1015. (For future googler's, run 'sas2flash -o -e 6' to erase the firmware, followed by your normal flash command, in my case 'sas2flash -o -f 2118it.bin -b mptsas2.rom')

Anyway, I rebooted, and created my volumes, then rebooted again... and STILL, the same problem.

This is rather baffling to me. This really seems like a FreeBSD issue to me. I checksummed the GPT data (pri and sec) just after creating my volumes, then rebooted and rechecksummed. The checksums are as follows:

Code:

before reboot:
Primary (first 16K):  2395c60bf610ac948503a83fa3a65db3
Secondary (last 16K): 86ba0a48e81a88190d06b51cbe092b37
after reboot:
Primary (first 16K):  2395c60bf610ac948503a83fa3a65db3
Secondary (last 16K): 20be3600e03f8797501e7d1f06ccc591 (Different??)

(The secondary changed, which might be due to an attempted recovery? It's just a back-up, so this shouldn't affect the GPT)

This is rather damning in my mind that FreeBSD is somehow messing up either prior to reboot or after reboot in how it is writing or interpreting (respectively) the GPT. I can't come up with any other explanation for why a reboot could affect how the drive comes back as "invalid". Is it possible there is some sort of write protection somewhere not allowing FreeBSD to write out to certain regions of the drive, which is causing problems for the GPT? Or could this be a sector size issue. Eitherway, I need suggestions on how it could be fixed. Could MBR tables be used??

Oh yea, Happy New Year!

cyberjock · Jan 1, 2014

Any chance we could do a team viewer session so I could check this out? If so pm me and we can setup a time. This is scaring me tbh.

Sent from my DROID BIONIC using Tapatalk

Todd Marimon · Jan 1, 2014

One New interesting tidbit... I just discovered that if I remove the drives (hot-unplug), then reboot the box, then hot-plug the drives again... the md5's come back identical after this. Then I just have to "Auto Import" then from the webUI, and there is no problem-- my data is all fine.

So, something about the shutdown process is messing with the GPT tables maybe?? This is highly suspect at least.

Todd Marimon · Jan 1, 2014

Piece of good news (for FreeNAS)... I was unable to replicate this in a VM on my workstation (I replicated things as best as I could, only virtual drives).

Another piece of good news... I've tried downgrading the BIOS on my motherboard to V2.02, and I don't seem to have any problems now. However, I also upgraded back to V3.00 and also am no longer having the problem. (The engineer in me wanted to verify downgrading is what fixed it).

I am now curious, if the very act of me hot swapping the drives in and out reset something with the LSI card to make it stop causing problems.

I'm truly baffled, but I don't seem to be having a problem now. I'll be playing even more with this, however, to gain confidence that this issue is now in the past. I'm probably going to reinstall ESXi and start from there again. I will report back here if I have any issues like this again.

Wish me luck!

Also, if anyone has theories on what the problem could have been, please post.

jgreco · Jan 1, 2014

No theories but glad to see you are not just pretending it's fixed. Understanding what blows stuff up is a key to reliability.

Todd Marimon · Jan 5, 2014

This is happening again. I have not yet figured out how to actually "fix it". It is a highly frustrating problem. It makes me very nervous to even consider trusting FreeNAS/ZFS with my data. Something is seriously amiss here. I'm really glad I've wasted my money on enterprise hardware and I can't even keep a filesystem through a reboot. (please disregard my frustrated tone... I really do want help with this problem)

Todd Marimon · Jan 5, 2014

For what it is worth, I'm running FreeNAS baremetal-- ESXi is not in the equation.

Important Announcement for the TrueNAS Community.

Reboot produces GPT Table Corrupt or Invalid

Dabbler

Inactive Account

Inactive Account

Inactive Account

Dabbler

Inactive Account

Dabbler

Inactive Account

Resident Grinch

Dabbler

Dabbler

Resident Grinch

Inactive Account

Dabbler

Inactive Account

Dabbler

Dabbler

Resident Grinch

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Reboot produces GPT Table Corrupt or Invalid"

Similar threads