Please help! Started up FreeNAS, suddenly voluime storage (ZFS) status unknown?!

HHawk · Mar 16, 2013

cyberjock said:
Nope. It'll be done when its done. :P

As I said above.. if you believe in a God, you'd better be praying to him/her/it. If this doesn't work you are screwed. If it does work, backup your data, then do a scrub. In that order.

Check out someone else that had the same problem.. and he did pretty much exactly what we did...

http://blog.solori.net/2010/07/15/zfs-pool-import-fails-after-power-outage/

Notice his equations at the end of the article:

Redundant Power + UPS + Generator = Protected; Anything else = Risk
SAS/RAID Controller + Cache + BBU = Fast; SAS/RAID Controller + Cache – BBU = Train Wreck

Yeah sorry, I read the rest after I ran the command and posted it was running.

Well I don't believe in Gods...
I really find it stupid that FreeNAS / ZFS (which should be the "safest" filesystem) can still cause this. I had power outages with Windows and RAID 0 and I never experienced something like this (as in complete dataloss). Luck? Maybe.

I spend a lot of money to get this NAS-machine. And choose for raidz2 for maximum safety. Now I have this and probably everything is gone.
Oh well... Not only movies, series, etc. are gone, but also my websites, my photoshop images, my PC backups, scanned invoices from the past 5+ years.

Might as well shoot myself through the head if I don't get anything back.
Really starting to wonder if I shouldn't have sticked with Windows after this experience.

And I am still not sure if the power outage caused this!

Anyways thank you for your help so far (and the rest of you guys). I will report back when something has changed at the prompt.
As you mentioned it could be a few hours.

//edit

Yeah I found that article already earlier this morning after doing my own searches on the I/O-error. Wasn't very happy after reading that. :(

cyberjock · Mar 16, 2013

Yes, but you didn't account for a loss of power. NTFS is horrible with losses of power. I've seen completely data loss from power loss.

This isn't a ZFS problem. It is a problem with all file systems. File systems are living documents, and if you lose power on a partial write there is no way to determine what was and wasn't written. ZFS is lucky with the transactional rollback that you are using right now, but there is no promises with it.

The real (major)oversight was having a server with no UPS. That was a big mistake, and its made regularly despite the manual recommending it, my FreeNAS guide for noobs mentioning it, it's been a common "lesson learned" for decades that server should have an UPS at the minimum, and the forum has plenty of people that have lost data because of no UPS.

Honestly, there is no doubt in my mind at all. Only a power outage or hardware failure would cause this kind of corruption. Since there was no hardware failure found that kind of limits the options.

With NTFS, on a loss of power it simply plays back its "log" and removes the corruption(read: removes the corrupted files.. which may be ALL files). This happens in the background and I consider it a very very stupid way to design Windows. The system admin SHOULD be completely aware of anything going on. Any file corruption is silent and you have no way to identify any potential files that are corrupted.

With ZFS, on any error that is significant enough that the file system can't determine the correct action for 100% certainty(aka potential data loss) the zpool will not mount. It will leave the admin to decide the best course of action(restore from backups, attempt recovery, etc). Those options aren't options at all with NTFS. The OS has decided for you that it will "fix" the issue and you will just have to accept it.

So if you want the "idiot-proof setup" then stick with Windows, but it comes with the risk of data corruption you won't know about until you have no backups with the data anymore. But if you want to be actively involved with the choices then ZFS is far superior. I understand why you feel like ZFS is failing you, but its really that you are used to letting Windows decide for you and now that ZFS isn't you are upset.

So no, don't shoot yourself for choosing FreeNAS over Windows. Shoot yourself in the foot for not buying an UPS.

HHawk · Mar 16, 2013

cyberjock said:
Yes, but you didn't account for a loss of power. NTFS is horrible with losses of power. I've seen completely data loss from power loss.

This isn't a ZFS problem. It is a problem with all file systems. File systems are living documents, and if you lose power on a partial write there is no way to determine what was and wasn't written. ZFS is lucky with the transactional rollback that you are using right now, but there is no promises with it.

The real (major)oversight was having a server with no UPS. That was a big mistake, and its made regularly despite the manual recommending it, my FreeNAS guide for noobs mentioning it, it's been a common "lesson learned" for decades that server should have an UPS at the minimum, and the forum has plenty of people that have lost data because of no UPS.

Honestly, there is no doubt in my mind at all. Only a power outage or hardware failure would cause this kind of corruption. Since there was no hardware failure found that kind of limits the options.

With NTFS, on a loss of power it simply plays back its "log" and removes the corruption(read: removes the corrupted files.. which may be ALL files). This happens in the background and I consider it a very very stupid way to design Windows. The system admin SHOULD be completely aware of anything going on. Any file corruption is silent and you have no way to identify any potential files that are corrupted.

With ZFS, on any error that is significant enough that the file system can't determine the correct action for 100% certainty(aka potential data loss) the zpool will not mount. It will leave the admin to decide the best course of action(restore from backups, attempt recovery, etc). Those options aren't options at all with NTFS. The OS has decided for you that it will "fix" the issue and you will just have to accept it.

So if you want the "idiot-proof setup" then stick with Windows, but it comes with the risk of data corruption you won't know about until you have no backups with the data anymore. But if you want to be actively involved with the choices then ZFS is far superior. I understand why you feel like ZFS is failing you, but its really that you are used to letting Windows decide for you and now that ZFS isn't you are upset.

So no, don't shoot yourself for choosing FreeNAS over Windows. Shoot yourself in the foot for not buying an UPS.

I checked my router logs to see when the NAS was online and it was indeed shutdown hours BEFORE the power outage. So it wasn't caused by the power outage. So UPS wouldn't have helped here.
So what could be the cause now...?

Anyways, lets hope for the best while FreeNAS is still busy running the import -X thing.

cyberjock · Mar 16, 2013

Honestly, since you did the RAM test and disk test, I'm convinced it was a power outage(Edit: or improper shutdown). In fact, when I asked the question:

4. Did you have any issues with the machine before this? Did you do any upgrades or change anything?

I was fully expecting you to say you unplugged it or it lost power somehow.

I'm not really sure how your router proved the system was offline. My FN server only gets a new IP address every 12 hours and in between then the router won't tell me if its actually online or not unless I setup a script to ping the server. Your average store-bought router can't do that, but my pfsense box does.

HHawk · Mar 16, 2013

cyberjock said:
Honestly, since you did the RAM test and disk test, I'm convinced it was a power outage(Edit: or improper shutdown). In fact, when I asked the question:

I was fully expecting you to say you unplugged it or it lost power somehow.

I'm not really sure how your router proved the system was offline. My FN server only gets a new IP address every 12 hours and in between then the router won't tell me if its actually online or not unless I setup a script to ping the server. Your average store-bought router can't do that, but my pfsense box does.

I am running Tomato Firmware from Toastman on my Asus RT-N66 router with some simple mods.
Anyways, I will wait for the result (if any) from the import. It's currently still busy. I don't care how long it will take, but I do hope my data is safe, or at least a big portion of it.

cyberjock · Mar 16, 2013

I just went searching through the forums for other people that have had to run the same commands. The most common failures are:

1. No UPS and a loss of power(or write cache enabled on a hardware RAID controller with no BBU and a loss of power)
2. RAM went bad
3. Improper shutdown/freezing/kernel panics
4. User error(for instance, pulling the wrong hard drive and losing all redundancy plus 1 more disk or adding a single disk to a RAIDZ2 and having the single disk fail).

cyberjock · Mar 16, 2013

HHawk said:
Anyways, I will wait for the result (if any) from the import. It's currently still busy. I don't care how long it will take, but I do hope my data is safe, or at least a big portion of it.

From what i've read of the -X typically(and assuming you do scrubs regularly) will recover all of your data minus some of the absolute latest stuff.

HHawk · Mar 16, 2013

cyberjock said:
From what i've read of the -X typically(and assuming you do scrubs regularly) will recover all of your data minus some of the absolute latest stuff.

Okay that would be good news. Well I don't do scrubs on daily or weekly basis, last scrub was 2 or 3 weeks ago I guess...
But even if everything is safe from that date it would mean the world to me.

Thanks for the heads up. You shed some positive light on it for me.

Still busy with -X on the NAS.

cyberjock · Mar 16, 2013

I do scrubs religiously every 14 days.. 1st and 15th of the month

cyberjock · Mar 16, 2013

If you know how long a regular scrubs take you should expect this to take approximately the same amount of time.

HHawk · Mar 16, 2013

cyberjock said:
If you know how long a regular scrubs take you should expect this to take approximately the same amount of time.

It's still busy at the moment... Scrubs never took this long.
But I am patient. I can wait all night, heck I can even wait whole Sunday, if I get some stuff back.

I check the GUI from time to time to see if the CPU load changes. So far it's not hanging or anything. Guess it really takes a long time.

HHawk · Mar 16, 2013

Well after several hours I got this message:

Code:

[root@freenas] ~# zpool import -X -fF 17472259698871586545
cannot import 'storage': one or more devices is currently unavailable

Now what? Sigh.

cyberjock · Mar 16, 2013

That's all. I got no other good ideas.

You could try removing one of the drives since you have redundancy.. if 1 disk is somehow responsible (and I don't think that's possible) you may be able to get the array online.

ProtoSD · Mar 16, 2013

I'm on my way out the door, but I wouldn't remove any disks, it already thinks one is missing.

Post another "camcontrol devlist".

If that looks ok, I'd run that last import command again.

cyberjock · Mar 16, 2013

You can check to see if the system thinks a disk is missing by doing zpool import. It'll list the current status of the drives. If they all still say ONLINE then you definitely have an issue. It is possible you have multiple failing disks that exceeds the ability to recover by redundancy, but its odd that they'd all be ONLINE as before yet still not work.

Remember that the first error you got was I/O error....

- - - Updated - - -

ZFS checksums or otherwise has duplicate information that makes it incredibly difficult to have corruption without one or more hardware failures. The fact that when you try to import the zpool contradicts the actual disk status makes me really wonder how you could get to this level of corruption without a loss of power or hardware failure. If you are 100% sure that there wasn't a loss of power before then I have to say there must have been a hardware failure and I'd be very concerned of that occurring again until you find and fix the issue.

I googled a little more and removing a disk is a waste of time. My thought process was that if one disk is so screwed up that its causing confusion then removing that one disk may fix the issue. But I consider it to not really be a solution at all since every piece of data is backed up on either the same disk or another disk on the pool via redundancy or checksums. You'd need multiple disks to have corrupted data, which has seemed to somewhat be the problem the whole time.

I might try to do zpool import -X -fF 17472259698871586545 once or twice more. If that doesn't work and you have another machine then perhaps try running on a different system temporarily to see if that matters. I'd only try that because I'm out of all ideas. Hardware failure doesn't seem to be the problem, but then we'd be saying that your system just died for no reason at all whatsoever.

HHawk · Mar 16, 2013

Code:

[root@freenas] ~# camcontrol devlist
<WDC WD20EARX-00PASB0 51.0AB51>    at scbus0 target 0 lun 0 (pass0,ada0)
<WDC WD20EARX-00PASB0 51.0AB51>    at scbus1 target 0 lun 0 (pass1,ada1)
<WDC WD20EARX-008FB0 51.0AB51>     at scbus2 target 0 lun 0 (pass2,ada2)
<WDC WD20EARX-00PASB0 51.0AB51>    at scbus3 target 0 lun 0 (pass3,ada3)
<WDC WD20EARX-00PASB0 51.0AB51>    at scbus4 target 0 lun 0 (pass4,ada4)
<WDC WD20EARX-00PASB0 51.0AB51>    at scbus5 target 0 lun 0 (pass5,ada5)
<OCZ RALLY2 0.00>                  at scbus9 target 0 lun 0 (pass6,da0)

Code:

[root@freenas] ~# zpool import
   pool: storage
     id: 17472259698871586545
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:

        storage                                         ONLINE
          raidz2-0                                      ONLINE
            gptid/19177fb9-25fa-11e2-9ab0-00151736994a  ONLINE
            gptid/19b5ec3a-25fa-11e2-9ab0-00151736994a  ONLINE
            gptid/3dc2f956-3de6-11e2-8af1-00151736994a  ONLINE
            gptid/1aefa3e9-25fa-11e2-9ab0-00151736994a  ONLINE
            gptid/1b8f2b64-25fa-11e2-9ab0-00151736994a  ONLINE
            gptid/1c2d6a74-25fa-11e2-9ab0-00151736994a  ONLINE

Seems all normal...? :(

Anyways, I am off to bed for now, it's been a very long and bad day...

/update

Code:

[root@freenas] ~# zpool import -fF 17472259698871586545
cannot import 'storage': I/O error
        Destroy and re-create the pool from
        a backup source.
[root@freenas] ~# zpool import -nfF 17472259698871586545
[root@freenas] ~#

ProtoSD · Mar 16, 2013

cyberjock said:
I might try to do zpool import -X -fF 17472259698871586545 once or twice more. If that doesn't work and you have another machine then perhaps try running on a different system temporarily to see if that matters. I'd only try that because I'm out of all ideas. Hardware failure doesn't seem to be the problem, but then we'd be saying that your system just died for no reason at all whatsoever.

I think it still could be a hardware problem, possibly the disk controller. It seems like that last import was working and possibly the load from the scrub it was doing caused it to fail or glitch again. It certainly seems like all the disks are online. I can't look back at the previous posts right now, but was there any long SMART tests done? Maybe one is getting too hot, or maybe it's a bad cable/connection.

I think it *might* be worth getting a new controller or putting the disks in another computer and trying that last import command again.

HHawk, I know it seems pretty discouraging, but there are still some options depending on how much time you have or what you're willing to try. I know it's stressful, but don't give up yet, just keep having patience.

I'm hoping PaleoN will pop in here with one of his rabbit out of the hat ZFS tricks :D

paleoN · Mar 16, 2013

ProtoSD said:
I'm hoping PaleoN will pop in here with one of his rabbit out of the hat ZFS tricks :D

No, rabbits this time.

ProtoSD said:
I think it still could be a hardware problem, possibly the disk controller.

This was my exact thinking. Currently I view this more likely than 3 or more disks being sufficiently messed up yet still appear to have valid labels.

ProtoSD said:
HHawk, I know it seems pretty discouraging, but there are still some options depending on how much time you have or what you're willing to try. I know it's stressful, but don't give up yet, just keep having patience.

There are some additional informational commands to run and a few other import commands to try before even moving the disks. It's not time to give up yet.

HHawk, I will do a more complete post later with the commands I want you to try.

paleoN · Mar 17, 2013

First, let's try a read-only import which will likely fail:

Code:

zpool import -f -R /mnt -o rdonly=on storage

ProtoSD said:
I can't look back at the previous posts right now, but was there any long SMART tests done? Maybe one is getting too hot, or maybe it's a bad cable/connection.

Yes, let's do this next:

Code:

smartctl -t long /dev/adaX

With 2TB drives it will take a few hours, but you can run all of them in parallel. When they are finished post the updated results:

Code:

smartctl -q noserial -a /dev/adaX

Between the aborted import & the long tests the disks should have had plenty opportunity to show us any problems.

See if this shows us anything different:

Code:

zdb -e storage

Then I would like to see the labels & uberblocks as they are on the actual disks:

Code:

mkdir /var/zfs
cd /var/zfs
zdb -lu /dev/ada0p2 > ada0.uber
zdb -lu /dev/ada1p2 > ada1.uber
zdb -lu /dev/ada2p2 > ada2.uber
zdb -lu /dev/ada3p2 > ada3.uber
zdb -lu /dev/ada4p2 > ada4.uber
zdb -lu /dev/ada5p2 > ada5.uber

Copy off all the *.uber files via scp. The output is quite lengthy. I suggest you use e.g. pastebin.com or something.

Last fire up mfsBSD, USB memstick - amd64/9/mfsbsd-9.1-RELEASE-amd64.img, and try these two commands:

Code:

zpool import

zpool import storage

I don't actually expect any different behavior, but I've found mfsBSD useful in the past. The environment is different and I like the confirmation.

HHawk · Mar 17, 2013

Thank you guys for responding.

It's currently still running the last import command. The one with the -X. So I cannot do anything at the moment.
Replacing the controller will be difficult, as it's onboard (ATI controller via South Bridge). The motherboard I use is a: Gigabyte GA-880GA-UD3H.

Anyways, as soon as the current import -X is done, I will try those commands.

//update

Well the import command hasn't done anything; said it failed to import or something and should check zpool status.
So I am now going to run those commands you gave me.

Important Announcement for the TrueNAS Community.

Please help! Started up FreeNAS, suddenly voluime storage (ZFS) status unknown?!

Contributor

Inactive Account

Contributor

Inactive Account

Contributor

Inactive Account

Inactive Account

Contributor

Inactive Account

Inactive Account

Contributor

Contributor

Inactive Account

MVP

Inactive Account

Contributor

MVP

Wizard

Wizard

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Please help! Started up FreeNAS, suddenly voluime storage (ZFS) status unknown?!"

Similar threads