Broadcast storm causes FreeNAS to reboot, ZFS volume now UNKNOWN

scurrier · Jul 25, 2015

While installing a new network switch today I accidentally created a loop. In the other room I hear my FreeNAS box reboot and it took me a moment to figure out what was going on. It might have rebooted more than once. So I fix the problem and then get into FreeNAS and what joys await me:

My volume "firstvol" is unavailable. I don't know what to do so for now I shut down the machine. Looking for any advice. I have a backup- replicating the important stuff on firstvol to a second ZFS volume, but not all of it. Hopefully that second volume is OK. I did have a third backup, but I keep it at work and so it is not as updated with pictures of my son, etc. I still hope to recover the un-backed-up stuff from firstvol if anyone can help me do it.

Full disclosure, da6 caused an email to me earlier today for its issues in this screenshot, seemingly unrelated to the network loop incident. da6 is not a part of firstvol though and wasn't even being used so I didn't worry about it right away. da0 is a part of firstvol so I am worried about it. It looks like all my disks still appear in the GUI "view disks".

Funny thing is, I never have caused a network loop in all my years of networking. Then, I buy my first smart switch which has features to prevent loops, but they don't come default enabled and then I go and cause a loop when I first plug it in. And then my monster FreeNAS box crashes and a volume gets damaged. Figures.

I didn't even know a broadcast storm could cause a machine to reboot. I guess experience really is what you get when you don't get what you want.

Any reason not to turn this machine back on and see if it magically sorts out its problems?

cyberjock · Jul 25, 2015

Well, if I were in your shoes I'd boot it up and see what the actual status is. The reboot could have created some problem that made things "not work right" when it booted.

So I'd power it on and see what it does. If you had RAIDZ1, this could be your sign that you are about to be yet another statistic in why RAIDZ1 is a terrible idea with regards to ZFS.

If the zpool doesn't mount, or you can't get a grasp of what is going on, I'd recommend you get a debug file (System -> Advanced -> Save Debug) and then shut the box back down until someone can tell you how to proceed.

cyberjock · Jul 25, 2015

I remember you had high values for UDMA_CRC and command timeouts on your server. Did you ever get those fixed? If not, that could be the source of your problems. UDMA_CRC and command timeouts are basically indicators that the disk isn't talking to the server particularly well. If that is the case, statistically you'll eventually be in a situation where multiple drives will have the same data corrupted, and so you'll have no way to correct it.

scurrier · Jul 25, 2015

Reboot didn't help. My volume is 3 stripes of 2 disks each, not RAIDZ1. I attached my debug info as requested.

I have a cold spare available, burnt in and everything so I can replace my bad one as soon as I figure out how to do that correctly.

About your comment of UDMA_CRC errors: I did have SMART attribute errors, so I replaced all of the (expensive) SATA cables with the ones that came with the supermicro stuff. At that time I took a snapshot of the SMART status and then later compared it and found that no further issues had occurred on all disks. It's something worth looking at again, though.

scurrier · Jul 25, 2015

Whoops, my volume is 2 stripes of 2 disks, not 3 stripes. The last other two disks are my backup mirror. Plus a 7th disk for cold spare.

cyberjock · Jul 25, 2015

So many things aren't making sense with this that I don't even know what the heck is going on with your server.

Can you post the output of "zpool import"?

Right now it looks like da4, da5, and da7 are unpartitioned. How they got that way totally escapes me right now.

Unless the zpool import provides some fruitful information, I don't even know what went wrong, but things are very fubared with your system.

I kind of get the idea that either:

1. You've suffered multiple failures simultaneously
2. You've suffered multiple undiagnosed failures over a period of time
3. You didn't do the initial setup of FreeNAS properly. For example, chose to create the zpool from the CLI instead of the WebGUI (but I know you didn't do that, its just an example).
4. You didn't setup FreeNAS properly to monitor and report problems, so you didn't know anything was wrong until it was too late.
5. Some combination of the above.

I can't even speculate on what it would be if you did #3 or #4, and what you might have done wrong because there's too many things broken to really understand what is going on. You had a good SMART testing schedule going. At least one disk got its schedule erased for some reason.

Without more info (and I don't even know what I'd ask for to be honest) there's no smoking gun I saw on what may have gone wrong.

fta · Jul 25, 2015

You da0 problems are probably because of this repeating error:

Code:

Jul 25 03:07:16 thumper (da0:mps0:0:0:0): READ(16). CDB: 88 00 00 00 00 01 b5 5b f9 98 00 00 01 00 00 00
Jul 25 03:07:16 thumper (da0:mps0:0:0:0): CAM status: SCSI Status Error
Jul 25 03:07:16 thumper (da0:mps0:0:0:0): SCSI status: Check Condition
Jul 25 03:07:16 thumper (da0:mps0:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jul 25 03:07:16 thumper (da0:mps0:0:0:0): Info: 0x1b55bfa30
Jul 25 03:07:16 thumper (da0:mps0:0:0:0): Error 5, Unretryable error

fta · Jul 25, 2015

If I've read you correctly, da0/da1 are a mirror making up firstvol. If so, you might try shutting down and pulling da0. Then boot up and see if you can access firstvol.

scurrier · Jul 25, 2015

cyberjock said:
Can you post the output of "zpool import"?

Thanks for helping me with this!!!
Here's the output:

Code:

[root@thumper] ~# zpool import
   pool: firstvol
     id: 15330060918112936227
  state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:

        firstvol                                        ONLINE
          mirror-0                                      ONLINE
            gptid/59d65f8b-c136-11e3-b86a-002590f06808  ONLINE
            gptid/5a40f730-c136-11e3-b86a-002590f06808  ONLINE
          mirror-1                                      ONLINE
            gptid/94d0c36e-c136-11e3-b86a-002590f06808  ONLINE
            gptid/95397879-c136-11e3-b86a-002590f06808  ONLINE

cyberjock said:
Right now it looks like da4, da5, and da7 are unpartitioned.

da7 is my cold spare. Yeah, I kept it installed for no good reason. I'm really only concerned about the rest of the 4TB Seagate drives, whose serials all start with "Z". Any other data is not important much at all to me because it was just the last month of security cam recordings.

cyberjock said:
1. You've suffered multiple failures simultaneously
2. You've suffered multiple undiagnosed failures over a period of time
3. You didn't do the initial setup of FreeNAS properly. For example, chose to create the zpool from the CLI instead of the WebGUI (but I know you didn't do that, its just an example).
4. You didn't setup FreeNAS properly to monitor and report problems, so you didn't know anything was wrong until it was too late.
5. Some combination of the above.

My responses:
2. This is certainly possible esp. given the SMART history. I could have drawn some wrong conclusions from my previous investigation, fix, and follow-up on that.
3. It's a fair possibility to suggest, but I doubt that I made any major newbie mistakes. I read probably combined 70 hours of forums and guides. I watched the Ben Rockwell videos and others on youtube. I followed what I learned religiously. Not saying that mistakes aren't possible, just saying that I probably did not make a major newbie mistake. (It feels weird to be boasting this because obviously something majorly wrong happened here.)
Specifically addressing your example setup fail: I did not create the zpool from the CLI, I made it from the GUI. I have not done any messing around "behind FreeNAS's back" in the CLI, mostly because I don't want a non-standard config and because I can't afford the time to do that messing around confidently. I'm not in this for a hacking project, I'm in this for stable storage.
4. I did all the suggested monitoring/reporting practices that I found. I got emails reliably and had SMART testing enabled on disks I cared about.

Since my pool appears to be importable from above, should I try to import it? If so, what is the best way to do that?

scurrier · Jul 25, 2015

Here's the SMART attributes on da0, the disk that appears to have caused firstvol's issues.

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 093 092 006 Pre-fail Always - 30169400
3 Spin_Up_Time 0x0003 091 091 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 29
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 078 060 030 Pre-fail Always - 72860196
9 Power_On_Hours 0x0032 087 087 000 Old_age Always - 11820
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 29
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 028 028 000 Old_age Always - 72
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 001 001 000 Old_age Always - 428
190 Airflow_Temperature_Cel 0x0022 077 050 045 Old_age Always - 23 (Min/Max 20/23)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 29
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 29
194 Temperature_Celsius 0x0022 023 050 000 Old_age Always - 23 (0 16 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 8
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 8
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0

Please note that this was NOT one of the disks with the UDMA_CRC errors.

BigDave · Jul 25, 2015

scurrier said:
I got emails reliably and had SMART testing enabled on disks I cared about.

Care about them or not, you should set up smart tests (both long and short) for all your drives. Knowledge is power...

Since my pool appears to be importable from above, should I try to import it? If so, what is the best way to do that?

See this section of the 9.3 manual, or the manual for the version of FreeNAS you are running.

Ericloewe · Jul 25, 2015

scurrier said:
Thanks for helping me with this!!!
Here's the output:

Code:
[root@thumper] ~# zpool import pool: firstvol id: 15330060918112936227 state: ONLINE action: The pool can be imported using its name or numeric identifier. config: firstvol ONLINE mirror-0 ONLINE gptid/59d65f8b-c136-11e3-b86a-002590f06808 ONLINE gptid/5a40f730-c136-11e3-b86a-002590f06808 ONLINE mirror-1 ONLINE gptid/94d0c36e-c136-11e3-b86a-002590f06808 ONLINE gptid/95397879-c136-11e3-b86a-002590f06808 ONLINE

da7 is my cold spare. Yeah, I kept it installed for no good reason. I'm really only concerned about the rest of the 4TB Seagate drives, whose serials all start with "Z". Any other data is not important much at all to me because it was just the last month of security cam recordings.

My responses:
2. This is certainly possible esp. given the SMART history. I could have drawn some wrong conclusions from my previous investigation, fix, and follow-up on that.
3. It's a fair possibility to suggest, but I doubt that I made any major newbie mistakes. I read probably combined 70 hours of forums and guides. I watched the Ben Rockwell videos and others on youtube. I followed what I learned religiously. Not saying that mistakes aren't possible, just saying that I probably did not make a major newbie mistake. (It feels weird to be boasting this because obviously something majorly wrong happened here.)
Specifically addressing your example setup fail: I did not create the zpool from the CLI, I made it from the GUI. I have not done any messing around "behind FreeNAS's back" in the CLI, mostly because I don't want a non-standard config and because I can't afford the time to do that messing around confidently. I'm not in this for a hacking project, I'm in this for stable storage.
4. I did all the suggested monitoring/reporting practices that I found. I got emails reliably and had SMART testing enabled on disks I cared about.

Since my pool appears to be importable from above, should I try to import it? If so, what is the best way to do that?

And what does zpool import firstvol say?

scurrier · Jul 25, 2015

BigDave said:
Care about them or not, you should set up smart tests (both long and short) for all your drives. Knowledge is power...

See this section of the 9.3 manual, or the manual for the version of FreeNAS you are running.

I guess I meant since it is screwed up is there anything special I need to do? The manual is for a clean import.

BigDave · Jul 25, 2015

Ericloewe said:
And what does zpool import firstvol say?

+1 (Do this first for sure)

scurrier said:
I guess I meant since it is screwed up is there anything special I need to do? The manual is for a clean import.

Others should correct me on this if I'm wrong, but afaik, if your pool is offline, attempting an import will either suceed or fail, but shoud not result in
any damage to the data (beyond what has already occured, if any).
Keep in mind, if ada0 is failed or failing, and the pool imports sucessfully, it may be in a *degraded* state. At which point, you would refer to the directions
in the maual for replacing a failed drive.

scurrier · Jul 25, 2015

OK. I am copying some stuff off of the local replication backup first and then I will try zpool import firstvol.

scurrier · Jul 25, 2015

Code:

[root@thumper] ~# zpool import firstvol
cannot import 'firstvol': one or more devices is currently unavailable

Robert Trevellyan · Jul 25, 2015

cyberjock said:
You had a good SMART testing schedule going. At least one disk got its schedule erased for some reason.

I bet that disk was replaced. The SMART scheduling GUI is pretty non-intuitive in that regard (lists disks by device name but tracks them by ID).

cyberjock · Jul 25, 2015

BigDave said:
Others should correct me on this if I'm wrong, but afaik, if your pool is offline, attempting an import will either suceed or fail, but shoud not result in
any damage to the data (beyond what has already occured, if any).

You are 1/2 true. Attempting to import a zpool will either succeed or fail. But you can actually damage your zpool by importing it. Of course, for most people there's little choice, but it shouldn't be taken lightly when you know there are problems. Especially since people read up on various parameters of the zpool import command and then do things like try to force mount, try to rollback transactions, etc. Those are extremely risky and shouldn't be done "just because".

Anyway, back to the problem at hand. If zpool import shows an ONLINE zpool, but when you try to import it then it says disks are unavailable, that normally means that there is a disparity between the different disks. So any disks that are failing/failed should be removed, if possible, and then try to import with the remaining good disks.

scurrier · Jul 26, 2015

Are we sure that I should remove the failed disk? Seems like it would be better to install and rebuild with the spare first, if this is possible, so that any good data on the bad disk can be used to benefit the rebuild. I feel like removing the disk without involving zfs is like going behind its back.

scurrier · Jul 26, 2015

For example, this link is making me believe that since my pool state is "ONLINE" and all my disks are "ONLINE" that I should be able to replace the faulted disk without removing it.
http://docs.oracle.com/cd/E19253-01/819-5461/gbcfb/index.html

Important Announcement for the TrueNAS Community.

Broadcast storm causes FreeNAS to reboot, ZFS volume now UNKNOWN

Patron

Attachments

Inactive Account

Inactive Account

Patron

Attachments

Patron

Inactive Account

Contributor

Contributor

Patron

Patron

FreeNAS Enthusiast

Server Wrangler

Patron

FreeNAS Enthusiast

Patron

Patron

Pony Wrangler

Inactive Account

Patron

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Broadcast storm causes FreeNAS to reboot, ZFS volume now UNKNOWN"

Similar threads