FreeNAS 9.2 w/ZFS Crash

Status
Not open for further replies.

tslewis99

Dabbler
Joined
Oct 6, 2013
Messages
14
Hello all,

I have a x64 Dell PowerEdge 2950 III Quad Core (PE2950 III)
with dual Intel Xeon E5450 3.0/12M/1333 4C 80W,
2GB PC5300 DDR2 ECC Fully Buffered x 8 = 16GB,
Western Digital Red NAS Hard Drive WD20EFRX 2TB IntelliPower 64MB Cache SATA 6.0Gb/s 3.5" x6
IBM ServeRAID M1015/LSI SAS9220-8i PCI-E SAS+SATA 46C8933 Half Height Low Pro flashed in IT mode with 9211_8i_Package_P17_IR_IT_Firmware_BIOS_for_MSDOS.

I have two systems called NAS1 and NAS2
NAS1 is production and replicates snapshots over a wireless bridge at 6:00 AM every day to NAS2. Both systems went online in September of 2013 with version 9.1.1. The systems were upgraded to 9.2.0 on January 8th 2014. Until yesterday there have been no problems. NAS1 crashed sometime Tuesday night into Wednesday morning. I simple reboot resolved the issue. It also spontaneously rebooted three more times on Wednesday. I decided to scrub the zpool to see if that would make it stable. It crashed sometime overnight presumably while scrubbing and now it hangs while trying to mount the zpool with the message:
Mounting local file systems :

At this point the console is unresponsive and input like CTRL-t, F1, etc... does not work. I can boot if I disconnect from the HBA.

The file system in question is a ZFS raidz2-0 called lun0 made up of 6 2TB drives.

The question is how do I proceed to get the box operational and discover the root cause of the crashes in the first place. All my data is available on NAS2 and else ware, it will just take a large amount of time to recover it to NAS1 if required. FYI, NAS2 appears to be functioning normally but never has any load placed directly on it like NAS1.

Thanks for any help,

Tim
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Firstly: Your pool is probably done for good. I say probably because there's other problems on the server that could either have trashed ZFS or otherwise making it not appear to be in good condition.

Second: Your server doesn't appear to officially support FreeBSD. This is usually a big red flag that this could go badly at any time. Not always, but almost always.

Third: You should have flashed it with p14 firmware since the driver in FreeNAS is p14. That's probably not your problem though, but something I'd point out.

Now, spontaneous reboots are usually caused by failing hardware or hardware that is incompatible with your OS. Considering your server doesn't appear to officially support FreeBSD I think this is a given. You definitely need to determine if your hardware is failing or not. We've had alot of people with Dell hardware lately that won't even boot FreeNAS/FreeBSD. You could try getting a FreeBSD boot disk and boot the CD up and see if it even boots. If it doesn't boot up properly that could also help prove that your hardware isn't a good fix.

Overall, if uptime is a big deal right now I'd just make a white-box server and move your drives to the new system. If uptime isn't a big deal, then you can try to diagnose the problem on your own. But, getting help with it is going to be difficult because it's often hard/impossible to definitively prove bad hardware aside from yanking everything out and slowly adding hardware back to the system.
 

tslewis99

Dabbler
Joined
Oct 6, 2013
Messages
14
Firstly: Your pool is probably done for good. I say probably because there's other problems on the server that could either have trashed ZFS or otherwise making it not appear to be in good condition.

Second: Your server doesn't appear to officially support FreeBSD. This is usually a big red flag that this could go badly at any time. Not always, but almost always.

Third: You should have flashed it with p14 firmware since the driver in FreeNAS is p14. That's probably not your problem though, but something I'd point out.

Now, spontaneous reboots are usually caused by failing hardware or hardware that is incompatible with your OS. Considering your server doesn't appear to officially support FreeBSD I think this is a given. You definitely need to determine if your hardware is failing or not. We've had alot of people with Dell hardware lately that won't even boot FreeNAS/FreeBSD. You could try getting a FreeBSD boot disk and boot the CD up and see if it even boots. If it doesn't boot up properly that could also help prove that your hardware isn't a good fix.

Overall, if uptime is a big deal right now I'd just make a white-box server and move your drives to the new system. If uptime isn't a big deal, then you can try to diagnose the problem on your own. But, getting help with it is going to be difficult because it's often hard/impossible to definitively prove bad hardware aside from yanking everything out and slowly adding hardware back to the system.

Thanks for the quick reply!
I have some time since NAS2 is working properly so I'd like to troubleshoot NAS1. I agree the hardware is not 100% compatible but it has ran fine for 4 1/2 months so I'm leaning toward a failing piece of hardware or an issue with my hardware and 9.2.0. Any suggestions to try regarding getting the pool online?

Thanks
 

Hexland

Contributor
Joined
Jan 17, 2012
Messages
110
I have almost exactly the same hardware setup and 9.2 is displaying very similar results.
I also have 2 x 80Gb Intel SSD's attached via the internal motherboard SATA, and 6 x 3Tb WD drives via the IBM M1015 (IT Mode)

9.1 works fine, 9.2 spontaneously reboots when trying to write to the ZPOOL on the M1015 controller.
Writing to the SSD's on the other hand works just fine.

I'm going to try putting the Dell PERC 5 controller back in tonight and swapping the 6xWD drives over - to confirm that it is an issue with the M1015 drivers.

There's another thread in the Help/Storage forum that also is seeing problems in 9.2 with LSI based controllers.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Yeah.. I can't really provide any advice for mounting the pool at this time without doing a teamviewer.. and I'm currently taking a vacation from teamviewing with people because its a big timesink and as a volunteer it's not worth the time invested.
 

tslewis99

Dabbler
Joined
Oct 6, 2013
Messages
14
I understand regarding TeamViewer.... Seems like I spend a large part of my off time on it helping family and friends. I setup and gparted a 4TB external USB drive and am using cp to get a copy of the snapshot files off of NAS2. I am doing a fresh install of 9.1.1 on NAS1 and restoring the settings from the last 9.1.1 config file. I will then copy over the files from the external drive and see how stable the system is. I should be up and running in about 14 more hours. I plan to wait awhile on 9.2.0 to see if anything comes of bug 3882.

Thanks,

Tim
 

tslewis99

Dabbler
Joined
Oct 6, 2013
Messages
14
After doing a fresh install of 9.1.1 and restoring the config file the zfs lun0 automatically mounted up with no errors and all data appears to be there.... Of course, I'm not sure of the data integrity so I will continue to recover data from the snapshots on NAS2. I think I'll scrub lun0 on NAS1 tonight just to see what happens.
 

tslewis99

Dabbler
Joined
Oct 6, 2013
Messages
14
Forgot to mention I have turned off replication until I get all the data I need from NAs2.
 

tslewis99

Dabbler
Joined
Oct 6, 2013
Messages
14
I noticed a lot of disk activity in the rack so I decided to check the zpool status before scrubbing:
[root@nas1] ~# zpool status
pool: lun0
state: ONLINE
scan: scrub in progress since Wed Jan 15 23:34:04 2014
651G scanned out of 2.64T at 243M/s, 2h24m to go
0 repaired, 24.11% done

It resumed the manual scrub I started last night via 9.2.0 without user intervention.

Just FYI, looks more and more like some sort of 9.2.0 issue.
 
Status
Not open for further replies.
Top