FreeNAS 9.3 (stable) kernel panic when disk fails

Status
Not open for further replies.

kajer

Dabbler
Joined
Dec 22, 2014
Messages
10
I have a Dell 2950 server. Dual Xeon, 32GB FB ram, 8x146GB 10K SAS disks connected via a PERC 6i. I am running the latest system and PERC 6/i BIOS. This is my second FreeNAS I am building, and am getting a weird problem.

I can create 8 raid 0 disks with MFIUTIL and likewise see 8 /dev/mfidx devices. Then I create a RAIDZ2 pool using the 8 disks. Next I create a dataset and share that out with CIFS. Then I start a large file transfer to the disk array, and max out my gigabit link. Finally, I use MFIUTIL to fail a disk. The system KERNEL PANICS.

I can repeat the above steps when no data transfer is occurring, and the system handles perfectly fine. I can even replace the disk with a new one, and rebuild the pool no problems.

This is NO GOOD. My last FreeNAS build was a generic supermicro box with some sort of LSI card that supported IT mode. I could physically pull two disks from a RAIDZ2 volume during a write. Granted, the system hung for a second, but otherwise the system remained connected and usable with a degraded volume.

I plan on using this DELL box as a SAS volume for vmware with iscsi. If this box reboots, my ESX hosts connected to the iscsi lun will also hang and cause my cluster to go down.

I have seen WAY TOO MANY posts about rebooting being normal when a disk fails... This is not an option.

Is there a bug with the MFI driver where throughput + disk fail == panic?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Did you read the forums about the Perc controllers? When you created 8 RAID-0 disks you put hardware RAID on ZFS. This is a no-no and I can't tell you how many people I've personally talked to on the phone that have lost their pools with that configuration.
 

kajer

Dabbler
Joined
Dec 22, 2014
Messages
10
Using the PERC5/i and testing drive failures while sustaining gigabit network throughput to the raidz-2, I was able to pull the disks during write simulating failure, and then replacing them with two new disks, and rebuilding the pool. I'm happy with the setup for now.

I agree that the RAID0 layer between the OS and DISK is a bad business for SMART, but using dell server hardware and SAS/SATA backplanes, my options for controllers are quite limited. My PCIe slots in the rest of the server are filled.

When I preformed the testing, I was just happy that I could fail a ZFS disk and replace it without losing the pool, or needing to reboot the server. For everyone else who thinks that rebooting is part of a drive replacement... I guess we drink different kool-aid.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Rebooting or not isn't really the issue. My honest question is how do you plan to monitor the drives?
 

kajer

Dabbler
Joined
Dec 22, 2014
Messages
10
Rebooting or not isn't really the issue. My honest question is how do you plan to monitor the drives?

Hence the testing. I have FreeNAS tied to a google apps account. Upon me typing in "mfiutil fail 7" to the CLI, I got a "degraded" email within 30 seconds. For good measure I also issued the command "mfiutil fail 6". No new email with the loss of the second drive... :(

So, no SMART warnings for me.... If I ever need to re-build this box, I'll find a way to stuff a proper HBA in place of the PERC... I's a standard size PCIe 8x for the most part...
 

kajer

Dabbler
Joined
Dec 22, 2014
Messages
10
I'm more curious to anyone that has tried to use a LSI 1078 chipset with FreeNAS... Failing a disk should NOT cause a kernel panic.

(back on topic)
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I'm more curious to anyone that has tried to use a LSI 1078 chipset with FreeNAS... Failing a disk should NOT cause a kernel panic.

(back on topic)
You are 100% right, it shouldn't. There's 2 problems (one of them is the insistence on using RAID). I have no doubt the user will later be complaining of a corrupt pool without explanation, and the answer will be 'restore from backup, you can't access the pool anymore'. There's a reason why we shun Perc cards and there's a reason why there's *tons* of stern warnings that hardware RAID is a terrible idea. Some people just don't want to listen.
 
Status
Not open for further replies.
Top