FreeNAS 9.3 (stable) kernel panic when disk fails

kajer · Dec 22, 2014

I have a Dell 2950 server. Dual Xeon, 32GB FB ram, 8x146GB 10K SAS disks connected via a PERC 6i. I am running the latest system and PERC 6/i BIOS. This is my second FreeNAS I am building, and am getting a weird problem.

I can create 8 raid 0 disks with MFIUTIL and likewise see 8 /dev/mfidx devices. Then I create a RAIDZ2 pool using the 8 disks. Next I create a dataset and share that out with CIFS. Then I start a large file transfer to the disk array, and max out my gigabit link. Finally, I use MFIUTIL to fail a disk. The system KERNEL PANICS.

I can repeat the above steps when no data transfer is occurring, and the system handles perfectly fine. I can even replace the disk with a new one, and rebuild the pool no problems.

This is NO GOOD. My last FreeNAS build was a generic supermicro box with some sort of LSI card that supported IT mode. I could physically pull two disks from a RAIDZ2 volume during a write. Granted, the system hung for a second, but otherwise the system remained connected and usable with a degraded volume.

I plan on using this DELL box as a SAS volume for vmware with iscsi. If this box reboots, my ESX hosts connected to the iscsi lun will also hang and cause my cluster to go down.

I have seen WAY TOO MANY posts about rebooting being normal when a disk fails... This is not an option.

Is there a bug with the MFI driver where throughput + disk fail == panic?

kajer · Dec 22, 2014

kajer said:
...

I have replaced the PERC 6/i card with a PERC 5/i (LSI 1068 B0) card and repeated the test. Everything works perfectly. Problem with the PERC 6i (LSI 1078) maybe?

cyberjock · Dec 22, 2014

Did you read the forums about the Perc controllers? When you created 8 RAID-0 disks you put hardware RAID on ZFS. This is a no-no and I can't tell you how many people I've personally talked to on the phone that have lost their pools with that configuration.

kajer · Dec 23, 2014

Using the PERC5/i and testing drive failures while sustaining gigabit network throughput to the raidz-2, I was able to pull the disks during write simulating failure, and then replacing them with two new disks, and rebuilding the pool. I'm happy with the setup for now.

I agree that the RAID0 layer between the OS and DISK is a bad business for SMART, but using dell server hardware and SAS/SATA backplanes, my options for controllers are quite limited. My PCIe slots in the rest of the server are filled.

When I preformed the testing, I was just happy that I could fail a ZFS disk and replace it without losing the pool, or needing to reboot the server. For everyone else who thinks that rebooting is part of a drive replacement... I guess we drink different kool-aid.

Ericloewe · Dec 23, 2014

Rebooting or not isn't really the issue. My honest question is how do you plan to monitor the drives?

kajer · Dec 23, 2014

Ericloewe said:
Rebooting or not isn't really the issue. My honest question is how do you plan to monitor the drives?

Hence the testing. I have FreeNAS tied to a google apps account. Upon me typing in "mfiutil fail 7" to the CLI, I got a "degraded" email within 30 seconds. For good measure I also issued the command "mfiutil fail 6". No new email with the loss of the second drive... :(

So, no SMART warnings for me.... If I ever need to re-build this box, I'll find a way to stuff a proper HBA in place of the PERC... I's a standard size PCIe 8x for the most part...

kajer · Dec 23, 2014

I'm more curious to anyone that has tried to use a LSI 1078 chipset with FreeNAS... Failing a disk should NOT cause a kernel panic.

(back on topic)

cyberjock · Dec 23, 2014

kajer said:
I'm more curious to anyone that has tried to use a LSI 1078 chipset with FreeNAS... Failing a disk should NOT cause a kernel panic.

(back on topic)

You are 100% right, it shouldn't. There's 2 problems (one of them is the insistence on using RAID). I have no doubt the user will later be complaining of a corrupt pool without explanation, and the answer will be 'restore from backup, you can't access the pool anymore'. There's a reason why we shun Perc cards and there's a reason why there's *tons* of stern warnings that hardware RAID is a terrible idea. Some people just don't want to listen.

Important Announcement for the TrueNAS Community.

FreeNAS 9.3 (stable) kernel panic when disk fails

kajer

Dabbler

kajer

Dabbler

cyberjock

Inactive Account

kajer

Dabbler

Ericloewe

Server Wrangler

kajer

Dabbler

kajer

Dabbler

cyberjock

Inactive Account

Similar threads

Important Announcement for the TrueNAS Community.

FreeNAS 9.3 (stable) kernel panic when disk fails

Dabbler

Dabbler

Inactive Account

Dabbler

Server Wrangler

Dabbler

Dabbler

Inactive Account

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "FreeNAS 9.3 (stable) kernel panic when disk fails"

Similar threads