mfi0: COMMAND 0xfffffe0000ef TIMOUT AFTER on Dell R730xd

Status
Not open for further replies.

Alan McKay

Dabbler
Joined
May 9, 2016
Messages
18
Hi folks,

Back in November I bought 2 x Dell R730xd systems due to the fact they had a lot of drive bays and the disks could be operated in JBOD mode to maximize ZFS's abilities. I installed and deployed them with FreeNAS-9.3-STABLE-201509022158

They have been running fine ever since, until last week when one of them rebooted itself a couple of times overnight, with a bunch of these mfi0: errors going to syslog.

I don't see any of these errors on the other identical system.

At one point on the broken one I was able to re-run the installer and get FreeNAS installed again. But in going through the config it started to hang and have issues.

Now I cannot even re-run the installer.

I am pretty sure I have some kind of HW issue but Dell keeps telling me that FreeNAS is not supported and they cannot help me. I already used the iDrac to send them a HW diagnostics and unfortunately for me it was clean.

The system is in BIOS mode not UEFI. And as mentioned the installer ran fine a few months ago.

Can anyone help?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
JBOD mode is insufficient; see the LSI sticky.
 

Alan McKay

Dabbler
Joined
May 9, 2016
Messages
18
Thanks for the quick reply.

This unit has a PERC H330 Mini controller - are you saying that is actually an LSI?

I'm not sure what "LSI Sticky" is but I'll google it I guess.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Dell doesn't make their own silicon. They just build systems out of other companies silicon.

The LSI sticky is available at the top of the hardware forum.
 

Alan McKay

Dabbler
Joined
May 9, 2016
Messages
18
OK found that thread (not a forums guy so did not catch the 'sticky' reference)

I went through each of the 17 pages of that thread searching for references to Dell and PERC and did not find anything useful.

Could you be a little more specific on what I am looking for and/or what I need to do.

17 pages of chatter is an awful lot to have to read through word for word hoping to find some gem.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
First post: you don't want to be using the mfi driver. Fix your RAID controller by replacing it with an HBA. Note that doing so probably renders the pool inaccessible, but fortunately you've got a second system with an MFI based controller there. I don't tend to write long replies when on the cell phone.
 

Alan McKay

Dabbler
Joined
May 9, 2016
Messages
18
This does not explain why the system worked fine for 4 months with the Dell PERC H330 Mini in JBOD mode, and why the second system continues to work with it as well.

Replacing the PERC with an HBA is not an ideal option given that this has worked all along.

The problem is not with the ZFS array at this point (knock on wood).
 

Alan McKay

Dabbler
Joined
May 9, 2016
Messages
18
Also, this is still under support from Dell, but they are telling me "we do not support FreeNAS". HW Diagnostics do not show a HW issue, but I am not convinced there is none. But I have no idea how to debug with FreeNAS.
 

Alan McKay

Dabbler
Joined
May 9, 2016
Messages
18
One final question : where can I find details on what this error means?

I did some googling and did not come up with much.
 

Alan McKay

Dabbler
Joined
May 9, 2016
Messages
18
One more question : does anyone offer paid support on this issue?

I have no problem at all paying someone to help me fix it. But the FreeNAS site seemed to indicate this was not an option.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Your controller is the problem. You can paypal me money for that support if you want.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
It's very difficult to know what is going on behind the MFI firmware and driver. Part of the thing with ZFS is that putting a tiny little hardware RAID controller in between the big massive software ZFS disk management system is that ZFS, with its massive caches in your host system and a desire to be able to pump lots of data at all the disks, will tend to punish the heck out of the RAID controller. This is a bad configuration, and it also makes it much more difficult for ZFS to be able to interface directly with the disks and understand their status and whether they're operating correctly. It is very possible that one of the disks has gone bad and is causing flaky performance from the RAID controller. That'd be my offhand suspicion. The controller may not have a clear idea of what is wrong. The mfi man page and mfiutil administrative utility are used to chat with this driver.

Your best option would be to see if you can get an HBA - not a RAID controller in RAID mode, not a RAID controller in JBOD mode, but an honest to goodness HBA - installed to handle the disks. This gives FreeNAS and ZFS the access to the disks that is best suited to providing better diagnostics, at which point there are a dozen guys here who'd provide interesting and useful commentary on the SMART data from the disks and other things that'd help you identify your problem.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
One final question : where can I find details on what this error means?

I did some googling and did not come up with much.

And, yes, some of us can provide paid support, but up to this point, the first thing you'll hear from me as a consultant will be the two paragraphs I just wrote, followed by "but I'm happy to bill some time trying to see if I can use mfiutil to find a problem. It'd be cheaper to get the HBA."

Almost everything a consultant ought to do for you are things you'll find people are willing to advise you for free on the forum.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
This does not explain why the system worked fine for 4 months with the Dell PERC H330 Mini in JBOD mode, and why the second system continues to work with it as well.
You can choose to see this as evidence that the controller isn't the problem, despite contrary advice coming from one of the forum's most experienced members. Alternatively, you can take this as a wakeup call, i.e. that the 2nd system is on borrowed time.
 

Alan McKay

Dabbler
Joined
May 9, 2016
Messages
18
Hey sorry I did not mean it to sound like I was double guessing the experience of this forum, but it is the obvious question that anyone would want to be able to answer. If the controller is the issue in and of itself, why did it work for so long? I still don't have a great answer on that. Like, has something gone wrong with the controller for example? Did it fail or break?

But I do hear what you are saying about the second one being on borrowed time.

From what I'm piecing together above, I lose my array if I swap out an HBA is that right? So I need a 3rd system to copy data into first? I mean, if I swap the PERC with an HBA, it won't be able to reassemble the array?

But here is a different point. I'm currently having issues just getting FreeNAS re-installed onto both the dual SDA in the Dell, as well as onto a USB stick we put in there for debugging. This is before anything even wants to try to find my ZFS array and reassemble and mount it.

What does that tell me? I think it tells me there is a basic problem with the controller, no? As in, something broke.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
If the controller is the issue in and of itself, why did it work for so long?
It's a fair question, but you might not ever get a satisfactory answer. It's important to keep in mind that FreeBSD and its drivers don't play nicely with every possible hardware combination, hence the very conservative approach of these forums in steering people towards thoroughly tried and tested combinations.
I lose my array if I swap out an HBA is that right?
Not necessarily, it depends on whether the controller got in the way when ZFS was formatting the disks. You could test this non-destructively by attempting to mount the pool read-only with an HBA (you'll have to use the command-line for this test).
I'm currently having issues just getting FreeNAS re-installed ... I think it tells me there is a basic problem with the controller, no?
Could be - not really enough information to tell.

One thing you can do to try to work around this is to run the installation on a different system, then transfer the boot device to the target system.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Hey sorry I did not mean it to sound like I was double guessing the experience of this forum, but it is the obvious question that anyone would want to be able to answer. If the controller is the issue in and of itself, why did it work for so long? I still don't have a great answer on that. Like, has something gone wrong with the controller for example? Did it fail or break?

Obviously something's gone wrong. Now you want an answer as to what. I already gave you a potential theory on that.

It's kinda like using the microwave to dry your clothes. Sure, it works. But then, one day, mysteriously, it stops working, so you go to the microwave support forums and want answers. They look at you a little funny and say "well we kinda expect that could work, but we don't actually know, because none of us do that. Magnetron may have overheated. Most of us dry clothes on the line or in a dryer, and use our microwaves for our dinners." And you don't care for that answer because it doesn't actually tell you what's wrong. Which we understand is frustrating.

I'm sorry that we don't know, but we don't really know. MFI based arrays never seem to end well. The LSI sticky talks a bit about that and I've already pointed at mfiutil for you, so you've already been pointed as far as we're reasonably able in the right direction to figure out what's wrong, and we've suggested an HBA as remediation which is the typical fix.

But I do hear what you are saying about the second one being on borrowed time.

From what I'm piecing together above, I lose my array if I swap out an HBA is that right? So I need a 3rd system to copy data into first? I mean, if I swap the PERC with an HBA, it won't be able to reassemble the array?

But here is a different point. I'm currently having issues just getting FreeNAS re-installed onto both the dual SDA in the Dell, as well as onto a USB stick we put in there for debugging. This is before anything even wants to try to find my ZFS array and reassemble and mount it.

What does that tell me? I think it tells me there is a basic problem with the controller, no? As in, something broke.

Well that could certainly be. But FreeNAS should always install fine onto a USB stick, do the prep on a different machine if needed. Then you should be able to log in and poke at the thing with mfiutil.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The DELL PERC H330 is listed in the FreeBSD supported hardware for 9.3

http://www.freebsd.org/releases/9.3R/hardware.html#DISK

Sure, so are Realtek ethernets, Marvell SATA controllers, non-ECC CPU's, USB hard drives, and wireless network cards. All of which are crap grade hardware, none of which work well (some not at all) with FreeNAS. Or even with FreeBSD, if truth be told.

All the various permutations of hardware have made the "PC" (meaning, usually, "Windows PC") an ongoing pain in the ass to the average computer owner, who often finds things that don't work quite right or only work sorta well. FreeBSD has tried to make as much hardware as possible work with FreeBSD, but do not confuse that for "working well." In many cases, working well is just totally outside the scope of the silicon, because it was designed by a manufacturer who is trying to make money in the PC market, where getting silicon out the door with a Windows device driver that just barely works is the only way to profitability. So when you look at the hardware that's recommended for FreeNAS, what you'll often find is that it isn't the cheap consumer grade stuff. It's the stuff that's better quality. But even there, we've identified significant limitations. When you're trying to make a device that operates correctly, continuously, under high load and adverse conditions, for years at a time, you need virtually flawless hardware and drivers.

As for the LSI RAID stuff, it's probably the best stuff out there, but the fact that the LSI hides a lot of the low level stuff from FreeNAS and ZFS means that what you're getting is not a real accurate view of what's going on. Like trying to drive a car in a downpour. You get some idea as to what's going on but clarity is lacking. As you noted, it worked GREAT for several months - totally expected - and then suddenly it doesn't work great - not unexpected. When something goes wrong, you've got to chat with the controller to find out what's gone wrong, and often it won't really identify a specific problem, but you might get a hint that a certain disk might need to be replaced. By comparison, when you're using an HBA, you can actually talk to the hard drive's SMART subsystem and get reports on read errors, communications issues, and all sorts of other faults. Plus usually faults bubble up through the device driver layer.

So anyways the mfiutil advice I provided back up in #12 is still the primary way to identify what's gone wrong. Beyond that, you need to start removing drives and attaching them to a SAS HBA (guessing they're SAS disks, otherwise an Intel SATA port is absolutely dandy for talking to SATA drives), and then seeing if you can identify problem trends in the drive statistics. If after going through them all, you can identify a fail-y drive, then try replacing it. Your problem probably goes away, until the next drive starts failing, at which point you get to repeat this whole fun procedure.

If you think that debugging process sucks, you're right, and since FreeNAS is perfectly capable of talking to drives via an HBA and monitoring SMART for failure trends itself, I will once again refer you to the first reply in this thread. A properly designed FreeNAS system will let you identify problem drives through some basic queries and will even be able to do things like automatic replacement of failed drives. We're not left to guessing.
 
Status
Not open for further replies.
Top