ZFS - more graceful failing in case of hardware defect?

Status
Not open for further replies.

herr_tichy

Dabbler
Joined
Jul 14, 2014
Messages
12
Hi,

About an hour ago, one of my disks died - FreeNAS noticed something was wrong (see attachment), but took over 30 minutes to finally remove it from the ZFS (raid-Z3) pool. In that time, the machine was completely unresponsive, NFS didn't work and the iSCSI target for the VMware cluster became unavailable.

This is a bit useless. Is there any way to make FreeNAS remove failing disks from the pool faster so I don't get downtime? That is kind of the point of having a RAID, isn't it....
 

Attachments

  • error.png
    error.png
    17.3 KB · Views: 331

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
Actually the errors you are looking at are hardware errors and not ZFS errors. One ZFS error ≠ One hardware error.

You can change the threshold for ZFS as well as your hardware (if supported) but I don't have that info handy.
 

herr_tichy

Dabbler
Joined
Jul 14, 2014
Messages
12
Yes, I know that they're hardware errors. Still, the entire pool became unresponsive until the kernel was fed up and disconnected the device. It could have been removed a lot earlier though - maybe an option to kick devices with smart errors or a settable threshold of I/O errors from the pool would be a nice feature in the future?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
No, let's put things in perspective...

Your system does not monitor SMART errors live. You check it every 30 minutes or so (30 minutes is the WebGUI default). So you'd *still* be looking at 30 minutes in a "best-case" scenario. Unless you want to be querying the living hell out of your disks. So clearly this isn't a good way to decide what to do.

ZFS will kick drives after so many I/O errors and I believe that threshold is adjustable with a tunable (I don't have the tunable handy). The problem is if you make the threshold too low you'll be kicking out good drives, and kicking out good drives leads to unmountable pools. So clearly that takes some serious planning and accepting of disks that might be completely fine in every other server except your FreeNAS server. If you want to go ultra-sensitive then you'd better have a good solid backup schedule as you can expect to be kicking out disks at a high enough frequency to be inadvertently offlining your pool requiring restoration from backups as a result.

Your hardware will also kick drives off the system after so many problems or timeout periods. This depends on your hardware as some lets you choose those settings and others don't. The M1015 had a robust list of settings if I remember correctly while others give you zero option. You stick with what the controller does and if you don't like it, tough sh*t. Clearly this isn't the most reliable solution as your choice of hardware has the biggest choice on how much control you have.

The *best* way to handle this is to schedule:

-ZFS scrubs
-SMART monitoring
-SMART testing
-emailing to something like a cell phone

Have spare disks ready and have the ability too offline a disk from anywhere on the planet on a moments notice *if* this is a problem.

If all of this is unacceptable you should be going to iX for a custom built setup with HA (high availability) as that is the only thing that is going to do what you are wanting/needing.

In my case, if my home server started having problems I'd know about it within 30 minutes as I'd get a text message from my phone. Then literally from my phone I could VPN into my network and offline a disk if necessary. I've actually done this so I know it can be done.

The real question I have for you is "how much planning did you make when you setup your box?" because it sounds like you didn't really plan ahead for this eventuality and are quite angry at the server for not being sensitive enough for your wants/needs/desires but expect the server to comply. To be blunt, I don't trust any hardware or software to ever be better than what a little human hand-holding can do. You shouldn't either if uptime and other things are ultra important. And if they are that important its probably better to pay for some professional to deal with these problems for you. iX offers 24x7 2 hour call support if you want it. And believe me, many companies want it.
 

herr_tichy

Dabbler
Joined
Jul 14, 2014
Messages
12
Yes, you are right. I have hot spares, email notification on my smartphone and vpn access, of course. And yes, I'm scrubbing every sunday morning. This is academia, so budget for storage is way <10.000€. I don't see a HA solution there. No, I do not have a problem with offlining disks manually and replacing with hot spares manually. That would be great, if it worked. However, the whole machine developing I/O latencies of several minutes because one disk in a raidz3 hangs, that IS a problem. I couldn't offline the disk, because the entire pool didn't react anymore. I couldn't log in via ssh. Logged in on the local console via the RMM, 'zpool status' started and immediately hung, until the kernel declared the disk dead half an hour later.

So, basically, what you're saying is: My controller (LSI SAS 9201-16i) doesn't declare disks dead quickly enough and therefore I have to live with the entire machine freezing until it does. Huh. I'm wondering what the professional solutions do instead, I assume, basically, they're also just using SAS HBAs with a lot of disks.

Oh well :(
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
Well, that depends.

Some "professional" solutions do everything wrong. One guy a few years ago paid like $100k for a server that had ZFS running on a hardware RAID5. The company just wanted to pocket money quickly and sold him a system that was virtually guaranteed to eat his data "someday". Sure enough, a year or so later it did! The really crappy part is that I built a server that was faster, cheaper, and more reliable than his system for like $3k.

Others do things correctly. iX does things correctly for ZFS and since they are under contract 99% of the time it's also in their best interest to do it right. They do it wrong and not only is their reputation ruined but they'll spend so much money in support calls that they won't be able to stay profitable. ;)

There's some risk to doing it yourself. My first FreeNAS box used non-ECC RAM. I had to do individual disks as independent RAID arrays so I could do ZFS. I couldn't do SMART testing or monitoring. Basically about the worst possible ideas for ZFS you can imagine. If you want to do it yourself then you also need to be your own professional. The 9201-16i may or may not be a good card for ZFS. I can't keep track of the bajillion LSI cards, but if you can't run it in JBOD mode, do SMART monitoring and testing, then you shouldn't be using that card with ZFS at all. Additionally if you aren't able to change important things like when the controller finally kicks a drive then you may have to look elsewhere.

Also there's a major difference between consumer grade and enterprise class drives... TLER. 7 second auto-fail for a read or write command can be a make-or-break situation for some servers, and spending the money for drives that do TLER can be just as valuable. TLER helps prevent locking up your pool because TLER will stop the drive dead in its tracks if it starts failing hard.

There's even more to building a high-end file server than this, but this is just some important stuff "from the surface".
 

herr_tichy

Dabbler
Joined
Jul 14, 2014
Messages
12
Yes, the controller hase been flashed with IT-mode firmware. I have direct access to the drives and can query them using smartctl. The only thing doing RAID there is ZFS. I've been through the whole "one RAID-0 logical volume per attached disk and then RAID-Z2 on that" thing with a RAID-only controller before, so I made sure that wouldn't happend again.

Also there's a major difference between consumer grade and enterprise class drives... TLER.

BINGO! That is precisely what I've been looking for. The drives are Seagate Constellation ES.3 SAS, they might support this. Can ZFS handle TLER (or ERC for Seagate)?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
It's not a "ZFS" thing. It's a hardware thing. You'd have to check with Seagate's support structure as to how to enable it or even change it if your disk supports it (which I believe it does but I'm not a Seagate guru).
 

herr_tichy

Dabbler
Joined
Jul 14, 2014
Messages
12
It's not a "ZFS" thing. It's a hardware thing. You'd have to check with Seagate's support structure as to how to enable it or even change it if your disk supports it (which I believe it does but I'm not a Seagate guru).

Yes, I understand that. After all I've read, a disk that supports that can, after a set amount of time, ask the RAID controller for recovery via RAID when it can't read a sector, instead of trying again and again or failing completely. My question was referring to, whether ZFS on BSD supports that or whether the "quick failing" is the only useful part of this feature set for ZFS. Which would be great as well, of course.

Anyway, I'll read up on how to enable it if my disks support it. If they do, my problem is probably solved. If not, I will buy a few disks that have the feature in their datasheet and see if I can get it working in my lab and think of this in the future. Thank you!
 

herr_tichy

Dabbler
Joined
Jul 14, 2014
Messages
12
Forget it - I had a moment of stupidity there. I now understand the feature that TLER/ERC/whatever provides correctly - quick return of an error if an I/O fails.
 

DaPlumber

Patron
Joined
May 21, 2014
Messages
246
So, basically, what you're saying is: My controller (LSI SAS 9201-16i) doesn't declare disks dead quickly enough and therefore I have to live with the entire machine freezing until it does. Huh. I'm wondering what the professional solutions do instead, I assume, basically, they're also just using SAS HBAs with a lot of disks.

Just guessing here:

Is the card in IT (Initiator-Target) or IR (Integrated RAID) mode? Hopefully it's in IT mode and just acting as a "dumb" JBOD pass-through and not trying to be disastrously clever...
 
Status
Not open for further replies.
Top