Failed drive Notification in ZFS + FreeNAS 8.x, what to do?

Status
Not open for further replies.

BloodyIron

Contributor
Joined
Feb 28, 2013
Messages
133
In my searching I have discovered that it seems current versions of FreeNAS (8.x) do not seem to properly detect when a drive has a problem in a RAIDZ array/configuration.

In my testing I put 4 drives in a Z1 array and pulled one. After 20 minutes the system did not report any errors at all, and shows it as HEALTHY.

As I said, my searching seems to reveal this is the state of things until zfsd makes it into FreeNAS 9; which is estimated for Mid Summer 2013.

Now, in the mean time, it stands to reason people want to use Z1 arrays and have a work-around for this. I have not been able to conclusively locate a work-around for this. I am concerned that if a real error occurs I will never know about it without rebooting. What have you folks done about this? (as in the people using FreeNAS)
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Re: failed drives in ZFS + freenas 8.x, what to do?

FreeNAS does eventually figure out something is wrong if I remember correctly. I was one of the people that did the actual testing of this issue and how to best resolve it. FreeNAS does appear to regularly check for a failed disk. But it doesn't do it at a very high frequency. Maybe 3-4 times a day at most.

What I do is setup the SMART emails for nightly review and I check those out. Since I know I have 18 disks I check to see if I have 18 drives worth of data. The script I included in the thread lists 18 lines of Current_Pending_Sector and 18 lines of temps before listing all of the SMART info I know if I only have 17 a disk is missing. A detached disk will immediately become unavailable for SMART polling, so my script will not get any drive data if the drive is unplugged even milliseconds before the SMART data is requested.

When I log into my computer every morning I check to make sure I have 18 lines of data, that all of the drives have a Current_Pending_Sector value of 0 and that all of the temps are acceptable(<40C). Then I delete the email. The emails are also kept on my zpool so if I have a need in the future to examine historical information I have the option.

Edit: If you are using the zpool and pull a disk you will have error messages if you do a zpool status. The applicable disk will start racking up lots of read and write errors. Of course, if the disk is not used at all you will have no increase in the error rate. But I'd expect that regular background operations would start to increment that number above zero.
 

BloodyIron

Contributor
Joined
Feb 28, 2013
Messages
133
Re: failed drives in ZFS + freenas 8.x, what to do?

Did you do any testing to verify this method works? Also, if a drive is disconnected/bad, does the web GUI update for you the next day?

FreeNAS does eventually figure out something is wrong if I remember correctly. I was one of the people that did the actual testing of this issue and how to best resolve it. FreeNAS does appear to regularly check for a failed disk. But it doesn't do it at a very high frequency. Maybe 3-4 times a day at most.

What I do is setup the SMART emails for nightly review and I check those out. Since I know I have 18 disks I check to see if I have 18 drives worth of data. The script I included in the thread lists 18 lines of Current_Pending_Sector and 18 lines of temps before listing all of the SMART info I know if I only have 17 a disk is missing. A detached disk will immediately become unavailable for SMART polling, so my script will not get any drive data if the drive is unplugged even milliseconds before the SMART data is requested.

When I log into my computer every morning I check to make sure I have 18 lines of data, that all of the drives have a Current_Pending_Sector value of 0 and that all of the temps are acceptable(<40C). Then I delete the email. The emails are also kept on my zpool so if I have a need in the future to examine historical information I have the option.

Edit: If you are using the zpool and pull a disk you will have error messages if you do a zpool status. The applicable disk will start racking up lots of read and write errors. Of course, if the disk is not used at all you will have no increase in the error rate. But I'd expect that regular background operations would start to increment that number above zero.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Re: failed drives in ZFS + freenas 8.x, what to do?

The GUI does eventually update, but it's not instantaneous. It falls into that range that I said was probably 3-4 times a day at most. I'm not sure if its on a timer or triggered by an event, but it will eventually update. For me, I think the longest it went without updating was 4 hours. Personally I don't consider it a problem. Obviously I'd prefer the GUI to immediately update accordingly. But 4 hours isn't going to significantly change anything.
 

BloodyIron

Contributor
Joined
Feb 28, 2013
Messages
133
Re: failed drives in ZFS + freenas 8.x, what to do?

I'll have to see what I can get done with this, thanks for the heads up ;D

The GUI does eventually update, but it's not instantaneous. It falls into that range that I said was probably 3-4 times a day at most. I'm not sure if its on a timer or triggered by an event, but it will eventually update. For me, I think the longest it went without updating was 4 hours. Personally I don't consider it a problem. Obviously I'd prefer the GUI to immediately update accordingly. But 4 hours isn't going to significantly change anything.
 

BloodyIron

Contributor
Joined
Feb 28, 2013
Messages
133
Re: failed drives in ZFS + freenas 8.x, what to do?

It's been 24 hours since I removed a drive and left it disconnected, FreeNAS still has not picked up on any "problems" according to the storage page, or the volume status page. This isn't working.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Re: failed drives in ZFS + freenas 8.x, what to do?

Have you done any reading/writing to the pool? I don't believe the system will know anything is wrong until errors are generated by reading and/or writing.
 

BloodyIron

Contributor
Joined
Feb 28, 2013
Messages
133
Re: failed drives in ZFS + freenas 8.x, what to do?

In the 24hrs no. I just did some copying around, and under storage it reports HEALTHY, but when I go to the pool status a drive says NULL instead of ONLINE. read performance and write performance look to be the same as earlier testing when all drives were in.

It's a bit concerning that it takes activity to bring light to a problem, but it is certainly improbable that absolutely no activity would happen to a production array.

Why does it still say HEALTHY in storage?

Have you done any reading/writing to the pool? I don't believe the system will know anything is wrong until errors are generated by reading and/or writing.
 

BloodyIron

Contributor
Joined
Feb 28, 2013
Messages
133
Re: failed drives in ZFS + freenas 8.x, what to do?

Oh and the alert system says the dataset i made is status "UNKNOWN", I'm not sure how long it took to come up with the alert as I was looking elsewhere. Looking better, but still a bit concerning.
 

BloodyIron

Contributor
Joined
Feb 28, 2013
Messages
133
Re: failed drives in ZFS + freenas 8.x, what to do?

initializing a scrub seems to change storage to "DEGRADED" from "HEALTHY", hummm :S

- - - Updated - - -

When I re-seat the drive I removed, and try to replace the "missing" drive with the one I just reconnected, it says it cannot join as it is already a member of the zfs pool (the same pool im trying to connect it to). I assume this means it doesn't want to format it because it is worried it has data? Mentions a -f flag to force, but nothing GUI seems to let me override. I'm trying to deal with this without rebooting.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Re: failed drives in ZFS + freenas 8.x, what to do?

When I re-seat the drive I removed, and try to replace the "missing" drive with the one I just reconnected, it says it cannot join as it is already a member of the zfs pool (the same pool im trying to connect it to). I assume this means it doesn't want to format it because it is worried it has data? Mentions a -f flag to force, but nothing GUI seems to let me override. I'm trying to deal with this without rebooting.

If your hardware fully supports hotswap then it should just be "readded" to the zpool. This is good for reliability, but also bad. If you don't recognize that a disk was lost and reconnected(this isn't uncommon with disks that have bad sectors) then the drive may not be in sync with the zpool. The only fix is to perform a scrub(and make sure drives aren't disconnecting and reconnecting during the scrub).

Some hardware doesn't seem to support hot-swap very well, and weird things happen if you do try to hotswap a drive.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
The indications you describe are normal. You typically will not see a failure indication until you try to read/write to the pool. As for joining, my friend CyberJock speaks the truth. When I did this sort of testing a few years ago I had to reformat the drive I pulled out and reinsert it as a new blank drive and let it resilver. Also there is a particular way to reinsert a drive into the pool when it has a swap partition. This stuff is in the users guide and I recommend that procedure be followed if you value your data. Yes I know that the data you have now if testing data, it can be lost, but in the future that will not be the case.
 

BloodyIron

Contributor
Joined
Feb 28, 2013
Messages
133
Would you mind pointing me to the particular part of the manual?

The indications you describe are normal. You typically will not see a failure indication until you try to read/write to the pool. As for joining, my friend CyberJock speaks the truth. When I did this sort of testing a few years ago I had to reformat the drive I pulled out and reinsert it as a new blank drive and let it resilver. Also there is a particular way to reinsert a drive into the pool when it has a swap partition. This stuff is in the users guide and I recommend that procedure be followed if you value your data. Yes I know that the data you have now if testing data, it can be lost, but in the future that will not be the case.

- - - Updated - - -

That definitely seems like a good reason to not want to have drives automatically re-attach, hiding the original problem. For me, I have been concerned that the alert took too long to tell me, but I'll have to retest again as you both say it isn't apparent until activity happens on the zpool.

Thanks for the heads up so far!

If your hardware fully supports hotswap then it should just be "readded" to the zpool. This is good for reliability, but also bad. If you don't recognize that a disk was lost and reconnected(this isn't uncommon with disks that have bad sectors) then the drive may not be in sync with the zpool. The only fix is to perform a scrub(and make sure drives aren't disconnecting and reconnecting during the scrub).

Some hardware doesn't seem to support hot-swap very well, and weird things happen if you do try to hotswap a drive.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
For me, I have been concerned that the alert took too long to tell me, but I'll have to retest again as you both say it isn't apparent until activity happens on the zpool.

Keep in mind this is not really any different than your car in the driveway. When do you know when your car won't start? When you try to start it of course. Kind of the same thing with ZFS in FreeBSD 8.3. You won't really know something is bad until it's actually not talking(or talking trash).
 

BloodyIron

Contributor
Joined
Feb 28, 2013
Messages
133
Okay so I transferred like 15GB into the z1 pool, and disconnected a drive while it was doing it. The volume status showed "null" on the relevant drive immediately, and im not sure how long before zpool detected an error, but at first the alert button didn't report an error when i clicked on it until a couple minutes later. However, the alert icon didn't change until I reloaded the page. A couple minutes is pretty reasonable to detect an issue.

However, I'm stuck at one of the recovery steps.

I'm trying to re-insert the drive, then wanting to format it and resilver/sliver it. However, dmesg reports the drive was removed, but I do not see the drive as being redetected by the environment.

dmesg:

Code:
mps0: mpssas_alloc_tm freezing simq
mps0: mpssas_remove_complete on handle 0x0009, IOCStatus= 0x0
mps0: mpssas_free_tm releasing simq
(da0:mps0:0:2:0): lost device - 4 outstanding, 1 refs


Now, I'm not sure what firmware is on the controller. The system is using an LSI SAS2008 controller, but I can't remember if it's an IT or IR firmware on there. Even still, does "mps" in the disconnect line mean that it's not acting through AHCI, or what?

Thanks for your insight into this so far folks, this is helping re-assure me about this :P
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Sounds to me like you are in that category I mentioned earlier. Hotswap doesn't appear to be supported fully on your hardware. Try a reboot and see what happens.

Of course, once you reboot and the drive is redetected you'll have to manually do a scrub and that will put all the drives back in sync. If you haven't copied alot of data onto the zpool you shouldn't expect many "errors" on the zpool status page.
 

BloodyIron

Contributor
Joined
Feb 28, 2013
Messages
133
I'm pretty sure this controller should be able to do hot-swap, but I think it might be running the IR firmware which from what I hear I should switch it to IT firmware to get better ZFS performance/support. I'll look into that.

I'm sure rebooting will initialize the drive, but that kind of defeats this exercise ;D

Well, what is "a lot" of data? a percentage of the capacity, or what?

Sounds to me like you are in that category I mentioned earlier. Hotswap doesn't appear to be supported fully on your hardware. Try a reboot and see what happens.

Of course, once you reboot and the drive is redetected you'll have to manually do a scrub and that will put all the drives back in sync. If you haven't copied alot of data onto the zpool you shouldn't expect many "errors" on the zpool status page.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I'm pretty sure this controller should be able to do hot-swap, but I think it might be running the IR firmware which from what I hear I should switch it to IT firmware to get better ZFS performance/support. I'll look into that.

Not necessarily. While the hardware may support it, does the hardware and software support it in FreeBSD? I've seen a few controllers that flawlessly work with hotswap on Windows(Server and Desktop OS) but won't play nice in FreeBSD no matter what. Then, some controllers even do hotswap on Linux, but others don't. It's very hit and miss and I just always tell people to assume hotswap doesn't work for this very reason. One controller I had would let you hotswap one drive, but if you didn't do a cold boot you could never hotswap any other drives.

I haven't seen where IT or IR firmware worked any better. It seems to be a combination of the driver and the firmware and if hotswap is supported in Windows it's fairly obvious that the firmware supports hotswap too. So the issue would almost certainly be in the drivers.
 

BloodyIron

Contributor
Joined
Feb 28, 2013
Messages
133
I looked at it seems ahci is not loading. I'm trying to get it into the loader.conf, but we'll see how that goes.

I flashed it from IR to IT firmware, and not seeing really a change just yet, except that I am having challenges booting from drives on the SAS2008 controller now, but I might have set that up wrong.

I'm getting the drive removal error pretty consistently now, but I really need to figure out how to replace the drive without rebooting; there must be a way.

Not necessarily. While the hardware may support it, does the hardware and software support it in FreeBSD? I've seen a few controllers that flawlessly work with hotswap on Windows(Server and Desktop OS) but won't play nice in FreeBSD no matter what. Then, some controllers even do hotswap on Linux, but others don't. It's very hit and miss and I just always tell people to assume hotswap doesn't work for this very reason. One controller I had would let you hotswap one drive, but if you didn't do a cold boot you could never hotswap any other drives.

I haven't seen where IT or IR firmware worked any better. It seems to be a combination of the driver and the firmware and if hotswap is supported in Windows it's fairly obvious that the firmware supports hotswap too. So the issue would almost certainly be in the drivers.
 
Status
Not open for further replies.
Top