Replaced dead drive in RAIDZ1, now what?

Status
Not open for further replies.

purduephotog

Explorer
Joined
Jan 14, 2013
Messages
73
I once had a customer call me to complain that a piece of software I supported wasn't working. I wrote out a huge diagnostic fault tree, with loads of simple tests... and at the end I added a caveat that, while highly unlikely, it was possible to have two bad interface cables on the same machine, although I had never seen it happen.

It was two bent pins.

Soooo.... I believe in coincidences.
 

darrenbest

Dabbler
Joined
Apr 7, 2012
Messages
33
Alright, I haven't touched the SATA cable yet. After streaming a few videos off the NAS last night, the checksum errors have really spiked now.

Code:
[root@freenas8] ~# zpool status
  pool: FREENAS-8
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
  see: http://www.sun.com/msg/ZFS-8000-9P
scrub: resilver completed after 0h0m with 0 errors on Sat Jul 20 14:56:03 2013
config:
 
        NAME                                            STATE    READ WRITE CKSUM
        FREENAS-8                                      ONLINE      0    0    0
          raidz1                                        ONLINE      0    0    0
            ada0p2                                      ONLINE      0    0    0
            ada1p2                                      ONLINE      0    0    0
            ada2p2                                      ONLINE      0    0    0
            gptid/772dfdc9-ef57-11e2-bbc4-001cc05665fb  ONLINE      0    0 1.71K  1.41M resilvered
            ada4p2                                      ONLINE      0    0    0
 
errors: No known data errors


From 104 errors yesterday afternoon to 1710 this morning.

I'm out today, but I'll replace the SATA cable, run some more streaming off the box, and see what happens. Thanks for the input.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Not buying that the SATA cable is bad. It doesn't hurt to replace it though.

I'd try another scrub command. The only time I've seen drives that scrubbed in a second or less were drives that were out of sync. In that case some repairing takes place but you need to do another scrub command to issue a full-on scrub.

You really need to stop and give this some urgency. Those CHKSUM errors are saying that data on your drive is missing or corrupt. In effect, you have all of your drives online, but one of them is missing or has corrupted data. If you had a disk fail right now your pool would be gone.
 

darrenbest

Dabbler
Joined
Apr 7, 2012
Messages
33
Alright, cyberjock, point taken. I'm glad I'm asking these questions: your responses indicate my fears are legitimate.

I've not replaced the cable, but I've logged in remotely and started "zpool scrub FREENAS-8". Now, 5 minutes later, I've got this:
Code:
[root@freenas8] ~# zpool status
  pool: FREENAS-8
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
  see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub in progress for 0h5m, 0.53% done, 17h39m to go
config:
 
        NAME                                            STATE    READ WRITE CKSUM
        FREENAS-8                                      ONLINE      0    0    0
          raidz1                                        ONLINE      0    0    0
            ada0p2                                      ONLINE      0    0    0
            ada1p2                                      ONLINE      0    0    0
            ada2p2                                      ONLINE      0    0    0
            gptid/772dfdc9-ef57-11e2-bbc4-001cc05665fb  ONLINE      0    0  243K  7.14G repaired
            ada4p2                                      ONLINE      0    0    0
 
errors: No known data errors


So it certainly looks like more is going on. I'll report back when the scrub is finished, and I look forward to your diagnosis. Thanks, everyone.
 

darrenbest

Dabbler
Joined
Apr 7, 2012
Messages
33
Ok, sorry. I was interrupted by a power failure two days ago, while I was out. The last time I had checked in with it remotely, the scrub was about 80% complete, and 1.18TB had been "repaired" on ada3.

After getting back yesterday, I started another scrub. When I checked back later, here's what I found:

Code:
[root@freenas8] ~# zpool status
  pool: FREENAS-8
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
  see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub completed after 0h28m with 0 errors on Tue Jul 23 14:51:15 2013
config:
 
        NAME                                            STATE    READ WRITE CKSUM
        FREENAS-8                                      ONLINE      0    0    0
          raidz1                                        ONLINE      0    0    0
            ada0p2                                      ONLINE      0    0    0
            ada1p2                                      ONLINE      0    0    0
            ada2p2                                      ONLINE      0    0    0
            gptid/772dfdc9-ef57-11e2-bbc4-001cc05665fb  ONLINE      0    0  629K  18.3G repaired
            ada4p2                                      ONLINE      0    0    0
 
errors: No known data errors


So it must have been pretty close to completion of the scrub when the power went off, as it only took 28min.

I initiated another scrub this morning, and here is what I've got so far:

Code:
[root@freenas8] ~# zpool status
  pool: FREENAS-8
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
  see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub in progress for 3h9m, 39.91% done, 4h45m to go
config:
 
        NAME                                            STATE    READ WRITE CKSUM
        FREENAS-8                                      ONLINE      0    0    0
          raidz1                                        ONLINE      0    0    0
            ada0p2                                      ONLINE      0    0    0
            ada1p2                                      ONLINE      0    0    0
            ada2p2                                      ONLINE      0    0    0
            gptid/772dfdc9-ef57-11e2-bbc4-001cc05665fb  ONLINE      0    0  629K
            ada4p2                                      ONLINE      0    0    0
 
errors: No known data errors


It does not appear to be "repairing" anymore, and I am not seeing an increase in checksum errors (I have been streaming video off the box throughout). All in all, it appears promising. Am I correct? Is the new drive functioning correctly now?

After the scrub finishes (assuming no more issues), should I issue the "zpool clear" command to delete the checksum errors?

Thanks for all your help.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Not to make you spend more money on your server, but a UPS is pretty much mandatory for FreeNAS. All file servers have heavy warnings recommending a UPS and FreeNAS is no different. We've had quite a few users that lost their entire zpool because of an unplanned shutdown.

Yes, once the scrub completes you can do a zpool clear.

Not having an UPS can cause its own errors like yours due to partial writes. :(
 

darrenbest

Dabbler
Joined
Apr 7, 2012
Messages
33
Not to make you spend more money on your server, but a UPS is pretty much mandatory for FreeNAS. All file servers have heavy warnings recommending a UPS and FreeNAS is no different. We've had quite a few users that lost their entire zpool because of an unplanned shutdown.

I do have a UPS, actually, but its plugged into two servers, the other of which I've configured for a graceful shutdown. The power outage was for several hours, and I wasn't home. When I can afford a 2nd UPS dedicated to the FreeNAS box, I'll be better protected.

In the meantime, here's the status from the completed scrub:

Code:
[root@freenas8] ~# zpool status
  pool: FREENAS-8
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
  see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub completed after 13h27m with 0 errors on Wed Jul 24 22:18:10 2013
config:
 
        NAME                                            STATE    READ WRITE CKSUM
        FREENAS-8                                      ONLINE      0    0    0
          raidz1                                        ONLINE      0    0    0
            ada0p2                                      ONLINE      0    0    0
            ada1p2                                      ONLINE      0    0    0
            ada2p2                                      ONLINE      0    0    0
            gptid/772dfdc9-ef57-11e2-bbc4-001cc05665fb  ONLINE      0    0  629K
            ada4p2                                      ONLINE      0    0    0
 
errors: No known data errors


As hoped, the checksum errors did not go up: they stayed at the 629K number. I've cleared the errors now. Here's to redundancy, and not losing any data. Thanks, everyone!
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Just so you know, your redundancy may not save you from corruption due to an unplanned shutdown. Getting an UPS should be very high on your priority list. We've had several users that suffered total data loss from an idle server that lost power while the owner was at work. ;)

I'm not sure what your other system is, but you could setup an SSH script to log into the FreeNAS server and execute "shutdown -p now" and that will allow you to shutdown your FreeNAS server. Of course, the network will need to still be up to connect to FreeNAS. Don't forget that little piece of info.
 

vaibhavyagnik

Dabbler
Joined
Aug 26, 2011
Messages
38
I think the UPS service should be used for this if the OP's router supports the same

Sent from my ALCATEL_one_touch_995 using Tapatalk 2
 

darrenbest

Dabbler
Joined
Apr 7, 2012
Messages
33
I'm not sure what your other system is, but you could setup an SSH script to log into the FreeNAS server and execute "shutdown -p now" and that will allow you to shutdown your FreeNAS server. Of course, the network will need to still be up to connect to FreeNAS. Don't forget that little piece of info.


That's a great idea! The network switch is also plugged into the UPS, so networking would be taken care of. I never thought of having the computer (CentOS) that is receiving a shutdown signal from the UPS, to have it send a shutdown to other computers first before shutting down! In retrospect, it's obvious, but a great idea nonetheless. I will set that up ASAP.

Thanks again for your help and expertise.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Not sure if putty exists for CentOS, but one of the programs that Putty has is called plink.exe. It's PERFECT for this kind of thing. You run it something like...

plink freenas root <mypassword> "shutdown -p now"

and that's all you have to do!
 

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,403
I never thought of having the computer (CentOS) that is receiving a shutdown signal from the UPS, to have it send a shutdown to other computers first before shutting down!
If you're using NUT on CentOS then you can simply, via the GUI, enable UPS slave support in the 9.x series after configuring it on the CentOS box.
 
Status
Not open for further replies.
Top