SOLVED Cancel replacing?

Makki · Feb 20, 2017

I'm trying to cancel (or finish) a replace for 4 days now :(

We have: see footer
What happened?
1. Replaced /dev/ada2 WD20 with a WD40 because it showed smart-errors by: shutting down, plugging it new one in (I know now, I should have sep. attach it; replace and then detach the old..)
1.1 Replaced /dev/ada3 WD20 (unfailed) with new WD40
-> resilvering went fine so far for both.

2. Replaced /dev/ada1 WD20 (also no errors on this) with a WD40 because I wanted to replace all of them one-by-one to expand storage anyway.
The replacement WD40 fully failed during resilder up to being dead on SATA-Port.
F***!

3. Then I maybe made a mistake: Replaced the failed WD40 on ada1 with another new WD40 without detaching first, so I had 3 disks in state "replacing"
This went over 2 days resulting in some "loop", resilvering succeded but "zpool status" never went out of "replacing" again, ONLINE but after reboot it always started over.

4. detached the first WD40 which was prev. on ada1 (offline, failed)

5. again resilvering seemed to go fine but now: ada2 started throwing SMART-Errors: another failure. f***

zpool status -v says (currently running a scrub)

Code:

  pool: zvol-wd20
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: scrub in progress since Mon Feb 20 18:57:20 2017
914G scanned out of 5.12T at 249M/s, 4h56m to go
218G repaired, 17.44% done
config:

NAME  STATE  READ WRITE CKSUM
zvol-wd20  ONLINE  0  0  6
raidz1-0  ONLINE  0  0  12
   (ada0) gptid/81d8ade0-ac29-11e6-8e79-941882377300  ONLINE  0  0  0
replacing-1  ONLINE  0  0  1
  (ada1) gptid/82d21e89-ac29-11e6-8e79-941882377300  ONLINE  0  0  0  (repairing)
  (ada4) gptid/7474530a-f5d6-11e6-9b8a-941882377300  ONLINE  0  0  0  (repairing)
  (ada2) gptid/072c9083-f3c6-11e6-9dba-941882377300  ONLINE  6  0  0  (repairing)
  (ada3) gptid/93e2855f-f43b-11e6-b05b-941882377300  ONLINE  0  0  0

errors: Permanent errors have been detected in the following files:

<0x1350>:<0x13936>
<0x1396>:<0x29da11>
<0x1396>:<0x29d9f1>
zvol-wd20/backup@auto-20170124.0300-1m:/v3750/duplicity-full.20161028T010314Z.vol2900.difftar.gpg

-> Now, I dont want to replace ada2 now while ada1 /ada4 (with two disks without any errors!) is in this "replacing"-loop.
Any Ideas? should I detach the "old" ada1 WD20 (still alive, re-attach to a fith SATA-port) or the new one?
Will it help?
-> Don't care about the files(snapshots) which are reported as error, these are unimportant and could be deleted or restored from backup..

I think replacing ada2 (again) makes less sense until ada1 is healthy replaced(?)

Hope that was somewhat explained correctly..

best regards, Michael

Edit: 2 / 4 brand new WD40 Red being nearly DOA is another sad part of the story..

joeschmuck · Feb 20, 2017

The best advice I can say is to let it finish resilvering before you do anything else. Then delete the files the scrub said were bad and replace them with new copies if you have them. Once resilvering is done and you have deleted those files, only then should you do anything. If the scrub is taking too long then you could cancel it which would speed up the resilvering process. Of course then kick off a scrub and ensure things are good.

Makki · Feb 20, 2017

Thanks. The problem is: resilvering finishes (I remove any file/snapshot that shows errors - as said no big issue so far..)
But after next reboot it starts over again, never leaving the "replacing-1" state of this disk

Looping through this for 5 days now..
I have a backup, I just don't want to loose the pool with all the things set-up like jails, snapshots, rsync and reconfig from scratch..
So the main question is how to get out of this "replacing-loop" with two healthy disks (and another - ada2 starting to fail)

Michael

Makki · Feb 20, 2017

To clarify: current main question is, should I detach the "old" healthy WD20 on ada1 or the new, also healthy WD40 which should replace it to the replacement-process start over which is obviously stuck?
After approx 5x resilver und repair they should be both fine.. Will delete the files reported as error..
Just think that I have currently not mch room for a wrong command/click

Michael

joeschmuck · Feb 20, 2017

You have a very odd problem. Ensure you have a copy of the FreeNAS configuration as I think you might need to rebuild the pool.

Let me just reiterate what I think you have said and tell me if I have this correct.

1) You only have five hard drives in your machine (ada0 through ada4).
2) You do not have another hard drive in the machine.
3) The scrub has completed.
4) After a reboot your pool begins to resilver drive ada1.

Okay, you just responded, here is my comment... Once your ada1 drive has resilvered you need to detach the old drive. Don't reboot before that. Follow the user manual. And track your drives by the serial numbers.

Makki · Feb 20, 2017

joeschmuck said:
You have a very odd problem. Ensure you have a copy of the FreeNAS configuration as I think you might need to rebuild the pool.
.

Sure.. have it anyway, daily during the last 5 days hourly :)
Freenas boot sits on another pool anyway..

Let me just reiterate what I think you have said and tell me if I have this correct.

.

1) You only have five hard drives in your machine (ada0 through ada4).
2) You do not have another hard drive in the machine.
.

Correct, just the boot-mirror and a (temporary) 4TB USB3 disk for backup..

.

3) The scrub has completed.
.

4h to go.. once again..

.

4) After a reboot your pool begins to resilver drive ada1.
.

correct. and the "replacing-1" and ada2 (which is failed partially acc. to SMART)

.

Okay, you just responded, here is my comment... Once your ada1 drive has resilvered you need to detach the old drive. Don't reboot before that. Follow the user manual. And track your drives by the serial numbers.

Will try in about 4hrs :(

Michael

Makki · Feb 20, 2017

scrub/repair has finished without further errors but I couldn't detach the original ada1 WD20..
but detaching the replacement-disk worked out to stop this "replacing-1" state:

Code:

  pool: zvol-wd20
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the                                                                                                                                                        
        entire pool from backup.                                                                                                                                                                                                
   see: http://illumos.org/msg/ZFS-8000-8A                                                                                                                                                                                      
  scan: scrub repaired 1.22T in 7h32m with 4 errors on Tue Feb 21 02:30:07 2017                                                                                                                                                
config:                                                                                                                                                                                                                        
                                                                                                                                                                                                                               
        NAME                                              STATE     READ WRITE CKSUM                                                                                                                                            
        zvol-wd20                                         ONLINE       0     0     8                                                                                                                                            
          raidz1-0                                        ONLINE       0     0    16                                                                                                                                            
            gptid/81d8ade0-ac29-11e6-8e79-941882377300    ONLINE       0     0     0                                                                                                                                            
            replacing-1                                   ONLINE       0     0     1                                                                                                                                            
              gptid/82d21e89-ac29-11e6-8e79-941882377300  ONLINE       0     0     0                                                                                                                                            
              gptid/7474530a-f5d6-11e6-9b8a-941882377300  ONLINE       0     0     0                                                                                                                                            
            gptid/072c9083-f3c6-11e6-9dba-941882377300    ONLINE       8     0     0                                                                                                                                            
            gptid/93e2855f-f43b-11e6-b05b-941882377300    ONLINE       0     0     0                                                                                                                                            
                                                                                                                                                                                                                               
errors: Permanent errors have been detected in the following files:                                                                                                                                                            
                                                                                                                                                                                                                               
        <0x13e0>:<0x29da11>                                                                                                                                                                                                    
        <0x13e0>:<0x13936>                                                                                                                                                                                                      
        <0x13e0>:<0x29d9f1>                                                                                                                                                                                                    

[root@freenas] ~# zpool detach zvol-wd20 gptid/82d21e89-ac29-11e6-8e79-941882377300
cannot detach gptid/82d21e89-ac29-11e6-8e79-941882377300: no valid replicas

  pool: zvol-wd20
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Feb 21 05:58:32 2017
        80.4G scanned out of 5.08T at 241M/s, 6h3m to go
        18.8G resilvered, 1.54% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        zvol-wd20                                       ONLINE       0     0     8
          raidz1-0                                      ONLINE       0     0    16
            gptid/81d8ade0-ac29-11e6-8e79-941882377300  ONLINE       0     0     0
            gptid/82d21e89-ac29-11e6-8e79-941882377300  ONLINE       0     0     0  (resilvering)
            gptid/072c9083-f3c6-11e6-9dba-941882377300  ONLINE       8     0     0
            gptid/93e2855f-f43b-11e6-b05b-941882377300  ONLINE       0     0     0  (resilvering)

errors: Permanent errors have been detected in the following files:

        <0x13e0>:<0x29da11>
        <0x13e0>:<0x13936>
        <0x13e0>:<0x29d9f1>
[root@freenas] ~#

I'll wait now for another resilver to complete first and then replace the failing ada2..
6h to go.
Though not understanding why it resilvers ada4 (was ada1 to be replaced) and the working ada3 now, I think I hit some strange bug bug with two disks failing consecutive (though not within the same resilver)..

Let's see..

Michael

Stux · Feb 20, 2017

I think you removed the wrong disk originally. Did you match serial numbers?

Makki · Feb 21, 2017

Pretty sure I didnt.. the old disks were also different models..
I think the root-cause was the already replaced ada2 WD40 starting to fail while I still fiddled around with the total fail of the new ada1-replacement, so I didn't recognise this immediately.

But it seems to get better now.. Resilver still running, 3h to go then replace the failing ada2

Code:

  1 Raw_Read_Error_Rate  0x002f  200  200  051  Pre-fail  Always  -  193
197 Current_Pending_Sector  0x0032  200  200  000  Old_age  Always  -  4

IT is a bitch, things always failing together and I'd never make WD Red great again though I'll go with them for now..
And yes, I know, its pretty clear that if they fail, they fail under heavy load like replace/resilver/scrub..

Hoping the best, Michael

Makki · Feb 22, 2017

Marked this as solved now, replacing the last WD20 disk is currently running fine as the other two went fine without further errors.
What I'll remeber is always doing a smartctl -t long /dev/xxx before replacing a disk and regarding the subject: detach the (all) replacement disks if replace goes wrong for whatever reason and start over from scratch..

Though still I'd be interested in other users experience with WD40EFRX (exactly: WD40EFRX-68WT0N0 82.00A82): Did I just have very bad luck with 2/4 failing/DOA or is this what to expect from new, modern disk drives

Michael

Stux · Feb 22, 2017

A good reason to burnin replacements I guess

Makki · Feb 22, 2017

Stux said:
A good reason to burnin replacements I guess

Oh yes, because this caused me all this trouble for over a week now..

joeschmuck · Feb 23, 2017

Glad it's all working again.

Makki · Feb 24, 2017

joeschmuck said:
Glad it's all working again.

Mee too :)

I see it in some positive way:
I learned a lot and FreeNAS/ZFS has proven to be right choice - even under nasty conditions (honestly, I wouldnt want to see an "enterprise" HW-RAID5 where a replaced disk fails during rebuild while another disk starts to make errors: the raid would have been completely lost surely!)
Finally I lost 4 unimportant (backup)files with some MB out of 3.8TB thats a pretty good ratio!
And what I liked most: zfs/zpool tells me exactly which files are affected, so an easy & confident small restore is possible.. (though I just deleted them, as it were really old backups of backups..)

What I'm disappointed about is not FreeNAS but the obvious quality lack of these fatory-new WD40EFRX, this is IMHO inacceptable to get 50% fail/DOA from Amazon!

Michael

IceBoosteR · Feb 24, 2017

Hi, glad you got your pool back 2 work! :)

I also have those
WDC WD40EFRX-68WT0N0
drives. I have them running 600 hours now, no problem reported via smarctl and hopefully there will be none.
I heard of some drive failures due to the same product charge, where sometims more and sometimes less drives have failures. Seems to be that you are not lucky that time.

Ice

joeschmuck · Feb 24, 2017

My experience and the postings on this forum actually back up that the WD Red lineup is overall a very good drive. Unfortunately there will be infant mortality and you were a bit unlucky. I would only say that you should leave the drives running 24/7 and you are likely to get a nice long performance out of them. I have 37460 hours on one of my drives, the others are immediately behind that. That puts me at ~4.27 years. I'll bet I make it to 5 years for each of the drives.

Stux · Feb 24, 2017

I've at least made it to the warranty expiry on a good dozen reds. That's all of them.

Just had the last of my dozen Seagate 1.5TB drives die. Yes they were getting old but a 150% failure rate is impressively bad. (Only reason it wasn't higher is i stopped bothering to RMA the drives)

joeschmuck · Feb 24, 2017

I remember when my Seagate 5MB hard drives worked really well. They were super fast compared to a 5.25" floppy disk. They did eventually fall out of head alignment so I manually did the realignment myself and they lasted a little bit longer, I think a little over a year. Those were the days before voice coils and we used stepper motors.

Makki · Feb 28, 2017

There is a 20yrs running IT-gag in German "Sie geht - oder sie geht nicht" (translated smthg like "Seagate" or "it doesn't work") :p

For now the reds look fine, stressed them with another scrub.. @joeschmuck: Maybe I really just had a bad charge.. They will be running 24/7 without powerdown anyway..

Michael

Important Announcement for the TrueNAS Community.

SOLVED Cancel replacing?

Explorer

Old Man

Explorer

Explorer

Old Man

Explorer

Explorer

MVP

Explorer

Explorer

MVP

Explorer

Old Man

Explorer

Guru

Old Man

MVP

Old Man

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Cancel replacing?"

Similar threads