Errors while resilvering - best course?

riz94107 · Jun 17, 2018

I should begin by saying that yes, most of this data is backed up, but BOY HOWDY would it be a pain to restore. (the usual story, I bet)

I have a FreeNAS 9.10.2-U1 box which has/had a raidz1 with 4x3T drives. One of the drives started throwing errors, so I decided now would be a good time to upgrade to 4T drives (already did this once, 2T to 3T, a couple years ago).

I swapped in the first drive for the bad one, and got everything resilvered. What I *didn't* do was a full scrub. :( Then, I swapped in the second drive, and after a couple hours, THE NEW (FIRST) DRIVE STARTED THROWING ERRORS. Things appear to sort of, kind of be resilvering still:

Code:

  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Jun 17 09:07:49 2018
		1.33T scanned out of 10.2T at 343M/s, 7h33m to go
		340G resilvered, 12.98% done
config:

	NAME											STATE	 READ WRITE CKSUM
	tank											DEGRADED	 0	 0   722
	  raidz1-0									  DEGRADED	 0	 0 1.54K
		gptid/2bfb3df4-c3b3-11e5-ab04-1cc1de023244  DEGRADED	 0	 0	 0  too many errors
		gptid/2d126fc9-c3b3-11e5-ab04-1cc1de023244  DEGRADED	 0	 0	 0  too many errors
		gptid/a0eeff41-703d-11e8-b75f-1cc1de023244  ONLINE	   0	 0	 0  (resilvering)
		gptid/35b07078-6f6d-11e8-b75f-1cc1de023244  DEGRADED   720	 0	 0  too many errors

errors: 152 data errors, use '-v' for a list

The "ONLINE" drive is the second one I swapped in, and the new drive with errors is gptid/35XXXXX .

What's my best course of action here? I'm curious whether the resilver will ever (can ever?) finish, since there's effectively two failures in the RAIDZ1. The file system itself appears up and available - I've been able to use it while this is going on, though I haven't needed any of the files with errors. I would be perfectly OK with losing the files which are currently showing data errors (or, really, any chunk of the data up to a complete rebuild - and even a complete rebuild is not the end of the world, just REALLY ANNOYING).

My preference is for the outcome which requires the least interactive effort on my part - I'm happy to let this resilver run for another couple days if it might actually succeed (for some value of "succeed"), if it means less actual work on my part (as opposed to on the part of the computers).

Also- yes, I realize that raidz2 or raidz3 would have prevented this, and I did understand that this could happen - I just didn't know exactly what it would look like. I knew the risks! :)

Thanks in advance -
+j

kdragon75 · Jun 17, 2018

Did you do any kind of burn in on the new drive?

bollar · Jun 17, 2018

At this point you don't have any good options aside from wait for the resilver to complete. Let's see the status once it's done.

Jailer · Jun 17, 2018

Hardware list please per the forum rules.

riz94107 · Jun 29, 2018

So the "resilvering" has been going on for approaching two weeks now - seems pretty clear the increasing error count on the first disk is going to prevent it from ever finishing. At this point, I'm interested in learning whether there's anything I can do to avoid having to completely reconstruct the pool. (I suspect not, but thought I'd check).

Code:

  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Jun 29 03:14:33 2018
		1.50T scanned out of 10.0T at 123M/s, 20h4m to go
		383G resilvered, 14.96% done
config:

	NAME											STATE	 READ WRITE CKSUM
	tank											DEGRADED	 0	 0 4.82K
	  raidz1-0									  DEGRADED	 0	 0 9.77K
		gptid/2bfb3df4-c3b3-11e5-ab04-1cc1de023244  DEGRADED	 0	 0	 0  too many errors
		gptid/2d126fc9-c3b3-11e5-ab04-1cc1de023244  DEGRADED	 0	 0	 0  too many errors
		gptid/a0eeff41-703d-11e8-b75f-1cc1de023244  ONLINE	   0	 0	 0  (resilvering)
		gptid/35b07078-6f6d-11e8-b75f-1cc1de023244  DEGRADED 4.80K	 0	 0  too many errors  (resilvering)

errors: 304 data errors, use '-v' for a list

My hardware:

HP DL160 G6 (some Tyan motherboard, I believe. I don't have the model number handy)
2x Xeon L5639 (six-core) 2.13GHz procs
72GB RAM
Currently attempting to swap 3T Seagate drives with 4T WD RE drives, one of which has errors.

riz94107 · Jul 15, 2018

An update for any who might be interested (probably nobody, but perhaps this will help some future searcher?), here's what's been going on.

I left the resilver running, because the data was mostly still available (I was getting a few errors on a particular zfs, but everything else seemed fine, so I was making sure my backups were as complete as I could make them, etc, etc.) and it was easier than the complete rebuild I was sure I would have to do if I rebooted.

Well, after about two weeks of this, the machine crashed. (I got about 180K of log messages out of the crash, but it's not clear to me exactly what happened, because I didn't get everything). Unlike what I was expecting, however, the pool was imported successfully, and the resilvering continued:

Code:

  pool: tank
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Jul 12 21:23:18 2018
		5.56T scanned out of 9.45T at 116M/s, 9h47m to go
		1.39T resilvered, 58.88% done
config:

	NAME											STATE	 READ WRITE CKSUM
	tank											ONLINE	   0	 0	 0
	  raidz1-0									  ONLINE	   0	 0	 0
		gptid/2bfb3df4-c3b3-11e5-ab04-1cc1de023244  ONLINE	   0	 0	 0
		gptid/2d126fc9-c3b3-11e5-ab04-1cc1de023244  ONLINE	   0	 0	 0
		gptid/a0eeff41-703d-11e8-b75f-1cc1de023244  ONLINE	   0	 0	 0  (resilvering)
		gptid/35b07078-6f6d-11e8-b75f-1cc1de023244  ONLINE	   0	 0	 0

errors: 402 data errors, use '-v' for a list

Eventually, it even appeared to complete!

I didn't snapshot the status after it completed, so I can't show it here, but it showed a couple hundred errors, but everything was ONLINE. So, not wanting to repeat the mistakes of the past, I decided to run a scrub, so I could repair or replace the first drive (`gptid/35b07078-6f6d-11e8-b75f-1cc1de023244`) which seems to be the source of my problems. The scrub ran for a while, showing that it did indeed see problems:

Code:

  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub in progress since Sat Jul 14 00:08:49 2018
		3.62T scanned out of 9.45T at 88.6M/s, 19h11m to go
		1.14M repaired, 38.27% done
config:

	NAME											STATE	 READ WRITE CKSUM
	tank											ONLINE	   0	 0   144
	  raidz1-0									  ONLINE	   0	 0   288
		gptid/2bfb3df4-c3b3-11e5-ab04-1cc1de023244  ONLINE	   0	 0	 0
		gptid/2d126fc9-c3b3-11e5-ab04-1cc1de023244  ONLINE	   0	 0	 0
		gptid/a0eeff41-703d-11e8-b75f-1cc1de023244  ONLINE	   0	 0	 0
		gptid/35b07078-6f6d-11e8-b75f-1cc1de023244  ONLINE	 144	 0	 0  (repairing)

errors: 345 data errors, use '-v' for a list

All seemed as it should be. A few hours later, though, and it seems to be resilvering again! (I didn't change anything manually).

Code:

  pool: tank
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Jul 14 12:28:35 2018
		8.07T scanned out of 9.46T at 65.9M/s, 6h9m to go
		2.02T resilvered, 85.29% done
config:

	NAME											STATE	 READ WRITE CKSUM
	tank											ONLINE	   0	 0	 0
	  raidz1-0									  ONLINE	   0	 0	 0
		gptid/2bfb3df4-c3b3-11e5-ab04-1cc1de023244  ONLINE	   0	 0	 0
		gptid/2d126fc9-c3b3-11e5-ab04-1cc1de023244  ONLINE	   0	 0	 0
		gptid/a0eeff41-703d-11e8-b75f-1cc1de023244  ONLINE	   0	 0	 0  (resilvering)
		gptid/35b07078-6f6d-11e8-b75f-1cc1de023244  ONLINE	   0	 0	 0

errors: 288 data errors, use '-v' for a list

At this point, I intend to let this run for a while - maybe the resilver will complete and so will a scrub? Who knows. I hope to get to a point where I can replace that disk without losing everything and having to rebuild.

riz94107 · Jul 15, 2018

Looks like the resilver completed again. I'm going to run another scrub, but here's what it looks like in the meantime:

Code:

  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 2.32T in 26h8m with 201 errors on Sun Jul 15 14:36:38 2018
config:

	NAME											STATE	 READ WRITE CKSUM
	tank											ONLINE	   0	 0	57
	  raidz1-0									  ONLINE	   0	 0   114
		gptid/2bfb3df4-c3b3-11e5-ab04-1cc1de023244  ONLINE	   0	 0	 0
		gptid/2d126fc9-c3b3-11e5-ab04-1cc1de023244  ONLINE	   0	 0	 0
		gptid/a0eeff41-703d-11e8-b75f-1cc1de023244  ONLINE	   0	 0	 0
		gptid/35b07078-6f6d-11e8-b75f-1cc1de023244  ONLINE	  56	 0	 0

errors: 201 data errors, use '-v' for a list

riz94107 · Jul 16, 2018

Code:

  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Jul 15 21:02:32 2018
		3.75T scanned out of 9.46T at 89.7M/s, 18h34m to go
		959G resilvered, 39.59% done
config:

	NAME											STATE	 READ WRITE CKSUM
	tank											DEGRADED	 0	 0   258
	  raidz1-0									  DEGRADED	 0	 0   516
		gptid/2bfb3df4-c3b3-11e5-ab04-1cc1de023244  DEGRADED	 0	 0	 0  too many errors
		gptid/2d126fc9-c3b3-11e5-ab04-1cc1de023244  DEGRADED	 0	 0	 0  too many errors
		gptid/a0eeff41-703d-11e8-b75f-1cc1de023244  ONLINE	   0	 0	 0  (resilvering)
		gptid/35b07078-6f6d-11e8-b75f-1cc1de023244  DEGRADED   256	 0	 0  too many errors  (resilvering)

errors: 201 data errors, use '-v' for a list

...and now we're back to "resilvering" drive 3 - though this time there's a resilvering message on drive 4, too. What's my best course of action, here? If resilvering completes this time, should I replace drive 4? Am I just completely out of luck and need to rebuild the entire pool?

If I do have to rebuild the pool, what's the best way to go about it?

Thanks,
+j

kdragon75 · Jul 16, 2018

kdragon75 said:
Did you do any kind of burn in on the new drive?

Did you do any testing on your new drives?

riz94107 · Jul 16, 2018

kdragon75 said:
Did you do any testing on your new drives?

Nope.

kdragon75 · Jul 16, 2018

That's where you went wrong. You NEED to test/burn-in new drives.

riz94107 · Jul 16, 2018

kdragon75 said:
That's where you went wrong. You NEED to test/burn-in new drives.

Thanks for your very helpful reply. Boy, was that helpful, and in no way might distract from anyone posting more-relevant information.

kdragon75 · Jul 16, 2018

riz94107 said:
Thanks for your very helpful reply. Boy, was that helpful, and in no way might distract from anyone posting more-relevant information.

It was not addressed anywhere else in this thread and as such (especially given you answer) though it necessary to add emphasis to the importance of this process. I hope this has enlightened you to test the rest of your new drives.

riz94107 said:
If I do have to rebuild the pool, what's the best way to go about it?

Simply detach the pool in question under Storage -> [Click on the pool name] -> click the icon on the bottom of the screen with the red X then under view disks select each disk and click wipe (on the bottom again). Once all of your disks are wiped, follow the burn in guide and recreate your pool and import data from backups.

Sir.Robin · Jul 17, 2018

This is a classic raid5/raidz1 trap. You replace a faulthy drive (wich leaves you woulnerable while rebuilding) and then... BOOM! another fails before rebuild is complete.
At the point where all is online and healthy i would replace the 35 drive again.
Although i also would create/rebuild the array and then go raidz2. Transfer what you can, and restore the rest.

Edit:
This reminds of me messing up once. I replaced drives one by one for increased capacity. After offlining a disk... i disconnected the wrong drive.
Still... shutdown. reconnect. boot. and voila! Resilver went smooth and no data loss. :) This were on one of my 6 drive raidz2 pools.

Important Announcement for the TrueNAS Community.

Errors while resilvering - best course?

riz94107

Dabbler

kdragon75

Wizard

bollar

Patron

Jailer

Not strong, but bad

riz94107

Dabbler

riz94107

Dabbler

riz94107

Dabbler

riz94107

Dabbler

kdragon75

Wizard

riz94107

Dabbler

kdragon75

Wizard

riz94107

Dabbler

kdragon75

Wizard

Sir.Robin

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Errors while resilvering - best course?

Dabbler

Wizard

Patron

Not strong, but bad

Dabbler

Dabbler

Dabbler

Dabbler

Wizard

Dabbler

Wizard

Dabbler

Wizard

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Errors while resilvering - best course?"

Similar threads