Drives offline, replaced w/ data errors, and hung resilvering. Could use some help.

Status
Not open for further replies.

Jahava

Cadet
Joined
Feb 9, 2013
Messages
5
My ZFS is in a bad place, and I'm out of ideas on how to fix it. Short summary: I'm running RAID-Z2, 6x3TB drives, running under encryption through FreeNAS's geli layer. After moving, I unboxed my server and powered in on - all's well, went to my folks' for the holidays. Remotely, I notice my pool reporting a degraded state and see that a disk has gone offline. That's fine, I ordered one on Amazon and continued in a degraded state. Then a second disk went offline. I remotely shutdown the server.

This motivated me to replace my server hardware, just in case it was actually a controller failing instead of just the disk. New motherboard, disks still not great: the first to go offline ( ada1) is DOA, the second ( ada4) is punting a handful of bad sectors in SMART scans but seems otherwise functional.

FreeNAS will not unlock the drives through the UI, hanging indefinitely, so I manually iterate through each and run geli attach -k /data/geli/... /dev/gptid/drive on all five drives. They all get attached, and the zpool can be imported, but missing ada1. I use the FreeNAS UI to provision my replacement drive and replace ada1.

Now, in the past I've replaced a drive or two, and what normally happens is the drive becomes resilvered over time. However, for what I suspect was the staggered nature of my drives failing, I am getting this (w/ map-resolved drive names written besides GPTID):

Code:
Geom name: gptid/7537576d-6d37-11e3-9def-0019db684008.eli - ada0																	
Geom name: gptid/770f6fe4-6d37-11e3-9def-0019db684008.eli - ada5																	
Geom name: gptid/78013b99-6d37-11e3-9def-0019db684008.eli - ada1																	
Geom name: gptid/7884d394-6d37-11e3-9def-0019db684008.eli - ada2																	
Geom name: gptid/78f3188b-6d37-11e3-9def-0019db684008.eli - ada3																	
/dev/gptid/61e5d38d-7fd3-11e4-86e0-0019db684008 - ada4																			 
87023237-f8d4-11e7-b423-309c2342a812 - new ada1

pool: poolname																													 
 state: DEGRADED																													
status: One or more devices is currently being resilvered.  The pool will														   
	   continue to function, possibly in a degraded state.																		 
action: Wait for the resilver to complete.																						 
  scan: resilver in progress since Fri Jan 19 19:03:46 2018																		 
	   947G scanned at 1.31G/s, 64.3G issued at 91.3M/s, 9.71T total															   
	   10.7G resilvered, 0.65% done, 1 days 06:46:28 to go																		 
config:																															 
																																   
	   NAME												  STATE	 READ WRITE CKSUM											
	   poolname											DEGRADED	 0	 0   404											
		 raidz2-0											DEGRADED	 0	 0 1.58K											
		   gptid/7537576d-6d37-11e3-9def-0019db684008.eli	DEGRADED	 0	 0	 0  too many errors		  <-- ada0				 
		   gptid/61e5d38d-7fd3-11e4-86e0-0019db684008.eli	ONLINE	   0	 0	 0						   <-- ada4
		   gptid/770f6fe4-6d37-11e3-9def-0019db684008.eli	DEGRADED	 0	 0	 0  too many errors		  <-- ada5			  
		   replacing-3									   DEGRADED	 0	 0	 0											
			 1216402105560805077							 UNAVAIL	  0	 0	 0  was /dev/gptid/78013b99-6d37-11e3-9def-001
9db684008.eli																													   
			 gptid/87023237-f8d4-11e7-b423-309c2342a812.eli  ONLINE	   0	 0	 0  (resilvering)			<-- ada1
		   gptid/7884d394-6d37-11e3-9def-0019db684008.eli	ONLINE	   0	 0	 0						   <-- ada2
		   gptid/78f3188b-6d37-11e3-9def-0019db684008.eli	DEGRADED	 0	 0	 0  too many errors		  <-- ada3   
																																   
errors: 7434 data errors, use '-v' for a list


Data errors, drives running in "degraded" state with too many errors, and the drive that I've replaced hanging there. That sucks, so I `zpool clear`, and the command hangs uninterruptible. Reboot, repeat, over the course of a few days and a few hangs, I am slowly making progress as a `zpool clear` completes without hanging. At the moment, the drive seems to be resilvering, a first in the ~5 times I've rebooted and manually unlocked them, so maybe some forum magic is already rubbing off on me!

Anyway, to date I've hit a few things that scare me:
  • "pool I/O is currently suspended"
  • "too many errors"
  • Hung resilvering (0% for days).
  • Hung `zpool clear`
My understanding of the current situation is: the data is still recoverable, since the fewest drives I've run with is 4 out of 6 and I'm running RAID-Z2. I probably messed up some data during a shutdown or recovery attempt, or when the fifth drive went offline and was brought back online later. ZFS is aware that I want to replace my old ada1 with my new disk, and this is currently being resilvered. I have a lot of data errors, will deal with them later. "too many errors" is b/c the checksums are off

Can anyone weigh in and let me know if I'm doing anything well, or if there is a really great thing that I could do to make this go better? I actually started writing this with the resilver hung at 0%, so I'm a little optimistic that maybe I've kicked it in the correct direction this round, but I'd still appreciate any wisdom surround the issue or things I should or should not do.

I'll update as things progress! Thanks in advance.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Make sure your sata cables are attached correctly. But it looks like this pool is toast with errors on all drives. Especially with top level checksum errors.
 

Jahava

Cadet
Joined
Feb 9, 2013
Messages
5
Yeah, they're connected :( *Toast* toast or just "some-files-are-damaged" toast? At this point I'd settle for some data recovery.

Any idea where things went wrong? My admittedly-untested impression of RAID-Z2 was that two drives could die without problems, and that's (at most) what happened. Should I not have brought the second SMART-erroring drive back online? Was the staggered nature of their decline problematic?

My understanding of top-level checksums is that those happen when the parity bits across the drives are not matching the data. If the entire drive is striped, a few thousand errors seems like drops in a 9T pool, which would suggest that most of the data is probably in-tact. Is this a sound understanding / reasoning?
 
Status
Not open for further replies.
Top