Scrub removed both drives from mirrored freenas-boot

Status
Not open for further replies.

cmh

Explorer
Joined
Jan 7, 2013
Messages
75
Been running 11 since RC1 on my backup NAS (replication target), working well. Upgraded to 11 release when it was available. Yesterday morning an auto scrub of freenas-boot started, and soon after I got errors emailed, and it pulled _both_ drives from the zpool, which is a problem when there's only two to begin with.

System wouldn't boot so had to plug in a monitor. Started to boot 11 release but then after GRUB screen it gave "error : compression algorithm inherit not supported" followed by "error: you need to load the kernel first" a couple times.

Figured I was sunk but tried booting 11RC4 and it came up golden. Scrubbed the freenas-boot (cuz I like to live dangerously) and it came back clean.

Any ideas what happened here? Suggestions for how to proceed? Delete the 11 release snapshot and then re-upgrade?
 
D

dlavigne

Guest
Sounds like the main boot device died. You could try booting from the other one. If that fails, do a fresh install of 11-release to a new boot device and upload your config.
 

cmh

Explorer
Joined
Jan 7, 2013
Messages
75
Sounds like the main boot device died. You could try booting from the other one. If that fails, do a fresh install of 11-release to a new boot device and upload your config.

That's the fun part. Neither one died. In the scrub output neither drive has any errors:

Code:
CRITICAL:   pool: freenas-boot
state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://illumos.org/msg/ZFS-8000-HC
  scan: scrub in progress since Fri Jun 16 03:47:00 2017
		13.5M scanned out of 4.23G at 25.5K/s, 48h9m to go
		0 repaired, 0.31% done
config:

	NAME					 STATE	 READ WRITE CKSUM
	freenas-boot			 UNAVAIL	 18	10	 0
	  mirror-0			   UNAVAIL	 89	23	34
		5625675823696067971  REMOVED	  0	 0	 0  was /dev/ada0p2
		5448661123923233136  REMOVED	  0	 0	 0  was /dev/ada1p2

errors: 32 data errors, use '-v' for a list


Lucky I check that in my monitoring system as it caught that before the entire system went derpy/unresponsive. Would ping and that was it.

This morning's scrub back on 11RC4 came back 100% clean on both drives as well.
 

cmh

Explorer
Joined
Jan 7, 2013
Messages
75
So it turns out it was a boot device issue. Crashed on 11RC4 again and this time I got to see the console errors. Rebooted and checked the drives and one was loaded up with SMART errors.

Kinda bothersome that a single drive failure causes the whole OS to drop out, but at least I had something to reboot with.
 
Status
Not open for further replies.
Top