SOLVED Ran zpool replace, started a resilver, now my system has Panicked (so am I!)

Kuyper · Dec 29, 2017

2017-12-31 noon UPDATE: Solved! When rolling back an OS to an earlier release (in my case to 11.0.U4 from 11.1) you CANNOT restore from the later release's backup; @Ericloewe pointed this out, and after doing a factory reset (on the 11.0.U4 install) I recovered from my 9.3 backup, let it do it's reboot jiggery-pokery, and my pool came back online, happy and healthy, even if I'm still shaking a little :) ) Read through the whole mission below, if you have the stomach for it :)

2017-12-31 morning UPDATE: the panic issue was resolved by rolling back from 11.1 to 11.0.U4 (apparently a known issue) but things are not yet rosy, as the pool imports, but is not mounting correctly, or being seen properly by the UI - scroll down for all the gory details; I'll update here again when that's been solved

Really hoping that @joeschmuck can jump in here, as he has some context in helping me earlier at
https://forums.freenas.org/index.ph...ot-volume-new-device-not-showing-gptid.59880/

I successfully upgraded to FreeNAS 11, and all was good. So I shut my system down and added the first of my new 8TB drives, fired up, and left the system running for a few days to burn in the new drive, let it run through some smarctl tests etc.

I then started the first of the

Code:

zpool replace

commands to start the resilvering. I was remote over ssh at the time, but within seconds of starting the zpool replace, I lost complete access to my machine. I just got home some days later, and was greeted with this message :(

Code:

panic: solaris assert: spa->spa_scrub_inflight > 0

full screen shot below

I followed the steps at
https://forums.freenas.org/index.php?threads/kernel-panic-importing-a-pool.7095/
but even after booting in single and running a

Code:

zpool import -f -R /mnt tank01

it panics again.

I'm obviously pretty desperate as I have >15TB of, well, everything, just sitting there inaccessible (but hopefully still intact) so pretty much any help appreciated.

I did a reboot and grabbed some more screenshots as it's coming up, and you can see some references to a resilver starting, but then right after that it panics again. 2nd screenshot below is actually right before the panic

As for the inevitable hardware specs question (note, all was working fine; all I've added is 1 new drive):

* MoBo: ASRock C216 WS ATX Server Motherboard LGA 1155
* CPU: Intel Xeon E3-1220L V2 @ 2.30GHz (derated to 2 core / 4 threads for heat/power saving)
* RAM: 32GB ECC (Crucial 240-Pin DDR3 SDRAM ECC Unbuffered DDR3L 1600 (PC3L 12800) Server Memory)
* SAS Controller: LSI SAS9211-81

drives:

Code:

[kuyper@knox ~]$ sudo camcontrol devlist
<ATA HGST HMS5C4040AL A580>		at scbus0 target 2 lun 0 (pass0,da0)
<ATA HGST HMS5C4040AL A580>		at scbus0 target 3 lun 0 (pass1,da1)
<ATA HGST HMS5C4040AL A580>		at scbus0 target 4 lun 0 (pass2,da2)
<ATA HGST HMS5C4040AL A580>		at scbus0 target 5 lun 0 (pass3,da3)
<ATA HGST HMS5C4040AL A580>		at scbus0 target 6 lun 0 (pass4,da4)
<STEC ZeusIOPs G3 E12B>			at scbus0 target 8 lun 0 (pass5,da5)
<STEC ZeusIOPs G3 E12B>			at scbus0 target 9 lun 0 (pass6,da6)
<ATA HGST HUH728080AL T7JF>		at scbus0 target 15 lun 0 (pass7,da7)
<Samsung Flash Drive FIT 1100>	 at scbus9 target 0 lun 0 (pass8,da8)

The HUH728080AL is the new drive. One of these
https://www.hgst.com/products/hard-drives/ultrastar-he10
in 8TB trim

Screenshot:

Just before it panics, I get this:

You can see it restarting the resilver before that has a stack backtrace itself.

Is my best bet just to pull the new drive and let it monut the old (even if degraded) pool?

Thanks
Kuyper

cobrakiller58 · Dec 30, 2017

i see you are using 11.1 there are two open bug reports that may be of interest to you.
FN11.1 - sharing services become unresponsive during scrub
Scrub problem over one particular volume
Also this thread https://forums.freenas.org/index.php?threads/freenas-11-1-locking-up-roughly-every-24-hours.59815/ what version did you update from?

Kuyper · Dec 30, 2017

Thanks for the reply, @cobrakiller58.

Update was from 9.3 (FreeNAS-9.3-STABLE-201604150515-20171218151004 to be precise); I technically did a fresh install with a restore to 11.1 (FreeNAS-11.1-RELEASE (dc7d195f4)-20171218165422) , per @joeschmuck's suggestion. "2017" timestamps are just my backup times; pulled release numbers straight from those files.

I've retained my 9.3 boot USBs (phew) so do you suggest I boot back to those, and see if the (old) pool will import? Again, I want to do no harm here, as there is no (I know, I know) data backup; I simply can't keep 2 complete copies (tho it looks like I will have to plan to; even if the other backup is a semi-cold RAID0 offsite)

I see several references to reverting to 11.04 in https://forums.freenas.org/index.php?threads/freenas-11-1-locking-up-roughly-every-24-hours.59815; I could install that instead of 9.3 to at least get me into the 11s; I assume I can then restore from my 11.1 backup, or should I just let it do the 9.3->11 conversion again?

Thanks
Kuyper

cobrakiller58 · Dec 30, 2017

TBH I do not know if a 11.1 config would import properly but I would personally install 11.0U4 then try the 11.1 config file even if it doesn't work you still have you original 9.3 to fall back to as well as the 9.3 config which should work as desired. You may want to wait for someone who is better versed in the ways of FreeNAS to chime in though, if you disconnect your data drives before switching OS's you won't have to worry about oopsies with the data.

Kuyper · Dec 30, 2017

cobrakiller58 said:
TBH I do not know if a 11.1 config would import properly but I would personally install 11.0U4 then try the 11.1 config file even if it doesn't work you still have you original 9.3 to fall back to as well as the 9.3 config which should work as desired. You may want to wait for someone who is better versed in the ways of FreeNAS to chime in though, if you disconnect your data drives before switching OS's you won't have to worry about oopsies with the data.

Thanks again - I am concerned about rolling all the way back to 9.3 because I noticed that all sorts of migrations happened when my 11.1 system first booted (several reboots during the migration) - I'll grab 11.0-U4 from https://download.freenas.org/11/11.0-U4/x64/ and see if that helps, but I will wait a while for more advice too; I've been offline while I was remote for a week (who starts a resilver remotely anyway! :) ) so a little longer delay isn't a biggie.

Thanks again for the advice and pointers thus far!

Kuyper

Kuyper · Dec 30, 2017

LONG Update - still have some questions at the bottom:

I've booted a fresh 11.0.U4 install; it went so-so, but on the reboot passes, there were some errors (I have screengrabs and will paste them if they seem useful) but the system did come up.

I then recovered my most recent backup from right after I installed 11.1.

However, my main pool was not immediately available, and my home directory was therefore not avail on first ssh login:

Code:

$ ssh kuyper@192.168.10.10
kuyper@192.168.10.10's password:
FreeBSD 11.0-STABLE (FreeNAS.amd64) #0 r321665+25fe8ba8d06(freenas/11.0-stable): Mon Sep 25 06:24:11 UTC 2017

		FreeNAS (c) 2009-2017, The FreeNAS Development Team
		All rights reserved.
		FreeNAS is released under the modified BSD license.

		For more information, documentation, help or support, go here:
		http://freenas.org
Welcome to Knox; Kuyper's FreeNAS
Could not chdir to home directory /mnt/tank01/kuyper: No such file or directory

The pool was not listed:

Code:

[kuyper@knox /]$ zpool status
  pool: freenas-boot
 state: ONLINE
  scan: none requested
config:

		NAME		STATE	 READ WRITE CKSUM
		freenas-boot  ONLINE	   0	 0	 0
		  da8p2	 ONLINE	   0	 0	 0

errors: No known data errors

so I tried a simple import:

Code:

[kuyper@knox /]$ sudo zpool import

We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:

	#1) Respect the privacy of others.
	#2) Think before you type.
	#3) With great power comes great responsibility.

Password:
   pool: tank01
	 id: 14285396135785859431
  state: ONLINE
 status: The pool was last accessed by another system.
 action: The pool can be imported using its name or numeric identifier and
		the '-f' flag.
   see: http://illumos.org/msg/ZFS-8000-EY
 config:

		tank01											ONLINE
		  raidz1-0										ONLINE
			gptid/07582aa3-7293-11e4-921a-bc5ff4c7d2fc	ONLINE
			gptid/07fdef2c-7293-11e4-921a-bc5ff4c7d2fc	ONLINE
			gptid/089c3601-7293-11e4-921a-bc5ff4c7d2fc	ONLINE
			gptid/09414ac2-7293-11e4-921a-bc5ff4c7d2fc	ONLINE
			replacing-4								   ONLINE
			  gptid/09e9d535-7293-11e4-921a-bc5ff4c7d2fc  ONLINE
			  gptid/0687c78d-e844-11e7-8ed0-002655ec9c7d  ONLINE
		logs
		  da5											 ONLINE
		  da6											 ONLINE

So ran with the -f option

Code:

[kuyper@knox /]$ sudo zpool import -f
   pool: tank01
	 id: 14285396135785859431
  state: ONLINE
 status: The pool was last accessed by another system.
 action: The pool can be imported using its name or numeric identifier and
		the '-f' flag.
   see: http://illumos.org/msg/ZFS-8000-EY
 config:

		tank01											ONLINE
		  raidz1-0										ONLINE
			gptid/07582aa3-7293-11e4-921a-bc5ff4c7d2fc	ONLINE
			gptid/07fdef2c-7293-11e4-921a-bc5ff4c7d2fc	ONLINE
			gptid/089c3601-7293-11e4-921a-bc5ff4c7d2fc	ONLINE
			gptid/09414ac2-7293-11e4-921a-bc5ff4c7d2fc	ONLINE
			replacing-4								   ONLINE
			  gptid/09e9d535-7293-11e4-921a-bc5ff4c7d2fc  ONLINE
			  gptid/0687c78d-e844-11e7-8ed0-002655ec9c7d  ONLINE
		logs
		  da5											 ONLINE
		  da6											 ONLINE

but it wanted me to name the pool:

Code:

[kuyper@knox /]$ sudo zpool import -f tank01
[kuyper@knox /]$

which completed (silently)

Thankfully the pool is now listing again, and currently the resilvering has just started:

Code:

[kuyper@knox /]$
[kuyper@knox /]$ zpool status
  pool: freenas-boot
 state: ONLINE
  scan: none requested
config:

		NAME		STATE	 READ WRITE CKSUM
		freenas-boot  ONLINE	   0	 0	 0
		  da8p2	 ONLINE	   0	 0	 0

errors: No known data errors

  pool: tank01
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
		continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Dec 30 11:56:35 2017
		91.1M scanned out of 17.4T at 11.4M/s, 445h52m to go
		16.5M resilvered, 0.00% done
config:

		NAME											  STATE	 READ WRITE CKSUM
		tank01											ONLINE	   0	 0	 0
		  raidz1-0										ONLINE	   0	 0	 0
			gptid/07582aa3-7293-11e4-921a-bc5ff4c7d2fc	ONLINE	   0	 0	 0
			gptid/07fdef2c-7293-11e4-921a-bc5ff4c7d2fc	ONLINE	   0	 0	 0
			gptid/089c3601-7293-11e4-921a-bc5ff4c7d2fc	ONLINE	   0	 0	 0
			gptid/09414ac2-7293-11e4-921a-bc5ff4c7d2fc	ONLINE	   0	 0	 0
			replacing-4								   ONLINE	   0	 0	 0
			  gptid/09e9d535-7293-11e4-921a-bc5ff4c7d2fc  ONLINE	   0	 0	 0
			  gptid/0687c78d-e844-11e7-8ed0-002655ec9c7d  ONLINE	   0	 0	 0  (resilvering)
		logs
		  da5											 ONLINE	   0	 0	 0
		  da6											 ONLINE	   0	 0	 0

errors: No known data errors
[kuyper@knox /]$

I guess the next 19 hours will confirm whether it works!

Code:

[kuyper@knox /]$ zpool status tank01
  pool: tank01
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
		continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Dec 30 11:56:35 2017
		105G scanned out of 17.4T at 267M/s, 18h54m to go
		21.0G resilvered, 0.59% done
config:

		NAME											  STATE	 READ WRITE CKSUM
		tank01											ONLINE	   0	 0	 0
		  raidz1-0										ONLINE	   0	 0	 0
			gptid/07582aa3-7293-11e4-921a-bc5ff4c7d2fc	ONLINE	   0	 0	 0
			gptid/07fdef2c-7293-11e4-921a-bc5ff4c7d2fc	ONLINE	   0	 0	 0
			gptid/089c3601-7293-11e4-921a-bc5ff4c7d2fc	ONLINE	   0	 0	 0
			gptid/09414ac2-7293-11e4-921a-bc5ff4c7d2fc	ONLINE	   0	 0	 0
			replacing-4								   ONLINE	   0	 0	 0
			  gptid/09e9d535-7293-11e4-921a-bc5ff4c7d2fc  ONLINE	   0	 0	 0
			  gptid/0687c78d-e844-11e7-8ed0-002655ec9c7d  ONLINE	   0	 0	 0  (resilvering)
		logs
		  da5											 ONLINE	   0	 0	 0
		  da6											 ONLINE	   0	 0	 0

errors: No known data errors

Yikes! This was nerve racking!

I'll update here later.

My home directory is still not showing up after I logged out and back in:

Code:

$ ssh kuyper@192.168.10.10
kuyper@192.168.10.10's password:
Last login: Sat Dec 30 11:55:07 2017 from 192.168.10.104
FreeBSD 11.0-STABLE (FreeNAS.amd64) #0 r321665+25fe8ba8d06(freenas/11.0-stable): Mon Sep 25 06:24:11 UTC 2017

		FreeNAS (c) 2009-2017, The FreeNAS Development Team
		All rights reserved.
		FreeNAS is released under the modified BSD license.

		For more information, documentation, help or support, go here:
		http://freenas.org
Welcome to Knox; Kuyper's FreeNAS
Could not chdir to home directory /mnt/tank01/kuyper: No such file or directory
[kuyper@knox /]$

which is odd because my volumes are all there:

Code:

[kuyper@knox /]$ mount
freenas-boot/ROOT/default on / (zfs, local, noatime, nfsv4acls)
devfs on /dev (devfs, local, multilabel)
tmpfs on /etc (tmpfs, local)
tmpfs on /mnt (tmpfs, local)
tmpfs on /var (tmpfs, local)
freenas-boot/grub on /boot/grub (zfs, local, noatime, nfsv4acls)
fdescfs on /dev/fd (fdescfs)
tank01 on /tank01 (zfs, local, nfsv4acls)
tank01/Mac on /tank01/Mac (zfs, local, nfsv4acls)
tank01/Mac/TimeSet on /tank01/Mac/TimeSet (zfs, local, nfsv4acls)
tank01/NFS on /tank01/NFS (zfs, local, nfsv4acls)
tank01/WinDuhz on /tank01/WinDuhz (zfs, local, nfsv4acls)
tank01/jails on /tank01/jails (zfs, local, nfsv4acls)
tank01/jails/.warden-template-pluginjail--x64 on /tank01/jails/.warden-template-pluginjail--x64 (zfs, local, nfsv4acls)
tank01/jails/.warden-template-pluginjail-9.3-x64 on /tank01/jails/.warden-template-pluginjail-9.3-x64 (zfs, local, nfsv4acls)
tank01/jails/plexmediaserver_1 on /tank01/jails/plexmediaserver_1 (zfs, local, nfsv4acls)
tank01/jails/plexmediaserver_2 on /tank01/jails/plexmediaserver_2 (zfs, local, nfsv4acls)
tank01/media on /tank01/media (zfs, local, nfsv4acls)

Argh! Just spotted the issue; I imported into /tank01 instead of /mnt/tank01 - Everything is actually there:

Code:

[kuyper@knox /]$ ls -l /tank01
total 119
drwxr-xr-x  10 root	wheel  10 Dec 18 16:52 jails
drwxr-xr-x   3 kuyper  users  14 Dec 22 21:13 kuyper
drwxrwxr-x+  4 root	users   6 Dec 31  2015 Mac
drwxrwxr-x  12 root	users  23 May 18  2017 media
drwxrwxr-x   4 root	users   4 Jun 16  2015 NFS
drwxrwxr-x+ 11 root	users  19 Dec 18 16:55 WinDuhz
[kuyper@knox /]$ ls -l /tank01/kuyper
total 490430
-rwxr-----  1 kuyper  users  502136832 Mar 16  2017 20170122_092804.mp4

I'm too petrified to interrupt the resilvering, so I guess I'd better just let it complete, then reboot and try to import into the correct mount point.

What's the process there? Do I need to "export" before importing?

Or do I just re-import (with a -f I assume) thus:

Code:

zpool import -f -R /mnt tank01

even while tank01 is at /tank01 ??

Thanks
Kuyper

Kuyper · Dec 31, 2017

OK, so resilver happened overnight:

Code:

[kuyper@knox /]$ zpool status tank01
  pool: tank01
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
		still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
		the pool may no longer be accessible by software that does not support
		the features. See zpool-features(7) for details.
  scan: resilvered 3.48T in 15h33m with 0 errors on Sun Dec 31 03:29:47 2017
config:

		NAME											STATE	 READ WRITE CKSUM
		tank01										  ONLINE	   0	 0	 0
		  raidz1-0									  ONLINE	   0	 0	 0
			gptid/07582aa3-7293-11e4-921a-bc5ff4c7d2fc  ONLINE	   0	 0	 0
			gptid/07fdef2c-7293-11e4-921a-bc5ff4c7d2fc  ONLINE	   0	 0	 0
			gptid/089c3601-7293-11e4-921a-bc5ff4c7d2fc  ONLINE	   0	 0	 0
			gptid/09414ac2-7293-11e4-921a-bc5ff4c7d2fc  ONLINE	   0	 0	 0
			gptid/0687c78d-e844-11e7-8ed0-002655ec9c7d  ONLINE	   0	 0	 0
		logs
		  da5										   ONLINE	   0	 0	 0
		  da6										   ONLINE	   0	 0	 0

errors: No known data errors

and now drive da4 (09e9d535-7293-11e4-921a-bc5ff4c7d2fc) is out of the picture - now I just need to get tank01 mounted back on /mnt/tank01.

Hoping to get a quick, safe way explained; really not sure what the best process is.

Is it as simple as a "zpool export tank01" and then "zpool import -R /mnt tank01"? That's what https://docs.oracle.com/cd/E19253-01/819-5461/gbchy/index.html and https://www.freebsd.org/doc/handbook/zfs-zpool.html seem to imply

SweetAndLow · Dec 31, 2017

Use the gui to import your pool... Why do you keep using the cli when it's specifically called out not to do that.

Kuyper · Dec 31, 2017

SweetAndLow said:
Use the gui to import your pool... Why do you keep using the cli when it's specifically called out not to do that.

Thanks @SweetAndLow but, for one:

But generally I'm just more comfortable with CLIs; remember, I ended up here because 11.1 horked the "zpool replace" and kept panicking, dropping me at the shell anyway... So now I have exported, and rebooted but still can't import the volume via the UI. If I try the CLI (even in single-user mode) it complains of missing mount points (which should be there):

so, getting a tad nervous again :(

Kuyper · Dec 31, 2017

Well, don't I feel daft; took a walk to calm down and looked more carefully at the error message, then took a look (freshly booted) at /mnt to confirm my suspicions:

Code:

[kuyper@knox /]$ ls -l /mnt
total 4
-rw-r--r--  1 root  wheel  5 Dec 30 11:34 md_size

There's no directory for tank01 !! It seems the "restore" didn't pre-create the old directory...

Created it:

Code:

[kuyper@knox /]$ sudo mkdir /mnt/tank01

We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:

	#1) Respect the privacy of others.
	#2) Think before you type.
	#3) With great power comes great responsibility.

Password:
[kuyper@knox /]$ zpool status
  pool: freenas-boot
 state: ONLINE
  scan: none requested
config:

		NAME		STATE	 READ WRITE CKSUM
		freenas-boot  ONLINE	   0	 0	 0
		  da8p2	 ONLINE	   0	 0	 0

errors: No known data errors
[kuyper@knox /]$ zpool list
NAME		   SIZE  ALLOC   FREE  EXPANDSZ   FRAG	CAP  DEDUP  HEALTH  ALTROOT
freenas-boot  59.5G   745M  58.8G		 -	  -	 1%  1.00x  ONLINE  -
[kuyper@knox /]$ sudo zpool import -R /mnt tank01
[kuyper@knox /]$ df -h
Filesystem										  Size	Used   Avail Capacity  Mounted on
freenas-boot/ROOT/default							58G	731M	 57G	 1%	/
devfs											   1.0K	1.0K	  0B   100%	/dev
tmpfs												32M	9.4M	 23M	29%	/etc
tmpfs											   4.0M	8.0K	4.0M	 0%	/mnt
tmpfs												11G	126M	 11G	 1%	/var
freenas-boot/grub									57G	6.3M	 57G	 0%	/boot/grub
fdescfs											 1.0K	1.0K	  0B   100%	/dev/fd
tank01											  111G	891M	111G	 1%	/mnt/tank01
tank01/Mac										  111G	409K	111G	 0%	/mnt/tank01/Mac
tank01/Mac/TimeSet								  835G	725G	111G	87%	/mnt/tank01/Mac/TimeSet
tank01/NFS										  3.1T	3.0T	111G	97%	/mnt/tank01/NFS
tank01/WinDuhz									  3.5T	3.4T	111G	97%	/mnt/tank01/WinDuhz
tank01/jails										111G	211K	111G	 0%	/mnt/tank01/jails
tank01/jails/.warden-template-pluginjail--x64	   111G	817M	111G	 1%	/mnt/tank01/jails/.warden-template-pluginjail--x64
tank01/jails/.warden-template-pluginjail-9.3-x64	111G	526M	111G	 0%	/mnt/tank01/jails/.warden-template-pluginjail-9.3-x64
tank01/jails/plexmediaserver_1					  159G	 49G	111G	31%	/mnt/tank01/jails/plexmediaserver_1
tank01/jails/plexmediaserver_2					  138G	 27G	111G	20%	/mnt/tank01/jails/plexmediaserver_2
tank01/media										6.8T	6.7T	111G	98%	/mnt/tank01/media

and voila!

The UI is still rather unhappy, despite that:

2017-12-31-FreeNAS-storage-still-not-happy-before-reboot.jpg

so another reboot is in order...

Humph, still not quite there; the pool did not auto-import:

Code:

$ ssh kuyper@192.168.10.10
kuyper@192.168.10.10's password:
FreeBSD 11.0-STABLE (FreeNAS.amd64-DEBUG) #0 r321665+25fe8ba8d06(freenas/11.0-stable): Mon Sep 25 06:25:17 UTC 2017

		FreeNAS (c) 2009-2017, The FreeNAS Development Team
		All rights reserved.
		FreeNAS is released under the modified BSD license.

		For more information, documentation, help or support, go here:
		http://freenas.org
Welcome to Knox; Kuyper's FreeNAS
Could not chdir to home directory /mnt/tank01/kuyper: No such file or directory
[kuyper@knox /]$
[kuyper@knox /]$
[kuyper@knox /]$
[kuyper@knox /]$ mount
freenas-boot/ROOT/default on / (zfs, local, noatime, nfsv4acls)
devfs on /dev (devfs, local, multilabel)
tmpfs on /etc (tmpfs, local)
tmpfs on /mnt (tmpfs, local)
tmpfs on /var (tmpfs, local)
freenas-boot/grub on /boot/grub (zfs, local, noatime, nfsv4acls)
fdescfs on /dev/fd (fdescfs)
[kuyper@knox /]$ zpool list
NAME		   SIZE  ALLOC   FREE  EXPANDSZ   FRAG	CAP  DEDUP  HEALTH  ALTROOT
freenas-boot  59.5G   746M  58.8G		 -	  -	 1%  1.00x  ONLINE  -
[kuyper@knox /]$

and my dir that I created on /mnt is gone again:

Code:

[kuyper@knox /]$ ls -l /mnt
total 4
-rw-r--r--  1 root  wheel  5 Dec 30 11:34 md_size
[kuyper@knox /]$

I also notice /var/log/messages starts from current boot time (previous stuff is all gone) so something is resetting me to fresh on each boot (messages and debug.log attached)

Reminder, this system was restored from the backup I took after updating to 11.1, but *before* I added the new drive causing the panic. Should I roll back to my 9.3 backup?

Ericloewe · Dec 31, 2017

No need to roll back, but you'll want to reset to factory defaults and upload your 9.3 config.

Ericloewe · Dec 31, 2017

Also, do me a favor and file a bug report containing those files and a link to this thread. I think it's the first actual panic caused by whatever is going on with the new scrub/resilver stuff.

Kuyper · Dec 31, 2017

Ericloewe said:
No need to roll back, but you'll want to reset to factory defaults and upload your 9.3 config.

Thanks @Ericloewe - factory reset, I assume, as with all else, leaves my pools intact? To be clear, I'm running 11.0.U4

So you think that 11.1 backup (which I notice is a tar file, not a "db" as in 9.3, is part of the problem? I guess I'm keen to understand what's wrong (in addition to, not instead of just blindly following steps :) )

Kuyper · Dec 31, 2017

Ericloewe said:
Also, do me a favor and file a bug report containing those files and a link to this thread. I think it's the first actual panic caused by whatever is going on with the new scrub/resilver stuff.

Will do - I don't have good logs for the resilver panic, unfortunately.
https://redmine.ixsystems.com/issues/27524

Ericloewe · Dec 31, 2017

Kuyper said:
factory reset, I assume, as with all else, leaves my pools intact?

Yeah. To be safe, you can unplug your pool before proceeding.

Kuyper said:
So you think that 11.1 backup (which I notice is a tar file, not a "db" as in 9.3, is part of the problem? I guess I'm keen to understand what's wrong (in addition to, not instead of just blindly following steps :) )

No, it's simpler than that. You can't import configs for newer versions than the one that is installed. It causes everyone's favorite: undefined behavior

It seems silly and lazy. That's probably true to some extent and should see some improvements, but there are bigger fish to fry at the moment.

Kuyper · Dec 31, 2017

\o/

@Ericloewe, you bloody superstar! Where do I send the beer? <wipes days of sweat from brow>

That was a nerve racking journey! Normal service has been resumed!

Now on to the ordeal of resilvering the next 4 drives. Deep breath, and be methodical about identifying each drive as I pull them (no activity lights, so I can't identify by anything other than serial number - new drives will have their #s taped to the back for easier ID)

Thank you a million times, again!

Ericloewe · Dec 31, 2017

It's too bad something's broken in 11.1, because scrubs and resilvers were supposed to see massive improvements in speed - exactly what you'd like right now.

Kuyper · Dec 31, 2017

Ericloewe said:
It's too bad something's broken in 11.1, because scrubs and resilvers were supposed to see massive improvements in speed - exactly what you'd like right now.

Yeah, that was a bit of a harsh welcome to the 11.x release train :)

Well, my smartctl:

Code:

[kuyper@knox ~]$ sudo smartctl -t long /dev/da7
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 1108 minutes for test to complete.
Test will complete after Mon Jan  1 08:27:27 2018

Use smartctl -X to abort test.
[kuyper@knox ~]$

is now running on the 2nd 8TB drive, having removed the previously replaced drive; and this time, to appease the UI-gods, I'll use the UI thusly:

instead of the CLI :)

But that will only be in the morning, as the long test runs for 1,100 minutes or so! I will repeat this process for each new drive until all 5 have been replaced.

Obrigado, again, @Ericloewe, for your patient explanations!

Cheers
Kuyper

Ericloewe · Dec 31, 2017

Kuyper said:
But that will only be in the morning, as the long test runs for 1,100 minutes or so!

Probably longer, if they're anything like my 3TB WD Reds.

Kuyper · Jan 1, 2018

Ericloewe said:
Probably longer, if they're anything like my 3TB WD Reds.

:)

Nah, done:

Code:

[kuyper@knox ~]$ date
Mon Jan  1 08:46:25 PST 2018
[kuyper@knox ~]$ sudo smartctl -a /dev/da7
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 HGST Ultrastar He8
Device Model:	 HGST HUH728080ALE600
Serial Number:	VKHA1RWX
LU WWN Device Id: 5 000cca 254d2a9bd
Firmware Version: A4MST7JF
User Capacity:	8,001,563,222,016 bytes [8.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	7200 rpm
Form Factor:	  3.5 inches
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Mon Jan  1 08:46:29 2018 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(  101) seconds.
Offline data collection
capabilities:					(0x5b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										No Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		(1108) minutes.
SCT capabilities:			  (0x003d) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000b   100   100   016	Pre-fail  Always	   -	   0
  2 Throughput_Performance  0x0005   100   100   054	Pre-fail  Offline	  -	   0
  3 Spin_Up_Time			0x0007   100   100   024	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0012   100   100   000	Old_age   Always	   -	   1
  5 Reallocated_Sector_Ct   0x0033   100   100   005	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000b   100   100   067	Pre-fail  Always	   -	   0
  8 Seek_Time_Performance   0x0005   100   100   020	Pre-fail  Offline	  -	   0
  9 Power_On_Hours		  0x0012   100   100   000	Old_age   Always	   -	   19
 10 Spin_Retry_Count		0x0013   100   100   060	Pre-fail  Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   1
 22 Helium_Level			0x0023   100   100   025	Pre-fail  Always	   -	   100
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   2
193 Load_Cycle_Count		0x0012   100   100   000	Old_age   Always	   -	   2
194 Temperature_Celsius	 0x0002   187   187   000	Old_age   Always	   -	   32 (Min/Max 24/35)
196 Reallocated_Event_Count 0x0032   100   100   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0022   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0008   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x000a   200   200   000	Old_age   Always	   -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%		17		 -
# 2  Short offline	   Completed without error	   00%		 0		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[kuyper@knox ~]$

For others: Go to Storage->Volumes tab; select the Volume (pool) then look for "Volume Status", which is a little hidden, right at the bottom:

Select the disk; confirm serial number of the planned disk to be replaced, with "Edit Disk"

then run "Replace":

Final check on zpool before I hit "go"

Code:

[kuyper@knox ~]$ zpool status tank01
  pool: tank01
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
		still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
		the pool may no longer be accessible by software that does not support
		the features. See zpool-features(7) for details.
  scan: resilvered 3.48T in 15h33m with 0 errors on Sun Dec 31 03:29:47 2017
config:

		NAME											STATE	 READ WRITE CKSUM
		tank01										  ONLINE	   0	 0	 0
		  raidz1-0									  ONLINE	   0	 0	 0
			gptid/07582aa3-7293-11e4-921a-bc5ff4c7d2fc  ONLINE	   0	 0	 0
			gptid/07fdef2c-7293-11e4-921a-bc5ff4c7d2fc  ONLINE	   0	 0	 0
			gptid/089c3601-7293-11e4-921a-bc5ff4c7d2fc  ONLINE	   0	 0	 0
			gptid/09414ac2-7293-11e4-921a-bc5ff4c7d2fc  ONLINE	   0	 0	 0
			gptid/0687c78d-e844-11e7-8ed0-002655ec9c7d  ONLINE	   0	 0	 0
		logs
		  da4										   ONLINE	   0	 0	 0
		  da5										   ONLINE	   0	 0	 0

errors: No known data errors

Ready and running:

Code:

[kuyper@knox ~]$ zpool status tank01
  pool: tank01
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
		continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
	scan: resilver in progress since Mon Jan  1 09:01:18 2018
		166G scanned out of 17.4T at 294M/s, 17h6m to go
		33.1G resilvered, 0.93% done
config:

		NAME											  STATE	 READ WRITE CKSUM
		tank01											ONLINE	   0	 0	 0
		  raidz1-0										ONLINE	   0	 0	 0
			gptid/07582aa3-7293-11e4-921a-bc5ff4c7d2fc	ONLINE	   0	 0	 0
			gptid/07fdef2c-7293-11e4-921a-bc5ff4c7d2fc	ONLINE	   0	 0	 0
			replacing-2								   ONLINE	   0	 0	 0
			  gptid/089c3601-7293-11e4-921a-bc5ff4c7d2fc  ONLINE	   0	 0	 0
			  gptid/6210a51d-ef15-11e7-9003-002655ec9c7d  ONLINE	   0	 0	 0  (resilvering)
			gptid/09414ac2-7293-11e4-921a-bc5ff4c7d2fc	ONLINE	   0	 0	 0
			gptid/0687c78d-e844-11e7-8ed0-002655ec9c7d	ONLINE	   0	 0	 0
		logs
		  da4											 ONLINE	   0	 0	 0
		  da5											 ONLINE	   0	 0	 0
errors: No known data errors

Now just sit and chill my heels for another 17 hours :)

Important Announcement for the TrueNAS Community.

SOLVED Ran zpool replace, started a resilver, now my system has Panicked (so am I!)

Dabbler

Attachments

Guru

Dabbler

Guru

Dabbler

Dabbler

Dabbler

Sweet'NASty

Dabbler

Dabbler

Attachments

Server Wrangler

Server Wrangler

Dabbler

Dabbler

Server Wrangler

Dabbler

Server Wrangler

Dabbler

Server Wrangler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Ran zpool replace, started a resilver, now my system has Panicked (so am I!)"

Similar threads