Volume still DEGRADED with UNAVAIL drive after replacement user error

petemounce · Jan 20, 2017

Hello; first post here. I have FreeNAS-9.10.2 (a476f16) and am new to it; it’s for home use. It’s installed on a single USB key. I’m reasonably competent, software- and operations-wise, but this is my first foray into storage, FreeBSD, and ZFS. I chose ZFS for some learning experience; it hasn't been disappointing.

The mother board is an ASRock C2750d4i which has a Marvell SE9230 and a Marvell SE9172 controller. The motherboard firmware is as far as I know running the latest version; I updated it late last year.

I created a single volume of 4x3TB physical drives, set up as a pair of mirrored vdevs based on a sensible sounding post by Jim Salter. I’ve encrypted the volume following the instructions in the manual.

One of the drives has developed a fault and the volume is in a DEGRADED state and I think I’ve made a mistake when I tried to fix it.

I bought an extra drive to replace the failed one, and another to be a spare (cold, disconnected). I followed the instructions in section 8 about replacing an encrypted volume. I replaced what I thought {1} was the failed drive, but I think here is where I made an error - I think I selected the wrong adaN in the dropdown that was presented. I should have paid more attention, frankly, and double checked prior.

So, after resilvering, it’s still in a degraded state, and still one entry that shows UNAVAIL. So, add the spare drive, resilver - still DEGRADED, still the same UNAVAIL entry.

To be honest, I’m now sure a) how I got to this point, and b) how to get out of it - first to an array that is ONLINE not DEGRADED, and ideally if there’s a way to repair the corrupt files. I’m prepared to lose the plex media server jail. I’d be obliged if someone can advise me.

Also - in the UI, the console log (black area, bottom) no longer shows me any data - I assume because that is one of the files that is corrupted (sounds like `/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/messages` et al is important, but guessing at what that console is tailing.)

Here are the diagnostics that some searching have suggested that I run (I don’t seem to be able to proceed successfully via the GUI, but I have made no changes via command line so far).

Code:

[root@orion ~]# zpool status -v vol1																								
  pool: vol1																														
state: DEGRADED																													
status: One or more devices has experienced an error resulting in data															
		corruption.  Applications may be affected.																				
action: Restore the file in question if possible.  Otherwise restore the															
		entire pool from backup.																									
   see: http://illumos.org/msg/ZFS-8000-8A																						
  scan: scrub repaired 445G in 1h27m with 161 errors on Wed Jan 11 02:44:34 2017													
config:																															
																																 
		NAME												  STATE	 READ WRITE CKSUM											
		vol1												  DEGRADED	 0	 0 97.6K											
		  mirror-0											ONLINE	   0	 0	 0											
			gptid/fdaefe2c-d137-11e6-a886-d05099c0adce.eli	ONLINE	   0	 0	 0											
			gptid/1efbebd9-88a9-11e6-9446-d05099c0adce.eli	ONLINE	   0	 0	 0											
		  mirror-1											DEGRADED	 0	 0  222K											
			replacing-0									   DEGRADED  222K	 0	 0											
			  10734777483274518261							UNAVAIL	  0	 0	 0  was /dev/gptid/abedf192-4d3e-11e6-bbbe-d05
099c0adce.eli																													
			  gptid/6078220f-d2cc-11e6-8cc5-d05099c0adce.eli  ONLINE	   0	 0  222K											
			  gptid/c352b547-d467-11e6-90ee-d05099c0adce.eli  ONLINE	   0	 0  222K											
			gptid/adcd673b-4d3e-11e6-bbbe-d05099c0adce.eli	ONLINE	   0	 0  222K											
																																 
errors: Permanent errors have been detected in the following files:																
																																 
		vol1/.system/cores:<0x0>																									
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/mdnsresponder.log												
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/userlog														
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/utx.log														
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/middlewared.log												
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/nginx/access.log												
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/samba4/log.nmbd												
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/samba4/log.winbindd											
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/samba4/log.wb-ORION											
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/auth.log														
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/cron															
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/daemon.log													
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/messages														
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/debug.log														
		vol1/jails/plexmediaserver_1:<0x0>																						
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Logs/Plex Media Scanner.4.log						
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Logs/Plex Media Server.1.log							
		/mnt/vol1/jails/plexmediaserver_1/var/log/cron																			
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Logs/Plex (anonymous).3.log							
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Logs/Plex Media Scanner.3.log						
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Logs/Plex (anonymous).log							
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Logs/Plex Media Scanner.2.log						
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Cache/CloudAccount.dat								
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Logs/Plex Media Scanner.1.log			
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Logs/Plex Media Scanner.log							
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Logs/Plex Media Server.log							
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Plug-in Support/Databases/com.plexapp.dlna.db-shm	
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Logs/Plex (anonymous).2.log							
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Plug-in Support/Databases/com.plexapp.plugins.library.db-wal																																
		/mnt/vol1/jails/plexmediaserver_1/var/log/messages

Code:

[root@orion ~]# glabel status																									
									  Name  Status  Components																	
gptid/c352b547-d467-11e6-90ee-d05099c0adce	 N/A  ada0p2																		
gptid/fdaefe2c-d137-11e6-a886-d05099c0adce	 N/A  ada1p2																		
gptid/adcd673b-4d3e-11e6-bbbe-d05099c0adce	 N/A  ada2p2																		
gptid/1efbebd9-88a9-11e6-9446-d05099c0adce	 N/A  ada3p2																		
gptid/6078220f-d2cc-11e6-8cc5-d05099c0adce	 N/A  ada4p2																		
gptid/157bddc1-4b5d-11e6-b367-d05099c0adce	 N/A  da0p1																		
gptid/158de726-4b5d-11e6-b367-d05099c0adce	 N/A  da0p2

Code:

[root@orion ~]# camcontrol devlist																								
<WDC WD30EFRX-68EUZN0 82.00A82>	at scbus0 target 0 lun 0 (pass0,ada0)															
<WDC WD30EFRX-68EUZN0 82.00A82>	at scbus1 target 0 lun 0 (pass1,ada1)															
<TOSHIBA DT01ACA300 MX6OABB0>	  at scbus2 target 0 lun 0 (pass2,ada2)															
<WDC WD30EFRX-68EUZN0 82.00A82>	at scbus4 target 0 lun 0 (pass3,ada3)															
<WDC WD30EFRX-68EUZN0 82.00A82>	at scbus5 target 0 lun 0 (pass4,ada4)															
<Marvell Console 1.01>			 at scbus9 target 0 lun 0 (pass5)																
<Samsung Flash Drive FIT 1100>	 at scbus17 target 0 lun 0 (pass6,da0)

Code:

[root@orion ~]# camcontrol identify ada0 | grep ^serial																			
serial number		 WD-WCC4N3KUL6DL																							
[root@orion ~]# camcontrol identify ada1 | grep ^serial																			
serial number		 WD-WCC4N7PCS8UN																							
[root@orion ~]# camcontrol identify ada2 | grep ^serial																			
serial number		 X5DAN58GS																									
[root@orion ~]# camcontrol identify ada3 | grep ^serial																			
serial number		 WD-WCC4N6XKYX6V																							
[root@orion ~]# camcontrol identify ada4 | grep ^serial																			
serial number		 WD-WCC4N4HRULNN

{1} - I would have found it very useful to be able to use the storage UI for view-disks and have that show which volume is in the degraded state. After reading around in the forums before building the unit, I labelled the front of each disk with the serial number, but it sounded like the device IDs (adaN) were mutable so I didn’t add those to the labels. It seems like a great feature to add, because when one loses a drive, one doesn't want to first google how to find out which one it was because it's not obvious in the UI.

melloa · Jan 20, 2017

Let's wait for the gurus to give some ideas, but:

petemounce said:
I would have found it very useful to be able to use the storage UI for view-disks and have that show which volume is in the degraded state.

Going to storage, selecting the volume, and clicking volume status, displays:

Going to storage, clicking view disks, it shows all disks and serial numbers:

Won't that be enough to identify the failed drive? And you did label your drives with the s/n (got to start doing that!)

petemounce · Jan 21, 2017

melloa said:
Won't that be enough to identify the failed drive? And you did label your drives with the s/n (got to start doing that!)

Unfortunately, in a failure state, the volume status page shows, for the failed disk, an integer that I don't know how to relate back to the failed disk, instead of the device ID.

This shot is what mine shows now, which is similar to (but not the same as) the initial failure state.

It would be nice for the operator to just need to go to the view disks screen and see a status column showing state of each disk.

petemounce · Jan 21, 2017

melloa said:
Won't that be enough to identify the failed drive? And you did label your drives with the s/n (got to start doing that!)

Unfortunately, in a failure state, the volume status page shows, for the failed disk, an integer that I don't know how to relate back to the failed disk, instead of the device ID.

This shot is what mine shows now, which is similar to (but not the same as) the initial failure state.

It would be nice for the operator to just need to go to the view disks screen and see a status column showing state of each disk.

saurav · Jan 21, 2017

I think the gurus will ask you for the output of "camcontrol devlist" and "glabel status" in code tags.

petemounce · Jan 21, 2017

It's already there; it's a single long code block. I've since edited it to make them separate blocks.

Glorious1 · Jan 21, 2017

I'm confused. You said you had 4 drives, 2 mirrors of 2, but mirror-1 looks like it has 4 drives?

What is that Marvell thing in the camcontrol devlist output? One hears disparaging remarks around here about Marvell controllers.

By the way, I am not the guru you're looking for <Jedi mind control hand gesture>

petemounce · Jan 21, 2017

Glorious1 said:
I'm confused. You said you had 4 drives, 2 mirrors of 2, but mirror-1 looks like it has 4 drives?

It has 3 drives and one UNAVAIL ghost, if I'm interpreting it correctly. I bought 1 replacement and 1 cold spare, then used the cold spare in addition. I lament that when I built it originally I went with Toshiba drives to save some money. I have at this point replaced all 4 original drives with WD Reds. They appear to run much colder than the Toshibas too.

Glorious1 said:
What is that Marvell thing in the camcontrol devlist output? One hears disparaging remarks around here about Marvell controllers.

I just edited the OP to answer that. When I built the NAS I was inspired by Brian Moses' FreeNAS series.

saurav · Jan 21, 2017

So mirror-1 somehow turned into a 4-way mirror?

I'm really curious to know what those 3 entries under "replacing-0", of which one is UNAVAIL and are sightly indented mean:

Code:

replacing-0									 DEGRADED 222K	 0	 0										
  10734777483274518261							UNAVAIL	 0	 0	 0 was /dev/gptid/abedf192-4d3e-11e6-bbbe-d05
099c0adce.eli																													
  gptid/6078220f-d2cc-11e6-8cc5-d05099c0adce.eli ONLINE	 0	 0 222K											
  gptid/c352b547-d467-11e6-90ee-d05099c0adce.eli ONLINE	 0	 0 222K										   gptid/adcd673b-4d3e-11e6-bbbe-d05099c0adce.eli	ONLINE	 0	 0 222K

petemounce · Jan 21, 2017

No - I think the UNAVAIL one is FreeNAS' memory of a device. It's how to clear that up that I'd like to find out (as well as, next, how to either recover or clean up the corruption).

I think that happened when I selected the incorrect device ID during the replace-a-disk procedure. The first time I did it, the UNAVAIL device was also not possible to clear up.

saurav · Jan 21, 2017

Yes, that I understand. But what does the "replacing-0" entity mean? Is it something temporary, like when a drive is being resilvered? Because going by just the looks (indentation), it _seems_ mirror-1 is a mirror of "replacing-0" and ada2, while "replacing-0" is a stripe of 3 disks (ada0 & ada4 and UNAVAIL). Once again, it looks that way, but I have no real idea what it actually is. Which is why I'm curious.

Whenever a pool configuration like this happens, its a good idea to ensure a backup of all data you care about. It may be possible to restore the pool structure to something sensible, but in a non-trivial way (meaning, you'd likely make a mistake & end up re-creating the pool). Whatever you do, don't try out commands in a shell that you read on other websites (especially those related to solaris and linux zfs, or even freebsd). They may not apply verbatim to FreeNAS. Wait for a zfs/freenas guru to respond.

Glorious1 · Jan 21, 2017

I'm not an expert on replacing drives; haven't had to do that yet. But it looks to me like mirror-1 thinks it should have 4 drives, one of which is UNAVAIL. So maybe when you thought you replaced a drive and "used the cold spare" (not sure what that means), FreeNAS thought you were adding them them to the vdev as additional mirrors rather than replacing a bad drive?

Ericloewe · Jan 21, 2017

Not exactly sure what the sequence of operations was, but the pool is essentially toast. Backup the data, get whatever has been corrupted off a backup (though it looks like only logs and stuff were corrupted, so the data should be fine - the plex jail may need to be rebuilt) and start over with a fresh pool.

Next time, carefully follow the manual and don't improvise disk replacements.

I suspect you may have hot-swapped drives during this process, without using a proper hot-swap bay. If that is indeed the case, it neatly explains the small amount of corruption.

Important Announcement for the TrueNAS Community.

Volume still DEGRADED with UNAVAIL drive after replacement user error

petemounce

Cadet

melloa

Wizard

petemounce

Cadet

petemounce

Cadet

saurav

Contributor

petemounce

Cadet

Glorious1

Guru

petemounce

Cadet

saurav

Contributor

petemounce

Cadet

saurav

Contributor

Glorious1

Guru

Ericloewe

Server Wrangler

Similar threads