Volume still DEGRADED with UNAVAIL drive after replacement user error

Status
Not open for further replies.

petemounce

Cadet
Joined
Jul 17, 2016
Messages
8
Hello; first post here. I have FreeNAS-9.10.2 (a476f16) and am new to it; it’s for home use. It’s installed on a single USB key. I’m reasonably competent, software- and operations-wise, but this is my first foray into storage, FreeBSD, and ZFS. I chose ZFS for some learning experience; it hasn't been disappointing.

The mother board is an ASRock C2750d4i which has a Marvell SE9230 and a Marvell SE9172 controller. The motherboard firmware is as far as I know running the latest version; I updated it late last year.

I created a single volume of 4x3TB physical drives, set up as a pair of mirrored vdevs based on a sensible sounding post by Jim Salter. I’ve encrypted the volume following the instructions in the manual.

One of the drives has developed a fault and the volume is in a DEGRADED state and I think I’ve made a mistake when I tried to fix it.

I bought an extra drive to replace the failed one, and another to be a spare (cold, disconnected). I followed the instructions in section 8 about replacing an encrypted volume. I replaced what I thought {1} was the failed drive, but I think here is where I made an error - I think I selected the wrong adaN in the dropdown that was presented. I should have paid more attention, frankly, and double checked prior.

So, after resilvering, it’s still in a degraded state, and still one entry that shows UNAVAIL. So, add the spare drive, resilver - still DEGRADED, still the same UNAVAIL entry.

To be honest, I’m now sure a) how I got to this point, and b) how to get out of it - first to an array that is ONLINE not DEGRADED, and ideally if there’s a way to repair the corrupt files. I’m prepared to lose the plex media server jail. I’d be obliged if someone can advise me.

Also - in the UI, the console log (black area, bottom) no longer shows me any data - I assume because that is one of the files that is corrupted (sounds like `/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/messages` et al is important, but guessing at what that console is tailing.)

Here are the diagnostics that some searching have suggested that I run (I don’t seem to be able to proceed successfully via the GUI, but I have made no changes via command line so far).

Code:
[root@orion ~]# zpool status -v vol1																								
  pool: vol1																														
state: DEGRADED																													
status: One or more devices has experienced an error resulting in data															
		corruption.  Applications may be affected.																				
action: Restore the file in question if possible.  Otherwise restore the															
		entire pool from backup.																									
   see: http://illumos.org/msg/ZFS-8000-8A																						
  scan: scrub repaired 445G in 1h27m with 161 errors on Wed Jan 11 02:44:34 2017													
config:																															
																																 
		NAME												  STATE	 READ WRITE CKSUM											
		vol1												  DEGRADED	 0	 0 97.6K											
		  mirror-0											ONLINE	   0	 0	 0											
			gptid/fdaefe2c-d137-11e6-a886-d05099c0adce.eli	ONLINE	   0	 0	 0											
			gptid/1efbebd9-88a9-11e6-9446-d05099c0adce.eli	ONLINE	   0	 0	 0											
		  mirror-1											DEGRADED	 0	 0  222K											
			replacing-0									   DEGRADED  222K	 0	 0											
			  10734777483274518261							UNAVAIL	  0	 0	 0  was /dev/gptid/abedf192-4d3e-11e6-bbbe-d05
099c0adce.eli																													
			  gptid/6078220f-d2cc-11e6-8cc5-d05099c0adce.eli  ONLINE	   0	 0  222K											
			  gptid/c352b547-d467-11e6-90ee-d05099c0adce.eli  ONLINE	   0	 0  222K											
			gptid/adcd673b-4d3e-11e6-bbbe-d05099c0adce.eli	ONLINE	   0	 0  222K											
																																 
errors: Permanent errors have been detected in the following files:																
																																 
		vol1/.system/cores:<0x0>																									
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/mdnsresponder.log												
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/userlog														
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/utx.log														
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/middlewared.log												
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/nginx/access.log												
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/samba4/log.nmbd												
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/samba4/log.winbindd											
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/samba4/log.wb-ORION											
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/auth.log														
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/cron															
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/daemon.log													
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/messages														
		/var/db/system/syslog-85defbf1587e4780a923a9999cad858e/log/debug.log														
		vol1/jails/plexmediaserver_1:<0x0>																						
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Logs/Plex Media Scanner.4.log						
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Logs/Plex Media Server.1.log							
		/mnt/vol1/jails/plexmediaserver_1/var/log/cron																			
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Logs/Plex (anonymous).3.log							
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Logs/Plex Media Scanner.3.log						
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Logs/Plex (anonymous).log							
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Logs/Plex Media Scanner.2.log						
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Cache/CloudAccount.dat								
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Logs/Plex Media Scanner.1.log			
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Logs/Plex Media Scanner.log							
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Logs/Plex Media Server.log							
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Plug-in Support/Databases/com.plexapp.dlna.db-shm	
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Logs/Plex (anonymous).2.log							
		/mnt/vol1/jails/plexmediaserver_1/var/db/plexdata/Plex Media Server/Plug-in Support/Databases/com.plexapp.plugins.library.db-wal																																
		/mnt/vol1/jails/plexmediaserver_1/var/log/messages

Code:
[root@orion ~]# glabel status																									
									  Name  Status  Components																	
gptid/c352b547-d467-11e6-90ee-d05099c0adce	 N/A  ada0p2																		
gptid/fdaefe2c-d137-11e6-a886-d05099c0adce	 N/A  ada1p2																		
gptid/adcd673b-4d3e-11e6-bbbe-d05099c0adce	 N/A  ada2p2																		
gptid/1efbebd9-88a9-11e6-9446-d05099c0adce	 N/A  ada3p2																		
gptid/6078220f-d2cc-11e6-8cc5-d05099c0adce	 N/A  ada4p2																		
gptid/157bddc1-4b5d-11e6-b367-d05099c0adce	 N/A  da0p1																		
gptid/158de726-4b5d-11e6-b367-d05099c0adce	 N/A  da0p2

Code:
[root@orion ~]# camcontrol devlist																								
<WDC WD30EFRX-68EUZN0 82.00A82>	at scbus0 target 0 lun 0 (pass0,ada0)															
<WDC WD30EFRX-68EUZN0 82.00A82>	at scbus1 target 0 lun 0 (pass1,ada1)															
<TOSHIBA DT01ACA300 MX6OABB0>	  at scbus2 target 0 lun 0 (pass2,ada2)															
<WDC WD30EFRX-68EUZN0 82.00A82>	at scbus4 target 0 lun 0 (pass3,ada3)															
<WDC WD30EFRX-68EUZN0 82.00A82>	at scbus5 target 0 lun 0 (pass4,ada4)															
<Marvell Console 1.01>			 at scbus9 target 0 lun 0 (pass5)																
<Samsung Flash Drive FIT 1100>	 at scbus17 target 0 lun 0 (pass6,da0)

Code:
[root@orion ~]# camcontrol identify ada0 | grep ^serial																			
serial number		 WD-WCC4N3KUL6DL																							
[root@orion ~]# camcontrol identify ada1 | grep ^serial																			
serial number		 WD-WCC4N7PCS8UN																							
[root@orion ~]# camcontrol identify ada2 | grep ^serial																			
serial number		 X5DAN58GS																									
[root@orion ~]# camcontrol identify ada3 | grep ^serial																			
serial number		 WD-WCC4N6XKYX6V																							
[root@orion ~]# camcontrol identify ada4 | grep ^serial																			
serial number		 WD-WCC4N4HRULNN		


{1} - I would have found it very useful to be able to use the storage UI for view-disks and have that show which volume is in the degraded state. After reading around in the forums before building the unit, I labelled the front of each disk with the serial number, but it sounded like the device IDs (adaN) were mutable so I didn’t add those to the labels. It seems like a great feature to add, because when one loses a drive, one doesn't want to first google how to find out which one it was because it's not obvious in the UI.
 
Last edited:

melloa

Wizard
Joined
May 22, 2016
Messages
1,749
Let's wait for the gurus to give some ideas, but:

I would have found it very useful to be able to use the storage UI for view-disks and have that show which volume is in the degraded state.

Going to storage, selecting the volume, and clicking volume status, displays:

upload_2017-1-20_21-59-53.png


Going to storage, clicking view disks, it shows all disks and serial numbers:

upload_2017-1-20_22-1-6.png


Won't that be enough to identify the failed drive? And you did label your drives with the s/n (got to start doing that!)
 

petemounce

Cadet
Joined
Jul 17, 2016
Messages
8
Won't that be enough to identify the failed drive? And you did label your drives with the s/n (got to start doing that!)

Unfortunately, in a failure state, the volume status page shows, for the failed disk, an integer that I don't know how to relate back to the failed disk, instead of the device ID.

This shot is what mine shows now, which is similar to (but not the same as) the initial failure state.

freenas-failure-state.png


It would be nice for the operator to just need to go to the view disks screen and see a status column showing state of each disk.
 

petemounce

Cadet
Joined
Jul 17, 2016
Messages
8
Won't that be enough to identify the failed drive? And you did label your drives with the s/n (got to start doing that!)

Unfortunately, in a failure state, the volume status page shows, for the failed disk, an integer that I don't know how to relate back to the failed disk, instead of the device ID.

This shot is what mine shows now, which is similar to (but not the same as) the initial failure state.

freenas-failure-state.png


It would be nice for the operator to just need to go to the view disks screen and see a status column showing state of each disk.
 
Last edited:

saurav

Contributor
Joined
Jul 29, 2012
Messages
139
I think the gurus will ask you for the output of "camcontrol devlist" and "glabel status" in code tags.
 

petemounce

Cadet
Joined
Jul 17, 2016
Messages
8
It's already there; it's a single long code block. I've since edited it to make them separate blocks.
 
Last edited:

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,211
I'm confused. You said you had 4 drives, 2 mirrors of 2, but mirror-1 looks like it has 4 drives?

What is that Marvell thing in the camcontrol devlist output? One hears disparaging remarks around here about Marvell controllers.

By the way, I am not the guru you're looking for <Jedi mind control hand gesture>
 

petemounce

Cadet
Joined
Jul 17, 2016
Messages
8
I'm confused. You said you had 4 drives, 2 mirrors of 2, but mirror-1 looks like it has 4 drives?

It has 3 drives and one UNAVAIL ghost, if I'm interpreting it correctly. I bought 1 replacement and 1 cold spare, then used the cold spare in addition. I lament that when I built it originally I went with Toshiba drives to save some money. I have at this point replaced all 4 original drives with WD Reds. They appear to run much colder than the Toshibas too.

What is that Marvell thing in the camcontrol devlist output? One hears disparaging remarks around here about Marvell controllers.
I just edited the OP to answer that. When I built the NAS I was inspired by Brian Moses' FreeNAS series.
 

saurav

Contributor
Joined
Jul 29, 2012
Messages
139
So mirror-1 somehow turned into a 4-way mirror?

I'm really curious to know what those 3 entries under "replacing-0", of which one is UNAVAIL and are sightly indented mean:

Code:
replacing-0									 DEGRADED 222K	 0	 0										
  10734777483274518261							UNAVAIL	 0	 0	 0 was /dev/gptid/abedf192-4d3e-11e6-bbbe-d05
099c0adce.eli																													
  gptid/6078220f-d2cc-11e6-8cc5-d05099c0adce.eli ONLINE	 0	 0 222K											
  gptid/c352b547-d467-11e6-90ee-d05099c0adce.eli ONLINE	 0	 0 222K										   gptid/adcd673b-4d3e-11e6-bbbe-d05099c0adce.eli	ONLINE	 0	 0 222K
 

petemounce

Cadet
Joined
Jul 17, 2016
Messages
8
No - I think the UNAVAIL one is FreeNAS' memory of a device. It's how to clear that up that I'd like to find out (as well as, next, how to either recover or clean up the corruption).

I think that happened when I selected the incorrect device ID during the replace-a-disk procedure. The first time I did it, the UNAVAIL device was also not possible to clear up.
 

saurav

Contributor
Joined
Jul 29, 2012
Messages
139
Yes, that I understand. But what does the "replacing-0" entity mean? Is it something temporary, like when a drive is being resilvered? Because going by just the looks (indentation), it _seems_ mirror-1 is a mirror of "replacing-0" and ada2, while "replacing-0" is a stripe of 3 disks (ada0 & ada4 and UNAVAIL). Once again, it looks that way, but I have no real idea what it actually is. Which is why I'm curious.

Whenever a pool configuration like this happens, its a good idea to ensure a backup of all data you care about. It may be possible to restore the pool structure to something sensible, but in a non-trivial way (meaning, you'd likely make a mistake & end up re-creating the pool). Whatever you do, don't try out commands in a shell that you read on other websites (especially those related to solaris and linux zfs, or even freebsd). They may not apply verbatim to FreeNAS. Wait for a zfs/freenas guru to respond.
 

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,211
I'm not an expert on replacing drives; haven't had to do that yet. But it looks to me like mirror-1 thinks it should have 4 drives, one of which is UNAVAIL. So maybe when you thought you replaced a drive and "used the cold spare" (not sure what that means), FreeNAS thought you were adding them them to the vdev as additional mirrors rather than replacing a bad drive?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Not exactly sure what the sequence of operations was, but the pool is essentially toast. Backup the data, get whatever has been corrupted off a backup (though it looks like only logs and stuff were corrupted, so the data should be fine - the plex jail may need to be rebuilt) and start over with a fresh pool.

Next time, carefully follow the manual and don't improvise disk replacements.

I suspect you may have hot-swapped drives during this process, without using a proper hot-swap bay. If that is indeed the case, it neatly explains the small amount of corruption.
 
Status
Not open for further replies.
Top