Drive failure prevents boot up; replace drive loses pool

wcage03 · Jan 16, 2017

I think I am hosed. I am not smart enough to know for sure, but I can see a disaster here.

This started with a power blink. The FreeNAS box was not on a UPS. I have had blinks before and occasionally a reboot was required to get everything back online. This time a reboot did not solve the problem. I would get to a repeating error that was primarily comprised of:
CAM status: ATA Status Error
ATA status: 51 (DRDY SERV ERR), error 40 (UNC)
Some digging pointed me to a disk error. I thought "that is why I have a redundant system"... replace the disk and let it rebuild. Looking through the log files and doing some SMART runs allowed me to figure out which drive was the culprit. I got a new disk and replaced it.

The boot went past the repeating error, but when it tried to import the pool it gave the error message:
Cannot import 'newraid': i/o error
Destroy and recreate the pool from a backup source.
Where 'newraid' is the name of my pool. The boot continued to the menu and the web interface for FreeNAS was enabled.

I can go to the web interface and see the pool definition in the volumes section. The Alert System has a critical item stating "The volume 'newraid' (ZFS) state is UNKNOWN:". I searched for instructions on replacing a disk and found instruction to go to "View Disks" and select "replace" for the old disk. I can see the new disk there, but no option to replace. The old disk is not there since it had to be removed to be able to get past the boot error.

I am stuck and I am not smart enough to get past this. Hopefully someone can help. I have tried zpool commands from various posts; "zpool status -v" only shows my boot drive pool, not newraid. Most of the rest are non-starters since I can't import the pool and the zpool commands mostly seem to require that first.

Some specifics about my setup: I am running FreeNAS 9.3 using an Intel box with (initially) 4 drives; one drive for the OS and 3 - 3TB drives for the NAS setup as RaidZ. One of the 3TB drives is the one that failed and I have tried to replace it without luck.

Can anyone provide some pointers on how to get through this or (I hope not) show me how to confirm that the NAS is lost for good?

Spearfoot · Jan 16, 2017

wcage03 said:
I think I am hosed. I am not smart enough to know for sure, but I can see a disaster here.

This started with a power blink. The FreeNAS box was not on a UPS. I have had blinks before and occasionally a reboot was required to get everything back online. This time a reboot did not solve the problem. I would get to a repeating error that was primarily comprised of:
CAM status: ATA Status Error
ATA status: 51 (DRDY SERV ERR), error 40 (UNC)
Some digging pointed me to a disk error. I thought "that is why I have a redundant system"... replace the disk and let it rebuild. Looking through the log files and doing some SMART runs allowed me to figure out which drive was the culprit. I got a new disk and replaced it.

The boot went past the repeating error, but when it tried to import the pool it gave the error message:
Cannot import 'newraid': i/o error
Destroy and recreate the pool from a backup source.
Where 'newraid' is the name of my pool. The boot continued to the menu and the web interface for FreeNAS was enabled.

I can go to the web interface and see the pool definition in the volumes section. The Alert System has a critical item stating "The volume 'newraid' (ZFS) state is UNKNOWN:". I searched for instructions on replacing a disk and found instruction to go to "View Disks" and select "replace" for the old disk. I can see the new disk there, but no option to replace. The old disk is not there since it had to be removed to be able to get past the boot error.

I am stuck and I am not smart enough to get past this. Hopefully someone can help. I have tried zpool commands from various posts; "zpool status -v" only shows my boot drive pool, not newraid. Most of the rest are non-starters since I can't import the pool and the zpool commands mostly seem to require that first.

Some specifics about my setup: I am running FreeNAS 9.3 using an Intel box with (initially) 4 drives; one drive for the OS and 3 - 3TB drives for the NAS setup as RaidZ. One of the 3TB drives is the one that failed and I have tried to replace it without luck.

Can anyone provide some pointers on how to get through this or (I hope not) show me how to confirm that the NAS is lost for good?

Well, I'll tell ya... it don't look good. It wouldn't do any harm to re-seat all of the SATA and power cables, but I think your pool is toast. When FreeNAS shows you this message, it's pretty emphatic:

Code:

 Destroy and recreate the pool from a backup source.

Do you have a backup of your data?

And this would be a good time to buy a UPS, and to re-think using RAIDZ1, which isn't recommended for 'large' drives more than ~1TB in size.

Good luck!

SweetAndLow · Jan 16, 2017

Output of zpool import, smartctl -a /dev/adaX for all disks and camcontrol devlist.

I don't think you will get your pool back though. You had multiple disks fail.

Sent from my Nexus 5X using Tapatalk

wcage03 · Jan 17, 2017

Interesting, I was about to reply back regarding running zpool import that it was useless since it falls into an (apparently) endless loop if the bad drive is in place and it says that the pool doesn't exist if the drive is not there, but, lo and behold, before I responded, I ran the command...

 

[root@freenas] ~# zpool import



  pool: newraid



	id: 12116166501612787503



  state: FAULTED



status: One or more devices are missing from the system.



action: The pool cannot be imported. Attach the missing



	devices and try again.



	The pool may be active on another system, but can be imported using



	the '-f' flag.



  see: http://illumos.org/msg/ZFS-8000-3C



config:





	newraid										 FAULTED  corrupted data



	  raidz1-0									  DEGRADED



		gptid/23e506a2-24cd-11e4-bbdb-0019d129bc15  ONLINE



		14289297479639926827						UNAVAIL  cannot open



		gptid/25d6bb74-24cd-11e4-bbdb-0019d129bc15  ONLINE



[root@freenas] ~#

This at least seems to open up the possibility that I can replace the drive... maybe...

The smartctl -a for each of the drives came back with clean statuses even for the failed drive except when I ran an extended check on the drive that failed. At that point it came back with a read failure. The one drive is clearly toast. The others seem to be clean. I can post those outputs if they are needed. The bad drive is no longer in the system since it will not allow me to boot when it is attached. If you need the output, I can put it back in and boot to single user mode and run the command, but it would only prove that the drive has failed.

cam control devlist output...

 

[root@freenas] ~# camcontrol devlist



<ST3000DM008-2DM166 CC26>		  at scbus0 target 0 lun 0 (ada0,pass0)



<WDC WD30EZRX-00D8PB0 80.00A80>	at scbus1 target 0 lun 0 (ada1,pass1)



<WDC WD30EZRX-00D8PB0 80.00A80>	at scbus3 target 0 lun 0 (ada2,pass2)



<LITE-ON DVDRW SOHW-1673S JS01>	at scbus5 target 0 lun 0 (pass3,cd0)



<Maxtor 6Y160P0 YAR41BW0>		  at scbus5 target 1 lun 0 (ada3,pass4)



[root@freenas] ~#

The first drive is the new (replacement hopefully) drive. The next two are the remaining drives from the problem pool. The last one is the disk holding the OS.

Thanks for your help on this.

wcage03 · Jan 17, 2017

So, I tried to get ahead of SweetAndLow and tried a zpool replace command using the GUID of the missing drive and got the following...

 

[root@freenas] ~# zpool replace newraid 14289297479639926827 /dev/ada0



cannot open 'newraid': no such pool



[root@freenas] ~#

mjt5282 · Jan 17, 2017

take a deep breath. I suspect that you did the replace wrong when the power blip happened. Unless you are unix neckbeard guru, use the GUI. read the docs. If you 'added' instead of replace, that might explain why zpool import is failing. normally raidz1 tolerates 1 disk loss. did you already try 'zpool import -f newraid' ?
I personally wouldn't try adding or deleting disks until I could mount the pool, even in a degraded fashion.

Robert Trevellyan · Jan 17, 2017

wcage03 said:
newraid FAULTED corrupted data

You might be able to mount the pool read-only and recover data from it before rebuilding. I doubt you'll be able to do much better than that.

wcage03 · Jan 18, 2017

Yeah, I don't think I clear the "neckbeard guru" bar, but I am comfortable on either side of the GUI boundary. The thing that I am grappling with is that the error on the bad disk prevents the pool from being imported. The error that it throws causes it to get into an (apparently) infinite loop. That is the "CAM Status: ATA Error" that I mentioned in the original post. So, if I have the bad drive connected, I can't get the system to boot except in single user mode since it hangs when it tries to do the import of the pool. If I take the drive out, I can boot, but it can't import the pool since the drive is missing.

That is the quandary. It seems like I need to be able to boot with the drive in place so that the pool is recognized, then I could do the replace of the drive. I will look at the import options to see if there is a way to get it to go past the error. Getting it to mount read-only would be acceptable.

SweetAndLow · Jan 18, 2017

Try to import using the GUI. If that doesn't work try zpool import -f newraid. Report back what happens.

Sent from my Nexus 5X using Tapatalk

wcage03 · Jan 18, 2017

SweetAndLow: you are poking straight into my quandary. To import using the GUI, I have to have the system booted in multiuser mode with the bad drive attached. The startup process in multiuser mode tries to automatically import the pool and gets into the infinite failure loop. => I can't boot into multiuser mode (and get the GUI) with the bad drive attached. If I boot into single user mode and try to do zpool -f newraid it puts me into the same infinite loop and never mounts.

If I disconnect the bad drive, I can boot into multiuser mode, but it then doesn't recognize that the pool exists (at least when I try to replace the drive using zpool replace). That is why I named this thread as I did. It seems like I need a way to boot into single user mode and have the pool import ignoring errors ( -f is not doing it) so that I can have the pool recognized and then replace the drive and resilver the new drive and then toss the old one into the trash.

What happens when a drive fails completely? My guess is that the import would fail gracefully and it would move past the import and mount the pool in a degraded status and then I could do the drive replacement via the GUI (or cmd line). My problem is the nature of the drive failure puts the import into an infinite loop. I will put the drive back in and try to capture the status and post it up here so that people can understand what is occurring. I can probably boot single user and do the import and push the stderr out to a file for a period of time. Usually I ssh in I can capture the output by cutting and pasting, but if I can't get to multiuser mode...

Thanks for everyone's help and input on this.

SweetAndLow · Jan 18, 2017

wcage03 said:
SweetAndLow: you are poking straight into my quandary. To import using the GUI, I have to have the system booted in multiuser mode with the bad drive attached. The startup process in multiuser mode tries to automatically import the pool and gets into the infinite failure loop. => I can't boot into multiuser mode (and get the GUI) with the bad drive attached. If I boot into single user mode and try to do zpool -f newraid it puts me into the same infinite loop and never mounts.

If I disconnect the bad drive, I can boot into multiuser mode, but it then doesn't recognize that the pool exists (at least when I try to replace the drive using zpool replace). That is why I named this thread as I did. It seems like I need a way to boot into single user mode and have the pool import ignoring errors ( -f is not doing it) so that I can have the pool recognized and then replace the drive and resilver the new drive and then toss the old one into the trash.

What happens when a drive fails completely? My guess is that the import would fail gracefully and it would move past the import and mount the pool in a degraded status and then I could do the drive replacement via the GUI (or cmd line). My problem is the nature of the drive failure puts the import into an infinite loop. I will put the drive back in and try to capture the status and post it up here so that people can understand what is occurring. I can probably boot single user and do the import and push the stderr out to a file for a period of time. Usually I ssh in I can capture the output by cutting and pasting, but if I can't get to multiuser mode...

Thanks for everyone's help and input on this.

Stop trying to replace stuff! Your going to break it if it's not already broken. But I think it's already fully broken. You had a disk failure and some other kind of failure so your raid z1 can't recover. Try importing the pool with a -F flag and cross your fingers. Then backup everything.

Sent from my Nexus 5X using Tapatalk

wcage03 · Jan 18, 2017

SweetAndLow: I appreciate your persistence in helping, but as I have said several times, zpool import -f newraid causes an infinite loop error condition if the bad drive is in; if the bad drive is out zpool import -f newraid complains that the drive is missing.

wcage03 · Jan 18, 2017

OK, I think I am throwing in the towel. I am coming around to what a couple of you have suggested that maybe more than one drive has been corrupted. At any rate, I can't get out of this hole. Just curious though if any of you have an opinion on what is the best configuration for a robust backup device. I was under the impression that FreeNAS using ZFS with RaidZ would allow me to not encounter what I am going through. Luckily, I think the most important things that have gone down with the ship exist in other places for me, but I have been blindly using this configuration for years thinking I was being proactive and safe (within reason).

At any rate, thanks for the suggestions and assistance. It has been much appreciated.

mjt5282 · Jan 18, 2017

whoa. a raidz1 with originally 3 disks, one dead, I think it should be able to be mounted up (read only maybe?). Sometimes people run Solaris or Linux ZFS to try and mount the disks. Send them to me and I will try and mount them on my rig(s).

wcage03 · Jan 18, 2017

mjt5282: I am of a like mind... one dead, the promise is that the other two live on to repopulate the world... something else is going on here. I am more and more convinced that smart on the disks is almost useless. I clearly have one drive that won't even allow the system to boot and smartctl -a says all is well unless I do a deep scan; 5 hours later it tells me that there is a read error. I need to deep scan the other two to see if there is something else there that is a problem.

I have considered sticking them into another linux box to see if I can get the pool back, but, more importantly, I have been going down memory lane to remember what all was on the NAS; time machine backups of the various family computers, about 30k photos, about 15k songs, miscellaneous documents and lots of crap from the many years the thing has been in service. The backups are clearly not a problem (unless one of those computers goes down soon). Luckily I got a new Mac a year ago and imported all those photos into the new Mac Photos app (lucky save there) and I was considering moving onto Google Music a couple of years ago and uploaded all of that music there... turns out, it is still there. So other than the miscellaneous stuff, I think I am going to clean up everything, check the remaining drives and rebuild the NAS. I have to admit though, my trust in RAID and NAS is broken. Better to scatter everything everywhere and keep a treasure map! Hello cloud backup.

I do appreciate the offer though.

SweetAndLow · Jan 18, 2017

wcage03 said:
mjt5282: I am of a like mind... one dead, the promise is that the other two live on to repopulate the world... something else is going on here. I am more and more convinced that smart on the disks is almost useless. I clearly have one drive that won't even allow the system to boot and smartctl -a says all is well unless I do a deep scan; 5 hours later it tells me that there is a read error. I need to deep scan the other two to see if there is something else there that is a problem.

I have considered sticking them into another linux box to see if I can get the pool back, but, more importantly, I have been going down memory lane to remember what all was on the NAS; time machine backups of the various family computers, about 30k photos, about 15k songs, miscellaneous documents and lots of crap from the many years the thing has been in service. The backups are clearly not a problem (unless one of those computers goes down soon). Luckily I got a new Mac a year ago and imported all those photos into the new Mac Photos app (lucky save there) and I was considering moving onto Google Music a couple of years ago and uploaded all of that music there... turns out, it is still there. So other than the miscellaneous stuff, I think I am going to clean up everything, check the remaining drives and rebuild the NAS. I have to admit though, my trust in RAID and NAS is broken. Better to scatter everything everywhere and keep a treasure map! Hello cloud backup.

I do appreciate the offer though.

You are terrible at following directions. You need to provide all that information in the thread otherwise people are just guessing at stuff. Run smart long tests provide output of smartctl -a for all disks, output of zpool import -F newraid. Try with all disks and with missing disk, provide output. These are thing things i have been saying for a couple posts now but you haven't provided the information. I suspect your data is gone, good luck.

wcage03 · Jan 18, 2017

You are terrible at following directions. You need to provide all that information in the thread otherwise people are just guessing at stuff. Run smart long tests provide output of smartctl -a for all disks, output of zpool import -F newraid. Try with all disks and with missing disk, provide output. These are thing things i have been saying for a couple posts now but you haven't provided the information. I suspect your data is gone, good luck.

Hmmm... enjoy your evening.

Important Announcement for the TrueNAS Community.

Drive failure prevents boot up; replace drive loses pool

wcage03

Dabbler

Spearfoot

He of the long foot

SweetAndLow

Sweet'NASty

wcage03

Dabbler

wcage03

Dabbler

mjt5282

Contributor

Robert Trevellyan

Pony Wrangler

wcage03

Dabbler

SweetAndLow

Sweet'NASty

wcage03

Dabbler

SweetAndLow

Sweet'NASty

wcage03

Dabbler

wcage03

Dabbler

mjt5282

Contributor

wcage03

Dabbler

SweetAndLow

Sweet'NASty

wcage03

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Drive failure prevents boot up; replace drive loses pool

Dabbler

He of the long foot

Sweet'NASty

Dabbler

Dabbler

Contributor

Pony Wrangler

Dabbler

Sweet'NASty

Dabbler

Sweet'NASty

Dabbler

Dabbler

Contributor

Dabbler

Sweet'NASty

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Drive failure prevents boot up; replace drive loses pool"

Similar threads