Can not shutdown after failed syslog/jail pool

indy · Nov 28, 2014

I got an unfriendly email today:

Code:

This message was generated by the smartd daemon running on:

   host name:  freenas
   DNS domain: local

The following warning/error was logged by the smartd daemon:

Device: /dev/ada0, unable to open device

Device info:
STT_FTM64GX25H, S/N:P612102-MIBY-208A016, FW:1916, 64.0 GB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
No additional messages about this problem will be sent.

This is a single ssd hosting the pool with syslog and jails on it.
The ssd has been generating errors during scrubs for ages, but they were always fixable since copies=2 was set.
I always wanted to migrate the pool to a different ssd but unfortunately was too lazy to do it in time.
The drive seems to have fully failed now, but the loss of that pool is not a problem in itself.
I dont care for the contents of this pool as my data is on another.

However there are other problems:
Trying to login via ssh fails at login, the web interface still somewhat works however.

What I have done so far:

1) Checked the zpool status via the web interface shell:

Code:

[root@freenas ~]# zpool status tank                                                                                                 
  pool: tank                                                                                                                        
 state: UNAVAIL                                                                                                                     
status: One or more devices are faulted in response to IO failures.                                                                 
action: Make sure the affected devices are connected, then run 'zpool clear'.                                                       
   see: http://illumos.org/msg/ZFS-8000-JQ                                                                                          
  scan: scrub repaired 12K in 0h0m with 0 errors on Fri Nov 28 01:00:54 2014                                                        
config:                                                                                                                             
                                                                                                                                    
        NAME               STATE     READ WRITE CKSUM                                                                               
        tank               UNAVAIL     15   102     0                                                                               
          102399081654048  REMOVED      0     0     0  was /dev/gptid/3fd35755-a16f-11e3-bbce-002590f062ca

2) Tried to change the syslog pool to the functional one.
Not sure if that worked out since the web interface got stuck.

2) Tried to reboot / shutdown
Nothing seems to happen after the shutdown message, I can still open up the web interface.

3) Tried to lock (the encrypted) pool with my data on it to prevent any faulty actions regarding the file system.
Nothing seems to happen again, the web interface gets stuck on the popup message

4) Tried to shutdown via the ipmi kvm console.
Again the system does not shutdown as commanded.
Additionally the console keeps throwing up the error:

Code:

vm_fault: pager read error, pid ##### (tail)

I would really appreciate help with shutting down the system gracefully.

SweetAndLow · Nov 28, 2014

Are you using raid? Why do you show only one device in your pool?

indy · Nov 29, 2014

The failed pool consisted of only one device, the failed ssd.
This pool contained the jails and system dataset.

no_connection · Nov 30, 2014

I would not be surprised if something hangs when trying to parse corrupt .system data to GUI, or just waits for it forever.

Ether way it *should* have been handled by exceptions and whatnot to make sure they system as a whole never becomes unreliable/unstable. In my opinion of course.

Hope it works out.

indy · Nov 30, 2014

Another thing I tried was shutting down the processes that tried to access the failed pool.

For example I entered

Code:

/etc/rc.d/syslogd stop
kill -KILL 7178

into the web-shell but neither does anything and the process refuses to die.

Another idea of mine is to export the (encrypted) pool with data on it via the console and redo the whole Freenas installation, but I would really love some input on this.

SweetAndLow · Nov 30, 2014

I think redoing your configuration would solve the issue. Might be a fair amount of work depending on how complicated your setup is. And since you have an encrypted pool make sure you follow all the proper steps in the manual that relate to encryption.

Maybe just removing the failed pool then rebooting would fix the issue also.

indy · Nov 30, 2014

Do you have any advice how I would safely unmount the healthy pool?

SweetAndLow · Nov 30, 2014

I don't think you should ever have to manually unmount anything so I'm not sure what you are trying to do. But if you need to do something safely it is in the manual.

http://web.freenas.org/images/resources/freenas9.2.1/freenas9.2.1_guide.pdf

indy · Dec 1, 2014

The problem is that the system does not shut down and the web interface does not react.
Only thing left working is the console.

Can i safely use

Code:

zpool export vol1

from the console to export the healthy (encrypted) pool?
I have both Key and Recovery Key saved as per manual.
Afterwards I would redo the whole installation and re-import the pool.

SweetAndLow · Dec 1, 2014

I think you can but i have never bothered to export a pool.

cyberjock · Dec 2, 2014

You can export it that way, but you're kind of in this "weird place" because your .system is fubared. Probably shouldn't have let the .system dataset get to that kind of "broke".

If worse comes to worse, just turn off the box, unplug the SSD that is bad, then bootup the system. All should be straightened out.

indy · Dec 3, 2014

So, I did the "zpool export -f vol1" command from the KVM console which went through without error.
After that I tried to do the same thing for the failed pool (in the hope of unfreezing the system) and that locked up the last functioning console.
Since there were no options left anyway I just switched the system off.
I redid the whole installation and setup on a new stick since I did not really trust what was left from that debacle.
Anyhow, my data pool seems to have imported just fine!

The new pool for the system dataset with less errors, more redundancy and Intel SSDs this time :)

Code:

[root@freenas] ~# zpool status vol0
  pool: vol0
 state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Wed Dec  3 19:30:53 2014
config:

        NAME                                            STATE     READ WRITE CKSUM
        vol0                                            ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/69ed7a0f-7b16-11e4-b34b-002590f062ca  ONLINE       0     0     0
            gptid/6a11a547-7b16-11e4-b34b-002590f062ca  ONLINE       0     0     0

errors: No known data errors

Thank you guys for helping me out!

Important Announcement for the TrueNAS Community.

Can not shutdown after failed syslog/jail pool

indy

Patron

SweetAndLow

Sweet'NASty

indy

Patron

no_connection

Patron

indy

Patron

SweetAndLow

Sweet'NASty

indy

Patron

SweetAndLow

Sweet'NASty

indy

Patron

SweetAndLow

Sweet'NASty

cyberjock

Inactive Account

indy

Patron

Similar threads