Melvil Dui
Cadet
- Joined
- Jun 21, 2018
- Messages
- 8
I ran into this a short while ago, and thought I should share the story in case anyone else has a similar issue.
An alert identified that accessing the Freenas device over NFS was not working. Basically the whole NFS mount had become a black hole.
Attempting to access the web interface on the Freenas was fruitless.
But ssh was enabled, and (more specifically) root login over ssh was allowed. And the ssh service was responsive. Only after accepting the password ssh just hung, unresponsive.
Using my years of Unix experience, I guessed that this ssh symptom was a classic case of trying to access a bad filesystem during the login process. So I tried the age old backdoor for logining in without going through the normal login process: ftp. Well, specifically sftp. That worked.
I quickly downloaded /etc/profile and root's .profile / .bash* files to check what they were doing. Nothing seemed odd there, so next I turned my attention to /var/log/wtmp. It turns out that Freenas doesn't use wtmp, it uses utx, which is also in /var/log/. And it was the /var/log/ mount that was causing problems, any cd into /var/log hung (kill sftp connection and restart).
At the same time I was doing this, someone physically closer to the machine logged in to console. And out of habit did "cd /var/log/". And hung console.
Inspection (sftp's "ls -l") showed /var/log was a symlink into a different pool, and cd into into that pool also hung (kill sftp connection and restart). Critically because I had a root sftp connection, I was able to "cd /var", "rename log log.BAD", "mkdir log".
After this regular ssh worked, creating a new /var/log/utx file. And then I could run "zpool status" and find that 21 disks of 55 in the main pool where all marked as "OFFLINE". That specific number of disks was the big clue that the commonality was everything on one side of the disk shelf.
The system was shutdown, the shelf was swapped, and everything has been working fine for more than a week. After the restore, I checked the log entries just prior to the issue, and there was nothing out of the ordinary. When it happened it was a sudden and catastrophic loss of access to more than a third of the pool.
Key takeaways:
* Forwarding logs to another system is very useful
* sftp works in ways different than ssh and is useful
* root ssh is a good thing to have in an emergency (but has other risks)
* Replacing /var/log/ while a system is running works almost surprisingly well
An alert identified that accessing the Freenas device over NFS was not working. Basically the whole NFS mount had become a black hole.
Attempting to access the web interface on the Freenas was fruitless.
But ssh was enabled, and (more specifically) root login over ssh was allowed. And the ssh service was responsive. Only after accepting the password ssh just hung, unresponsive.
Using my years of Unix experience, I guessed that this ssh symptom was a classic case of trying to access a bad filesystem during the login process. So I tried the age old backdoor for logining in without going through the normal login process: ftp. Well, specifically sftp. That worked.
I quickly downloaded /etc/profile and root's .profile / .bash* files to check what they were doing. Nothing seemed odd there, so next I turned my attention to /var/log/wtmp. It turns out that Freenas doesn't use wtmp, it uses utx, which is also in /var/log/. And it was the /var/log/ mount that was causing problems, any cd into /var/log hung (kill sftp connection and restart).
At the same time I was doing this, someone physically closer to the machine logged in to console. And out of habit did "cd /var/log/". And hung console.
Inspection (sftp's "ls -l") showed /var/log was a symlink into a different pool, and cd into into that pool also hung (kill sftp connection and restart). Critically because I had a root sftp connection, I was able to "cd /var", "rename log log.BAD", "mkdir log".
After this regular ssh worked, creating a new /var/log/utx file. And then I could run "zpool status" and find that 21 disks of 55 in the main pool where all marked as "OFFLINE". That specific number of disks was the big clue that the commonality was everything on one side of the disk shelf.
The system was shutdown, the shelf was swapped, and everything has been working fine for more than a week. After the restore, I checked the log entries just prior to the issue, and there was nothing out of the ordinary. When it happened it was a sudden and catastrophic loss of access to more than a third of the pool.
Key takeaways:
* Forwarding logs to another system is very useful
* sftp works in ways different than ssh and is useful
* root ssh is a good thing to have in an emergency (but has other risks)
* Replacing /var/log/ while a system is running works almost surprisingly well