Console dead, SSH dead but VMs and Jails still work

Status
Not open for further replies.

Thomas NAS

Cadet
Joined
Jun 29, 2017
Messages
3
Hi -

Using the FreeNAS-11-STABLE train and up to date, a box which had been working for years has suddenly started behaving very oddly.

Some time after boot, the console becomes non-responsive. Pressing 9 to get a shell will just display the nine ... and then nothing else happens. Around the same time, the box stops responding to SSH and to SMB requests.

Oddly, a Plex jail (installed with the current plug-in) continues to work just fine and a Centos VM also continues to work just fine. FWIW, the VM has disk mounted via NFS from the underlying machine and that is still readable/writable. It's almost like the FreeNAS instance ceases to be able to start new processes but existing ones are unaffected.

Any ideas here?


The box has been working with FreeNAS since ~2013, starting with 8 or so, continuing with Corral, earlier versions of FreeNAS 11. The only recent change was to add in a few additional hard discs and a controller card for same back into the box. These discs and the controller card had been in the box previously (circa 2015/early 2016) and were working perfectly under the production version of FreeNAS available at that time.

This has now happened twice. After being in this state for a few days, I shutdown the VM (either via SSH or VNC, I can't recall) and then hit the power button on the box a few minutes later. All ZFS pools came up on reboot without complaint. I chalked it up a cosmic ray or some similar one-off and hoped that was that. A few days later (after it had been up and running since that initial forced power off & reboot), the same symptoms have reappeared.

The only hint of a hardware problem has been sporadic complaints about one disc or another, but they're on the order of 1 per day per disc and this particular hardware has always displayed such errors. (Cheap SATA cables maybe? -- but it wasn't any different in the days and weeks before these symptoms started that it is now in the "working" period after a power off & reboot.)

Does anyone have advice on how I can get the system to provide useful info to diagnose the problem?

Thanks
Thomas
(Mods, please feel free to move if this isn't in the best place)
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Do you have your gateway address entered on the Network, Global Configuration, page in the GUI? Also same in the Nameserver fields (I recall this was a recommendation some months ago, I think for a potential remedy to a situation such as you are reporting). If it's in your router's DHCP address range, is it reserved?
 

Thomas NAS

Cadet
Joined
Jun 29, 2017
Messages
3
Do you have your gateway address entered on the Network, Global Configuration, page in the GUI? Also same in the Nameserver fields (I recall this was a recommendation some months ago, I think for a potential remedy to a situation such as you are reporting). If it's in your router's DHCP address range, is it reserved?

Thanks for the quick response. It's a static address, outside of my router's DHCP range.

Without crashing the box again, I can't get to the GUI to be _absolutely_ certain but I'm 99% certain that I do have the gateway address entered in on the Network screen as that's not something I would have left blank. I used to earn my living doing network administration, so I'm sensitive to network fields. :smile: As for the nameserver field, I might have either the Google DNS servers or the OpenDNS servers or one of each instead of or in addition to 192.168.0.1. (My ISP is CenturyLink and their DSL modem/gateway forces unresolved queries off to their advertising servers regardless of the settings that I enter in the gateway, which is why I avoid having some of my boxes use it.)

Any ideas of what I can do now or try to do now before crashing the box? As it's generally working, I'm tempted to leave it as-is and just live with the VM's services and the Plex jail for an indefinite period. Alternatively, is there a minimum time (30 seconds? 5 minutes? 1 hour? ) that anyone wants to recommend to let the ZFS pools quiesce after shutting down the VM and disconnecting from all Plex clients before killing the power to the box?

Thanks
Thomas
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
First guess would be a failing boot device. But alas that would only be a guess because you haven't shared your hardware that you are running on.......
 

Thomas NAS

Cadet
Joined
Jun 29, 2017
Messages
3
@Jailer I think you got it. While it's impossible to prove a negative, it hasn't gone catatonic since I dropped the box, cloned the boot media (via dd, on a nearby Linux box) to a known-good USB stick, and rebooted with the clone.

FWIW, the hardware is an AMD FX-6300 with 16 GB of RAM hosting (6) 4TB discs in RAID-Z2, (4) 2TB discs in a striped, mirrored pool, and a pair of SSDs in a mirrored pool.

The media should, in retrospect, have been suspect. When I upgraded to FreeNAS 11, I'd swapped boot media. At the time, the flash media closest to hand was a 16 GB SD card and a media reader. It seems that looking farther afield for a real USB stick would have been a better idea. :smile:
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
At the time, the flash media closest to hand was a 16 GB SD card and a media reader. It seems that looking farther afield for a real USB stick would have been a better idea. :)
Yeah SD media doesn't seem to fair too well for a freeNAS installation, at least not from what others have posted. Also next time reboot before upgrading to rule out a bad USB boot drive. Likely if it will survive a reboot unscathed then it will survive an update.
 
Status
Not open for further replies.
Top