Something is killing my FreeNAS - partially

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,943
Hi,
Hopefully details of my setup are in my sig. But the short version is that I have a FreeNAS and two ESX hosts (one normally turned of)
VM's are Domain Controllers, Firewalls (3), and a couple of application servers. Nothing strenuous.
The NAS provides iSCSI, NFS and SMB services

Symptom:
I loose all SMB access. iSCSI remains working as the VM's all continue working and I am unsure about NFS - but I think its good.
Restarting SMB does not work only a complete reboot works and that has its own issues. The server (NAS) won't reboot it just hangs as per the screenshot attached below.
When I log into the FreeNAS GUI, the Dashboard just comes up blank and none of the other menu options do much although triggering a reboot does work, just the reboot complains about processes not dying and then the process just stops.

What do I notice - all the shared folders just vanish

After the NAS has been rebooted further reboots are fine until the condition occurs again. The irregular, sometimes every day, and sometimes a few weeks go by before it happens.

I am not sure how to approach this

1602316317575.png



N.B. the 172.16.16.0/24 network in the screenshot is the dedicated iSCSI network
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,970
This is beyond my knowledge level for this type of problem (I have had not direct experience) but one thing you left out is the version of FreeNAS you are running, please add that information as it will likely help someone give you a good answer.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,943
Good spot
Version: FreeNAS-11.3-U4.1
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,970
This may not fix anything but please upgrade to FreeNAS-11.3-U5, it was a fairly significant bug fix, also fixed some memory leaks. As usual, before any upgrade make a backup of your configuration file so in the event you must manually reinstall, you have the data.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,943
OK, upgrade done. I'll see if the issue re-occurs. I wasn't aware there was an update
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,943
Issue definitely reoccurs.
The only error I can see is a 05:52 where the NAS complains that the esxi host is not responding to a ping and drops the connection. Except that the connection still works
 

tripodal

Dabbler
Joined
Oct 8, 2020
Messages
19
Does this coincide with your system removing large snapshots?
what do you see with zpool list -o freeing, and gstat
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,943
My gut feeling is no
however that doesn't mean I am right. I do use snapshots, but most of the snapshots are small due to a low volume of deltas. I have all the VM's off the ISCSI pools as its a lot easier to reboot without them.

Now:
gstat shows all 0 with occasional twitches (I do not have a busy system)
Zpool list -o freeing shows all 0

It only happens overnight (that I have spotted) which is when various backups / snapshots / maintenance tasks occur
Backups, from FreeNAS to a Synology using Duplicati (as a Jail) occur between 01:00 and 02:00 and succeed - but don't use SMB

I have disabled the snapshots and deleted all existing snapshots - which took about 30 seconds for 183 snapshots during which I had full access to SMB. Lets see how that runs and see if the problem re-occurs for a few days after which I'll turn stuff back on again. Unfortunately the problem is irregular, it can occur everynight, or not occur for a few weeks.

What I have noticed is that when I reboot the system (when everything is working) there is a long list of PIDS that terminate up to the buffers syncing
when I reboot a not working well system a bunch of stuff refuses to terminate - but I have no idea what. There is probably a log that tells me what - any ideas?

I do not know freebsd / linux at all well I am afraid
 

tripodal

Dabbler
Joined
Oct 8, 2020
Messages
19
I've found some useful error messages in the files located /var/log, have you checked those out?
I'm not sure which would be relevant, if any.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,943
Just need it to happen again and since deleting the snapshots it hasn't gone wrong (yet). I'll give it a week and then re-enable the snapshots - but they are on a 2 week cycle - so maybe in three weeks time.
 

tripodal

Dabbler
Joined
Oct 8, 2020
Messages
19
Are you using encryption; my issue which sounded similar seems to have hinged on pool encryption.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,943
I am not. Its all fairly vanilla
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,943
OK - happened again. SMB completely unresponsive, (iSCSI & NFS fine)
Reboot complains about a load of PIDs not stopping. Gets to buffers sync at which point I have to reset.
Well the logs I am looking at seem less than useful. Is there any guidance as to what I should be looking at as there seems to be a very big gap around the moment I spotted the issue. I have made a copy of the /var/log after I rebooted - maybe I should have done that beforehand?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,943
I sis turn the snapshots back on again. However they are on a 2 week cycle (and had been running for 2 days) and happen around 4:00 AM. My NAS "Crashed" between 6:00 pm and 9:00PM. I don't think its snapshots.

I'll change one thing tonight and see if that has an effect
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,943
Still happening from time to time (last night) - this morning all the SMB shares had vanished and the NAS needed rebooting. Restarting the services does not work. Also the logs I have looked at don't seem useful
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,943
Hmm, I found a duff SFP+ on the network. It was generating a boat load of Rx FCS errors. I swapped the unit out and haven't had an issue since - but I am keeping an eye on things. It was on the link between the garage and the house, so this still doesn't entirely make sense. Still a few errors on the link, so I shall swap the unit on the other end tomorrow and if that doesn't help I'll swap the fibre over
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,943
I think it would be better to say I bypassed it. I found a set of circumstances that seemed to cause the issue and avoid them like the plague.
Swapping the SFP+ seems to have helped, although the location of the SFP+ didn't lead itself to causing the issues I was finding.

I have also upgraded since to TrueNAS so don't know if the issue would re-occur if I try again. (Its on my list to do)

BTW - the circumstances / conditions I found were definitely weird. I have a rack with 2 ESXi servers and a TrueNAS. The TrueNAS supplies NFS, iSCSI and SMB services. Note that iSCSI is on a separate NIC/VLAN on ESXi & FreeNAS/TrueNAS. That VLAN only exists between the three boxes
If I attempted to attach a external (across the iffy fibre link) iSCSI target to the ESXi boxes, not using the ISCSI VLAN, then FreeNAS threw a hissy fit and SMB died

Go on - explain that one

I only wanted some swing storage for VM's, if I wanted to do some maintenance work on the NAS, so I stuffed some cheap storage into the working ESXi box and can work the critical VM's off that if I need to whilst I reboot the NAS. I might try again with NFS - I have an SSD store on the main LAN for swing storage that is now unused - maybe NFS?
 
Top