Sudden loss of network at 3am?

Status
Not open for further replies.

paulatmig

Dabbler
Joined
Jul 14, 2014
Messages
41
I'm hoping it's just a fluke, as my previous switch just stopped working a few days ago and the new switch was just put in place.

For unknown reasons, at 3:00am - pretty much on the dot - all network connectivity, save the IPMI (thank you, IPMI!) totally stopped on our FreeNAS box. Both our Chelsio NICS and the on-board NIC serving as the management connection just stopped moving traffic. I could see iSCSI connection errors via the remote terminal window. No additional work was being done on the newly installed switch, it had been in place for 10 hours without any problems, and no other configuration changes were made on either the FreeNAS or the ESXi hosts.

Rebooting, of course, fixed the problem - and unlike the reboots we've done in the past with the ctl issues prior to the 11/28 fixes, we could scroll back through the reporting and see all the activity up to 3:00am when it just went empty until our reboot.

No scrubs, cron tasks (that I'm aware of), or snapshots were scheduled for that time. We didn't have any power events that I could discern. Since it was across all network interfaces (save IPMI), it's not the Chelsio NICs themselves.

I'll be looking into our switch error logs later today, but figured I'd put this out there as a preliminary "anyone else seen this problem" sort of deal.

System details:
Supermicro X9DRD-7LNF-JBOD
192Gb RAM (yes, ECC)
two Chelsio 10GbE NICs - each running separate vlans for iSCSI targets to ESXi hosts
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
There's some house keeping stuff going on at 3 am every day even if there isn't any scrub and co. However I don't know why you have this problem, I never had it and I don't see what the cause can be, sorry.
 
D

dlavigne

Guest
It's interesting that that is happening on that hardware... Anything useful in /var/log/messages?
 

paulatmig

Dabbler
Joined
Jul 14, 2014
Messages
41
Nothing useful in the logs, just that it loses connection to the iSCSI hosts and then nothing else. I'm starting to wonder if this is a continuation of that bug with iSCSI ctl - I had another freeze/reboot yesterday evening at 6pm after doing a lot of file transfer to volumes on that system (about 1Tb). Besides one L2 error, the system had been stable for 4 weeks after the Nov. 28th updates.

The only thing is that since the reboots from the hanged state don't leave any log information behind, it's a bit difficult to figure out what exactly the issue could be. The 11/28 update did stabilize the system significantly - only one reboot in a month compared to one a week, so I'm wondering if things are only worked-around instead of fixed.
 

paulatmig

Dabbler
Joined
Jul 14, 2014
Messages
41
I know this is now hella old, but I came across something in the forums that gives me hope - I'm thinking it could be my Chelsio T420-CRs that are the problem.

At first I thought about the nmbclusters, but seeing that being set to 16,276,940 that's clearly not the problem (at least I hope not). So maybe it's just a bad card. Going to try replacing the CPUs and updating to 9.10 this weekend and see if that helps - then replace the T420s with T520s in a week or so and hope.
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
That seems like way too many mbclusters, what's netstat -m report?
 

paulatmig

Dabbler
Joined
Jul 14, 2014
Messages
41
Under normal usage - i.e. our backup server not trying to do a 2.24TB base image - this is what it looks like:

Code:
24790/20345/45135 mbufs in use (current/cache/total)                           
50000/13152/63152/16276940 mbuf clusters in use (current/cache/total/max)      
24560/3690 mbuf+clusters out of packet secondary zone in use (current/cache)   
1/12780/12781/8138470 4k (page size) jumbo clusters in use (current/cache/total/
max)                                                                           
0/0/0/2411398 9k jumbo clusters in use (current/cache/total/max)               
256/0/256/1356411 16k jumbo clusters in use (current/cache/total/max)          
110297K/82510K/192807K bytes allocated to network (current/cache/total)        
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)                 
0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)                
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)                          
0/0/0 requests for jumbo clusters denied (4k/9k/16k)                           
0/0/0 sfbufs in use (current/peak/max)                                         
0 requests for sfbufs denied                                                   
0 requests for sfbufs delayed                                                  
22 requests for I/O initiated by sendfile                                      
0 calls to protocol drain routines
 

paulatmig

Dabbler
Joined
Jul 14, 2014
Messages
41
Our freezing looks sort of like this:

* Everything's working fine, then suddenly the terminal becomes unresponsive and we lose network connectivity to the unit. Watchdog, thankfully, reboots the system after being stalled for 15 minutes if no one's around to use IMPI to reset it.

-- it's been hard to find an external smoking gun for this, but it usually happens after there's been a lot of network read activity, like our backup system running a base image of the file store, the time we transferred 4Tb of our archive data on to a CIFS share in that unit, etc.

* We get about a 3-5 minute delay during the boot process when we see "KDB:debugger backend: ddb KDB: current backend: ddb".

-- when we were test building this system, with minimal storage and memory, this process went pretty fast. My guess is that this part of boot can take longer for larger disk arrays / memory configurations (we've got 262Gb)
 

paulatmig

Dabbler
Joined
Jul 14, 2014
Messages
41
Okay, so I think it's the motherboard. We swapped out the Chelsio T420's for T520's , and during that time...

* When we connected one T520 to PCI slot 2, an old T420 to slot 3, and the other T520 to slot 4, the NIC in slot 2 didn't power up. No indication lights.
-- added note, we see cxl0 and cxl1 show up as unassigned network interfaces, so the new T520 is at least loading when it's got power.

* When we swapped hardware around to see if it was just a dead T520, but it looks like if we moved the hardware to just slot 3 and 4 (leaving slot 2 empty), everything powers up normally.

* However, when we do that, during boot neither of the two new T520's show up - checking the dmesg, it shows the motherboard's PCI slots being scanned and drivers being loaded, but nothing is found on either of the slots the T520's are plugged into. They're powered, they've got green indicator lights on the hardware, but the OS doesn't even see them.
-- when the Chelsio T420's were in, during the boot the Chelsio BIOS message would come up. Not so much with the T520's.

So I'm going back to the T420's, and putting them into slot 3 & 4 - avoiding slot 2 - and let that sit for a while, seeing if the freezing continues. If it does, I'll swap in a new mobo and... honestly, I hope that does it.
 
Last edited:

paulatmig

Dabbler
Joined
Jul 14, 2014
Messages
41
Quick follow-up - hang is only happening about once every 20-30 days, all signs - as jgreco mentioned - point to nmbclusters, so setting it to a much lower value and seeing where it goes from there.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Holy crap that's a lotta mbufclusters. I'm more used to seeing underallocation of mbuf issues (imagine massive numbers of broadband speed TCP connections) but network going deaf and basic userland wedging are both symptoms that result from that. On one hand I'm just a wee bit skeptical that overallocation is actually a problem, and it strikes me as odd that the symptoms are the same. On the other hand, it just suggests to me that tuning is still an arcane art.

I apologize for missing your previous replies, don't know what happened.
 
Status
Not open for further replies.
Top