System unresponsive

JayG30 · Jun 19, 2017

Hello,

Came in today to find that over the weekend my server was unresponsive and I'm not sure what is going on. No SSH, web GUI, CIFS, NFS...not even a response to pings. A reboot brought it back.

Equipment is server grade.

FreeNAS-9.10.1-U2 (f045a8b)
Intel R2312GL4GS barebones server
128GB of RAM
2 x E5-2670 CPU's
6 x 4TB HDD on an LSI 2308 (IT firmware)
Intel S3710 SLOG on SATA port
32GB USB 3.0 Sandisk Ultra Fit

Also noticed that when I rebooted the server and ran zpool I got this status;

Code:

[root@freenas] ~# zpool status
  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Wed May 17 03:45:46 2017
config:

		NAME										  STATE	 READ WRITE CKSUM
		freenas-boot								  ONLINE	   0	 0	 0
		  gptid/954aa8f3-02a9-11e7-b39b-001e67d50a3e  ONLINE	   0	 0	 0

errors: No known data errors

  pool: store
state: ONLINE
  scan: scrub repaired 0 in 55h13m with 0 errors on Mon Jun 19 09:14:09 2017
config:

		NAME										  STATE	 READ WRITE CKSUM
		store										 ONLINE	   0	 0	 0
		  raidz2-0									ONLINE	   0	 0	 0
			da5p2									 ONLINE	   0	 0	 0
			da4p2									 ONLINE	   0	 0	 0
			da3p2									 ONLINE	   0	 0	 0
			da2p2									 ONLINE	   0	 0	 0
			da1p2									 ONLINE	   0	 0	 0
			da0p2									 ONLINE	   0	 0	 0
		logs
		  gptid/86bc1495-0ab8-11e7-b9df-001e67d50a3e  ONLINE	   0	 0	 0

errors: No known data errors

Notice the "scrub repaired 0 in 55h13m with 0 errors on Mon Jun 19 09:14:09 2017". The timestamp is when I rebooted the server and 55h13m sounds very long to me from how fast I remember scrubs being on this machine. It doesn't hold a ton of data really.

Code:

[root@freenas] ~# zpool list
NAME		   SIZE  ALLOC   FREE  EXPANDSZ   FRAG	CAP  DEDUP  HEALTH  ALTROOT
freenas-boot  28.8G   658M  28.1G		 -	  -	 2%  1.00x  ONLINE  -
store		 21.8T  1.06T  20.7T		 -	 1%	 4%  1.00x  ONLINE  /mnt

I did just alter SMART and SCRUB schedules. This weekend a Short SMART test should have run but NOT a SCRUB. I really don't know why SCRUB would be saying it ran and I'm not sure if ZFS thought it was scheduled to run a SCRUB. I think the SCRUB might have ran because it is out of wack after I updated the schedule, but it was ran on the old schedule and now is trying to change to the new schedule (I have to look into this because I think this will prevent it from running when I want). Still, a SCRUB shouldn't lock the system up I wouldn't think. However I also noticed that the Short SMART test doesn't seem to even show as ran when I check with smartctl. It is supposed to run at 5am on Sunday and I'm guessing the machine was locked up at that point.

This is the only thing I could see on the screen (over IPMI).

This is what I see in dmesg.yesterday (which I guess is the only relevant dmesg)

Code:

[root@freenas] ~# vi /var/log/dmesg.yesterday
MCA: Bank 5, Status 0x8c00004000010091
MCA: Global Cap 0x0000000001000c14, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x206d7, APIC ID 0
MCA: CPU 0 COR (1) RD channel 1 memory error
MCA: Address 0x6578cfc40
MCA: Misc 0x40666686
MCA: Bank 11, Status 0x8c000046000800c3
MCA: Global Cap 0x0000000001000c14, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x206d7, APIC ID 46
MCA: CPU 30 COR (1) MS channel 3 memory error
MCA: Address 0x1b101545c0
MCA: Misc 0x908420002000c8c
Limiting closed port RST response from 250 to 200 packets/sec

dmesg.today is showing;

Code:

MCA: Bank 11, Status 0x8c000046000800c3
MCA: Global Cap 0x0000000001000c14, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x206d7, APIC ID 46
MCA: CPU 30 COR (1) MS channel 3 memory error
MCA: Address 0x1b101545c0
MCA: Misc 0x908420002000c8c
Limiting closed port RST response from 250 to 200 packets/sec
nfsd: can't register svc name

And here is a link to /var/log/messages since it is a bit long to post here (I don't think the .login_conf errors are related);
https://pastebin.com/pYqCSfzJ

Also, in the Intel BMC Web Console I saw this;

Code:

 Event ID 
	

	
	
		
		

		
		
	


	
		
	


  Time Stamp 
	

	
	
		
		

		
		
	


	
		
	


  Sensor Name 
	

	
	
		
		

		
		
	


	
		
	


  Sensor Type 
	

	
	
		
		

		
		
	


	
		
	


  Description 
	

	
	
		
		

		
		
	


	
		
	



442 05/14/2017 13:00:39 Mmry ECC Sensor Memory Correctable ECC. CPU: 2, DIMM: H1. - Asserted

Is there anywhere else I can look for clues? Anyone have an idea?

Thanks.

SweetAndLow · Jun 19, 2017

Looks like you built your pooling incorrectly for freenas. That is problem 1 and that leads me to believe you use the command line and who knows what else you changed thinking it was better but really you just broke things.

A kernel panic usually means hardware problem or failing USB device. Since it survived a reboot that eliminates USB failing so I would lean towards hardware or some change you made on the cli that broke things.

What modifications have you made?

Sent from my Nexus 5X using Tapatalk

JayG30 · Jun 19, 2017

I assume to are referring to the lack of gptid on the pool. Yes I'm aware of that. But, no I did NOT build it through the CLI. I also never made a change to the system through the CLI expect in 1 instance where I HAD to due to limited functionality in the GUI, but this wasn't even on THIS SERVER it was on my other remote FreeNAS server (https://forums.freenas.org/index.php?threads/9-3-offline-drive-system-seems-to-get-confused.28363/). Anything else I've done on the CLI is to query status or read logs.

I could never figure out why when I build the server using an old version of FreeNAS many MANY moons ago it built itself like that. But I know for a FACT that when I build it I did so directly in the FreeNAS GUI and this was the result. Could have been a bug at the time, no idea, but please don't go blaming me for things I didn't do. However it has been like that forever and has never been a problem. Even moving the disks around between slots didn't cause issues (which from my understanding is the whole purpose of the gptid values anyway). Unfortunately I couldn't just "rebuild" everything but eventually I will migrate to a newly constructed pool since the server has room for another 6 disks.

This system has been running for something like 5+ years. The pool was also moved from an old consumer grade PC/chassis to this server (about 2 years ago). I've had 2 "instances" in that time. First was a bad USB drive that happened probably 6 months ago, which I replaced with this one. That one also rebooted fine the first 1 or 2 times and then failed to boot. Would be very strange to have 2 USB drives fail so quickly after all that time without issue.

Jailer · Jun 19, 2017

JayG30 said:
Would be very strange to have 2 USB drives fail so quickly after all that time without issue.

Not really.

JayG30 · Jun 19, 2017

Jailer said:
Not really.

Seems strange to have 2 failures in like 6 months. Doesn't freenas run primarily in RAM once loaded?

SweetAndLow · Jun 19, 2017

JayG30 said:
Seems strange to have 2 failures in like 6 months. Doesn't freenas run primarily in RAM once loaded?

Yes but most people leave their system dataset and/or rrd data on the USB stick which kills them fast.

Sent from my Nexus 5X using Tapatalk

JayG30 · Jun 19, 2017

SweetAndLow said:
Yes but most people leave their system dataset and/or rrd data on the USB stick which kills them fast.

Sent from my Nexus 5X using Tapatalk

You're referring to the setting under System -> System Dataset -> System dataset pool correct?

Mine is NOT on the USB drive, it is on the pool.
Also, syslog and reporting database (RRD) are checked and should thus be stored on the system dataset pool. So I'd expect writes to be very very very limited.

JayG30 · Jun 19, 2017

I'm starting to think it is an error with RAM and/or DIMM slot on the board. The dmesg errors and IPMI logs seem to indicate a problem. And while I did some light testing of this hardware back when I put it into use I wasn't able to do a burn-in process like I normally would as it "had to be put into production ASAP".

Pretty sure I have 3 other server all with 128GB of the same RAM that I can currently run extensive memtest on and if they report good replace the RAM in this server and then test the current modules as well.

SweetAndLow · Jun 19, 2017

JayG30 said:
You're referring to the setting under System -> System Dataset -> System dataset pool correct?

Mine is NOT on the USB drive, it is on the pool.
Also, syslog and reporting database (RRD) are checked and should thus be stored on the system dataset pool. So I'd expect writes to be very very very limited.

Yep! If those are checked then you probably don't have crazy writes happening.

Sent from my Nexus 5X using Tapatalk

Robert Trevellyan · Jun 19, 2017

JayG30 said:
MCA: CPU 0 COR (1) RD channel 1 memory error
...
MCA: CPU 30 COR (1) MS channel 3 memory error
...
442 05/14/2017 13:00:39 Mmry ECC Sensor Memory Correctable ECC. CPU: 2, DIMM: H1. - Asserted

Looks to me like you have some failing RAM.

Important Announcement for the TrueNAS Community.

System unresponsive

JayG30

Contributor

SweetAndLow

Sweet'NASty

JayG30

Contributor

Jailer

Not strong, but bad

JayG30

Contributor

SweetAndLow

Sweet'NASty

JayG30

Contributor

JayG30

Contributor

SweetAndLow

Sweet'NASty

Robert Trevellyan

Pony Wrangler

Similar threads

Important Announcement for the TrueNAS Community.

System unresponsive

Contributor

Sweet'NASty

Contributor

Not strong, but bad

Contributor

Sweet'NASty

Contributor

Contributor

Sweet'NASty

Pony Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "System unresponsive"

Similar threads