FreeNAS components crashing

Status
Not open for further replies.

Stefan Bonfert

Dabbler
Joined
Aug 27, 2015
Messages
10
Hi,

I recently updated my hardware setup running FreeNAS to proper server components.
My current setup now is:
- Supermicro A1SRi-2758F Board with onboard C2758-CPU
- 16GB ECC-RAM
- FreeNAS-9.3-STABLE-201508250051

Since the upgrade, which I did a few days ago, system services seem to crash randomly. The effects so far included:
- Web UI becoming unavilable
- Error Message via Mail: The volume Tank (ZFS) state is UNKNOWN
- Server becoming unavailable via SSH

/var/log/messages looks like this:
Code:
Aug 27 00:22:51 freenas kernel: pid 14761 (sh), uid 0: exited on signal 11
Aug 27 00:56:59 freenas kernel: pid 15535 (ls), uid 0: exited on signal 11
Aug 27 01:37:08 freenas kernel: pid 16497 (python2.7), uid 0: exited on signal 11
Aug 27 03:07:50 freenas kernel: pid 20143 (sendmail), uid 25: exited on signal 11
Aug 27 03:14:32 freenas kernel: pid 20289 (sh), uid 0: exited on signal 11
Aug 27 03:31:01 freenas kernel: pid 20772 (python2.7), uid 0: exited on signal 11
Aug 27 03:50:47 freenas kernel: pid 7112 (btsync), uid 817: exited on signal 11
Aug 27 06:32:21 freenas kernel: pid 24963 (sh), uid 0: exited on signal 11
Aug 27 07:52:42 freenas kernel: pid 26759 (md5), uid 0: exited on signal 11
Aug 27 10:21:20 freenas kernel: pid 30436 (sh), uid 0: exited on signal 11
Aug 27 11:58:44 freenas kernel: pid 32626 (ps), uid 0: exited on signal 11
Aug 27 13:35:08 freenas kernel: pid 34846 (sh), uid 0: exited on signal 11
Aug 27 13:39:09 freenas kernel: pid 34921 (ps), uid 0: exited on signal 11
Aug 27 13:57:09 freenas kernel: pid 5496 (btsync), uid 817: exited on signal 11
Aug 27 14:50:27 freenas kernel: pid 36532 (sh), uid 0: exited on signal 11
Aug 27 15:05:36 freenas kernel: pid 3445 (collectd), uid 0: exited on signal 11 (core dumped)


I couldn't find anyone else experiencing this problem. The last error I noticed was the "volume state is UNKNOWN"-one at 14:50

Could this be a problem with the new hardware components? Should I exchange them?
Is some other information helpful to solve the problem?

I would greatly appreciate some help.
 
D

dlavigne

Guest
Signal 11 is usually hardware, often RAM. Have you done a memtest? Checked that everything is seated firmly?
 

Stefan Bonfert

Dabbler
Joined
Aug 27, 2015
Messages
10
I ran memtest the whole night and it didn't report any errors.

1.jpg


Is there anything else I can do to find the cause of the crashing programs?
 
Last edited:

Stefan Bonfert

Dabbler
Joined
Aug 27, 2015
Messages
10
Memtest has been running for 18 hours now, still no errors.
Could it still be the memory or is something else more likely?
 
Joined
Oct 2, 2014
Messages
925
How many drives and what size is your pool? Those were left out in your HW specs
 
Joined
Oct 2, 2014
Messages
925
I am using 4x 2TB WD Red HDDs, configured as a RAIDZ1 volume.
mhmmm, and these drives were thorough tested before put into use, what kind of case/chassis do you have and cooling.
 

Stefan Bonfert

Dabbler
Joined
Aug 27, 2015
Messages
10
mhmmm, and these drives were thorough tested before put into use, what kind of case/chassis do you have and cooling.
I am using these drives since 1.5 years ago and I didn't exchange them during the upgrade of my system. I never had any problem with them.
As a chassis I use a Fractal Design Node 304 with the cooling, that came with it.
That is a large fan at the back of the case and two smaller ones at its front.
The CPU is cooled passively with the preinstalled heat sink.
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
You need to get some active cooling on that CPU, it's likely overheating. The passive heat sink was designed for a high airflow environment of a server chassis.
 
Joined
Oct 2, 2014
Messages
925
I am using these drives since 1.5 years ago and I didn't exchange them during the upgrade of my system. I never had any problem with them.
As a chassis I use a Fractal Design Node 304 with the cooling, that came with it.
That is a large fan at the back of the case and two smaller ones at its front.
The CPU is cooled passively with the preinstalled heat sink.
You need to get some active cooling on that CPU, it's likely overheating. The passive heat sink was designed for a high airflow environment of a server chassis.
I would get some kind of cooler for that passive heatsink, as @Jailer said theyre passive and designed to cooled by fans that blow over/through the heatsink.
 

Stefan Bonfert

Dabbler
Joined
Aug 27, 2015
Messages
10
I just ran 8 instances of
Code:
yes > /dev/null
for several hours, completely utilizing all available CPU cores.
The CPU temperature went up to about 90°C, which seems a bit high. Therefore I am going to upgrade the cooling in the coming days.

However, there were no errors during this time. Therefore I don't think, the errors were caused by the CPU overheating, because during normal operation the load is below 10%.
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
Signal 11 is pretty much always hardware related. According to the FreeBSD documentation, the main causes for this problem are:
Code:

[*]The hard disks might be overheating: Check that the fans are still working, as the disk and other
hardware might be overheating.


[*]The processor running is overheating: This might be because the processor has been overclocked, or the
fan on the processor might have died. In either case, ensure that the hardware is running at what it is
specified to run at, at least while trying to solve this problem. If it is not, clock it back to the default
settings.)

Regarding overclocking, it is far cheaper to have a slow system than a fried system that needs replacing!
Also the community is not sympathetic to problems on overclocked systems.


[*]Dodgy memory: if multiple memory SIMMS/DIMMS are installed, pull them all out and try running
the machine with each SIMM or DIMM individually to narrow the problem down to either the
problematic DIMM/SIMM or perhaps even a combination.


[*]Over-optimistic motherboard settings: the BIOS settings, and some motherboard jumpers, provide
options to set various timings. The defaults are often sufficient, but sometimes setting the wait states on
RAM too low, or setting the “RAM Speed: Turbo” option will cause strange behavior. A possible idea is to
set to BIOS defaults, after noting the current settings first.


[*]Unclean or insufficient power to the motherboard. Remove any unused I/O boards, hard disks, or
CD-ROMs, or disconnect the power cable from them, to see if the power supply can manage a smaller
load. Or try another power supply, preferably one with a little more power. For instance, if the current
power supply is rated at 250 Watts, try one rated at 300 Watts.


I would very strongly suggest that your problem is almost certainly overheating CPU, overheating chipset, or bad memory sticks (it is common for memory to be bad even though memtest fails to find anything wrong)

Also, as for "consuming all of the CPU cores" by yes'ing out to /dev/null, come on bro. If you want to max out your cores, have them compute Fourier Transforms or something. Just because a core is 100% busy doesn't mean it's 100% maxxed out in terms of power.

Most people trying to run this motherboard simply lay (or screw, or glue) a case fan on to the heat sink, and IMMEDIATELY experience a huge thermal performance improvement.
 

Stefan Bonfert

Dabbler
Joined
Aug 27, 2015
Messages
10
Thanks for this comment. I wasn't aware, that the FreeNAS manual mentions this error.
I will upgrade the cooling next week and report the effects. I really hope, that resolves the errors.
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
Thanks for this comment. I wasn't aware, that the FreeNAS manual mentions this error.
I will upgrade the cooling next week and report the effects. I really hope, that resolves the errors.
For any curious onlookers, or others running into this problem, the following contains a pretty exhaustive list of signal 11 triggers:

http://www.bitwizard.nl/sig11/
 

pirateghost

Unintelligible Geek
Joined
Feb 29, 2012
Messages
4,219
Thanks for this comment. I wasn't aware, that the FreeNAS manual mentions this error.
I will upgrade the cooling next week and report the effects. I really hope, that resolves the errors.
It's not the freenas manual. It's freebsd documentation that he quoted.
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630

Stefan Bonfert

Dabbler
Joined
Aug 27, 2015
Messages
10
I installed a fan directly on the heatsink of the processor two days ago. This brought the temperatures on the board down A LOT! This is really awesome.
Under heavy load I am not able to get the CPU temperature higher than about 40°C.

The "Signal 11" errors didn't go away, although I have the feeling of them happening much less frequently.
Here are the log files for reference:
Code:
Sep  1 00:00:00 freenas newsyslog[23102]: logfile turned over due to size>100K
Sep  1 00:00:00 freenas syslog-ng[2260]: Configuration reload request received, reloading configuration;
Sep  1 00:23:53 freenas afpd[23719]: Login by timemachinembpr (AFP3.4)
Sep  1 00:30:35 freenas afpd[23719]: AFP logout by timemachinembpr
Sep  1 00:30:35 freenas afpd[23719]: AFP statistics: 315557.36 KB read, 204981.61 KB written
Sep  1 00:30:35 freenas afpd[23719]: done
Sep  1 01:31:03 freenas afpd[25305]: Login by timemachinembpr (AFP3.4)
Sep  1 01:35:38 freenas afpd[25305]: AFP logout by timemachinembpr
Sep  1 01:35:38 freenas afpd[25305]: AFP statistics: 211883.37 KB read, 162697.64 KB written
Sep  1 01:35:38 freenas afpd[25305]: done
Sep  1 02:35:58 freenas afpd[26798]: Login by timemachinembpr (AFP3.4)
Sep  1 02:39:35 freenas afpd[26798]: AFP logout by timemachinembpr
Sep  1 02:39:35 freenas afpd[26798]: AFP statistics: 59986.97 KB read, 148584.01 KB written
Sep  1 02:39:35 freenas afpd[26798]: done
Sep  1 03:39:57 freenas afpd[29899]: Login by timemachinembpr (AFP3.4)
Sep  1 03:43:48 freenas afpd[29899]: AFP logout by timemachinembpr
Sep  1 03:43:48 freenas afpd[29899]: AFP statistics: 70414.44 KB read, 155517.93 KB written
Sep  1 03:43:48 freenas afpd[29899]: done
Sep  1 04:44:10 freenas afpd[31405]: Login by timemachinembpr (AFP3.4)
Sep  1 04:47:28 freenas afpd[31405]: AFP logout by timemachinembpr
Sep  1 04:47:28 freenas afpd[31405]: AFP statistics: 62785.94 KB read, 150129.71 KB written
Sep  1 04:47:28 freenas afpd[31405]: done
Sep  1 05:47:56 freenas afpd[32993]: Login by timemachinembpr (AFP3.4)
Sep  1 05:51:26 freenas afpd[32993]: AFP logout by timemachinembpr
Sep  1 05:51:26 freenas afpd[32993]: AFP statistics: 78438.71 KB read, 162471.66 KB written
Sep  1 05:51:26 freenas afpd[32993]: done
Sep  1 06:51:49 freenas afpd[34451]: Login by timemachinembpr (AFP3.4)
Sep  1 06:56:39 freenas afpd[34451]: AFP logout by timemachinembpr
Sep  1 06:56:40 freenas afpd[34451]: AFP statistics: 309969.45 KB read, 159652.97 KB written
Sep  1 06:56:40 freenas afpd[34451]: done
Sep  1 07:57:08 freenas afpd[35935]: Login by timemachinembpr (AFP3.4)
Sep  1 08:05:46 freenas afpd[35935]: AFP logout by timemachinembpr
Sep  1 08:05:46 freenas afpd[35935]: AFP statistics: 187907.71 KB read, 180044.25 KB written
Sep  1 08:05:46 freenas afpd[35935]: done
Sep  1 16:31:00 freenas kernel: pid 47658 (python2.7), uid 0: exited on signal 11
Sep  1 16:52:24 freenas afpd[48146]: Login by timemachinembpr (AFP3.4)
Sep  1 16:56:11 freenas afpd[48238]: Login by stefan (AFP3.4)
Sep  1 16:57:38 freenas afpd[48238]: AFP logout by stefan
Sep  1 16:57:38 freenas afpd[48238]: AFP statistics: 88.72 KB read, 181180.19 KB written
Sep  1 16:57:38 freenas afpd[48238]: done
Sep  1 17:11:52 freenas afpd[48146]: AFP logout by timemachinembpr
Sep  1 17:11:52 freenas afpd[48146]: AFP statistics: 1545938.71 KB read, 297138.72 KB written
Sep  1 17:11:52 freenas afpd[48146]: done
Sep  1 17:52:29 freenas afpd[49594]: Login by timemachinembpr (AFP3.4)
Sep  1 17:59:01 freenas afpd[49594]: AFP logout by timemachinembpr
Sep  1 17:59:01 freenas afpd[49594]: AFP statistics: 267817.85 KB read, 218886.72 KB written
Sep  1 17:59:01 freenas afpd[49594]: done
Sep  1 18:59:28 freenas afpd[51183]: Login by timemachinembpr (AFP3.4)
Sep  1 19:02:59 freenas afpd[51183]: AFP logout by timemachinembpr
Sep  1 19:02:59 freenas afpd[51183]: AFP statistics: 77841.79 KB read, 155651.89 KB written
Sep  1 19:02:59 freenas afpd[51183]: done
Sep  1 20:03:27 freenas afpd[52706]: Login by timemachinembpr (AFP3.4)
Sep  1 20:10:50 freenas afpd[52706]: AFP logout by timemachinembpr
Sep  1 20:10:50 freenas afpd[52706]: AFP statistics: 152129.98 KB read, 207904.68 KB written
Sep  1 20:10:50 freenas afpd[52706]: done
Sep  1 21:11:08 freenas afpd[54331]: Login by timemachinembpr (AFP3.4)
Sep  1 21:17:55 freenas afpd[54331]: AFP logout by timemachinembpr
Sep  1 21:17:55 freenas afpd[54331]: AFP statistics: 214868.12 KB read, 216973.50 KB written
Sep  1 21:17:55 freenas afpd[54331]: done
Sep  1 22:18:21 freenas afpd[55953]: Login by timemachinembpr (AFP3.4)
Sep  1 22:24:37 freenas afpd[55953]: AFP logout by timemachinembpr
Sep  1 22:24:37 freenas afpd[55953]: AFP statistics: 139196.19 KB read, 208383.75 KB written
Sep  1 22:24:37 freenas afpd[55953]: done
Sep  1 23:24:54 freenas afpd[57538]: Login by timemachinembpr (AFP3.4)
Sep  1 23:31:49 freenas afpd[57538]: AFP logout by timemachinembpr
Sep  1 23:31:49 freenas afpd[57538]: AFP statistics: 222165.71 KB read, 205670.84 KB written
Sep  1 23:31:49 freenas afpd[57538]: done
Sep  2 00:00:00 freenas syslog-ng[2260]: Configuration reload request received, reloading configuration;
Sep  2 00:37:35 freenas afpd[59287]: Login by timemachinembpr (AFP3.4)
Sep  2 00:37:36 freenas afpd[59287]: afp_zzz: entering normal sleep
Sep  2 00:49:35 freenas afpd[59287]: afp_alarm: child timed out, entering disconnected state
Sep  2 00:49:35 freenas afpd[59287]: dsi_disconnect: entering disconnected state
Sep  2 00:49:35 freenas afpd[59287]: dsi_disconnect: entering disconnected state
Sep  2 02:25:58 freenas afpd[61843]: Login by timemachinembpr (AFP3.4)
Sep  2 02:25:58 freenas afpd[61843]: afp_disconnect: trying primary reconnect
Sep  2 02:25:58 freenas afpd[3289]: Reconnect: transfering session to child[59287]
Sep  2 02:25:58 freenas afpd[3289]: Reconnect: killing new session child[61843] after transfer
Sep  2 02:25:58 freenas afpd[59287]: afp_dsi_transfer_session: succesfull primary reconnect
Sep  2 02:25:58 freenas afpd[59287]: AFP Replay Cache match: id: 2284 / cmd: AFP_FLUSHFORK
Sep  2 02:25:58 freenas afpd[59287]: afp_zzz: entering extended sleep
Sep  2 02:26:00 freenas afpd[61843]: afp_disconnect: primary reconnect succeeded
Sep  2 02:26:08 freenas afpd[59287]: AFP logout by timemachinembpr
Sep  2 02:26:08 freenas afpd[59287]: AFP statistics: 385008.04 KB read, 241750.18 KB written
Sep  2 02:26:08 freenas afpd[59287]: done
Sep  2 09:08:59 freenas afpd[73154]: Login by timemachinembpr (AFP3.4)
Sep  2 09:09:56 freenas afpd[73154]: afp_zzz: entering extended sleep
Sep  2 09:11:49 freenas afpd[73154]: read: Operation timed out
Sep  2 09:11:49 freenas afpd[73154]: dsi_stream_read: len:-1, Operation timed out
Sep  2 09:11:49 freenas afpd[73154]: dsi_disconnect: entering disconnected state
Sep  2 16:50:40 freenas kernel: pid 83676 (sh), uid 0: exited on signal 11
Sep  2 19:55:59 freenas afpd[88015]: Login by timemachinembpr (AFP3.4)
Sep  2 19:55:59 freenas afpd[73154]: Disconnected session terminating
Sep  2 19:55:59 freenas afpd[3289]: Terminated disconnected child[73154], client rebooted.
Nothing interesting in debug.log at the time of the last crash, I think:
Code:
Sep  2 16:50:39 freenas alert.py: [middleware.notifier:212] Popen()ing: /sbin/zpool status -x freenas-boot

Sep  2 16:50:39 freenas alert.py: [middleware.notifier:212] Popen()ing: zpool list -H -o health IcyBox

Sep  2 16:50:39 freenas alert.py: [middleware.notifier:212] Popen()ing: /sbin/zpool status -x IcyBox

Sep  2 16:51:39 freenas alert.py: [middleware.notifier:212] Popen()ing: /sbin/zpool status -x freenas-boot

Sep  2 16:51:40 freenas alert.py: [middleware.notifier:212] Popen()ing: zpool list -H -o health IcyBox

Sep  2 16:51:40 freenas alert.py: [middleware.notifier:212] Popen()ing: /sbin/zpool status -x IcyBox

Next I will be removing the RAM modules one by one and check, whether the crashes still occur, if only one module is installed.
 

Stefan Bonfert

Dabbler
Joined
Aug 27, 2015
Messages
10
Ok, so here is a little conclusion.
I tried to run FreeNAS using only one of the two memory modules. I got signal 11 errors.
I tired running it using only the other module. Also signal 11.

On friday I did two things simultaneously:
1. Exchange the SATA cables for the ones, that came with the mainboard
2. Reinstall FreeNAS, set it up again.

I really had great success with this. There were no crashes, yet. I reinstalled FreeNAS 5 days ago, so I don't expect any crashes anymore, since they happened multiple times per day.
Since I changed two things at the same time (yes, I know... Shame on me.) I don't know what actually solved my problem.

On a related note: I was surprised, how easy it was to reinstall FreeNAS and set everything up again. It only took me about an hour.

Thanks a lot for your help and your suggestions. I really hope, my server will run smoothly from now on.
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
Ok, so here is a little conclusion.
I tried to run FreeNAS using only one of the two memory modules. I got signal 11 errors.
I tired running it using only the other module. Also signal 11.

On friday I did two things simultaneously:
1. Exchange the SATA cables for the ones, that came with the mainboard
2. Reinstall FreeNAS, set it up again.

I really had great success with this. There were no crashes, yet. I reinstalled FreeNAS 5 days ago, so I don't expect any crashes anymore, since they happened multiple times per day.
Since I changed two things at the same time (yes, I know... Shame on me.) I don't know what actually solved my problem.

On a related note: I was surprised, how easy it was to reinstall FreeNAS and set everything up again. It only took me about an hour.

Thanks a lot for your help and your suggestions. I really hope, my server will run smoothly from now on.
Thank you for the follow-up. Signal 11 is usually an illegal memory access, I believe, so the only way that is happening is if the system is corrupt (which we didn't rule out in your case, but would have been solved by the reinstall), or, a problem with the hardware. I can't imagine the SATA cabling causing that problem, unless, a bad/semi-shorted cable causes an electrical disturbance to the bus somehow, which triggers the memory access violation.

Anyway, I am glad it is working. Signal 11 on a FreeNAS is almost always caused by misbehaving hardware---bad memory, a thermal condition putting things into error states, etc. It could, however, be caused by a corrupt system install---under system->boot, I believe, is a "verify install" button, which should report either no errors, or a possible error on "resolv.conf" which can be ignored and is a known bug.
 
Status
Not open for further replies.
Top