save_rrds.sh error

Status
Not open for further replies.

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
Last night I got an email at 12:00am sharp from my FreeNAS server(bigbigrack). Here is the body of the email:

Subject: Cron <root@bigbigrack> /bin/sh /root/save_rrds.sh

bzip2/libbzip2: internal error number 1007.
This is a bug in bzip2/libbzip2, 1.0.5, 10-Dec-2007.
Please report it to me at: jseward@bzip.org. If this happened
when you were using some program which uses libbzip2 as a
component, you should also report this bug to the author(s)
of that program. Please make an effort to report this bug;
timely and accurate bug reports eventually lead to higher
quality software. Thanks. Julian Seward, 10 December 2007.


*** A special note about internal error number 1007 ***

Experience suggests that a common cause of i.e. 1007
is unreliable memory or other hardware. The 1007 assertion
just happens to cross-check the results of huge numbers of
memory reads/writes, and so acts (unintendedly) as a stress
test of your memory system.

I suggest the following: try compressing the file again,
possibly monitoring progress in detail with the -vv flag.

* If the error cannot be reproduced, and/or happens at different
points in compression, you may have a flaky memory system.
Try a memory-test program. I have used Memtest86
(www.memtest86.com). At the time of writing it is free (GPLd).
Memtest86 tests memory much more thorougly than your BIOSs
power-on test, and may find failures that the BIOS doesn't.

* If the error can be repeatably reproduced, this is a bug in
bzip2, and I would very much like to hear about it. Please
let me know, and, ideally, save a copy of the file causing the
problem -- without which I will be unable to investigate it.


The server has been working without issues since April(aside from random powerdowns we never explained). Any idea what this means? Could this really mean we have a RAM issue? I tried googling save_rrds.sh and I couldn't get any clues as to what was being zipped with bzip2.

When we first built the system we did RAM tests and we had bizaar results. We ran memtestx86 on the system for several days(I was gone for the weekend so I let it run) and it would run fine for 18+ cycles, but randomly we'd get tons of errors. By the end of the weekend we had 10k+ errors after 37 passes. We then determined that by adding a Kingston RAM cooler to blow air over the RAM sticks the error went away. Should we revisit the idea that RAM may be bad?

Thanks!
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
DSC01682.jpg

Here's a picture of the errors we got back in April. The fact that we had to do so many passes to get the error lead us to think that the error was somehow relating to something and not failed RAM. The computer wasn't left to overheat and was kept cool with the cover open and I wouldn't have expected overheating RAM to be the cause, but adding the RAM fan fixed the issue. We also found that the error never occurred with just 2 sticks of RAM(which could mean that 2 sticks next to each other might cause localized heating that may overheat). I still feel that the overheating is very unlikely but we had no better theory and the fan fixed it.

As you can see we did 25 passes and at pass 13 we had some errors. So basically we had errors before and during pass 13, but for the last 11 full passes no errors. Quite bizarre to say the least! Every time I've seen bad RAM it was much more prevalent and obvious.

Anyway, the error seemed to randomly come and go. How do you get errors and then stop for 11 full passes? The general rule is if you can do 3 complete passes and not get an error then your RAM is fine. In this case I would normally never let it run as long as I did if I hadn't been gone for the weekend.

So yeah, after we added the RAM fan and it fixed it we decided to keep going. Due to time constraints it wasn't reasonable to try to RMA the RAM or order replacements.
 
Status
Not open for further replies.
Top