Pool problems

BlueScreenTT

Explorer
Joined
Mar 26, 2018
Messages
69
now we are getting some where
i will do more testing tomorrow

[Worker #26 Jan 26 23:56] FATAL ERROR: Rounding was 0.5, expected less than 0.4
[Worker #26 Jan 26 23:56] Hardware failure detected, consult stress.txt file.
[Worker #26 Jan 26 23:56] Torture Test completed 0 tests in 0 minutes - 1 errors, 0 warnings.
[Worker #26 Jan 26 23:56] Worker stopped.
[Worker #1 Jan 26 23:56] FATAL ERROR: Rounding was 0.5, expected less than 0.4
[Worker #1 Jan 26 23:56] Hardware failure detected, consult stress.txt file.
[Worker #1 Jan 26 23:56] Torture Test completed 0 tests in 0 minutes - 1 errors, 0 warnings.
[Worker #1 Jan 26 23:56] Worker stopped.
 

BlueScreenTT

Explorer
Joined
Mar 26, 2018
Messages
69
ok i have run Prime 95 all day and it is working when i run torture (15) with option 1 but if i use Torture (15) option 2 and 3 it crashes after 10 seconds with

FATAL ERROR: Rounding was 0.5, expected less than 0.4
Hardware failure detected, consult stress.txt file.
Torture Test completed 0 tests in 0 minutes - 1 errors, 0 warnings.

so i just have to find out now if it is the RAM or CPU / Motherboard
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Was the CPU overheating? What happens when you run a memory test?
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
You should also run something like "CoreTemp" while running Prime95 to watch your CPU temps, but wait, you can do that via your motherboard, so watch your CPU temps and the Voltage Regulator temps. Watch the voltage levels too, maybe your power supply is faulty. Maybe your heatsink is not installed properly.
 

BlueScreenTT

Explorer
Joined
Mar 26, 2018
Messages
69
ok so i have now testet my machine and i get errors in memtest :-(
i tested the ram in a different machine and no problems
so i have a defective motherboard or 1 possibly 2 defective CPU's :-(
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
Maybe but also maybe the RAM timing is off and you need to slow it down just a bit. Is your RAM on the QVL for your motherboard?
 

BlueScreenTT

Explorer
Joined
Mar 26, 2018
Messages
69
So i took the machine apart and checked the CPU socket, dimm sockets, processors and dimm's
no problems found
i installed new ECC memory and one CPU
it crashed again
this time it was one of my new ECC sticks that did not work

so now i have tested everything several times i have 2 stable ECC sticks and 2 stable processors
i have yet to test dual processors but i need one more working ECC stick to be able to do that

my freebsd is running for now 2 days without any reboots and the pools do not go offline and online all the time like before


:smile:
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
this time it was one of my new ECC sticks that did not work
You have some seriously bad luck. Are you using an ESD wrist strap when handling your components?
my freebsd is running for now 2 days without any reboots and the pools do not go offline and online all the time like before
Well you were due for some good luck, lets hope it continues.
 

BlueScreenTT

Explorer
Joined
Mar 26, 2018
Messages
69
yes i am

yes i am on the right track now and paranoid so i will do a lot of testing in the future

there should be a perfect freebsd server setup guide so all noobs have a chance of getting it almost right the first time :smile:
like how to test hardware first
setup network, shares, NTP, jails, all services needed and HDD maintenance
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996

BlueScreenTT

Explorer
Joined
Mar 26, 2018
Messages
69
ok
so i have run memtest many times now
i found broken ECC and got then changed so now i dont get any errors when running paralell or sequentian test
but Round Robin crashes / freezes the server so i have to power cycle it

what is wrong ?
i have testet dual cpu and single cpu and it is the same problem no matter how i combo the cpu's and Dimms
and it freezes the same place every time
 

Attachments

  • 20190303_200136.jpg
    20190303_200136.jpg
    249.1 KB · Views: 490

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
i found broken ECC and got then changed so now i don't get any errors when running paralell or sequentian test
How did you make these pass when they use to fail?

I will cut to the chase here, this is not the proper forum to troubleshoot your motherboard hardware. I'm fine helping you but there are websites that specifically deal with hardware issues, specifically overclocker websites. I use to do a lot of this when CPUs were running in the MHz ranges and RAM was much slower just to squeeze a few more MHZ of speed out to the system. My first change was from 4.7MHz to 12MHz, that was over double the speed! In the day that was a big thing.

1. You may have faulty RAM.
2. You may have a faulty CPU.
3. You may have a faulty Motherboard.
4. You may have dirty power.
5. Your system components may not be able to run at the speed/voltage you are providing.

How do you troubleshoot something like this, well painfully slowly. You can do this anyway you desire but here is how I'd do it:

Each item below is to be tried and then you run the test that failed. If the test passes then you keep running the testing for at least 2 more days non-stop. You must prove that you fixed it and it wasn't a lucky pass.

1. If you have a spare power supply, replace it.
2. Find out what speeds and voltages your RAM supports and what your motherboard is running it at. According to the screen shot you are running at 1866 MHz, manually set the BIOS to run the RAM at 1600 MHz and run your testing again. Typically the voltage of the RAM will automatically drop a little as well, it depends on the RAM really. Do not manually adjust the voltage settings unless you know the risks!!!
3. Underclock your CPU or change the CPU voltages in BIOS. This is not for the faint of heart!!! You can easily destroy your CPU or motherboard voltage regulators. You make changes on .01 VDC increments (depending on your motherboard). Do a Google search for "overclocker RAM failure CPU" or something similar, they can give you advice but you will need to do a lot of reading as well.

So if you do not have a spare power supply, you can safely do #2 and underclock to 1600 MHz.

Good Luck!
 
Last edited:

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
No problem. I hope you find a good site to help you and I hope it's as simple as lowering your RAM speed to 1600 MHz.
 

BlueScreenTT

Explorer
Joined
Mar 26, 2018
Messages
69
So
after talking to Passmark (memtest86)
apparently there is a UEFI bug in the 3.3 Firmware for my motherboard and that is why it keeps chrashing during testing
Memtest86 works just fine in Sequential and paralell testing but RoundRobin is a no go.

SuperMicro dont reply to my ticket
but then again
what are the chances they will make a new firmware

i gues i will just have to trust that it is a uefi problem and then use legazy bios and hope for the best
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
So you are booting using UEFI? Normal Legacy BIOS would be fine enough to boot with. Glad you figured out there was a bug.
 

BlueScreenTT

Explorer
Joined
Mar 26, 2018
Messages
69
Is the temp sensors only working with UEFI boot ?
before i could see the temperatures but now i installed via BIOS i dont get any readings. ?
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
I'll be honest with you, I would not have expected that result.
 
Top