APEI Fatal Hardware Error

reina

Cadet
Joined
Oct 14, 2023
Messages
2
Hi! I'm looking for advice with some crashes I've been getting.

I just rebuilt my Truenas machine with various parts I have lying around, but I've been getting system crashes & reboots every time I try to move data into Truenas.

The console outputs

Code:
 panic: APEI Fatal Hardware Error! 


20231014_072443.jpg


Following the textdumps, I find:

Code:
 
root@truenas[~]# cat /data/crash/info.last
Dump header from device: /dev/ada6p1
  Architecture: amd64
  Architecture Version: 4
  Dump Length: 401408
  Blocksize: 512
  Compression: none
  Dumptime: 2023-10-14 07:24:39 -0700
  Hostname: truenas.localdomain
  Magic: FreeBSD Text Dump
  Version String: FreeBSD 13.1-RELEASE-p2 n245412-484f039b1d0 TRUENAS
  Panic String: APEI Fatal Hardware Error!
  Dump Parity: 3924496761
  Bounds: 2
  Dump Status: good


I believe that the error isn't just with the device ada6 - having crashed the system repeatedly, different crashes will name different SATA drives here.

And I've attached the full textdump here: https://pastebin.com/YH2JT5Qz

I'm not sure how to continue troubleshooting this - none of the logs make any sense to me. As far as hardware goes, all the hardware in this system is known good and has been pulled from various other system's I've tested (and have been using).

General hardware list:
Intel i3-7320
Intel S1200SPL motherboard
2x32GB DDR4 ECC UDIMM
Intel SSD boot drive
6x 14TB WD HDDs in Z2
Samsung SM953 480GB for cache

Drives were used in a previous Truenas setup, have tested good and show no signs of failing.

CPU & motherboard were also previously used in a different build and have tested stable.

The RAM is the only thing I can think of that might possibly create errors - these sticks were known to be good with another motherboard & this CPU/RAM/motherboard configuration passes memtest86 on a warm boot, but is susceptible to the ECC cold boot bug. See this link for what I'm referring to: https://forums.passmark.com/memtest...r-only-after-cold-boot-not-after-repeat-reset

I'm unsure how to proceed here. I would swap the RAM, but this is the only kit of ECC UDIMM I have lying around. Is there a configuration issue or some obvious mistake I'm missing?
 

reina

Cadet
Joined
Oct 14, 2023
Messages
2
Update:

Definitely looks like some interaction between the CPU/board/RAM/Truenas is creating memory errors. It's interesting because I know the CPU/board are good on their own and the memory was a working pull from a different server as well

With both sticks (or either individual stick), I just get a few hundred ECC corrected errors in my IPMI log, until an uncorrectable error initiates a reboot.

Code:
3777    10/15/2023 01:57:05    Mmry ECC Sensor    Memory    Uncorrectable ECC. CPU: 1, DIMM: B2. - Asserted
3776    10/15/2023 01:57:05    Mmry ECC Sensor    Memory    Correctable ECC. CPU: 1, DIMM: B2. - Asserted
3775    10/15/2023 01:57:05    Mmry ECC Sensor    Memory    Correctable ECC. CPU: 1, DIMM: B2. - Asserted
...


If I run memtester under Truenas, same thing. A few thousand correctables per minute until an uncorrectable initiates a reboot.

Code:
21179    10/15/2023 08:13:35    Mmry ECC Sensor    Memory    Uncorrectable ECC. CPU: 1, DIMM: B2. - Asserted
21178    10/15/2023 08:13:35    Mmry ECC Sensor    Memory    Correctable ECC. CPU: 1, DIMM: A2. - Asserted
21177    10/15/2023 08:13:35    Mmry ECC Sensor    Memory    Correctable ECC. CPU: 1, DIMM: A2. - Asserted
...


Yet if I boot memtest86 from a USB, I only get behavior that looks exactly like the ECC cold boot bug. A dozen ECC errors in the console, a few ECC errors in IPMI on the very first (or second) test, but it will subsequently pass hours of memtest86. If I then restart the memory test without restarting the whole PC, no errors at all.

I'll probably try swapping some non-ECC memory in to see what happens, as that's all I have on hand right now. But I don't think it's worth my time to chase this any further on such old hardware. Just buy something else and call it a day.
 
Top