Truenas Core degrading - several pools with Error "One or more devices has experienced an error..." + AEPI Corrected Memory Errors

Jay Gardner · Apr 19, 2022

Truenas Core 12.0-U8
32 Gig RAM (originally, but now running on 16 after system wouldn't boot after running fine for 7 years, 16 Gig Crucial, 16Gig Kingston, both 1333, System running on Kingston currently).
Supermicro X10SL7-F
Xeon E3-1265L V4
8 x 3TB drives configured as 4 mirrored pairs (all Seagate NAS drives)
1 x 60Gig OCX-Vertex SSD Boot Drive

System has been running for 7 years - originally the CPU was a Core I3 4160, about 1.5 years ago, i bought the Xeon, used. The Boot SSD has been in use also for about 1.5 years, prior to that it was running on flash. drives have changed over time - smartctl seems to show the drive health on all drives are fine.

A few days ago, i found the system off. Upon trying to reboot, it wouldn't start with the boot code on the screen of 55 which i found indicated a memory issue. I began removing memory sticks until i found it would boot with memory installed only in channel A. I originally thought it was a board problem (& it may still be). But after some more work, i was able to get it to boot with all 4 sticks of RAM - Channel B seems to be more temperamental about which sticks would work in it - the Kingston memory didn't want to work in Channel B. I don't remember which memory was in which channel prior to these issues.

But after a few hours, the system was off again. So, i removed the memory from channel B and used the Crucial memory in channel A and this time while it booted, it began displaying memory errors on the console:

APEI Corrected Memory Error
Node: 0
Device: 0
Memory Error Type 2

I also had same error for Device: 1 - so if i understand this correctly, this suggests both sticks of the crucial memory are bad - Correct?

So, then swapped out the Crucial memory for the Kingston Memory still in Channel A. Now the system stays up, but it is generating some other data corruption errors -on both the boot and data pools:

root@freenas[~]# zpool status -v
pool: freenas-boot
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 00:01:07 with 0 errors on Thu Apr 14 03:46:07 2022
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
ada2p2 ONLINE 0 0 1

errors: Permanent errors have been detected in the following files:

/var/db/system/rrd-f62aea877c404489b2f9c112843f66c1/localhost/disk-ada0/disk_ops.rrd

pool: zfs3t
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 03:56:50 with 0 errors on Sun Apr 17 03:56:50 2022
config:

NAME STATE READ WRITE CKSUM
zfs3t ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/f6094f20-078a-11e4-90d0-6805ca1a8b59 ONLINE 0 0 0
gptid/c8edf3e5-ad98-11ea-a1a1-0cc47aaa6584 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/6052474d-8f5b-11e6-8fca-6805ca1a8b59 ONLINE 0 0 0
gptid/60ee4b8b-8f5b-11e6-8fca-6805ca1a8b59 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
gptid/9c0f128f-8f5b-11e6-8fca-6805ca1a8b59 ONLINE 0 0 0
gptid/9cbd6a3c-8f5b-11e6-8fca-6805ca1a8b59 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
gptid/cbd30b4b-b042-11ea-b3f9-0cc47aaa6584 ONLINE 0 0 1
gptid/ccb738ed-b042-11ea-b3f9-0cc47aaa6584 ONLINE 0 0 1

errors: Permanent errors have been detected in the following files:

zfs3t/jftp@auto-2022-04-19_15-04:/jcam/LR/2022Y04M19D08H236764-video.mp4

On the data pool, i've swapped out the referenced file.

QUESTION - with the error on the boot pool, do i essentially need to reinstall Truenas from scratch and import my data pools?

There was another error indicating a zfs replication task had failed - it completed partially. (I have a Truenas Scale system on another machine that i was replicating certain datasets to). The message indicates "A resuming stream can be generated on the sending system by running: zfs send -t a-really-long-string. But i can't figure out how to use this command. I tried using (from truenas core system):

zfs send -t "the really long string" | ssh root@mytruenas-scale-system -s name-of-dataset-that-receives-replication-from-core

but it didn't work:

warning: cannot send 'zfs3t/jftp@auto-2022-04-19_15-04': Input/output error
cannot receive resume stream: checksum mismatch or incomplete stream.

QUESTION - am i using this command correctly?
If I can't get this to work, I suppose i can just rerun the snapshot, ignore/delete the original snapshot and rerun the replication, which is fine.

Finally, i'm trying to determine if i have a board problem or a CPU problem in addition to what looks now like a memory problem. To me the Crucial memory looks like it's gone bad given the above memory errors. I've ordered a replacement X10 board and i still have the I3-4160 CPU...

QUESTIONS
-Does it make sense to first swap out the Xeon for the Core I3 and see if i can get the system restarted with memory in both channel A and B before i recieve the new memory?
-If i still have problems using channel B on the X10 board with the Core I3 CPU, does it follow that the X10 board is likely problematic?
-Any other suggestions/guidance on what to do from here?

thanks in advance.

Jay Gardner · Apr 27, 2022

I've now chewed through 2 pairs of Crucial memory - any time there's memory in channel B, the system at some point crashes (i can't find anything in a log that says why only a message that the system has recovered from an unscheduled reboot), and after a few restarts i get AEPI memory errors and then the system won't boot with that RAM. I tested with Memtest86 and Memtest86+ and did a system stress test (from UBCD). Neither version of Memtest86 finds a problem with the memory (thats AFTER the problems show up), which seems a bit odd, but from what i've been able to find, isn't all that uncommon either. I ran the stress test for about 5 hours and it didn't find any issues.

I've ordered replacement memory (the Crucial equivalent of the Micro memory - what is shown in the forum's resources section as qualified memory for this series of boards) and have a replacement X10 board on hand. I'm not putting more memory into the current board, though it seems to chug along just find with the Kingston memory in channel A.

My biggest question now is whether it's the CPU or the motherboard. My gut feeling is that it's the motherboard. I plan to put back my old Core i3 in the current motherboard and see if I get the same symptoms of being unable to boot with memory in channel B, but i don't recall if i ever tried the Kingston Memory in Channel B when the system was in good working order, though i have to believe i did once upon a time because when i first bought the X10 board, i only had 16GB of RAM (Kingston), and i'm sure i would have installed 8GB in channel A and 8GB in channel B in order to get interleaved memory.

Assuming i did run it in that way, if it doesn't boot with the i3 using both memory channels, i'mg going to conclude that it is indeed the motherboard and reinstall the current xeon CPU in the new motherboard.

But, if it boots fine w/ the i3 with memory in channel B, then i'm tempted to go get a new CPU too (or just continue to use the i3).

Would be interested to hear other suggestions of how to approach or ideas on this approach. I really want to get back to a working system of 32GB without burning up more memory.

Jay Gardner · Apr 27, 2022

well, it booted with the Core i3 and Kingston memory - 8GB in channel A and 8GB in channel B; it did boot with new Crucial memory in that same config a few days ago, but it didn't last more than 24 hours. So at this point, i'm going to wait and see...

If it runs for a few days, i'll conclude the CPU was bad, but if it goes down again and the Kingston memory fails with AEPI Corrected Memory errors when trying to boot Truenas, i'll conclude it was the the motherboard. I had zero issues w/ the Core i3 CPU - i just wanted an upgrade to the Xeon when i bought it (used).

Again, any suggestions would be welcome.

thanks.

Jay Gardner · May 22, 2022

recording here for posterity... above was all wrong. Except i did ruin the new memory i bought. I had neglected to mention that coincident with swapping out the CPU, i also swapped out the power supply as i had a spare available. System has been running fine for last 4 weeks. Swapped the Xeon CPU back in and also swapped in the newer memory (Crucial CT102472BD160B as per hardware recommendation), so motherboard and CPU were fine. Problem was a bad power supply.

Important Announcement for the TrueNAS Community.

Truenas Core degrading - several pools with Error "One or more devices has experienced an error..." + AEPI Corrected Memory Errors

Jay Gardner

Cadet

Jay Gardner

Cadet

Jay Gardner

Cadet

Jay Gardner

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

Truenas Core degrading - several pools with Error "One or more devices has experienced an error..." + AEPI Corrected Memory Errors

Jay Gardner

Cadet

Jay Gardner

Cadet

Jay Gardner

Cadet

Jay Gardner

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Truenas Core degrading - several pools with Error "One or more devices has experienced an error..." + AEPI Corrected Memory Errors"

Similar threads