System behaviour with Uncorrectable ECC error

Morpheus187 · Aug 11, 2023

Hello

System board: Supermicro X11SSH-CTF-O
Ram: 64 GB ECC
CPU: E3-1245v5
16x8 TB HDD

After physically moving a system it has crashes once and I observed Uncorrectable ECC errors in the IPMI

492 2023/08/11 22:35:17 OEM Memory Uncorrectable ECC @ DIMMB1 - Assertion
493 2023/08/11 22:35:17 OEM Memory Uncorrectable ECC @ DIMMB1 - Assertion
494 2023/08/11 22:35:17 OEM Memory Uncorrectable ECC @ DIMMB1 - Assertion
495 2023/08/11 22:35:18 OEM Memory Uncorrectable ECC @ DIMMB1 - Assertion
496 2023/08/11 22:35:18 OEM Memory Uncorrectable ECC @ DIMMB1 - Assertion
497 2023/08/11 22:35:18 OEM Memory Uncorrectable ECC @ DIMMB1 - Assertion
498 2023/08/11 22:35:18 OEM Memory Uncorrectable ECC @ DIMMB1 - Assertion
499 2023/08/11 22:35:18 OEM Memory Uncorrectable ECC @ DIMMB1 - Assertion
500 2023/08/11 22:35:18 OEM Memory Uncorrectable ECC @ DIMMB1 - Assertion
501 2023/08/11 22:35:19 OEM Memory Uncorrectable ECC @ DIMMB1 - Assertion

Code:

zpool status
  pool: TANK
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 11:42:52 with 0 errors on Sun Jul 23 11:42:53 2023
config:

        NAME                                            STATE     READ WRITE CKSUM
        TANK                                            ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/3bfa2165-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/4423cee1-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/4c3bf716-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/54809430-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/5c9b9aa0-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/64ba6a24-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/75c846ed-40e4-11ea-bf7c-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/751465c9-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
          raidz2-1                                      ONLINE       0     0     0
            gptid/7d5bcb86-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/859ae1b4-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/8de09aa6-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/961d1a78-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/9e44d8a3-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/a670711d-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/aeac39ab-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/4a859d1e-f16c-11e9-9f4e-ac1f6b6ab88e  ONLINE       0     0     0

errors: No known data errors

What seems strange is that dmesg or any other tool on the server doesn't show any error, it behaves absolutely normal. Also zpool status shows NO read errors or write errors.

But when I copy a large file to the system and copy it back, the checksum is always wrong, so it corrupts everything it gets.
I've now ran a scrub .. No just a joke. I shut down the system now and will investigate the issue and re-seat ram to see if that fixes the issue

Luckily the system is only a secondary system.

I just think it's a bit strange that it doesn't show any signs of a data corruption.

I can download a large file from Truenas 5 times and 5 times I get a file with a completely different md5 hash. But when I do file operations on Truenas itself without copying it over network, the hash stays the same. It seems that the memory error is mainly located in a area that is used for network transfer / samba as it toasts every file that is being transfered over network. The ZFS subsystem doesn't seem to be affected right now.

I thought I just share my findings because it's pretty rare to actually have an Uncorrectable ECC error with ECC ram.

UPDATE: Memtest agrees

artlessknave · Aug 11, 2023

ZFS will NOT return corrupted data unless there is no good copy. it is designed from the disks up to assume the disks arent reliable, and it checksums the checksums of the checksums.

the network stack, on the other hand, just resends data until the requesting system is happy. which there are checksums, they are generally nothing like what ZFS does.

repeated ECC means you have repeated errors that need correcting; you need to replace your RAM or mobo, because something is really wrong.
ECC alerts will trigger for any detected flipped bit, I dont think it needs to actually have any data in it.

NickF · Aug 11, 2023

Morpheus187 said:
After physically moving a system

Take the RAM out and pop it back in again

Morpheus187 · Sep 9, 2023

Just a little update.

Got a replacement ram and system tested OK in memtest.

System ran for a few days now, syncing data and this night it crashed unrecoverable ( without any data being written onto it )

The system is in a boot loop and hard crashes when it tries to load the pool. Strange that it happened after a few days. Maybe the damanged ECC ram caused some problems with the pool that now showed up.

Is there a way to boot into Truenas without mounting the pool? I guess the first option would be to boot in single user mode and delete the pool.

But first I will look for some options to diagnose the issue a bit more, maybe there is something to repair the pool or just take a learning from it.

Mini Update:
I managed to boot an ubunut live cd and could mount the pool. It looks like it in the middle of a scrub process that completely fried the pool. Doing scans with zdb -c and zdb -b all hard fail with hard crashes of the software ( ubuntu manage to contain panics and don't crash the whole kernel it seems )
It looks like the ECC error caused some mayor pool corruption which didn't show up at first but after the system started a scrub ( at a time when ram was already replaced and tested GOOD ), it ran into some issues.

ChrisRJ · Sep 10, 2023

Morpheus187 said:
I can download a large file from Truenas 5 times and 5 times I get a file with a completely different md5 hash. But when I do file operations on Truenas itself without copying it over network, the hash stays the same. It seems that the memory error is mainly located in a area that is used for network transfer / samba as it toasts every file that is being transfered over network. The ZFS subsystem doesn't seem to be affected right now.

What about looking into the network a bit deeper? I can think about the following things:

Electromagnetic compatibility: Is the cabling close to something that interfere?
Has any of the cables been bent too much?
NIC: Does using another one change things?
Client: Did you use different client systems and they all show the same behavior?
Different protocol: What about using SFTP instead of SMB (or whatever you used so far)?

Morpheus187 · Sep 10, 2023

I've now done several experiments with the damaged pool, including zdb scans which all lead to panics. A scrub under ubuntu also starts and then throws panics always at the same position.

Under ubuntu I can still mount the pool and access files because it doesn't immediately crashes the whole system ( like Truenas )

As this is a backup system, I will now destroy the pool and create a new one because I consider my pool as dead for good.

Findings about ECC Errors ( caution, those are my experiences and doesn't need to be true for every system version )

- Truenas Core doesn't recognize ECC errors nor throws any warning or hint that something is going wrong. "zpool status" doesn't show any anomalies.
- Files did get corrupt during smb file transfers, ZFS didn't notice anything going wrong, no checksum errors, no read or write errors
- in the background the ZFS filesystem got seriously destroyed beyond my repair abilities. I could still use it and copy 30 TB data onto it with out any issue, but when it did an automatic scrub it detected the damage.
- Despite the damage, it's still possible to access the pool with an ubuntu live cd, as it seems to be more resilent to system "Panics"

My learnings: ( I knew all of this before hand but experiencing it for myself is still something more valueable )

- ZFS is extremely reliant on functioning system memory, even ECC does not guarantee the safety of the pool, as even ECC can have errors that are uncorrecable.
- ZFS is NOT a backup and not unfailable, always have a second copy of the data available on a different pool or system. In my case the impact is minor because this is already actually the third copy of the data I need to backup.
- Always verify system ram with memtest before putting a ZFS system into productive use

Maybe those findings help anyone in the future encountering a similiar problem

indy · Sep 15, 2023

Good to know that uncorrectable memory errors will hose the entire system.
I thought the default behavior was to halt the system upon detecting the first uncorrectable error... apparently not.

Important Announcement for the TrueNAS Community.

System behaviour with Uncorrectable ECC error

Morpheus187

Explorer

artlessknave

Wizard

NickF

Guru

Morpheus187

Explorer

Attachments

ChrisRJ

Wizard

Morpheus187

Explorer

indy

Patron

Similar threads

492	2023/08/11 22:35:17	OEM	Memory	Uncorrectable ECC @ DIMMB1 - Assertion
493	2023/08/11 22:35:17	OEM	Memory	Uncorrectable ECC @ DIMMB1 - Assertion
494	2023/08/11 22:35:17	OEM	Memory	Uncorrectable ECC @ DIMMB1 - Assertion
495	2023/08/11 22:35:18	OEM	Memory	Uncorrectable ECC @ DIMMB1 - Assertion
496	2023/08/11 22:35:18	OEM	Memory	Uncorrectable ECC @ DIMMB1 - Assertion
497	2023/08/11 22:35:18	OEM	Memory	Uncorrectable ECC @ DIMMB1 - Assertion
498	2023/08/11 22:35:18	OEM	Memory	Uncorrectable ECC @ DIMMB1 - Assertion
499	2023/08/11 22:35:18	OEM	Memory	Uncorrectable ECC @ DIMMB1 - Assertion
500	2023/08/11 22:35:18	OEM	Memory	Uncorrectable ECC @ DIMMB1 - Assertion
501	2023/08/11 22:35:19	OEM	Memory	Uncorrectable ECC @ DIMMB1 - Assertion

Important Announcement for the TrueNAS Community.

System behaviour with Uncorrectable ECC error

Morpheus187

Explorer

artlessknave

Wizard

NickF

Guru

Morpheus187

Explorer

Attachments

ChrisRJ

Wizard

Morpheus187

Explorer

indy

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "System behaviour with Uncorrectable ECC error"

Similar threads