System behaviour with Uncorrectable ECC error

Morpheus187

Explorer
Joined
Mar 11, 2016
Messages
61
Hello

System board: Supermicro X11SSH-CTF-O
Ram: 64 GB ECC
CPU: E3-1245v5
16x8 TB HDD

After physically moving a system it has crashes once and I observed Uncorrectable ECC errors in the IPMI

4922023/08/11 22:35:17OEMMemoryUncorrectable ECC @ DIMMB1 - Assertion
4932023/08/11 22:35:17OEMMemoryUncorrectable ECC @ DIMMB1 - Assertion
4942023/08/11 22:35:17OEMMemoryUncorrectable ECC @ DIMMB1 - Assertion
4952023/08/11 22:35:18OEMMemoryUncorrectable ECC @ DIMMB1 - Assertion
4962023/08/11 22:35:18OEMMemoryUncorrectable ECC @ DIMMB1 - Assertion
4972023/08/11 22:35:18OEMMemoryUncorrectable ECC @ DIMMB1 - Assertion
4982023/08/11 22:35:18OEMMemoryUncorrectable ECC @ DIMMB1 - Assertion
4992023/08/11 22:35:18OEMMemoryUncorrectable ECC @ DIMMB1 - Assertion
5002023/08/11 22:35:18OEMMemoryUncorrectable ECC @ DIMMB1 - Assertion
5012023/08/11 22:35:19OEMMemoryUncorrectable ECC @ DIMMB1 - Assertion


Code:
zpool status
  pool: TANK
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 11:42:52 with 0 errors on Sun Jul 23 11:42:53 2023
config:

        NAME                                            STATE     READ WRITE CKSUM
        TANK                                            ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/3bfa2165-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/4423cee1-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/4c3bf716-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/54809430-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/5c9b9aa0-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/64ba6a24-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/75c846ed-40e4-11ea-bf7c-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/751465c9-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
          raidz2-1                                      ONLINE       0     0     0
            gptid/7d5bcb86-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/859ae1b4-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/8de09aa6-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/961d1a78-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/9e44d8a3-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/a670711d-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/aeac39ab-7ff9-11e9-adcd-ac1f6b6ab88e  ONLINE       0     0     0
            gptid/4a859d1e-f16c-11e9-9f4e-ac1f6b6ab88e  ONLINE       0     0     0

errors: No known data errors




What seems strange is that dmesg or any other tool on the server doesn't show any error, it behaves absolutely normal. Also zpool status shows NO read errors or write errors.



But when I copy a large file to the system and copy it back, the checksum is always wrong, so it corrupts everything it gets.
I've now ran a scrub .. No just a joke. I shut down the system now and will investigate the issue and re-seat ram to see if that fixes the issue

Luckily the system is only a secondary system.

I just think it's a bit strange that it doesn't show any signs of a data corruption.

I can download a large file from Truenas 5 times and 5 times I get a file with a completely different md5 hash. But when I do file operations on Truenas itself without copying it over network, the hash stays the same. It seems that the memory error is mainly located in a area that is used for network transfer / samba as it toasts every file that is being transfered over network. The ZFS subsystem doesn't seem to be affected right now.

I thought I just share my findings because it's pretty rare to actually have an Uncorrectable ECC error with ECC ram.


UPDATE: Memtest agrees
memtest.png
 
Last edited:

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
ZFS will NOT return corrupted data unless there is no good copy. it is designed from the disks up to assume the disks arent reliable, and it checksums the checksums of the checksums.

the network stack, on the other hand, just resends data until the requesting system is happy. which there are checksums, they are generally nothing like what ZFS does.

repeated ECC means you have repeated errors that need correcting; you need to replace your RAM or mobo, because something is really wrong.
ECC alerts will trigger for any detected flipped bit, I dont think it needs to actually have any data in it.
 

Morpheus187

Explorer
Joined
Mar 11, 2016
Messages
61
Just a little update.

Got a replacement ram and system tested OK in memtest.

System ran for a few days now, syncing data and this night it crashed unrecoverable ( without any data being written onto it )

The system is in a boot loop and hard crashes when it tries to load the pool. Strange that it happened after a few days. Maybe the damanged ECC ram caused some problems with the pool that now showed up.

Is there a way to boot into Truenas without mounting the pool? I guess the first option would be to boot in single user mode and delete the pool.

But first I will look for some options to diagnose the issue a bit more, maybe there is something to repair the pool or just take a learning from it.

Mini Update:
I managed to boot an ubunut live cd and could mount the pool. It looks like it in the middle of a scrub process that completely fried the pool. Doing scans with zdb -c and zdb -b all hard fail with hard crashes of the software ( ubuntu manage to contain panics and don't crash the whole kernel it seems )
It looks like the ECC error caused some mayor pool corruption which didn't show up at first but after the system started a scrub ( at a time when ram was already replaced and tested GOOD ), it ran into some issues.
 

Attachments

  • panic.png
    panic.png
    216.2 KB · Views: 59
  • ubunut.png
    ubunut.png
    1 MB · Views: 61
Last edited:

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
I can download a large file from Truenas 5 times and 5 times I get a file with a completely different md5 hash. But when I do file operations on Truenas itself without copying it over network, the hash stays the same. It seems that the memory error is mainly located in a area that is used for network transfer / samba as it toasts every file that is being transfered over network. The ZFS subsystem doesn't seem to be affected right now.
What about looking into the network a bit deeper? I can think about the following things:
  • Electromagnetic compatibility: Is the cabling close to something that interfere?
  • Has any of the cables been bent too much?
  • NIC: Does using another one change things?
  • Client: Did you use different client systems and they all show the same behavior?
  • Different protocol: What about using SFTP instead of SMB (or whatever you used so far)?
 

Morpheus187

Explorer
Joined
Mar 11, 2016
Messages
61
I've now done several experiments with the damaged pool, including zdb scans which all lead to panics. A scrub under ubuntu also starts and then throws panics always at the same position.

Under ubuntu I can still mount the pool and access files because it doesn't immediately crashes the whole system ( like Truenas )

As this is a backup system, I will now destroy the pool and create a new one because I consider my pool as dead for good.

Findings about ECC Errors ( caution, those are my experiences and doesn't need to be true for every system version )

- Truenas Core doesn't recognize ECC errors nor throws any warning or hint that something is going wrong. "zpool status" doesn't show any anomalies.
- Files did get corrupt during smb file transfers, ZFS didn't notice anything going wrong, no checksum errors, no read or write errors
- in the background the ZFS filesystem got seriously destroyed beyond my repair abilities. I could still use it and copy 30 TB data onto it with out any issue, but when it did an automatic scrub it detected the damage.
- Despite the damage, it's still possible to access the pool with an ubuntu live cd, as it seems to be more resilent to system "Panics"


My learnings: ( I knew all of this before hand but experiencing it for myself is still something more valueable )

- ZFS is extremely reliant on functioning system memory, even ECC does not guarantee the safety of the pool, as even ECC can have errors that are uncorrecable.
- ZFS is NOT a backup and not unfailable, always have a second copy of the data available on a different pool or system. In my case the impact is minor because this is already actually the third copy of the data I need to backup.
- Always verify system ram with memtest before putting a ZFS system into productive use

Maybe those findings help anyone in the future encountering a similiar problem
 

indy

Patron
Joined
Dec 28, 2013
Messages
287
Good to know that uncorrectable memory errors will hose the entire system.
I thought the default behavior was to halt the system upon detecting the first uncorrectable error... apparently not.
 
Last edited:
Top