Files Corruption Adventure

swoosh

Cadet
Joined
Oct 20, 2023
Messages
4
Initially I wanted to ask for troubleshooting advice but then I managed to mostly resolve the issue. Not sure why I'm still posting this, maybe roast me for not using a cluster of certified servers or whatever.

Hello!

For the second time in several months I'm waking up to a notification about data corruption in my home NAS. In both cases the corrupted file is in /var/db/system/rrd which is part of system dataset which is located on the data pool.

I found two other threads with similar issues but they don't go into many details:

There are zero errors in all disks in zpool status and SMART error logs. zpool status reports this:
Code:
errors: Permanent errors have been detected in the following files:
        datapool/.system/rrd-d968e280827d4008b5065e3ef738a9cd:/localhost/cpu-1/cpu-softirq.rrd

After moving system dataset to boot disks and destroying old .system dataset zpool status reports:
Code:
errors: Permanent errors have been detected in the following files:
        <0x437e>:<0x305>

Then, running scrub on the pool clears the error completely. At least that was the case last time.
Now, the second time, scrub reveals additional errors in another two dozens of my files, and checksum errors in all data drives. Okay, now things are getting bad.
I'm fully aware that my setup is highly unorthodox by iXsystems standards, but I refuse to believe that ZFS is so fragile that it fails on anything other than top-shelf components. My hardware probably decided to finally kick the bucket.

TrueNAS-SCALE-22.12.3.1
Motherboard: ASUS P8Z77-V LX
CPU: Intel Core i7 3770
RAM: 4x4GiB DDR3 non-ECC (two slightly different kits of Kingston HyperX)
Boot disks: a pair of 2.5" 240GB SSDs attached to USB3 internal header via USB-SATA adapters.
Pool layout: (SATA everywhere)
VDEV: 2 x MIRROR | 2 wide | 10.91 TiB - two IronWolfs, one WD Red, and one WD Red Plus
SLOG: 1 x MIRROR | 2 wide | 223.57 GiB - two SSDs
L2ARC: 1 x 447.13 GiB - one SSD

Let's go by the process of elimination:
  • It's not the drives, because it's unlikely that all drives are failing at the same time, with no SMART errors. It's also odd that drives in each vdev have the exact same number of checksum errors.
  • It's not the cables or chipset, because half of the drives are connected directly to the motherboard, and another half - via HP H220 HBA.
  • It's probably not the PSU, because it's relatively new, there's not a lot of load on it, it's a SilverStone unit, and the whole thing is behind a UPS.
  • So it must be something common to the entire system, which is either CPU or RAM.
For what it's worth, I'm using echo 16 > /sys/module/zfs/parameters/zfs_flags init command to ameliorate RAM situation.
But let's check it. memtest86 (v10.2) did 4 passes in 3.5 hours and found one error. I opened up the case, reseated CPU and all sticks of RAM, cleaned it up a bit, and also rearranged the sticks to put matching pairs in each channel.
Another go at memtest86 - and almost a dozen errors. Interestingly, memtest86 indicated that all errors (during both runs so far) were caught by core #4, so I disabled half the cores and tried again, with same result just on a different core this time. In hindsight, no idea why I thought it would matter as there's only one memory controller anyway.
Okay, let's try lowering RAM speed to 1333 while bumping up the voltage to 1.65. No dice, still plenty of errors.
Let's try taking out one of the kits (2 sticks) of RAM. 4 memtest passes - no errors.
As a sanity check, let's try the other kit alone aaand it's spewing out errors almost immediately. Ladies and gentlemen, we got him!
Another scrub later, and the <0x437e>:<0x305> error related to .system is cleared, but errors in other files remain. I guess they got permanently damaged during a recent send|recv.

Anyway, hopefully the removal of bad sticks of RAM bought me enough time to find new hardware. I'm looking at:
  • "Ryzen 5 PRO 4650G" or "Ryzen 7 Pro 5750G".
  • "ASRock B450M Pro4 R2.0" or "ASRock A520M Phantom Gaming 4" or "ASRock B550M Phantom Gaming 4".
  • 4x8GB Micron DDR4 ECC UDIMM, either "1Rx8 PC4-2666V" or "2Rx8 PC4-21ЗЗP".
What do you think about these options?
 

swoosh

Cadet
Joined
Oct 20, 2023
Messages
4
I found two instances of people using almost the same hardware and software as I was considering and that gave me some confidence in my selection.

I ended up with
  • Ryzen 5 4650G PRO
  • Asrock B550M Pro4
  • Two sticks of KSM26ED8/16HD

First boot with both sticks of RAM either took way too long or something was wrong because "CPU" and "DRAM" debug LEDs on the motherboard were ON. I removed one stick of memory and the boot was successful. Next, I updated bios to v3.20, put back second stick of memory and from then on all RAM was detected each time.

The only quirk so far is with boot drives that are connected via two USB adapters to the same internal header - only one drive is consistently detected. Not sure if it's the header not supplying enough power for two SATA SSDs, or there's some communication error between adapters and chipset. Either way, I plugged one of the drives into one of the back ports and that should be fine for now.

Other than that, everything seems to be working fine. ECC is at least reported to be enabled.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Thank you for clearly posting that on occasion memory errors can corrupt pool files. With the tons of SOHO users now on ZFS, either TrueNAS SCALE, Core or something else, (Proxmox?), I figured some of the data loss they experienced was due to memory errors.

Your complete diagnoses was able to confirm, in your case, that it was memory errors. And that ECC memory probably would have helped.

Good luck.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
Initially I wanted to ask for troubleshooting advice but then I managed to mostly resolve the issue. Not sure why I'm still posting this, maybe roast me for not using a cluster of certified servers or whatever.
Congratulation for your diagnostic work, and thanks for posting this educating report, which will certainly be useful to other users. Incidentally, this indicated that the flag for checksumming in RAM is not a full substitute to using ECC RAM.

Wish you the best for your new ECC build.
 

asap2go

Patron
Joined
Jun 11, 2023
Messages
228
Thank you for clearly posting that on occasion memory errors can corrupt pool files. With the tons of SOHO users now on ZFS, either TrueNAS SCALE, Core or something else, (Proxmox?), I figured some of the data loss they experienced was due to memory errors.

Your complete diagnoses was able to confirm, in your case, that it was memory errors. And that ECC memory probably would have helped.

Good luck.
Maybe. But keeping your machine running with such broken memory is not the intended usecase of ECC.
It's to prevent bitflips caused by electromagnetic radiation or similar.
If your memory starts failing this bad then it's only a question of time before you have irrecoverable multi-bit flips.
 

asap2go

Patron
Joined
Jun 11, 2023
Messages
228
Initially I wanted to ask for troubleshooting advice but then I managed to mostly resolve the issue. Not sure why I'm still posting this, maybe roast me for not using a cluster of certified servers or whatever.

Hello!

For the second time in several months I'm waking up to a notification about data corruption in my home NAS. In both cases the corrupted file is in /var/db/system/rrd which is part of system dataset which is located on the data pool.

I found two other threads with similar issues but they don't go into many details:

There are zero errors in all disks in zpool status and SMART error logs. zpool status reports this:
Code:
errors: Permanent errors have been detected in the following files:
        datapool/.system/rrd-d968e280827d4008b5065e3ef738a9cd:/localhost/cpu-1/cpu-softirq.rrd

After moving system dataset to boot disks and destroying old .system dataset zpool status reports:
Code:
errors: Permanent errors have been detected in the following files:
        <0x437e>:<0x305>

Then, running scrub on the pool clears the error completely. At least that was the case last time.
Now, the second time, scrub reveals additional errors in another two dozens of my files, and checksum errors in all data drives. Okay, now things are getting bad.
I'm fully aware that my setup is highly unorthodox by iXsystems standards, but I refuse to believe that ZFS is so fragile that it fails on anything other than top-shelf components. My hardware probably decided to finally kick the bucket.

TrueNAS-SCALE-22.12.3.1
Motherboard: ASUS P8Z77-V LX
CPU: Intel Core i7 3770
RAM: 4x4GiB DDR3 non-ECC (two slightly different kits of Kingston HyperX)
Boot disks: a pair of 2.5" 240GB SSDs attached to USB3 internal header via USB-SATA adapters.
Pool layout: (SATA everywhere)
VDEV: 2 x MIRROR | 2 wide | 10.91 TiB - two IronWolfs, one WD Red, and one WD Red Plus
SLOG: 1 x MIRROR | 2 wide | 223.57 GiB - two SSDs
L2ARC: 1 x 447.13 GiB - one SSD

Let's go by the process of elimination:
  • It's not the drives, because it's unlikely that all drives are failing at the same time, with no SMART errors. It's also odd that drives in each vdev have the exact same number of checksum errors.
  • It's not the cables or chipset, because half of the drives are connected directly to the motherboard, and another half - via HP H220 HBA.
  • It's probably not the PSU, because it's relatively new, there's not a lot of load on it, it's a SilverStone unit, and the whole thing is behind a UPS.
  • So it must be something common to the entire system, which is either CPU or RAM.
For what it's worth, I'm using echo 16 > /sys/module/zfs/parameters/zfs_flags init command to ameliorate RAM situation.
But let's check it. memtest86 (v10.2) did 4 passes in 3.5 hours and found one error. I opened up the case, reseated CPU and all sticks of RAM, cleaned it up a bit, and also rearranged the sticks to put matching pairs in each channel.
Another go at memtest86 - and almost a dozen errors. Interestingly, memtest86 indicated that all errors (during both runs so far) were caught by core #4, so I disabled half the cores and tried again, with same result just on a different core this time. In hindsight, no idea why I thought it would matter as there's only one memory controller anyway.
Okay, let's try lowering RAM speed to 1333 while bumping up the voltage to 1.65. No dice, still plenty of errors.
Let's try taking out one of the kits (2 sticks) of RAM. 4 memtest passes - no errors.
As a sanity check, let's try the other kit alone aaand it's spewing out errors almost immediately. Ladies and gentlemen, we got him!
Another scrub later, and the <0x437e>:<0x305> error related to .system is cleared, but errors in other files remain. I guess they got permanently damaged during a recent send|recv.

Anyway, hopefully the removal of bad sticks of RAM bought me enough time to find new hardware. I'm looking at:
  • "Ryzen 5 PRO 4650G" or "Ryzen 7 Pro 5750G".
  • "ASRock B450M Pro4 R2.0" or "ASRock A520M Phantom Gaming 4" or "ASRock B550M Phantom Gaming 4".
  • 4x8GB Micron DDR4 ECC UDIMM, either "1Rx8 PC4-2666V" or "2Rx8 PC4-21ЗЗP".
What do you think about these options?
If you can get a Ryzen Pro then neither are a bad option. I'd go with 2 * 16GB so you can upgrade later without selling the old memory.
Also 2666 works fine with 4 DIMMs no need to go slower (check with the mobos supported memory list though).
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Maybe. But keeping your machine running with such broken memory is not the intended usecase of ECC.
It's to prevent bitflips caused by electromagnetic radiation or similar.
If your memory starts failing this bad then it's only a question of time before you have irrecoverable multi-bit flips.
The original posters old server was not using ECC;
Motherboard: ASUS P8Z77-V LX
CPU: Intel Core i7 3770
RAM: 4x4GiB DDR3 non-ECC (two slightly different kits of Kingston HyperX)
It was listed under "Show : System info" collapsed button.


But yes, ECC memory that is failing, either needs to be replaced or removed.
 

swoosh

Cadet
Joined
Oct 20, 2023
Messages
4
Thanks everyone!

Yeah, ECC would have at least alerted the system to the issue and prevented it from running in broken state. But I'd side with @asap2go and argue that the real problem was that hardware degraded to a point of becoming actually broken. It was in use for 10+ years though:cool: first as personal machine, then as 24/7 NAS.

And I think more people using ZFS in various ways, even if not entirely canonical, is still net positive for everyone's data. My case in point: ZFS was operating on top of dead memory and the outcome is merely a handful of broken blocks/files, i.e. filesystem as a whole is alive and well. Or did I just got lucky that no superblock or other metadata happened to fall into dead bits?...
Another example: I'm using ZFS in personal PC for 2+ years now, with DDR4 non-ECC and a flavour of Linux, and not only is it 99% reliable* but it already saved me from a failed disk in a mirror.

*I got ZFS kernel module panic once, most likely this one, related to recordsize: https://github.com/openzfs/zfs/issues/12078
 

swoosh

Cadet
Joined
Oct 20, 2023
Messages
4
Congratulation for your diagnostic work, and thanks for posting this educating report, which will certainly be useful to other users.
The troves of information in all corners of the internet are regularly helping me to research and debug all kinds of problems, and I always wondered who writes it all. Why don't I do it this time?:smile:
 
Top