swoosh
Cadet
- Joined
- Oct 20, 2023
- Messages
- 4
Initially I wanted to ask for troubleshooting advice but then I managed to mostly resolve the issue. Not sure why I'm still posting this, maybe roast me for not using a cluster of certified servers or whatever.
Hello!
For the second time in several months I'm waking up to a notification about data corruption in my home NAS. In both cases the corrupted file is in /var/db/system/rrd which is part of system dataset which is located on the data pool.
I found two other threads with similar issues but they don't go into many details:
www.truenas.com
www.truenas.com
There are zero errors in all disks in zpool status and SMART error logs.
After moving system dataset to boot disks and destroying old .system dataset
Then, running scrub on the pool clears the error completely. At least that was the case last time.
Now, the second time, scrub reveals additional errors in another two dozens of my files, and checksum errors in all data drives. Okay, now things are getting bad.
I'm fully aware that my setup is highly unorthodox by iXsystems standards, but I refuse to believe that ZFS is so fragile that it fails on anything other than top-shelf components. My hardware probably decided to finally kick the bucket.
Let's go by the process of elimination:
But let's check it. memtest86 (v10.2) did 4 passes in 3.5 hours and found one error. I opened up the case, reseated CPU and all sticks of RAM, cleaned it up a bit, and also rearranged the sticks to put matching pairs in each channel.
Another go at memtest86 - and almost a dozen errors. Interestingly, memtest86 indicated that all errors (during both runs so far) were caught by core #4, so I disabled half the cores and tried again, with same result just on a different core this time. In hindsight, no idea why I thought it would matter as there's only one memory controller anyway.
Okay, let's try lowering RAM speed to 1333 while bumping up the voltage to 1.65. No dice, still plenty of errors.
Let's try taking out one of the kits (2 sticks) of RAM. 4 memtest passes - no errors.
As a sanity check, let's try the other kit alone aaand it's spewing out errors almost immediately. Ladies and gentlemen, we got him!
Another scrub later, and the <0x437e>:<0x305> error related to .system is cleared, but errors in other files remain. I guess they got permanently damaged during a recent
Anyway, hopefully the removal of bad sticks of RAM bought me enough time to find new hardware. I'm looking at:
Hello!
For the second time in several months I'm waking up to a notification about data corruption in my home NAS. In both cases the corrupted file is in /var/db/system/rrd which is part of system dataset which is located on the data pool.
I found two other threads with similar issues but they don't go into many details:
corrupted rrdcached files
Hi. The wrong disk got pulled in a hot swap and caused a crash. Now everything in this directory /var/db/collectd/rrd/truenas.local/zfs_arc_v2/ seems to corrupted, coming up on terminal with either Input/output error or Integrity check failed. I have tried to remove the files and stop the...

Permanent errors have been detected in the following files: /var/db/system/rrd-cadb1ce
Hi All, So after a recent JBOD issue which is now resolved I've got a couple of files that ZFS is reporting as permanent errors. It doesn't appear to be a big issue as all is working fine and I think its referencing files in the system dataset that was on my main pool but I have subsequently...

There are zero errors in all disks in zpool status and SMART error logs.
zpool status
reports this:Code:
errors: Permanent errors have been detected in the following files: datapool/.system/rrd-d968e280827d4008b5065e3ef738a9cd:/localhost/cpu-1/cpu-softirq.rrd
After moving system dataset to boot disks and destroying old .system dataset
zpool status
reports:Code:
errors: Permanent errors have been detected in the following files: <0x437e>:<0x305>
Then, running scrub on the pool clears the error completely. At least that was the case last time.
Now, the second time, scrub reveals additional errors in another two dozens of my files, and checksum errors in all data drives. Okay, now things are getting bad.
I'm fully aware that my setup is highly unorthodox by iXsystems standards, but I refuse to believe that ZFS is so fragile that it fails on anything other than top-shelf components. My hardware probably decided to finally kick the bucket.
TrueNAS-SCALE-22.12.3.1
Motherboard: ASUS P8Z77-V LX
CPU: Intel Core i7 3770
RAM: 4x4GiB DDR3 non-ECC (two slightly different kits of Kingston HyperX)
Boot disks: a pair of 2.5" 240GB SSDs attached to USB3 internal header via USB-SATA adapters.
Pool layout: (SATA everywhere)
VDEV: 2 x MIRROR | 2 wide | 10.91 TiB - two IronWolfs, one WD Red, and one WD Red Plus
SLOG: 1 x MIRROR | 2 wide | 223.57 GiB - two SSDs
L2ARC: 1 x 447.13 GiB - one SSD
Motherboard: ASUS P8Z77-V LX
CPU: Intel Core i7 3770
RAM: 4x4GiB DDR3 non-ECC (two slightly different kits of Kingston HyperX)
Boot disks: a pair of 2.5" 240GB SSDs attached to USB3 internal header via USB-SATA adapters.
Pool layout: (SATA everywhere)
VDEV: 2 x MIRROR | 2 wide | 10.91 TiB - two IronWolfs, one WD Red, and one WD Red Plus
SLOG: 1 x MIRROR | 2 wide | 223.57 GiB - two SSDs
L2ARC: 1 x 447.13 GiB - one SSD
Let's go by the process of elimination:
- It's not the drives, because it's unlikely that all drives are failing at the same time, with no SMART errors. It's also odd that drives in each vdev have the exact same number of checksum errors.
- It's not the cables or chipset, because half of the drives are connected directly to the motherboard, and another half - via HP H220 HBA.
- It's probably not the PSU, because it's relatively new, there's not a lot of load on it, it's a SilverStone unit, and the whole thing is behind a UPS.
- So it must be something common to the entire system, which is either CPU or RAM.
echo 16 > /sys/module/zfs/parameters/zfs_flags
init command to ameliorate RAM situation.But let's check it. memtest86 (v10.2) did 4 passes in 3.5 hours and found one error. I opened up the case, reseated CPU and all sticks of RAM, cleaned it up a bit, and also rearranged the sticks to put matching pairs in each channel.
Another go at memtest86 - and almost a dozen errors. Interestingly, memtest86 indicated that all errors (during both runs so far) were caught by core #4, so I disabled half the cores and tried again, with same result just on a different core this time. In hindsight, no idea why I thought it would matter as there's only one memory controller anyway.
Okay, let's try lowering RAM speed to 1333 while bumping up the voltage to 1.65. No dice, still plenty of errors.
Let's try taking out one of the kits (2 sticks) of RAM. 4 memtest passes - no errors.
As a sanity check, let's try the other kit alone aaand it's spewing out errors almost immediately. Ladies and gentlemen, we got him!
Another scrub later, and the <0x437e>:<0x305> error related to .system is cleared, but errors in other files remain. I guess they got permanently damaged during a recent
send|recv
.Anyway, hopefully the removal of bad sticks of RAM bought me enough time to find new hardware. I'm looking at:
- "Ryzen 5 PRO 4650G" or "Ryzen 7 Pro 5750G".
- "ASRock B450M Pro4 R2.0" or "ASRock A520M Phantom Gaming 4" or "ASRock B550M Phantom Gaming 4".
- 4x8GB Micron DDR4 ECC UDIMM, either "1Rx8 PC4-2666V" or "2Rx8 PC4-21ЗЗP".