ZFS Checksum Errors

simplextech

Cadet
Joined
Sep 19, 2020
Messages
7
Hey guys, really weird thing here that I'm trying to figure out.

memtest86 is launched and running again.... let it run longer this time just to be sure.

Hardware
CPU: AMD 8350FX
RAM: 32G Not ECC (I know the recommendation is for ECC)
MB: ASRock 970 Pro3 R2
Controller: LSI 9211-8i (flashed to IT)
Drives: 6 x WD Red (one is known SMR and was left out during testing)

Test Data Transfer Info
Test Data is a mix of small and large files. Small files are random text and git repo files and larger are MKV of local DVD and BD. Test data set size ranges from 50G to 100G.

ZFS Configuration
Defaults for ZFS Pool create
2 x drives in a Mirror for the test

FreeNAS 11.3-U4.1
ZFS Checksum errors and corrupt files showing after scrub of test data transfers on all drives. Remove the files and copy again and problem resolved. Delete everything and perform the same copy and different files show as corrupt.

TrueNAS 12 (RC1)
<Same as FreeNAS results>

XigmaNAS 12.1.0.4773
Same testing. No Errors

ProxMox VE 6.2-1
Same testing. No Errors

Tomorrow I may run the same testing with bare Linux or OmniOS and see what results I get but I'm stumped. I've checked cables, re-seated components, even replaced the memory as I had 2x8G DIMM's installed when I started this journey a few days ago and I replaced that with new 4x8G DIMM's and that did nothing. I swapped cables, check the LSI Controller, moved the disks to the MB SATA ports, pulled the drives from the drive cage and ran them directly. Test results under every permutation of the testing were the same. I can't explain it.
 

simplextech

Cadet
Joined
Sep 19, 2020
Messages
7
Wow. Nothing. Interesting.

So here's the outcome so far. After numerous hardware swapping and running memtest86 multiple times and no errors I started to question my sanity. What I did notice and learn then is that memtest86 by does NOT test all cores/cpu's in the system by default. You must explicitly tell it to test all CPU's and in what method (parallel, round robin, etc). Well now we have more testing to do.

I set memtest86 to run the CPU's in parallel and finally the system failed! Finally. Bad CPU.

For anyone else banging their heads against a wall where everything "seems" to be working and memtest86 checks out "clean" make sure to change the method of CPU testing to ensure all CPU's/Cores are being tested.
 
Top