Data corruption detected, but files are not corrupted

MoltenMoose · Nov 3, 2016

I have about "352 data errors" in zpool status, and (with -v) apparently "Permanent errors have been detected in the following files:" except the files listed are completely fine.

Over the past couple of weeks I have had literally tens of thousands of CKSUM errors across every drive in the NAS, (9 drives spread between two distinct pools) but very surprisingly, after running binary comparison over all files in the zpool against the master copy stored on a separate machine, no files show any binary differences at all. Basically I can't find any example of actual data corruption detectable from applications despite ZFS throwing a constant checksum hissyfit.

So my question is, are there any known reasons that ZFS could incorrectly report uncorrectable/permanent data corruption?

Ericloewe · Nov 3, 2016

Are you using ECC RAM? If not, my first bet is seriously dying RAM.

DrKK · Nov 3, 2016

Yeah, dying non-ECC RAM. That's also my diagnosis.

Would like to see the user's complete set of hardware specifications.

Ericloewe · Nov 3, 2016

If our hypothesis is correct, this would also be a fascinating data point, showing that ZFS won't just "correct" everything it reads with garbage, in the event of a catastrophic RAM error. That would clearly support the "ZFS is no less safe than traditional FreeNAS file systems when used with non-ECC RAM" point of view (apart from the lack of recovery tools for when the fan becomes a fecal delivery system).

Fascinating, but ultimately irrelevant, since any data worth storing is worthy of ECC RAM.

DrKK · Nov 3, 2016

@MoltenMoose can you tell us your hardware specifications, and begin to run a rigorous ram test on your DIMMs?

MoltenMoose · Nov 4, 2016

Just so we're clear, all data that I care about is backed up elsewhere! It's this backup that is allowing me to test whether the data has been corrupted.

Now with that out of the way, I'm definitely not running what you guys would consider a recommended specification!

I'm not running ECC ram
I'm virtualising FreeNas over Proxmox (with VT-d onboard SATA controller passthrough)
All disks making up my arrays have around 5 years (!) continuous runtime on them

I haven't run a memory test yet, although I'll get a long one done over this weekend.

Full specifications:

Gigabyte GA-H170M-DS3H motherboard
Intel Core i7 6700 (Skylake)
2 x 16GB Corsair Vengence BLK DDR4 2133MHz memory
Seasonic Platinum 400W passive power supply

The machine that hosts the disks is only about 2 weeks old, but the disks themselves are 5400 rpm Samsung Spinpoints (model HD204UI) from 2011 with over 40,000 hours on them, all pass extended SMART tests. They've been in a Synology NAS until now.

I have another set of new WD Red 3TB drives arriving today to replace this old set, but I have a feeling given the number of CKSUM errors and they fact that they affect every disk, that the aging drives aren't my problem - yet. I sure know they're a ticking time bomb however.

But actually what is my problem? My data isn't corrupted, at least not yet anyway, it's just being reported as such. Regardless of virtualisation, non-ECC ram and old hard drives I would have thought it was fundamental that when ZFS reports that "Permanent errors have been detected" that end-to-end data corruption has actually happened.

Ericloewe · Nov 4, 2016

MoltenMoose said:
I haven't run a memory test yet, although I'll get a long one done over this weekend.

Please do so and report back.

MoltenMoose said:
But actually what is my problem?

If we're correct, your RAM is.

MoltenMoose said:
My data isn't corrupted, at least not yet anyway, it's just being reported as such. Regardless of virtualisation, non-ECC ram and old hard drives I would have thought it was fundamental that when ZFS reports that "Permanent errors have been detected" that end-to-end data corruption has actually happened.

Well, yes, assuming the system is reliable. All bets are off if your RAM is suffering from constant errors. ZFS does what it normally does, but the results are garbage because they were changed in RAM. Garbage doesn't match the stored checksum for the blocks -> error is raised.

Bidule0hm · Nov 4, 2016

MoltenMoose said:
[...] except the files listed are completely fine.

How did you tested they are fine?

Ericloewe · Nov 4, 2016

Bidule0hm said:
How did you tested they are fine?

MoltenMoose said:
but very surprisingly, after running binary comparison over all files in the zpool against the master copy stored on a separate machine, no files show any binary differences at all.

If it was silent corruption from the source, ZFS wouldn't be complaining.

Stux · Nov 4, 2016

Could it also be the snapshots which are corrupted? Or do they show up for a different file?

If data only referenced in a snapshot was flagged as corrupted, would zfs say the original file had corruption?

Bidule0hm · Nov 4, 2016

Ok but what

MoltenMoose said:
after running binary comparison

means? md5? diff? ...?

Ericloewe · Nov 4, 2016

Bidule0hm said:
Ok but what

means? md5? diff? ...?

I'm guessing closer to diff.

MoltenMoose · Nov 4, 2016

Thanks everyone for your responses so far. I have to admit I was bracing myself for an absolute flaming for the trifecta of virtualising, not using ECC memory and running really old drives but your responses have been really constructive :)

Bidule0hm said:
How did you tested they are fine?

Bidule0hm said:
md5? diff? ...?

I'm running a comparison, using the Windows comparison tool BeyondCompare (in binary compare mode), of the files stored on the zpool vs. the copy which was used to originally populate the zpool. This copy is held on a different machine, running Windows on a single harddrive formatted with NTFS. I even deliberately changed a single byte in one file on the zpool to ensure that the comparison tool correctly picked this up, which it did; it was the only difference it picked up.

Stux said:
Could it also be the snapshots which are corrupted? Or do they show up for a different file?

If data only referenced in a snapshot was flagged as corrupted, would zfs say the original file had corruption?

I'm at work so I can't get in via SSH to post any logs at the moment, but from what I remember the list of files available under the "Permanent errors have been detected in the following files:" heading contain both snapshot versions and active versions.

The physical machine that it virtualising Freenas is also running a number of other virtual machines, which include a Windows box, pfsense firewall and Plex Media Server. I will run an exhaustive memory test in due course, but for what it's worth I haven't noticed any of these other machines exhibiting strange behaviour which could point toward faulty memory.

I'm wondering if there's any configuration of the virtual machine which could affect the behaviour of Freenas in some way? Perhaps the type of virtual CPU being exposed to Freenas? I'm using the "Qemu64" CPU type and have allocated 12GB (of the available 32) exclusively to the Freenas VM. I tried to follow this guide to help ensure that I didn't make any massive faux pas but could have easily missed something. Does anyone have any experience hosting Freenas on a KVM/QEMU-based hypervisor?

MoltenMoose · Nov 5, 2016

Just to keep this updated, I completed a pass through memtest86 last night.

MoltenMoose · Nov 5, 2016

Today I'm going to replace one of the arrays with 5, 40,000 hour drives with 5, brand new 3TB WD Reds and create a new zpool on these using RAID-Z2. I have a feeling however that it will suffer just the same with these false corruption reports. Although this time I'll have double parity to protect me, rather than the single parity I have now.
I obviously won't be putting any trust in this new array by putting important data on it just yet.

As another data point I've noticed that the CKSUM fail numbers seem to spike the most when doing a very specific kind of thing with the arrays. They don't seem to increase much when reading or writing large chucks of sequential data, such as copying over 1GB video files or playing these with VLC. They don't move much when scrubbing the volume either. What seems to really cause them to spike is indexing the volume using an external service.

For example, it was pointing Plex Media Server at the root video folder which originally revealed the problem and the CKSUM numbers quickly spiked to the thousands across all drives and caused the array to go DEGRADED. Prior to this they were all zero. It just seems that lots of random, quick-seeking access is the most effective way to cause ZFS to misreport corruption.

Anyway, I'll get to installing that new array now and we'll see how that performs.

Bidule0hm · Nov 5, 2016

MoltenMoose said:
I'm running a comparison, using the Windows comparison tool BeyondCompare (in binary compare mode), of the files stored on the zpool vs. the copy which was used to originally populate the zpool. This copy is held on a different machine, running Windows on a single harddrive formatted with NTFS. I even deliberately changed a single byte in one file on the zpool to ensure that the comparison tool correctly picked this up, which it did; it was the only difference it picked up.

Ok, perfect. So we can be pretty sure that the files are indeed the same.

MoltenMoose said:
As another data point I've noticed that the CKSUM fail numbers seem to spike the most when doing a very specific kind of thing with the arrays. They don't seem to increase much when reading or writing large chucks of sequential data, such as copying over 1GB video files or playing these with VLC. They don't move much when scrubbing the volume either. What seems to really cause them to spike is indexing the volume using an external service.

For example, it was pointing Plex Media Server at the root video folder which originally revealed the problem and the CKSUM numbers quickly spiked to the thousands across all drives and caused the array to go DEGRADED. Prior to this they were all zero. It just seems that lots of random, quick-seeking access is the most effective way to cause ZFS to misreport corruption.

This makes me think it's the drives (random seeks vs. sequential seeks) so the SMART reports would be very useful.

Stux · Nov 5, 2016

How are the drives connected to the system? HBA? Motherboard?

Could be a PSU issue.

Could be a VM issue.

MoltenMoose · Nov 9, 2016

Okay - another update.

I removed all the old harddrives and put in 5 brand new WD Red 3TB drives, formed a Raid Z2 with it (7.8TB) and filled it to around 80%.
So far, 0 errors.

I then ran a scrub over last night and this morning I was greeted with this:

Code:

  pool: vault
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
		corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
		entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub in progress since Tue Nov  8 23:26:33 2016
		8.18T scanned out of 10.2T at 288M/s, 2h1m to go
		232G repaired, 80.37% done
config:

		NAME											STATE	 READ WRITE CKSUM
		vault										   DEGRADED	 0	 0 7.97K
		  raidz2-0									  DEGRADED	 0	 0 66.1K
			ada1p2									  DEGRADED	 0	 0 1.07M  too many errors  (repairing)
			gptid/db78c633-a34f-11e6-8d02-8314b1a62e3d  DEGRADED	 0	 0 1.07M  too many errors  (repairing)
			ada0p2									  DEGRADED	 0	 0 1.07M  too many errors  (repairing)
			ada3p2									  DEGRADED	 0	 0 1.07M  too many errors  (repairing)
			ada2p2									  DEGRADED	 0	 0 1.07M  too many errors  (repairing)

errors: 8144 data errors, use '-v' for a list

I haven't had to time check, but I'm predicting that as before, all the files will actually be fine.
SATA cables are my next target. I've got some new ones arriving today so will fit these later and report back.

hugovsky · Nov 9, 2016

You should take care when you hit 80%+ used space in your pool. Performance may degrade.

MoltenMoose · Nov 9, 2016

hugovsky said:
You should take care when you hit 80%+ used space in your pool. Performance may degrade.

Haha - I think that's the least of my worries.

Important Announcement for the TrueNAS Community.

Data corruption detected, but files are not corrupted

Cadet

Server Wrangler

FreeNAS Generalissimo

Server Wrangler

FreeNAS Generalissimo

Cadet

Server Wrangler

Server Electronics Sorcerer

Server Wrangler

MVP

Server Electronics Sorcerer

Server Wrangler

Cadet

Cadet

Cadet

Server Electronics Sorcerer

MVP

Cadet

Guru

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Data corruption detected, but files are not corrupted"

Similar threads