Data corruption detected, but files are not corrupted

Status
Not open for further replies.

MoltenMoose

Cadet
Joined
Oct 25, 2016
Messages
8
I have about "352 data errors" in zpool status, and (with -v) apparently "Permanent errors have been detected in the following files:" except the files listed are completely fine.

Over the past couple of weeks I have had literally tens of thousands of CKSUM errors across every drive in the NAS, (9 drives spread between two distinct pools) but very surprisingly, after running binary comparison over all files in the zpool against the master copy stored on a separate machine, no files show any binary differences at all. Basically I can't find any example of actual data corruption detectable from applications despite ZFS throwing a constant checksum hissyfit.

So my question is, are there any known reasons that ZFS could incorrectly report uncorrectable/permanent data corruption?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Are you using ECC RAM? If not, my first bet is seriously dying RAM.
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
Yeah, dying non-ECC RAM. That's also my diagnosis.

Would like to see the user's complete set of hardware specifications.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
If our hypothesis is correct, this would also be a fascinating data point, showing that ZFS won't just "correct" everything it reads with garbage, in the event of a catastrophic RAM error. That would clearly support the "ZFS is no less safe than traditional FreeNAS file systems when used with non-ECC RAM" point of view (apart from the lack of recovery tools for when the fan becomes a fecal delivery system).

Fascinating, but ultimately irrelevant, since any data worth storing is worthy of ECC RAM.
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
@MoltenMoose can you tell us your hardware specifications, and begin to run a rigorous ram test on your DIMMs?
 

MoltenMoose

Cadet
Joined
Oct 25, 2016
Messages
8
Just so we're clear, all data that I care about is backed up elsewhere! It's this backup that is allowing me to test whether the data has been corrupted.

Now with that out of the way, I'm definitely not running what you guys would consider a recommended specification!
  • I'm not running ECC ram
  • I'm virtualising FreeNas over Proxmox (with VT-d onboard SATA controller passthrough)
  • All disks making up my arrays have around 5 years (!) continuous runtime on them
I haven't run a memory test yet, although I'll get a long one done over this weekend.

Full specifications:
  • Gigabyte GA-H170M-DS3H motherboard
  • Intel Core i7 6700 (Skylake)
  • 2 x 16GB Corsair Vengence BLK DDR4 2133MHz memory
  • Seasonic Platinum 400W passive power supply
The machine that hosts the disks is only about 2 weeks old, but the disks themselves are 5400 rpm Samsung Spinpoints (model HD204UI) from 2011 with over 40,000 hours on them, all pass extended SMART tests. They've been in a Synology NAS until now.

I have another set of new WD Red 3TB drives arriving today to replace this old set, but I have a feeling given the number of CKSUM errors and they fact that they affect every disk, that the aging drives aren't my problem - yet. I sure know they're a ticking time bomb however.

But actually what is my problem? My data isn't corrupted, at least not yet anyway, it's just being reported as such. Regardless of virtualisation, non-ECC ram and old hard drives I would have thought it was fundamental that when ZFS reports that "Permanent errors have been detected" that end-to-end data corruption has actually happened.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I haven't run a memory test yet, although I'll get a long one done over this weekend.
Please do so and report back.
But actually what is my problem?
If we're correct, your RAM is.
My data isn't corrupted, at least not yet anyway, it's just being reported as such. Regardless of virtualisation, non-ECC ram and old hard drives I would have thought it was fundamental that when ZFS reports that "Permanent errors have been detected" that end-to-end data corruption has actually happened.
Well, yes, assuming the system is reliable. All bets are off if your RAM is suffering from constant errors. ZFS does what it normally does, but the results are garbage because they were changed in RAM. Garbage doesn't match the stored checksum for the blocks -> error is raised.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
How did you tested they are fine?
but very surprisingly, after running binary comparison over all files in the zpool against the master copy stored on a separate machine, no files show any binary differences at all.
If it was silent corruption from the source, ZFS wouldn't be complaining.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Could it also be the snapshots which are corrupted? Or do they show up for a different file?

If data only referenced in a snapshot was flagged as corrupted, would zfs say the original file had corruption?
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194

MoltenMoose

Cadet
Joined
Oct 25, 2016
Messages
8
Thanks everyone for your responses so far. I have to admit I was bracing myself for an absolute flaming for the trifecta of virtualising, not using ECC memory and running really old drives but your responses have been really constructive :)

How did you tested they are fine?
md5? diff? ...?

I'm running a comparison, using the Windows comparison tool BeyondCompare (in binary compare mode), of the files stored on the zpool vs. the copy which was used to originally populate the zpool. This copy is held on a different machine, running Windows on a single harddrive formatted with NTFS. I even deliberately changed a single byte in one file on the zpool to ensure that the comparison tool correctly picked this up, which it did; it was the only difference it picked up.

Could it also be the snapshots which are corrupted? Or do they show up for a different file?

If data only referenced in a snapshot was flagged as corrupted, would zfs say the original file had corruption?

I'm at work so I can't get in via SSH to post any logs at the moment, but from what I remember the list of files available under the "Permanent errors have been detected in the following files:" heading contain both snapshot versions and active versions.

The physical machine that it virtualising Freenas is also running a number of other virtual machines, which include a Windows box, pfsense firewall and Plex Media Server. I will run an exhaustive memory test in due course, but for what it's worth I haven't noticed any of these other machines exhibiting strange behaviour which could point toward faulty memory.

I'm wondering if there's any configuration of the virtual machine which could affect the behaviour of Freenas in some way? Perhaps the type of virtual CPU being exposed to Freenas? I'm using the "Qemu64" CPU type and have allocated 12GB (of the available 32) exclusively to the Freenas VM. I tried to follow this guide to help ensure that I didn't make any massive faux pas but could have easily missed something. Does anyone have any experience hosting Freenas on a KVM/QEMU-based hypervisor?
 

MoltenMoose

Cadet
Joined
Oct 25, 2016
Messages
8
Just to keep this updated, I completed a pass through memtest86 last night.

uc
 

MoltenMoose

Cadet
Joined
Oct 25, 2016
Messages
8
Today I'm going to replace one of the arrays with 5, 40,000 hour drives with 5, brand new 3TB WD Reds and create a new zpool on these using RAID-Z2. I have a feeling however that it will suffer just the same with these false corruption reports. Although this time I'll have double parity to protect me, rather than the single parity I have now.
I obviously won't be putting any trust in this new array by putting important data on it just yet.

As another data point I've noticed that the CKSUM fail numbers seem to spike the most when doing a very specific kind of thing with the arrays. They don't seem to increase much when reading or writing large chucks of sequential data, such as copying over 1GB video files or playing these with VLC. They don't move much when scrubbing the volume either. What seems to really cause them to spike is indexing the volume using an external service.

For example, it was pointing Plex Media Server at the root video folder which originally revealed the problem and the CKSUM numbers quickly spiked to the thousands across all drives and caused the array to go DEGRADED. Prior to this they were all zero. It just seems that lots of random, quick-seeking access is the most effective way to cause ZFS to misreport corruption.

Anyway, I'll get to installing that new array now and we'll see how that performs.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
I'm running a comparison, using the Windows comparison tool BeyondCompare (in binary compare mode), of the files stored on the zpool vs. the copy which was used to originally populate the zpool. This copy is held on a different machine, running Windows on a single harddrive formatted with NTFS. I even deliberately changed a single byte in one file on the zpool to ensure that the comparison tool correctly picked this up, which it did; it was the only difference it picked up.

Ok, perfect. So we can be pretty sure that the files are indeed the same.

As another data point I've noticed that the CKSUM fail numbers seem to spike the most when doing a very specific kind of thing with the arrays. They don't seem to increase much when reading or writing large chucks of sequential data, such as copying over 1GB video files or playing these with VLC. They don't move much when scrubbing the volume either. What seems to really cause them to spike is indexing the volume using an external service.

For example, it was pointing Plex Media Server at the root video folder which originally revealed the problem and the CKSUM numbers quickly spiked to the thousands across all drives and caused the array to go DEGRADED. Prior to this they were all zero. It just seems that lots of random, quick-seeking access is the most effective way to cause ZFS to misreport corruption.

This makes me think it's the drives (random seeks vs. sequential seeks) so the SMART reports would be very useful.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
How are the drives connected to the system? HBA? Motherboard?

Could be a PSU issue.

Could be a VM issue.
 

MoltenMoose

Cadet
Joined
Oct 25, 2016
Messages
8
Okay - another update.

I removed all the old harddrives and put in 5 brand new WD Red 3TB drives, formed a Raid Z2 with it (7.8TB) and filled it to around 80%.
So far, 0 errors.

I then ran a scrub over last night and this morning I was greeted with this:

Code:
  pool: vault
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
		corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
		entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub in progress since Tue Nov  8 23:26:33 2016
		8.18T scanned out of 10.2T at 288M/s, 2h1m to go
		232G repaired, 80.37% done
config:

		NAME											STATE	 READ WRITE CKSUM
		vault										   DEGRADED	 0	 0 7.97K
		  raidz2-0									  DEGRADED	 0	 0 66.1K
			ada1p2									  DEGRADED	 0	 0 1.07M  too many errors  (repairing)
			gptid/db78c633-a34f-11e6-8d02-8314b1a62e3d  DEGRADED	 0	 0 1.07M  too many errors  (repairing)
			ada0p2									  DEGRADED	 0	 0 1.07M  too many errors  (repairing)
			ada3p2									  DEGRADED	 0	 0 1.07M  too many errors  (repairing)
			ada2p2									  DEGRADED	 0	 0 1.07M  too many errors  (repairing)

errors: 8144 data errors, use '-v' for a list


I haven't had to time check, but I'm predicting that as before, all the files will actually be fine.
SATA cables are my next target. I've got some new ones arriving today so will fit these later and report back.
 
Status
Not open for further replies.
Top