Corrupt data on pool with single checksum failure in RAIDZ2

Evi Vanoost · Jun 13, 2017

I have the following pool, made up out of 36 4TB hard drives. It's a backup pool so it receives ZFS streams. It is getting near to full (73%) but this morning, I received messages that the backups had failed (most likely because they couldn't receive the massive data stream) and then there is this error.

There seems to be only 4 checksum errors on a single drive, they're in RAIDZ2 so I don't understand why it would report massive data corruption. Does ZFS not handle the drives running full? If so, that is really, really bad.

Code:

 pool: Backup
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
  see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 0 in 0h25m with 0 errors on Sun Jun 11 03:27:43 2017
config:
	NAME											STATE	 READ WRITE CKSUM
	Backup										  ONLINE	   0	 0	 0
	  raidz2-0									  ONLINE	   0	 0	 0
		gptid/4a03c5a7-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/4a902559-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/4b1a5dc7-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/4b9e58e2-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/4c20c17b-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/63a6b1b9-b582-11e6-a19e-003048f15c54  ONLINE	   0	 0	 0
		gptid/05a36a2c-c934-11e6-b962-003048f15c54  ONLINE	   0	 0	 0
		gptid/4dc3255f-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/4e53bf06-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/4ee405e5-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/4f730710-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/500440ad-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
	  raidz2-1									  ONLINE	   0	 0	 0
		gptid/50ee8494-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/51790b55-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/5202efc1-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/528ee10a-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/531f8b1a-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/53ace5a0-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/544886ba-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 4
		gptid/54e25596-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/557ea40a-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/5616cb43-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/56af5efc-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/574b5f60-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
	  raidz2-2									  ONLINE	   0	 0	 0
		gptid/583af6e1-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/58d02102-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/59698b70-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/59fbeaf8-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/5a8f22f5-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/5b21f5d8-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/5bb3b29d-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/5c4abf1e-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/5cd6d8ab-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/5d616609-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/5de5c744-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/5e6f1c39-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
	logs
	  gptid/60814fd1-1e8f-11e6-ad7f-003048f15c54	ONLINE	   0	 0	 0
	cache
	  gptid/5f72a8b4-1e8f-11e6-ad7f-003048f15c54	ONLINE	   0	 0	 0

errors: Permanent errors have been detected in the following files:
		<metadata>:<0x0>
		/var/db/system/update/base-os-9.10.2-U5-3c693eb3ffa689371074d2ddaab93d59.tgz
		<0x9f0e>:<0x0>
		<0x9f0e>:<0x59f217>
		Backup/.system/syslog-5d784362f850449eb03d556b65045a45:<0x55>
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/cpu-6/cpu-interrupt.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_bw-da20.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_ops-ada1p2.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_bw-da7p1.eli.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_ops_rwd-ada1p2.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_busy_percent-da7p1.eli.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_ops_rwd-da21p1.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/cpu-10/cpu-interrupt.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/disk-da12/disk_octets.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_ops-da8p1.eli.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/zfs_arc_v2/gauge_arcstats_raw_cp-c.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/disk-da13/disk_octets.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_busy_percent-da8p1.eli.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_ops-da21.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_ops-da9p1.eli.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_bw-da21.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_ops-da0p2.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_ops_rwd-da0p2.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/disk-da16/disk_octets.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/cpu-7/cpu-idle.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_busy_percent-da21.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/cpu-8/cpu-idle.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_ops-da1p1.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_latency-da10p1.eli.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/zfs_arc_v2/gauge_arcstats_raw_size-data_size.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_busy_percent-da10p1.eli.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/cpu-13/cpu-idle.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_ops_rwd-da11p1.eli.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_ops-da1p2.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_bw-da11p1.eli.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/zfs_arc_v2/gauge_arcstats_raw_l2-l2_misses.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_ops_rwd-da1p2.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_latency-da11p1.eli.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_busy_percent-da11p1.eli.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/disk-da21/disk_octets.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/zfs_arc_v2/gauge_arcstats_raw_l2-l2_feeds.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/disk-da22/disk_octets.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/disk-da23/disk_octets.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_busy_percent-da13p1.eli.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_queue-da14p1.eli.rrd
...
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_busy_percent-da6p1.eli.rrd
		<0xffffffffffffffff>:<0x1725a5c>
		<0xffffffffffffffff>:<0x1725a5f>

  pool: freenas-boot
 state: ONLINE
  scan: none requested
config:
	NAME		STATE	 READ WRITE CKSUM
	freenas-boot  ONLINE	   0	 0	 0
	  mirror-0  ONLINE	   0	 0	 0
		ada0p2  ONLINE	   0	 0	 0
		ada1p2  ONLINE	   0	 0	 0

rs225 · Jun 13, 2017

The 4 checksum on a single drive might be a clue, but it is not the cause of the damaged data. Being full would also not be a cause of damaged data.

Are there regular scrubs? Is the RAM ECC?

You may want to save the error list in a file, zpool clear Backup and then start a new scrub, to see if the same errors are reported when it is done, and what the error counts are on each device.

There is a good chance there is metadata corruption in the pool and it would have to be rebuilt to clear it. But having an idea of the cause is important, so you should probably post system specs to see if anything raises concerns.

Dice · Jun 13, 2017

Im curious to see how this unfolds.
Never seen anything like it before.

Was the system dataset relocated to the pool?

DrKK · Jun 13, 2017

Dice said:
Im curious to see how this unfolds.
Never seen anything like it before.

Was the system dataset relocated to the pool?

This is my questiuon as well---why is it all system dataset stuff.

m0nkey_ · Jun 13, 2017

Hardware specs?

Ericloewe · Jun 13, 2017

Evi Vanoost said:
scan: resilvered 0 in 0h25m with 0 errors on Sun Jun 11 03:27:43 2017

Something very bad is going on here.

ZFS felt the need to resilver, but didn't actually correct anything and stopped after a mere 25m, on a very large, very full pool.

Stux · Jun 13, 2017

Evi Vanoost said:
<metadata>:<0x0>
<0x9f0e>:<0x0>
<0x9f0e>:<0x59f217>

Looks like look corruption. Luckily only data that looks corrupt is from the system dataset.

Time to backup and restore.

As to why this happened? I imagine there might be a bug where the metadata couldn't be written in triplicate once the HDs were full.

Dice · Jun 14, 2017

Stux said:
I imagine there might be a bug where the metadata couldn't be written in triplicate once the HDs were full.

Yet this should not be happning anywhere close to a 73% full pool. If it would've been 98% things would've been looking a lot different.

Evi Vanoost · Jun 23, 2017

The pool was full at some point because it receives ZFS sends but I deleted a bunch of stuff. Either way, I don't think a full pool would cause corruption. Everything is ECC, the hardware has been running for close to a decade now without any major issues although the drives have been replaced, the controllers are Areca ARC-1680. The system only has 12G RAM (ECC).

DrKK · Jun 24, 2017

You have 36x4 TB drives in your pool, 12GB RAM, and the pool achieved at some point 100% or near-100% capacity? Did I read that right?

If that's correct sir, you are so far into edge-case territory that I am not sure how much conclusive help we could give you. We'd be half-guessing with anything we said.

However, none of the explanations advanced so far make sense. What "mode" do you have that Areca controller in?

Jailer · Jun 24, 2017

You're running a 144TB pool on 12GB of RAM.

This has to be the most unbalanced setup I've seen since I've been a member of this forum. And you still haven't answered this:

m0nkey_ said:
Hardware specs?

Evi Vanoost · Jun 24, 2017

Motherboard: Supermicro X8DT3
Enclosure: 8U, 50 drive SuperMicro enclosure w/ 4x 750W power supply
2 CPU's: Intel Xeon CPU E5520 @ 2.27GHz
6 Kingston 2GB modules PC3-10600E UNBUFF ECC DDR3-1333 @ 1066MHz w/ECC
HGST HDS724040ALE640 drives
ZLOG: 2x Intel X-25-E @ 32GB (SLC)
Cache: 2x Intel X-25-M @ 160GB (MLC)
Boot: 2x Innodisk 16GB SATA-DOM
Controllers: 2x ARC-1680 + 1x LSI (BroadCom) SAS1068E on-board, 1GB of cache although it's setup as passthrough - 1 disk is giving SMART errors though which is that one drive with the checksum errors.

The problem with upgrading RAM is that it's nearly impossible to find a module that matches this system, you can't mix different brands or types apparently because the system won't boot so I would have to replace all 12GB. This system was purchased late 2009 when it was "the latest and greatest" (from a SuperMicro vendor) and 12GB was considered a lot. It has operated Solaris 11, Nexenta and FreeNAS without an issue for years.

There is a scrub every month. The issue seemed to have happened after I started deleting some major ZFS trees. I've never had issues with ZFS either, ZFS was designed to be able to handle multiple failures and be able to indicate faulty hard drives and even memory.

Even so, I never had an issue with the system being unstable, the ARC hit rate is close to 90% and the system is only used for tertiary backups (not like a media center or anything, just ZFS send to it).

The hard drives did get full for a small period of time, there was ~75% free and a large ZFS send (a single snapshot with ~20TB of new data) was being sent and that failed so after I got that alert I started freeing the space by deleting some old ZFS datasets, after deleting those a scheduled scrub caught those errors. Right now I have 90.1T Used, 6.18T Free, my largest dataset is only 54.1T.

The funniest thing happened: I go back to check, there is a scrub running but all my error numbers are 0 again. I did play around with FreeNAS Corral and it's now running FreeNAS 11. I think I did replace a drive during that time as well but all that went off without a hitch or any failures or bad files.

Evi Vanoost · Jun 28, 2017

So after removing a bad (SMART error) drive and running a scrub again, all the errors are now gone (except the warning that we're missing a drive) but all the inconsistent data errors have disappeared now too. Not sure what is going on, sent the drive back to the manufacturer for replacement.

Important Announcement for the TrueNAS Community.

Corrupt data on pool with single checksum failure in RAIDZ2

Evi Vanoost

Explorer

rs225

Guru

Dice

Wizard

DrKK

FreeNAS Generalissimo

m0nkey_

MVP

Ericloewe

Server Wrangler

Stux

MVP

Dice

Wizard

Evi Vanoost

Explorer

DrKK

FreeNAS Generalissimo

Jailer

Not strong, but bad

Evi Vanoost

Explorer

Evi Vanoost

Explorer

Similar threads

Important Announcement for the TrueNAS Community.

Corrupt data on pool with single checksum failure in RAIDZ2

Explorer

Guru

Wizard

FreeNAS Generalissimo

MVP

Server Wrangler

MVP

Wizard

Explorer

FreeNAS Generalissimo

Not strong, but bad

Explorer

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Corrupt data on pool with single checksum failure in RAIDZ2"

Similar threads