Corrupt data on pool with single checksum failure in RAIDZ2

Status
Not open for further replies.

Evi Vanoost

Explorer
Joined
Aug 4, 2016
Messages
91
I have the following pool, made up out of 36 4TB hard drives. It's a backup pool so it receives ZFS streams. It is getting near to full (73%) but this morning, I received messages that the backups had failed (most likely because they couldn't receive the massive data stream) and then there is this error.

There seems to be only 4 checksum errors on a single drive, they're in RAIDZ2 so I don't understand why it would report massive data corruption. Does ZFS not handle the drives running full? If so, that is really, really bad.

Code:
 pool: Backup
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
  see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 0 in 0h25m with 0 errors on Sun Jun 11 03:27:43 2017
config:
	NAME											STATE	 READ WRITE CKSUM
	Backup										  ONLINE	   0	 0	 0
	  raidz2-0									  ONLINE	   0	 0	 0
		gptid/4a03c5a7-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/4a902559-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/4b1a5dc7-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/4b9e58e2-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/4c20c17b-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/63a6b1b9-b582-11e6-a19e-003048f15c54  ONLINE	   0	 0	 0
		gptid/05a36a2c-c934-11e6-b962-003048f15c54  ONLINE	   0	 0	 0
		gptid/4dc3255f-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/4e53bf06-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/4ee405e5-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/4f730710-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/500440ad-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
	  raidz2-1									  ONLINE	   0	 0	 0
		gptid/50ee8494-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/51790b55-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/5202efc1-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/528ee10a-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/531f8b1a-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/53ace5a0-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/544886ba-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 4
		gptid/54e25596-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/557ea40a-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/5616cb43-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/56af5efc-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/574b5f60-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
	  raidz2-2									  ONLINE	   0	 0	 0
		gptid/583af6e1-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/58d02102-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/59698b70-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/59fbeaf8-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/5a8f22f5-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/5b21f5d8-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/5bb3b29d-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/5c4abf1e-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/5cd6d8ab-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/5d616609-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/5de5c744-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
		gptid/5e6f1c39-1e8f-11e6-ad7f-003048f15c54  ONLINE	   0	 0	 0
	logs
	  gptid/60814fd1-1e8f-11e6-ad7f-003048f15c54	ONLINE	   0	 0	 0
	cache
	  gptid/5f72a8b4-1e8f-11e6-ad7f-003048f15c54	ONLINE	   0	 0	 0

errors: Permanent errors have been detected in the following files:
		<metadata>:<0x0>
		/var/db/system/update/base-os-9.10.2-U5-3c693eb3ffa689371074d2ddaab93d59.tgz
		<0x9f0e>:<0x0>
		<0x9f0e>:<0x59f217>
		Backup/.system/syslog-5d784362f850449eb03d556b65045a45:<0x55>
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/cpu-6/cpu-interrupt.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_bw-da20.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_ops-ada1p2.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_bw-da7p1.eli.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_ops_rwd-ada1p2.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_busy_percent-da7p1.eli.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_ops_rwd-da21p1.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/cpu-10/cpu-interrupt.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/disk-da12/disk_octets.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_ops-da8p1.eli.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/zfs_arc_v2/gauge_arcstats_raw_cp-c.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/disk-da13/disk_octets.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_busy_percent-da8p1.eli.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_ops-da21.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_ops-da9p1.eli.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_bw-da21.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_ops-da0p2.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_ops_rwd-da0p2.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/disk-da16/disk_octets.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/cpu-7/cpu-idle.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_busy_percent-da21.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/cpu-8/cpu-idle.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_ops-da1p1.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_latency-da10p1.eli.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/zfs_arc_v2/gauge_arcstats_raw_size-data_size.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_busy_percent-da10p1.eli.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/cpu-13/cpu-idle.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_ops_rwd-da11p1.eli.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_ops-da1p2.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_bw-da11p1.eli.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/zfs_arc_v2/gauge_arcstats_raw_l2-l2_misses.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_ops_rwd-da1p2.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_latency-da11p1.eli.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_busy_percent-da11p1.eli.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/disk-da21/disk_octets.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/zfs_arc_v2/gauge_arcstats_raw_l2-l2_feeds.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/disk-da22/disk_octets.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/disk-da23/disk_octets.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_busy_percent-da13p1.eli.rrd
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_queue-da14p1.eli.rrd
...
		/var/db/system/rrd-5d784362f850449eb03d556b65045a45/zombie.rcbi.rochester.edu/geom_stat/geom_busy_percent-da6p1.eli.rrd
		<0xffffffffffffffff>:<0x1725a5c>
		<0xffffffffffffffff>:<0x1725a5f>

  pool: freenas-boot
 state: ONLINE
  scan: none requested
config:
	NAME		STATE	 READ WRITE CKSUM
	freenas-boot  ONLINE	   0	 0	 0
	  mirror-0  ONLINE	   0	 0	 0
		ada0p2  ONLINE	   0	 0	 0
		ada1p2  ONLINE	   0	 0	 0
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
The 4 checksum on a single drive might be a clue, but it is not the cause of the damaged data. Being full would also not be a cause of damaged data.

Are there regular scrubs? Is the RAM ECC?

You may want to save the error list in a file, zpool clear Backup and then start a new scrub, to see if the same errors are reported when it is done, and what the error counts are on each device.

There is a good chance there is metadata corruption in the pool and it would have to be rebuilt to clear it. But having an idea of the cause is important, so you should probably post system specs to see if anything raises concerns.
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
Im curious to see how this unfolds.
Never seen anything like it before.

Was the system dataset relocated to the pool?
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
Im curious to see how this unfolds.
Never seen anything like it before.

Was the system dataset relocated to the pool?
This is my questiuon as well---why is it all system dataset stuff.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
scan: resilvered 0 in 0h25m with 0 errors on Sun Jun 11 03:27:43 2017
Something very bad is going on here.

ZFS felt the need to resilver, but didn't actually correct anything and stopped after a mere 25m, on a very large, very full pool.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
<metadata>:<0x0>
<0x9f0e>:<0x0>
<0x9f0e>:<0x59f217>

Looks like look corruption. Luckily only data that looks corrupt is from the system dataset.

Time to backup and restore.

As to why this happened? I imagine there might be a bug where the metadata couldn't be written in triplicate once the HDs were full.
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
I imagine there might be a bug where the metadata couldn't be written in triplicate once the HDs were full.
Yet this should not be happning anywhere close to a 73% full pool. If it would've been 98% things would've been looking a lot different.
 

Evi Vanoost

Explorer
Joined
Aug 4, 2016
Messages
91
The pool was full at some point because it receives ZFS sends but I deleted a bunch of stuff. Either way, I don't think a full pool would cause corruption. Everything is ECC, the hardware has been running for close to a decade now without any major issues although the drives have been replaced, the controllers are Areca ARC-1680. The system only has 12G RAM (ECC).
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
You have 36x4 TB drives in your pool, 12GB RAM, and the pool achieved at some point 100% or near-100% capacity? Did I read that right?

If that's correct sir, you are so far into edge-case territory that I am not sure how much conclusive help we could give you. We'd be half-guessing with anything we said.

However, none of the explanations advanced so far make sense. What "mode" do you have that Areca controller in?
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
You're running a 144TB pool on 12GB of RAM. :eek:

This has to be the most unbalanced setup I've seen since I've been a member of this forum. And you still haven't answered this:
Hardware specs?
 

Evi Vanoost

Explorer
Joined
Aug 4, 2016
Messages
91
Motherboard: Supermicro X8DT3
Enclosure: 8U, 50 drive SuperMicro enclosure w/ 4x 750W power supply
2 CPU's: Intel Xeon CPU E5520 @ 2.27GHz
6 Kingston 2GB modules PC3-10600E UNBUFF ECC DDR3-1333 @ 1066MHz w/ECC
HGST HDS724040ALE640 drives
ZLOG: 2x Intel X-25-E @ 32GB (SLC)
Cache: 2x Intel X-25-M @ 160GB (MLC)
Boot: 2x Innodisk 16GB SATA-DOM
Controllers: 2x ARC-1680 + 1x LSI (BroadCom) SAS1068E on-board, 1GB of cache although it's setup as passthrough - 1 disk is giving SMART errors though which is that one drive with the checksum errors.

The problem with upgrading RAM is that it's nearly impossible to find a module that matches this system, you can't mix different brands or types apparently because the system won't boot so I would have to replace all 12GB. This system was purchased late 2009 when it was "the latest and greatest" (from a SuperMicro vendor) and 12GB was considered a lot. It has operated Solaris 11, Nexenta and FreeNAS without an issue for years.

There is a scrub every month. The issue seemed to have happened after I started deleting some major ZFS trees. I've never had issues with ZFS either, ZFS was designed to be able to handle multiple failures and be able to indicate faulty hard drives and even memory.

Even so, I never had an issue with the system being unstable, the ARC hit rate is close to 90% and the system is only used for tertiary backups (not like a media center or anything, just ZFS send to it).

The hard drives did get full for a small period of time, there was ~75% free and a large ZFS send (a single snapshot with ~20TB of new data) was being sent and that failed so after I got that alert I started freeing the space by deleting some old ZFS datasets, after deleting those a scheduled scrub caught those errors. Right now I have 90.1T Used, 6.18T Free, my largest dataset is only 54.1T.

The funniest thing happened: I go back to check, there is a scrub running but all my error numbers are 0 again. I did play around with FreeNAS Corral and it's now running FreeNAS 11. I think I did replace a drive during that time as well but all that went off without a hitch or any failures or bad files.
 
Last edited:

Evi Vanoost

Explorer
Joined
Aug 4, 2016
Messages
91
So after removing a bad (SMART error) drive and running a scrub again, all the errors are now gone (except the warning that we're missing a drive) but all the inconsistent data errors have disappeared now too. Not sure what is going on, sent the drive back to the manufacturer for replacement.
 
Status
Not open for further replies.
Top