Data Corruption in ZFS

mfrabotta · Apr 16, 2013

I have a FreeNAS box setup that has been running about a year.

The box hasn't shut off, the box hasn't frozen, been hard reset, or any other failure that I know of.

I ran a scrub yesterday, and was greeted by a "Scrub Repaired 0 in 12h48m with 0 errors."
All well and good right?
The next morning I see that I have a new message "Errors: 1 data errors, use '-v' for a list."
And up above I see reference to "ZFS-8000-8A".
So I Google this, and I see a lot of people face issues when replacing drives, or dealing with drive failures.
My problem is that I have had zero drive failures reported, nor have I attempted to replace one recently.

So, what can I do?
I have no idea what my next steps are. I currently have no backup of the data, but can not stress the importance of the data.
That is really backwards, I know. I have already ordered the needed hardware to create the second NAS and start replicating snapshots.
Unfortunately this doesn't help me now, and I only realize my folly through my pain now.

Any help would be appreciated.
Thank you guys so much in advance!

The box is:
AMD FX-8120 Eight Core Processor
32 gb of RAM.
FreeNAS installed on 8gb flash drive.
2 - 2 port Intel NICs
Running: "FreeNAS-8.3.0-BETA1-x64 (r12054)"

My zpool status output:

pool: zp1
state: ONLINE
status: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scan: scrub repaired 0 in 12h48m with 0 errors on Fri Apr 12 23:45:56 2013
config:

NAME STATE READ WRITE CKSUM
zp1 ONLINE 0 0 4
raidz1-0 ONLINE 0 0 9
ada1 ONLINE 0 0 2
ada3 ONLINE 0 0 0
ada2 ONLINE 0 0 0
ada4 ONLINE 0 0 2
ada5 ONLINE 0 0 1
logs
ada6 ONLINE 0 0 0
cache
ada0 ONLINE 0 0 0

errors: 1 data errors, use '-v' for a list

My zpool status -v output:

errors: Permanent errors have been detected in the following files:
zp1/VMRaid02:<0X1>

My gpart show output:
=> 63 15435713 da0 MBR (7.4G)
63 1930257 1 freebsd [active] (942M)
1930320 63 - free - (31k)
1930383 1930257 2 freebsd (942M)
3860640 3024 3 freebsd (1.5M)
3863664 41328 4 freebsd (20M)
3904992 11530784 - free - (5.5G)

=> 0 1930257 da0s1 BSD (942M)
0 16 - free - (8.0k)
16 1930241 1 !0 (942M)

Thank you again guys.

cyberjock · Apr 16, 2013

I'd do a RAM test if I were you.

9C1 Newbee · Apr 16, 2013

Just curious, is this why ECC ram is good to have?

cyberjock · Apr 16, 2013

Yep. If you read the technical stuff on ZFS, ZFS relies on the RAM being trustworthy for ZFS to work. This reliance on RAM allows you to deal with corruption at virtually all other places. If your RAM goes bad(and you don't have ECC to correct it) then all the building blocks for ZFS come crashing down. :(

Now the real question is "How trustworthy is non-ECC RAM?" I've seen 6 sticks of RAM that were bad in more than 20 years of computing, 4 in the last 4 months! If you are confident that your RAM won't go bad, non-ECC is fine. But you are taking that small gamble.

The first FreeNAS machine I built was non-ECC because that's what I already had. My second one had ECC because that's what it had. If I were building one from scratch, I'd always go ECC.

Edit: Don't assume your RAM is bad. CHKSUM errors are something that isn't well explained in the documents I've found, so it's only an educated guess. Your problem may not be RAM related, but that's where I'd look first.

9C1 Newbee · Apr 16, 2013

Thank you for breaking it down for us. Ram is probably a very easy thing to test and a great place to start troubleshooting, I would think.

mfrabotta · Apr 17, 2013

Cyberjock,
Thank you for your reply! And the info on ECC RAM, which i have put in our new VMWare server.

I had rerun the scrub last night, and the errors persistered.
When i got into work this morning I had everything shut down and restarted.
Upon restart all errors were gone.

Odd, right?
I want to run the Memory Check Friday night, as I believe it will take a while.
That being said, How do I do it? Would you suggest something like http://www.memtest86.com/

Also, I have run smartctl short tests against each disk. Only one returned a value for the Raw_Read_Error_Rate.
I have been unable to find much data on what the expected value from WD is. Does there appear to be a problem with this printout to anyone?

~# smartctl -a /dev/ada1 -t short
smartctl 5.43 2012-06-30 r3573 [FreeBSD 8.3-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Black
Device Model: WDC WD1002FAEX-00Y9A0
Serial Number: WD-WCAW33960928
LU WWN Device Id: 5 0014ee 2b1dbbd3a
Firmware Version: 05.01D05
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Wed Apr 17 11:45:07 2013 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x85) Offline data collection activity
was aborted by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 241) Self-test routine in progress...
10% of test remaining.
Total time to complete Offline
data collection: (16680) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 172) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 12
3 Spin_Up_Time 0x0027 173 173 021 Pre-fail Always - 4325
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 12
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 5844
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 10
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 6
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 5
194 Temperature_Celsius 0x0022 108 103 000 Old_age Always - 39
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 5844 -
# 2 Short offline Completed without error 00% 5844 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Can't start self-test without aborting current test (10% remaining),
add '-t force' option to override, or run 'smartctl -X' to abort test.

Thank you all so much again.

Important Announcement for the TrueNAS Community.

Data Corruption in ZFS

mfrabotta

Cadet

cyberjock

Inactive Account

9C1 Newbee

Patron

cyberjock

Inactive Account

9C1 Newbee

Patron

mfrabotta

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

Data Corruption in ZFS

mfrabotta

Cadet

cyberjock

Inactive Account

9C1 Newbee

Patron

cyberjock

Inactive Account

9C1 Newbee

Patron

mfrabotta

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Data Corruption in ZFS"

Similar threads