Permanent Error ins ZFS pool, need help!!!

danields32 · Aug 7, 2015

I need HELP, I can't figure this out. I'm new to Freenas and Linux.

I built a Freenas server a few months ago, I used 3 new 3tb HDD's and an older 3tb I had laying around.
After a few weeks, I got the "ZFS state is online: One or more devices has experienced an error resulting in data corruption. Applications may be affected."
It just so happened to be on the old HDD, so I bought a new one and replaced it. After the re-silvering process everything was great.
ONE week later, I get the same problem (see my zpool status below), but it doesn't show which hard drive is having the problem, or the file path to the file in question.
I am using ECC-RAM.
All of the hardware on my server is now new:
Motherboard: SuperMicro x10sla-f
Processor: Intel Xeon E3-1231 v3
Memory: 2 Crucial 16gb kits- DDR3-1600 1.35v ECC UDIMM
HDD: 3 x 3tb HGST Deskstar NAS 7200rpm 64mb Cache
1 x 3tb WD Red Pro NAS WD3001FFSX 64mb Cache

Any help would be greatly appreciated, thank you.

[root@FREENAS] ~# zpool status -v
pool: HOMESERVER
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: scrub repaired 256K in 10h22m with 1 errors on Fri Jul 24 06:32:35 2015
config:

NAME STATE READ WRITE CKSUM
HOMESERVER ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/c348d25d-1da5-11e5-9d8c-001e2a495414 ONLINE 0 0 0
gptid/7fdd465d-cfd5-11e4-8e5e-1c6f65307ba6 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/80298e41-cfd5-11e4-8e5e-1c6f65307ba6 ONLINE 0 0 0
gptid/80873a39-cfd5-11e4-8e5e-1c6f65307ba6 ONLINE 0 0 0

errors: Permanent errors have been detected in the following files:

<0x447c>:<0x16133>

pool: freenas-boot
state: ONLINE
scan: scrub repaired 0 in 0h0m with 0 errors on Sun Jul 19 03:45:58 2015
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
gptid/f484000b-d979-11e4-bbfa-001e2a495414 ONLINE 0 0 0

errors: No known data errors

BigDave · Aug 9, 2015

danields32 said:
I built a Freenas server a few months ago, I used 3 new 3tb HDD's and an older 3tb I had laying around.
After a few weeks, I got the "ZFS state is online: One or more devices has experienced an error resulting in data corruption. Applications may be affected."
It just so happened to be on the old HDD, so I bought a new one and replaced it. After the re-silvering process everything was great.
ONE week later, I get the same problem (see my zpool status below), but it doesn't show which hard drive is having the problem, or the file path to the file in question

Have you looked at each drive's recent smart output?
Are the drive temps exceeding 40 degrees Celsius?
Are U doing regular short and long smart testing of all the drives?
Could your server afford the down time to test (3 or 4 passes of memtest) your memory?
Do you have proper back ups of the data?

A quick Google search found this forum post that may help... it's a bit old but may have relevance.

danields32 · Aug 10, 2015

Thank you for replying BigDave, it's really appreciated.
The drives temps are good, I've run 4 passes of memtest with no issues, and I am running regular smart tests.
Sorry, no backup solution yet, but will be setting that up in the next month or so.

I did smartctl -a on all HDD's and found an error on 1st HDD(ada0), see the results below, I'm not sure what the issue is, it looks like a corrupt file maybe? But I don't know how to find it.

Thanks for your help.

=== START OF INFORMATION SECTION ===

Model Family: Western Digital Red Pro

Device Model: WDC WD3001FFSX-68JNUN0

Serial Number: WD-WMC130D4SHU4

LU WWN Device Id: 5 0014ee 0ae9e59be

Firmware Version: 81.00A81

User Capacity: 3,000,592,982,016 bytes [3.00 TB]

Sector Sizes: 512 bytes logical, 4096 bytes physical

Rotation Rate: 7200 rpm

Device is: In smartctl database [for details use: -P show]

ATA Version is: ATA8-ACS (minor revision not indicated)

SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is: Tue Aug 11 00:05:23 2015 EDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x00)Offline data collection activity

was never started.

Auto Offline Data Collection: Disabled.

Self-test execution status: ( 0)The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: (31920) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003)Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01)Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 2) minutes.

Extended self-test routine

recommended polling time: ( 347) minutes.

Conveyance self-test routine

recommended polling time: ( 5) minutes.

SCT capabilities: (0x70bd)SCT Status supported.

SCT Error Recovery Control supported.

SCT Feature Control supported.

SCT Data Table supported.

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0

3 Spin_Up_Time 0x0027 170 140 021 Pre-fail Always - 10483

4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 143

5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0

7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0

9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3379

10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0

11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 140

16 Unknown_Attribute 0x0022 006 194 000 Old_age Always - 118602038433

183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0

192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 118

193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 43

194 Temperature_Celsius 0x0022 117 107 000 Old_age Always - 35

196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0

197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x0032 200 196 000 Old_age Always - 4064424

200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0

SMART Error Log Version: 1

ATA Error Count: 1

CR = Command Register [HEX]

FR = Features Register [HEX]

SC = Sector Count Register [HEX]

SN = Sector Number Register [HEX]

CL = Cylinder Low Register [HEX]

CH = Cylinder High Register [HEX]

DH = Device/Head Register [HEX]

DC = Device Command Register [HEX]

ER = Error register [HEX]

ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 2572 hours (107 days + 4 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 00 a8 1a df e0 Error: UNC at LBA = 0x00df1aa8 = 14621352

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

c8 00 00 a8 1a df e0 00 5d+14:26:22.843 READ DMA

c8 00 00 a8 19 df e0 00 5d+14:26:22.842 READ DMA

c8 00 00 a8 18 df e0 00 5d+14:26:22.841 READ DMA

c8 00 00 a8 17 df e0 00 5d+14:26:22.840 READ DMA

c8 00 00 a8 16 df e0 00 5d+14:26:22.840 READ DMA

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Short offline Completed without error 00% 2433 -

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

danb35 · Aug 11, 2015

Your post would be easier to read if you put the SMART output in [ code ] [ /code ] tags. But that said, the results don't appear to show any serious problems, except that you aren't running SMART tests on this disk. You ran a short test 1000 hours ago, and that's it. You should schedule regular SMART tests on all your disks--short tests every few days to week at the most, long tests every week or two.

There is an error in the log, but it's 800 hours ago, and didn't result in updating any of the SMART attributes to show a failing sector. A long test on all of your disks would help narrow down the problem, but so far there's nothing obvious.

The error noted in zpool status is somewhere in the metadata. That means there isn't a particular file you can delete and restore. I don't know if there's a way to fix it without rebuilding the pool.

danields32 · Aug 11, 2015

Thanks for replying danb35, I have the short test running every 2 days and the long test twice per mont, but upon further investigation, they're running on all the HDD's except this one, for some reason I forgot to select this one on the list.

I'm going to run them now, and will post when done, thanks again.

SweetAndLow · Aug 11, 2015

Don't run then that often. Run long tests every 2 weeks and short tests once or twice a week.

danb35 · Aug 11, 2015

Every day is definitely excessive for a long test; I think it's on the low end of reasonable for a short test. I run short tests daily and long tests weekly. I don't see any reason to run either test more often than that, and I expect you could easily double the interval without a problem.

cyberjock · Aug 11, 2015

Your problem is here...

199 UDMA_CRC_Error_Count 0x0032 200 196 000 Old_age Always - 4064424

4 million CRC errors. For comparison, Windows disables UDMA if you have 5 errors.

No doubt the CRC errors are your problems. You also couldn't have had corruption without both disks in a a vdev having problems, so you have some combination of failures that lead to bad data on both disks simultaneously.

Frankly, if I were you I'd start digging deep into your setup to figure out what actually *is* broken and replacing it before you lose more than a little metadata.

danields32 · Aug 11, 2015

SweetAndLow said:
Don't run then that often. Run long tests every 2 weeks and short tests once or twice a week.

Sorry, I wrote that wrong, short is every 2 days, and long is twice a month.

danields32 · Aug 14, 2015

cyberjock said:
Your problem is here...

199 UDMA_CRC_Error_Count 0x0032 200 196 000 Old_age Always - 4064424

4 million CRC errors. For comparison, Windows disables UDMA if you have 5 errors.

No doubt the CRC errors are your problems. You also couldn't have had corruption without both disks in a a vdev having problems, so you have some combination of failures that lead to bad data on both disks simultaneously.

Frankly, if I were you I'd start digging deep into your setup to figure out what actually *is* broken and replacing it before you lose more than a little metadata.

When I replaced the previously failed hard drive, I noticed that one of the sata cables was broken, could that be the cause of these errors? I replaced the cable at the time so that shouldn't be an issue anymore.
Here is the result from the smartctl test:

Code:

 smartctl -l selftest /dev/ada0

smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p16 amd64] (local build)

Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF READ SMART DATA SECTION ===

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline       Completed without error       00%      3428         -

# 2  Extended offline    Completed without error       00%      3395         -

# 3  Short offline       Completed without error       00%      2433         -

Ericloewe · Aug 14, 2015

danields32 said:
sata cables was broken

Textbook cause of UDMA CRC errors. These are transmission errors detected by the SATA physical layer.

cyberjock · Aug 14, 2015

What Ericloewe said. I'd do a scrub and see if you get any uncorrectable ZFS problems. ;)

danields32 · Aug 16, 2015

cyberjock said:
What Ericloewe said. I'd do a scrub and see if you get any uncorrectable ZFS problems. ;)

This is what I got after a scrub:

Code:

 
[root@FREENAS] ~# zpool status -v

  pool: HOMESERVER

 state: DEGRADED

status: One or more devices has experienced an error resulting in data

corruption.  Applications may be affected.

action: Restore the file in question if possible.  Otherwise restore the

entire pool from backup.

  see: http://illumos.org/msg/ZFS-8000-8A

  scan: scrub repaired 1.19M in 11h24m with 1 errors on Sat Aug 15 22:42:32 2015

config:


NAME                                            STATE     READ WRITE CKSUM

HOMESERVER                                      DEGRADED     0     0     2

  mirror-0                                      DEGRADED     0     0     0

    gptid/c348d25d-1da5-11e5-9d8c-001e2a495414  ONLINE       0     0     0

    gptid/7fdd465d-cfd5-11e4-8e5e-1c6f65307ba6  DEGRADED     0     0   366  too many errors

  mirror-1                                      ONLINE       0     0     4

    gptid/80298e41-cfd5-11e4-8e5e-1c6f65307ba6  ONLINE       0     0    21

    gptid/80873a39-cfd5-11e4-8e5e-1c6f65307ba6  ONLINE       0     0     4


errors: Permanent errors have been detected in the following files:


        HOMESERVER/HomeMedia@manual-20150628-recursive:/Miscellaneous/The American Revolution, Documentary, History/PBS.The.American.Revolution.3of6.XviD-AC3.avi


  pool: freenas-boot

 state: ONLINE

  scan: scrub repaired 0 in 0h0m with 0 errors on Sun Jul 19 03:45:58 2015

config:


NAME                                          STATE     READ WRITE CKSUM

freenas-boot                                  ONLINE       0     0     0

  gptid/f484000b-d979-11e4-bbfa-001e2a495414  ONLINE       0     0     0


errors: No known data errors

danields32 · Aug 16, 2015

P.S. - I'm moving out of state tomorrow, so I will be shutting the server down, wont be back up for about a week.

cyberjock · Aug 17, 2015

I would delete the snapshot protecting that file. Of course, if the file still exists right now, you'd have to delete all of the snapshots up to the current snapshot and delete the file itself.

Other than that, you look like you might have lucked out and not have corruption of the metadata.

Important Announcement for the TrueNAS Community.

Permanent Error ins ZFS pool, need help!!!

danields32

Cadet

BigDave

FreeNAS Enthusiast

danields32

Cadet

danb35

Hall of Famer

danields32

Cadet

SweetAndLow

Sweet'NASty

danb35

Hall of Famer

cyberjock

Inactive Account

danields32

Cadet

danields32

Cadet

Ericloewe

Server Wrangler

cyberjock

Inactive Account

danields32

Cadet

danields32

Cadet

cyberjock

Inactive Account

Similar threads

Important Announcement for the TrueNAS Community.

Permanent Error ins ZFS pool, need help!!!

Cadet

FreeNAS Enthusiast

Cadet

Hall of Famer

Cadet

Sweet'NASty

Hall of Famer

Inactive Account

Cadet

Cadet

Server Wrangler

Inactive Account

Cadet

Cadet

Inactive Account

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Permanent Error ins ZFS pool, need help!!!"

Similar threads