The volume xxxx (ZFS) status is ONLINE: One or more devices has experienced an unrecoverable error

mattlach · May 16, 2014

Hey,

I got this error today, and I'm not quite sure what to make out of it. I googled but didn't find much. I'd appreciate some input.

At first I thought it meant one of my drives had failed, but I can't find anything in the GUI to suggest which one.

A zpool status command is equally unhelpful.

Code:

  pool: RAIDz2-01
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 7h45m with 0 errors on Sun May 11 12:45:39 2014
config:
 
NAME                                            STATE     READ WRITE CKSUM
RAIDz2-01                                       ONLINE       0     0     0
 raidz2-0                                      ONLINE       0     0     0
   gptid/6e7ee1e5-264a-11e2-a6f8-000c29733b59  ONLINE       0     0     0
   gptid/6f7e6a37-264a-11e2-a6f8-000c29733b59  ONLINE       0     0     0
   gptid/70634851-264a-11e2-a6f8-000c29733b59  ONLINE       0     0     0
   gptid/715dc2c3-264a-11e2-a6f8-000c29733b59  ONLINE       0     0     0
   gptid/72786009-264a-11e2-a6f8-000c29733b59  ONLINE       0     0     0
   gptid/73773c11-264a-11e2-a6f8-000c29733b59  ONLINE       0     0   663
 
errors: No known data errors

Any thoughts?

Does the checksum figure of 663 indicate that the last drive is the problem? How do I "determine if the device needs to be replaced"?

Any input would be helpful.

Thanks,
Matt

mattlach · May 16, 2014

I was actually planning on replacing the two 2TB drives with 3TB ones in the near future, so if it needs to be replaced it's not a big deal, but how do I know which drive gptid/73773c11-264a-11e2-a6f8-000c29733b59 corresponds to? The list in the gui has the serial numbers, and that is what would be useful to determine which physical disk is the problem... Even knowing which da device that corresponds to would be helpful.

Much obliged,
Matt

mattlach · May 16, 2014

mattlach said:
I was actually planning on replacing the two 2TB drives with 3TB ones in the near future, so if it needs to be replaced it's not a big deal, but how do I know which drive gptid/73773c11-264a-11e2-a6f8-000c29733b59 corresponds to? The list in the gui has the serial numbers, and that is what would be useful to determine which physical disk is the problem... Even knowing which da device that corresponds to would be helpful.

Much obliged,
Matt

Never mind, I figured this part out, there is a volume status button on the bottom I had missed. So now I know it is indeed my da6 drive that is the problem, and I can locate and physically remove it in my server.

So, now what I need to know is, is the number 663 under cheksum a clear indication that the drive is bad? What does it even mean? Is it a count? 663 checksum errors? That sounds kind of high? Does it make sense to replace ASAP, or monitor and see if the number goes up?

Appreciate any help!

--Matt

alexg · May 16, 2014

Run long smart test on it. It could be bad SATA cable or memory issue. Do you have ECC memory?

mattlach · May 16, 2014

Alright, so there is definitely something funky with the drive. The SMART Checksum is off, and SMART thinks that the drive is 1.2PB (it is a 2TB drive)

So I ran a short test:

Code:

[root@freenas] ~# smartctl -t short /dev/da6
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
 
Warning! Drive Identity Structure error: invalid SMART checksum.
Warning! SMART Attribute Data Structure error: invalid SMART checksum.
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Fri May 16 22:13:02 2014
 
Use smartctl -X to abort test.

Followed by a report:

Code:

smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
 
Warning! Drive Identity Structure error: invalid SMART checksum.
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF)
Device Model:     WDC WD20EARS-02MVWB0  "   "   "   "   "
Serial Number:    "   "WD-WMAZC0182178
LU WWN Device Id: 5 2014ee 6578e34b9
Firmware Version: 50.0AB70
User Capacity:    1,127,900,306,038,784 bytes [1.12 PB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Fri May 16 22:15:27 2014 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
Warning! SMART Attribute Data Structure error: invalid SMART checksum.
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x85)Offline data collection activity
was aborted by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)The previous self-test routine completed
without error or no self-test has ever 
been run.
Total time to complete Offline 
data collection:(36000) seconds.
Offline data collection
capabilities:(0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:            (0x0003)Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:        (0x01)Error logging supported.
General Purpose Logging supported.
Short self-test routine 
recommended polling time:(   2) minutes.
Extended self-test routine
recommended polling time:( 253) minutes.
Conveyance self-test routine
recommended polling time:(   5) minutes.
SCT capabilities:      (0x3035)SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   181   163   021    Pre-fail  Always       -       5833
  4 Start_Stop_Count        0x0032   096   096   000    Old_age   Always       -       4553
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   057   057   000    Old_age   Always       -       31451
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       119
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       43
193 Load_Cycle_Count        0x0032   073   075   000    Old_age   Always       -       377498
194 Temperature_Celsius     0x0022   117   094   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       6
 
SMART Error Log Version: 1
No Errors Logged
 
Warning! SMART Self-Test Log Structure error: invalid SMART checksum.
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     30939         -
# 2  Short offline       Completed without error       00%     30938         -
 
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

It says no errors logged, but I don't know if I can trust SMART on this disk since it gives me a SMART checksum error, and it is clearly corrupted since it thinks it's sitting on a 1.2 petabyte drive.
Any thoughts? Would a long smart test make sense?

alexg · May 16, 2014

That disk size doesn't make sense. That is really strange. However,. bigger issue is that you have load cycle count in 300K range. I assume you have not been reading these forums when you decided to use green drives.

mattlach · May 16, 2014

alexg said:
That disk size doesn't make sense. That is really strange. However,. bigger issue is that you have load cycle count in 300K range. I assume you have not been reading these forums when you decided to use green drives.

Yeah, I know. Not the best drives for this. I had a bunch of greens already when I decided to build my FreeNAS box. Rather than go out and buy all new drives I figured I'd just use what I had, and if any failed, I'd replace them. It took 3 years, but that time just arrived :p

That's what redundancy is for, right? :p

All in all not TOO terrible IMHO. I got 3+ years out of a drive not recommended for this purpose. After 3-4 years I'm usually ready for an upgrade anyway...

As the drives fail, the replacements are going to be WD RED's.

mattlach · May 16, 2014

Oh and also, I tried booting to DOS and using the wdidle3 command on the drives a while back, but it never detected the drives, presumably because I didn't have the DOS drivers for my LSI controller.

I also tried to compile that wdidle3-tools package, but it's distributed in source only, and I couldn't for the life of me get gcc to install in FreeNAS, and couldn't figure out how to get gcc on my linux desktop to compile a BSD binary.

I had this vague plan in my head to take the drives out of the server, put them in my desktop and try to run the utility from there, but it was a major hassle, and a few years later, I still haven't gotten around to it :/

cyberjock · May 16, 2014

I've seen PB drives before.. your firmware is trashed. Just do an RMA and/or replace it.

mattlach · May 17, 2014

cyberjock said:
I've seen PB drives before.. your firmware is trashed. Just do an RMA and/or replace it.

I'm afraid an RMA is out of the question. This drive is about 2 years out of warranty. I do have a replacement red drive on the way though, and not a minute too soon, as the array degraded today:

Code:

[root@freenas] ~# zpool status
  pool: RAIDz2-01
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: scrub repaired 0 in 7h45m with 0 errors on Sun May 11 12:45:39 2014
config:
 
        NAME                                            STATE     READ WRITE CKSUM
        RAIDz2-01                                       DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            gptid/6e7ee1e5-264a-11e2-a6f8-000c29733b59  ONLINE       0     0     0
            gptid/6f7e6a37-264a-11e2-a6f8-000c29733b59  ONLINE       0     0     0
            gptid/70634851-264a-11e2-a6f8-000c29733b59  ONLINE       0     0     0
            gptid/715dc2c3-264a-11e2-a6f8-000c29733b59  ONLINE       0     0     0
            gptid/72786009-264a-11e2-a6f8-000c29733b59  ONLINE       0     0     0
            9092068756242309220                         UNAVAIL      0     0     0  was /dev/gptid/73773c11-264a-11e2-a6f8-000c29733b59
 
errors: No known data errors

This is why I used RAIDz2. I still have a drive worth of redundancy while I wait for the replacement! :)

cyberjock · May 17, 2014

Yeah.. good choice with RAIDZ2.

mattlach · May 21, 2014

Replacement drive just arrived!

It's a WD Red, and just to be sure, I decided to run th edrive update utility from my linux desktop to make sure I would ahve the head parking issue:

Code:

$ sudo ./wd5741x64 -D1
WD5741 Version 1
Update Drive
Copyright (C) 2013 Western Digital Corporation
 
 
WDC WD40EFRX-68WT0N0   80.00A80   Drive update not needed

So apparently the drive I got ships with the firmware update already installed!

mattlach · May 21, 2014

A word to the wise.

Label your disks!

I just spent way too much time trying to find which disk had the serial number that was listed as failed.

WD in their wisdom put a serial number label on the back side of the drive, but not on the front where the connectors are...

mattlach · May 28, 2014

Exporting, reinstalling and importing again did the trick!

I have a question though. Now in my volumes in the GUI I have a three extra lines right under the volume, all listed as volume/.system/something.

Did these appear in error, or are they supposed to be there? I don't remember seeing them previously, but all this also coincided with me upgrading from 9.2.0 to 9.2.1.5, so I don't know what is supposed to be different.

Thanks,
Matt

Important Announcement for the TrueNAS Community.

The volume xxxx (ZFS) status is ONLINE: One or more devices has experienced an unrecoverable error

mattlach

Patron

mattlach

Patron

mattlach

Patron

alexg

Contributor

mattlach

Patron

alexg

Contributor

mattlach

Patron

mattlach

Patron

cyberjock

Inactive Account

mattlach

Patron

cyberjock

Inactive Account

mattlach

Patron

mattlach

Patron

mattlach

Patron

Similar threads

Important Announcement for the TrueNAS Community.

The volume xxxx (ZFS) status is ONLINE: One or more devices has experienced an unrecoverable error

Patron

Patron

Patron

Contributor

Patron

Contributor

Patron

Patron

Inactive Account

Patron

Inactive Account

Patron

Patron

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "The volume xxxx (ZFS) status is ONLINE: One or more devices has experienced an unrecoverable error"

Similar threads