Failing Hard Drive? How do I know what one?

Status
Not open for further replies.

TheBlueDalek

Cadet
Joined
Jun 28, 2013
Messages
8
Hi all

I've been happily using FreeNAS for about 2yrs now, and never had a problem.
Last weekend, however, I logged into the admin GUI and it was complaining about an unknown error. Google returned a couple of results, and I ended up with this:

Code:
[root@freenas] ~# zpool status -x                                                                                                                                                                       
  pool: storage                                                                                                                                                                                         
state: ONLINE                                                                                                                                                                                           
status: One or more devices has experienced an unrecoverable error.  An                                                                                                                                 
        attempt was made to correct the error.  Applications are unaffected.                                                                                                                             
action: Determine if the device needs to be replaced, and clear the errors                                                                                                                               
        using 'zpool clear' or replace the device with 'zpool replace'.                                                                                                                                 
  see: http://www.sun.com/msg/ZFS-8000-9P                                                                                                                                                               
scrub: none requested                                                                                                                                                                                   
config:                                                                                                                                                                                                 
                                                                                                                                                                                                         
        NAME                                            STATE    READ WRITE CKSUM                                                                                                                       
        storage                                        ONLINE      0    0    0                                                                                                                       
          raidz1                                        ONLINE      0    0    0                                                                                                                       
            gptid/242960b3-4b27-11e2-b9e0-50e54952401a  ONLINE      0    0    0                                                                                                                       
            gptid/2488f665-4b27-11e2-b9e0-50e54952401a  ONLINE      0    0    0                                                                                                                       
            gptid/24e5ce4f-4b27-11e2-b9e0-50e54952401a  ONLINE      0    0    0                                                                                                                       
            gptid/256c0c11-4b27-11e2-b9e0-50e54952401a  ONLINE      0    0    96                                                                                                                       
                                                                                                                                                                                                         
errors: No known data errors


I spoke to one of my co-workers, who has been using FreeNAS for years, and suggested scrubbing the array.
After the scrub, I ended up with the following:

Code:
[root@freenas] ~# zpool status storage
  pool: storage
  state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
    see: http://www.sun.com/msg/ZFS-8000-9P
  scrub: scrub completed after 6h22m with 0 errors on Fri Jun 28
02:16:46 2013
config:
 
        NAME                                            STATE READ
WRITE CKSUM
        storage                                        ONLINE 0   0    0
          raidz1                                        ONLINE 0   0    0
            gptid/242960b3-4b27-11e2-b9e0-50e54952401a  ONLINE 0   0    0
            gptid/2488f665-4b27-11e2-b9e0-50e54952401a  ONLINE 0  0    0
            gptid/24e5ce4f-4b27-11e2-b9e0-50e54952401a  ONLINE 0   0    0
            gptid/256c0c11-4b27-11e2-b9e0-50e54952401a  ONLINE 0    0  180K  7.61G repaired
 
errors: No known data errors


It looks like one of my drives is failing. Question is, how do I know what one? I'm Linux competent, but have very limited BSD knowledge. In Linux, there is a command that shows all the drives including serial & model numbers - lshw.

Is there a similar command or utility that will show me what gptid/256c0c11-4b27-11e2-b9e0-50e54952401a is in terms of make / model / SN?

Many thanks in advance!
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
Have you checked "volume status" under "active volumes" in the 'storage' section? It should list which /dev/[a]daX device it is.

Then 'view disks' will let you match that to a disk serial number.

How often is your scheduled scrub set up? 180,000 checksum errors is a lot. Either the drive is returning huge amounts of bad data, or something else weird is going on.

Do all the drives pass 'long' smart tests?

Can you paste a "smartctl -a -q noserial /dev/adaX" for whatever the 'problem' drive is?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
gpart list will match your gptid to device. Then look at the device in the FreeNAS GUI to get the serial number.
 

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,403

TheBlueDalek

Cadet
Joined
Jun 28, 2013
Messages
8
Thanks for the help guys.

gpart list gives me:

Code:
2. Name: ada4p2
  Mediasize: 2998445415936 (2.7T)
  Sectorsize: 512
  Stripesize: 4096
  Stripeoffset: 0
  Mode: r1w1e2
  rawuuid: 256c0c11-4b27-11e2-b9e0-50e54952401a
  rawtype: 516e7cba-6ecf-11d6-8ff8-00022d09712b
  label: (null)
  length: 2998445415936
  offset: 2147549184
  type: freebsd-zfs
  index: 2
  end: 5860533134
  start: 4194432


This is a brand new (4mo) Seagate Barracuda.

Code:
[root@freenas] ~# smartctl -a -q noserial /dev/ada4
smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.2-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
 
=== START OF INFORMATION SECTION ===
Device Model:    ST3000DM001-9YN166
Firmware Version: CC4H
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:    512 bytes logical, 4096 bytes physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:  8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Sat Jun 29 14:28:51 2013 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (  0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:        (  600) seconds.
Offline data collection
capabilities:              (0x73) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (  1) minutes.
Extended self-test routine
recommended polling time:      ( 255) minutes.
Conveyance self-test routine
recommended polling time:      (  2) minutes.
SCT capabilities:            (0x3085)    SCT Status supported.
 
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000f  117  099  006    Pre-fail  Always      -      127495160
  3 Spin_Up_Time            0x0003  092  092  000    Pre-fail  Always      -      0
  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      398
  5 Reallocated_Sector_Ct  0x0033  100  100  036    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x000f  038  036  030    Pre-fail  Always      -      9938568106768
  9 Power_On_Hours          0x0032  096  096  000    Old_age  Always      -      3662
10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0
12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      16
183 Runtime_Bad_Block      0x0032  100  100  000    Old_age  Always      -      0
184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0
187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0
188 Command_Timeout        0x0032  100  100  000    Old_age  Always      -      0
189 High_Fly_Writes        0x003a  100  100  000    Old_age  Always      -      0
190 Airflow_Temperature_Cel 0x0022  067  056  045    Old_age  Always      -      33 (Min/Max 29/44)
191 G-Sense_Error_Rate      0x0032  100  100  000    Old_age  Always      -      0
192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      7
193 Load_Cycle_Count        0x0032  098  098  000    Old_age  Always      -      4585
194 Temperature_Celsius    0x0022  033  044  000    Old_age  Always      -      33 (0 12 0 0 0)
197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x003e  200  123  000    Old_age  Always      -      2268097
240 Head_Flying_Hours      0x0000  100  253  000    Old_age  Offline      -      227972569105024
241 Total_LBAs_Written      0x0000  100  253  000    Old_age  Offline      -      5891963953900
242 Total_LBAs_Read        0x0000  100  253  000    Old_age  Offline      -      4241206544186
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]
 
 
SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


I've told it to do a full smart test and can post results later if needed.
If this is not a drive problem, the rest of my hardware is as follows:

Code:
Hostname    freenas.local
Build    FreeNAS-8.2.0-RELEASE-p1-x64 (r11950)
Platform    AMD A4-3400 APU with Radeon(tm) HD Graphics
Memory    7663MB
System Time    Sat Jun 29 14:30:19 EDT 2013
Uptime    2:30PM up 16:06, 1 user
Load Average    0.08, 0.03, 0.00
Connected through    192.168.1.122


I don't have the specific motherboard model number, but it is a Gigabyte full ATX.
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
Code:
199 UDMA_CRC_Error_Count    0x003e  200  123  000    Old_age  Always      -      2268097


You're sure it's not a bad cable or something?

I thought udma error counts were not usually the drive, but controller / cable / etc?

That's very possibly where your checksum errors are coming from.

Have you tried bypassing the 5 bay enclosure thing? I had a 3 bay enclsoure that would occasionally give me udma errors until I changed the ports that were hooked up to it to sata2 ports. When sata3 ports were connected, I got intermittent errors on the drives. I wrote it off as the enclosure not being rated for sata3.
 

TheBlueDalek

Cadet
Joined
Jun 28, 2013
Messages
8
Code:
199 UDMA_CRC_Error_Count    0x003e  200  123  000    Old_age  Always      -      2268097


You're sure it's not a bad cable or something?

I thought udma error counts were not usually the drive, but controller / cable / etc?

That's very possibly where your checksum errors are coming from.

Have you tried bypassing the 5 bay enclosure thing? I had a 3 bay enclsoure that would occasionally give me udma errors until I changed the ports that were hooked up to it to sata2 ports. When sata3 ports were connected, I got intermittent errors on the drives. I wrote it off as the enclosure not being rated for sata3.


It's not an 'enclosure', but rather a purpose built PC.
It could be a cable I guess. I'll try swapping it and see if that clears the issue.

I picked up a couple WD Red drives just in case it was a bad drive.. maybe this means I can take 'em back for a refund! :)
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
Never mind. Must have been thinking of a different post.

You should really have scheduled scrubs. It will 'prove' that everything is good every time it does a scrub instead of relying on whenever you happen to read through the old data.

Definitely swap everything. Try a different cable, different sata port, different disk. I know you've done some of this switching around of things already.

Do any of the other drives show udma crc errors?
 

TheBlueDalek

Cadet
Joined
Jun 28, 2013
Messages
8
How odd..

I simply restarted the system...

Code:
[root@freenas] ~# zpool status storage
  pool: storage
state: ONLINE
scrub: scrub completed after 3h3m with 0 errors on Sat Jun 29 01:29:13 2013
config:
 
        NAME                                            STATE    READ WRITE CKSUM
        storage                                        ONLINE      0    0    0
          raidz1                                        ONLINE      0    0    0
            gptid/242960b3-4b27-11e2-b9e0-50e54952401a  ONLINE      0    0    0
            gptid/2488f665-4b27-11e2-b9e0-50e54952401a  ONLINE      0    0    0
            gptid/24e5ce4f-4b27-11e2-b9e0-50e54952401a  ONLINE      0    0    0
            gptid/256c0c11-4b27-11e2-b9e0-50e54952401a  ONLINE      0    0    0
 
errors: No known data errors


I picked up the WD Red drives at a very good price, so I may just hold on to them for now... y'know... just in case.
I'll also be keeping an eye on if any other errors show up.

Thanks again all!
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I believe a reboot resets all values to zero.

I'd do a RAM test if I were you. Not that there is ever much evidence that RAM is bad, but its an easy and cheap test to do.
 

TheBlueDalek

Cadet
Joined
Jun 28, 2013
Messages
8
Alright, so I happened to log into the GUI and the green indicator was now flashing yellow once again...

Code:
[root@freenas] ~# zpool status storage
  pool: storage
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
  see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub completed after 3h3m with 0 errors on Sat Jun 29 01:29:13 2013
config:
 
    NAME                                            STATE    READ WRITE CKSUM
    storage                                        ONLINE      0    0    0
      raidz1                                        ONLINE      0    0    0
        gptid/242960b3-4b27-11e2-b9e0-50e54952401a  ONLINE      0    0    0
        gptid/2488f665-4b27-11e2-b9e0-50e54952401a  ONLINE      0    0    0
        gptid/24e5ce4f-4b27-11e2-b9e0-50e54952401a  ONLINE      0    0    0
        gptid/256c0c11-4b27-11e2-b9e0-50e54952401a  ONLINE      0    0    2
 
errors: No known data errors


I'll open the case tomorrow and swap out the cable. I don't have another SATA port available. Is FreeNAS smart enough to detect a drive change if I move drives around? ie... change the port that they are connected to?

I don't believe it is a RAM issue as there are no errors being reported on the other drives. Does FreeNAS have a way to do so? Normally I'd put an Ubuntu install disc in the optical drive and run the memory checker that is included, but this system does not have an optical drive.

Thanks
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I agree that I don't "believe" it is a RAm issue either, but with how easy it is to do the test, it's something that you can't go wrong with.

One of the most frustrating things I have ever had to do is troubleshoot a computer with bad RAM. You'll get very oddball errors and messages that don't make any sense. You'll rack your brain for days trying to figure out what is going on until you make the silly choice to run a RAM test. Each time you think you've "narrowed down" the problem something will point you in a different direction.

www.memtest.org can make a bootable USB stick to test RAM. 3 full passes(typically leave it on overnight) will generally prove your RAM is good.
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
You need to determine what's causing the problem. Port, Cable, or Drive.

The first step, switch one of them. Like change drive A with drive B, but keep the port and cable the same. So you'd have Port A and cable A, but drive B. And Port B, cable B, and drive A. If the checksum errors continue to be reported on the same drive, you know it's the drive.

If not, do similar tests with the other components. Switch just the cable, or just the port.

Freenas doesn't care which port the drive(s) are connected to. As long as it sees the drive(s) on any port, it'll 'do the right thing'.

As cyberjock says, memtest is pretty much the standard in proving memory. Maybe the ubuntu memory checker is memtest. I wouldn't use anything but. All you need is a spare usb flash drive to boot memtest. Good peace of mind to know the ram passes memtest overnight.
 
Status
Not open for further replies.
Top