FAILED SMART self-check

kam270 · Dec 15, 2014

I’ve been getting some serious errors and warnings;

I have the data backed up and considering replacing the faulty hard drive. Do you guys think I need to ?

Are there more things I can do to repair it or must I do a zpool replace ?

Dec 15 15:20:25 nakham smartd[2405]: Device: /dev/ada1, FAILED SMART self-check. BACK UP DATA NOW!
Dec 15 15:20:26 nakham smartd[2405]: Device: /dev/ada1, 322 Currently unreadable (pending) sectors
Dec 15 15:20:26 nakham smartd[2405]: Device: /dev/ada1, Failed SMART usage Attribute: 1 Raw_Read_Error_Rate.

Checking status of zfs pools:
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
raid_zfs 928G 699G 229G 75% 1.00x ONLINE /mnt

pool: raid_zfs
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub repaired 6.49M in 5h11m with 0 errors on Sun Dec 7 05:11:11 2014
config:

NAME STATE READ WRITE CKSUM
raid_zfs ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/b249edeb-f8bc-11e1-b17f-c89cdcab3397 ONLINE 0 0 0
ada1 ONLINE 0 0 73

errors: No known data errors

-- End of daily output --

# zpool status -x

pool status -x
pool: raid_zfs
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub repaired 6.49M in 5h11m with 0 errors on Sun Dec 7 05:11:11 2014
config:

NAME STATE READ WRITE CKSUM
raid_zfs ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/b249edeb-f8bc-11e1-b17f-c89cdcab3397 ONLINE 0 0 0
ada1 ONLINE 0 0 73

errors: No known data errors

danb35 · Dec 15, 2014

Post (in code tags) the output of 'smartctl -a /dev/ada1'. But pending that, yes, it looks like the disk is failing and needs to be replaced. Follow the manual's instructions to the letter to replace it.

kam270 · Dec 15, 2014

[root@nakham] ~# smartctl -a /dev/ada1
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p10 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint F3
Device Model: SAMSUNG HD103SJ
Serial Number: S246J90B919163
LU WWN Device Id: 5 0024e9 206103421
Firmware Version: 1AJ10001
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 6
SATA Version is: SATA 2.6, 3.0 Gb/s
Local Time is: Tue Dec 16 08:15:43 2014 ICT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
See vendor-specific Attribute list for failed Attributes.

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 137) The previous self-test completed having
a test element that failed and the
device is suspected of having handling
damage.
Total time to complete Offline
data collection: ( 9420) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 157) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 001 001 051 Pre-fail Always FAILING_NOW 34361
2 Throughput_Performance 0x0026 252 252 000 Old_age Always - 0
3 Spin_Up_Time 0x0023 071 071 025 Pre-fail Always - 8836
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 170
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0
8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 20964
10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 252 252 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 298
191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always - 1
192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0
194 Temperature_Celsius 0x0002 052 034 000 Old_age Always - 48 (Min/Max 28/66)
195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 097 097 000 Old_age Always - 325
198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 123
223 Load_Retry_Count 0x0032 252 252 000 Old_age Always - 0
225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 298

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: handling damage?? 90% 20820 0
# 2 Short offline Completed: handling damage?? 90% 20806 0
# 3 Short offline Completed: handling damage?? 90% 20806 0
# 4 Extended offline Completed: handling damage?? 90% 20800 0

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Completed_handling_damage?? [90% left] (0-65535)
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

danb35 · Dec 15, 2014

Yep, it's dying. It's also too hot--you want the temp under 40 C; it's 48 now, and it's seen 66. Time to replace, and work on cooling the replacement and remaining disks.

kam270 · Dec 15, 2014

Ok im gonna replace.. gulp...

I’ve read the manual on how to replace. As a new guy at this place on this new system I have a couple more questions:

How can I determine if the hardware is AHCI capable. ( I don’t really want to shutdown )

How can I tell what RAID I have. RAID 0 would be a pain as I don’t want to restore data from backups.

Thanks for the advice.

danb35 · Dec 15, 2014

Your zpool status output shows you have a mirror. As to the AHCI capability, the motherboard manual should say, but there's always some risk involved with hot-swapping disks.

SweetAndLow · Dec 15, 2014

It's highly suggested to shutdown to do the HDD swap. With the machine running you run a risk to bumping a different drive while trying to swap out. This for for any of the other parts also.

cyberjock · Dec 15, 2014

Keep in mind that even if the hardware supports AHCI, if the FreeBSD driver for the hardware doesn't support hotswap/hotplug then it still won't work. This is another reason why its recommended you do a shutdown unless you know for certain it should work. In your case you don't know, so you'd be VERY smart to shutdown the server to replace the drive.

kam270 · Dec 16, 2014

Ok ill shut it down. Scheduled for tomorrow. Will let you know how it goes.

Thanks

kam270 · Dec 16, 2014

Ok ill shut it down. Scheduled for tomorrow. Will let you know how it goes.

Thanks

kam270 · Dec 17, 2014

HDD replaced currently re-silvering :

2h3m to go 457G resilvered, 61.86% done

Might need to hit the detach button as I still see the old HDD listed in GUI

kam270 · Dec 17, 2014

It isn’t going very well I still see the old disk. and the new disk is offline :

zpool status
pool: raid_zfs
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: resilvered 738G in 6h48m with 2 errors on Wed Dec 17 19:25:28 2014
config:

NAME STATE READ WRITE CKSUM
raid_zfs DEGRADED 2 0 0
mirror-0 DEGRADED 2 0 19
replacing-0 UNAVAIL 0 0 0
7478789364016303368 UNAVAIL 0 0 0 was /dev/gptid/b249edeb-f8bc-11e1-b17f-c89cdcab3397
gptid/a5e42f7e-85ae-11e4-a9a2-c89cdcab3397 ONLINE 0 0 0 block size: 512B configured, 4096B native
ada1 ONLINE 2 0 19

errors: 2 data errors, use '-v' for a list

cyberjock · Dec 17, 2014

You also have corruption. Do a "zpool status -v" to see what files are corrupt.

If you look closely you replaced the old (but good) disk with another disk, and left the bad disk (ada1) in the pool. This is totally self-inflicted error. At this point you'll need to destroy the pool and restore from backup. Your backup might be the disk you pulled out of the system too....

kam270 · Dec 17, 2014

[root@nakham] ~# zpool status -v
pool: raid_zfs
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: resilvered 738G in 6h48m with 2 errors on Wed Dec 17 19:25:28 2014
config:

NAME STATE READ WRITE CKSUM
raid_zfs DEGRADED 2 0 0
mirror-0 DEGRADED 2 0 19
replacing-0 UNAVAIL 0 0 0
7478789364016303368 UNAVAIL 0 0 0 was /dev/gptid/b249edeb-f8bc-11e1-b17f-c89cdcab3397
gptid/a5e42f7e-85ae-11e4-a9a2-c89cdcab3397 ONLINE 0 0 0 block size: 512B configured, 4096B native
ada1 ONLINE 2 0 19

errors: Permanent errors have been detected in the following files:

<0xb0>:<0x121c12>
<0xb0>:<0x1875f0>

kam270 · Dec 17, 2014

I checked the serial of the HDD matched the disk with the error . I think the problem might of been when I pressed replace only ada1 was being listed. So I replaced with that ?

We have a separate system with the backups on so can get back the data.

What should I do now. Scared now..

kam270 · Dec 17, 2014

Oh sh*T i did pull the wrong disk . I checked the serials again and I pulled the wrong one.

The serials are exactly the same except for the last digit !

kam270 · Dec 17, 2014

So my question now is how do I recover from this mess I have created.

How do I destroy the pool ?

Can't I put the good HDD back and replicate from that. Import the zfs disk ?

How could I mount this good HD from Ubuntu and get data off there ?

cyberjock · Dec 17, 2014

Destroy the pool by following the FreeNAS manual and create a new one.

I won't comment on how to mount ZFS on Ubuntu because I'm pretty sure that the flags that FreeNAS uses aren't compatible with Ubuntu and therefore can't mount the pool.

kam270 · Dec 17, 2014

cyberjock said:
Destroy the pool by following the FreeNAS manual and create a new one.

I won't comment on how to mount ZFS on Ubuntu because I'm pretty sure that the flags that FreeNAS uses aren't compatible with Ubuntu and therefore can't mount the pool.

Ive been through the base manual 9.2.1. Couldn’t find anything specific to destroying pools . Am I in the right manual . What page specifically ?

danb35 · Dec 18, 2014

Section 6.3.1 of the 9.2.1 manual describes the View Volumes page. One of the options discussed there is how to detach a volume and mark the disks as new, which is what you want. See also figure 6.3m.

Important Announcement for the TrueNAS Community.

FAILED SMART self-check

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Sweet'NASty

Inactive Account

Dabbler

Dabbler

Dabbler

Dabbler

Inactive Account

Dabbler

Dabbler

Dabbler

Dabbler

Inactive Account

Dabbler

Hall of Famer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "FAILED SMART self-check"

Similar threads