Advice on faulty drive/resilvering

David Buchanan · Aug 16, 2016

Hi All,

I'm looking for some advice to resolve my issue with what I believe is a faulty drive (maybe two) as I'm still learning Freenas.

A little info about my system:

CPU: Intel Xeon W3530
MEMORY: 24GB Samsung ECC Ram
HDD: 5 x 2TB WD in RAIDZ1
MOBO: Ausu P6T Deluxe
RUNNING: FreeNAS-9.10-STABLE-201606270534 (dd17351)

This morning I got an email from my NAS:
The volume ZFS1 (ZFS) state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.

Running zpool status -v I get the following:

pool: ZFS1
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Wed Aug 17 12:47:49 2016
73.1M scanned out of 6.81T at 18.6K/s, (scan is slow, no estimated time)
6.61M resilvered, 0.00% done
config:

NAME STATE READ WRITE CKSUM
ZFS1 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gptid/bbb6c3ef-fd39-11e3-b935-001f29554225 ONLINE 0 0 0 (resilvering)
gptid/bc263c59-fd39-11e3-b935-001f29554225 ONLINE 0 0 0
gptid/bccdac9d-fd39-11e3-b935-001f29554225 ONLINE 0 0 0
gptid/bd5b3a21-fd39-11e3-b935-001f29554225 ONLINE 0 0 0
gptid/be37c79c-fd39-11e3-b935-001f29554225 ONLINE 0 0 0 (resilvering)

errors: Permanent errors have been detected in the following files:

ZFS1/Backups:<0x2f>

pool: freenas-boot
state: ONLINE
scan: scrub repaired 0 in 0h11m with 0 errors on Sun Aug 7 03:56:03 2016
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
da0p2 ONLINE 0 0 0

As you can see two of my drives are showing resilvering.

Does this mean only one has issuies and another drive is being used to resolve it, or do I have two drives with problems?

Also, the resilver speed is ridiculously slow, 18.6K/s. Again this makes me think there is a faulty drive.

The console on the server shows the following message over and over again:

(ada0:ahcich0:0:0) : CAM status: ATA Status Error
(ada0:ahcich0:0:0) : ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada0:ahcich0:0:0) : RES: 41 40 c0 40 fb 19 00 00 00 00
(ada0:ahcich0:0:0) : Retrying command
(ada0:ahcich0:0:0) : READ_FPDMA_QUEUED. ACB: 60 08 c0............... massive string of digits.

Am I correct in thinking the drive ada0 is screwed and needs replacing?

Can I offline ada0 and replace it or does the fact I have two drives showing "resilvering" mean I need to take a different course of action (I have a spare drive here ready to go into the system).

I'd appericate any help you can give me. :)

Thanks
David

CraigD · Aug 16, 2016

If they are resilvering I would wait and not do a thing until at least one drive is resilvered and then change the SATA data cable on ada0

I would not stress the drives at all no writes at all and only limited reads,, do one thing wrong now and you may be lose your data

If you have a back the decision is easier, buy two new drive test them, replace the faulting drives, the check the data cables destroy the pool a recover from your backup, then look at the drives you pulled

I am confused as to a RAIDz1 vdev being able to resilver two drives at the same time

Have Fun

David Buchanan · Aug 16, 2016

OK, so there is differently something wrong with ada0.

smartclt -a /dev/ada0 outputs:

smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Green (AF)
Device Model: WDC WD20EARS-00MVWB0
Serial Number: WD-WCAZA1436668
LU WWN Device Id: 5 0014ee 2afb5c646
Firmware Version: 51.0AB51
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 2.6, 3.0 Gb/s
Local Time is: Wed Aug 17 15:33:44 2016 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (37200) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 359) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 198 198 051 Pre-fail Always - 31775
3 Spin_Up_Time 0x0027 169 167 021 Pre-fail Always - 6516
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 297
5 Reallocated_Sector_Ct 0x0033 153 153 140 Pre-fail Always - 904
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 046 046 000 Old_age Always - 39977
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 296
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 275
193 Load_Cycle_Count 0x0032 051 051 000 Old_age Always - 447815
194 Temperature_Celsius 0x0022 116 089 000 Old_age Always - 34
196 Reallocated_Event_Count 0x0032 001 001 000 Old_age Always - 268
197 Current_Pending_Sector 0x0032 001 001 000 Old_age Always - 65504
198 Offline_Uncorrectable 0x0030 200 199 000 Old_age Offline - 185
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 184 179 000 Old_age Offline - 4515

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 25343 3827330289
# 2 Extended offline Completed: read failure 90% 23495 3827330290

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

If I'm reading that correctly that's lots of bad sectors on ada0. I've checked all the other drives, including ada4 which is the 2nd drive resilvering and they all have 0 bad sectors being reported by smart.

My issue here is as you say Criag, how can I have two drives resilvering on a RAIDZ1?

With the speed the resilver is progressing at (now 11.8K/s) I don't see how the 6.81TB will ever complete.

I have a complete backup of all the important data on this NAS, so I'm not in any panic about resolving this. This system is also a home NAS, it has very little I/O going onto it. I'm more interested in fixing it for the learning experience.

Thanks,
David.

Robert Trevellyan · Aug 17, 2016

David Buchanan said:
With the speed the resilver is progressing at (now 11.8K/s) I don't see how the 6.81TB will ever complete.

Is ada0 one of the drives that's marked as "resilvering"?

Ideally you'd replace that drive immediately based on the smartctl output, but with your current pool status and RAIDZ1, this might be a bad move.

David Buchanan · Aug 18, 2016

Ada0 is one of the drives resilvering, yes.

Sent from my iPhone using Tapatalk

Robert Trevellyan · Aug 18, 2016

OK, then the poor performance is not surprising. You're in a tough spot.

Since you have a full backup, it might be worth risking a shutdown, to double-check all drive power and data cables.

If you can get to a zpool status that only shows one drive resilvering, then you can follow the directions in the manual for replacing a failing drive. If not, I would suspect a corrupted pool.

David Buchanan · Aug 18, 2016

The zpool is still working, I can access the CIFS shares. It's just very slow. Not surprising given the status of ada0.

I'll check all the cables tonight as you suggest and see how it goes. If that doesn't help I'll either leave it for a while to see if it gets to a status of just one drive resilvering. If not, I may just have to risk trying to replace ada0 even tho ada4 is also in a resilvering state.

I'm still confused as to how a RAIDZ1 system can end up with two drives resilvering tho....

Thanks,
David.

Robert Trevellyan · Aug 18, 2016

David Buchanan said:
I'm still confused as to how a RAIDZ1 system can end up with two drives resilvering

Same here, hence the concern about possible pool corruption.

SweetAndLow · Aug 18, 2016

It's still going because only some of the data can't be rebuilt. Any data that was on those 2 drivers is lost but data on other drivers is still there so zfs can return it to the user.

David Buchanan · Aug 21, 2016

So, 3 days later the re-silver completed. I've begun the replacement of ada0.

[root@nas1] ~# zpool status -v
pool: ZFS1
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Aug 21 21:07:20 2016
6.45M scanned out of 6.76T at 31.1K/s, (scan is slow, no estimated time)
1.22M resilvered, 0.00% done
config:

NAME STATE READ WRITE CKSUM
ZFS1 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
replacing-0 ONLINE 0 0 0
gptid/bbb6c3ef-fd39-11e3-b935-001f29554225 ONLINE 0 0 0
gptid/bdc576aa-678e-11e6-bf2a-001f29554225 ONLINE 0 0 0 (resilvering)
gptid/bc263c59-fd39-11e3-b935-001f29554225 ONLINE 0 0 0
gptid/bccdac9d-fd39-11e3-b935-001f29554225 ONLINE 0 0 0
gptid/bd5b3a21-fd39-11e3-b935-001f29554225 ONLINE 0 0 0
gptid/be37c79c-fd39-11e3-b935-001f29554225 ONLINE 0 0 0

errors: No known data errors

pool: freenas-boot
state: ONLINE
scan: scrub repaired 0 in 0h11m with 0 errors on Thu Aug 18 03:56:21 2016
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
da0p2 ONLINE 0 0 0

errors: No known data errors

The re-silver to replace ada0 is again slow. But my assumption here is that because ada0 is in such bad shape it's causing the slow re-silver speed.

Am I correct in my thinking?

Would I be better off shutting down the NAS even tho it's re-silvering at the moment and removing ada0 altogether and then letting it re-silver without ada0 being online?

Thanks,
David.

SweetAndLow · Aug 21, 2016

You can let the resilver try to finish but you do realize your pool is trashed and you will have to rebuild.

Robert Trevellyan · Aug 21, 2016

David Buchanan said:
removing ada0

Which directions for drive replacement are you following?

joeschmuck · Aug 21, 2016

Just out of curiosity, why did you ignore the drive failures back when it reported then at 25343 hours and now its 39977 hours? Also you don't appear to do any SMART testing. Look at that value for ID 197, holy cow! This whole problem could have been avoided.

I also think your data is corrupt and you would be better off just destroying your pool if you have a backup of your data and then restoring it.

Stux · Aug 21, 2016

Robert Trevellyan said:
Same here, hence the concern about possible pool corruption.

Data is written in blocks across all disks. Say 5 disks. You need to be able to read 4 of those blocks to reconstruct the whole block of data.

Since the drives aren't totally dead, as long as each unreadable data block is readable on the other drive... You should be fine, assuming the resilver finishes.

Let it run. Which ever drive finishes first, replace the other.

While you're waiting, burn in the replacement drive, and refresh your backup.

If you were using raidz2 you'd be fine.

The big problem with raidz1 is if you have a disk failure you have 100% loss of the blocks on that disk, if you then lose any other single block on any other disk, you get data loss.

Luckily you didn't have a total loss. Yet.

David Buchanan · Aug 21, 2016

Robert Trevellyan said:
Which directions for drive replacement are you following?

I'm following Section 8.1.10 of the Freenas guide: https://doc.freenas.org/9.3/freenas_storage.html

joeschmuck said:
Just out of curiosity, why did you ignore the drive failures back when it reported then at 25343 hours and now its 39977 hours? Also you don't appear to do any SMART testing. Look at that value for ID 197, holy cow! This whole problem could have been avoided.

I don't have much of an answer. I do have a SMART test scheduled but I suspect I've configured something wrong. I assume those hours are coming out of the drive SMART info? As the NAS is no where near that old, but the drives where used in a previous system.

Stux said:
Luckily you didn't have a total loss. Yet.

While I freely admit I'm no FreeNAS expert, I sure as hell would never have my data in just one place. As mentioned in the first post this is more about the learning experience and also seeing how FreeNAS handles situations like this. My data is safe, I'm not concerned at all. I'll be rebuilding the NAS in Z2 once I've had my fill of this.

Thanks for the input so far everyone! :)

Thanks,
David.

rs225 · Aug 22, 2016

Since you aren't worried about the data in the pool, I would try running iostat -t da -x 1 on the command line, and see if two drives are actually writing. If only one is, I would attempt the swap of the failing/stalling drive. The other 'resilver' may just be some kind of glitch caused by timeouts of the first drive, but without any problem with the data.

Important Announcement for the TrueNAS Community.

Advice on faulty drive/resilvering

David Buchanan

Dabbler

CraigD

Patron

David Buchanan

Dabbler

Robert Trevellyan

Pony Wrangler

David Buchanan

Dabbler

Robert Trevellyan

Pony Wrangler

David Buchanan

Dabbler

Robert Trevellyan

Pony Wrangler

SweetAndLow

Sweet'NASty

David Buchanan

Dabbler

SweetAndLow

Sweet'NASty

Robert Trevellyan

Pony Wrangler

joeschmuck

Old Man

Stux

MVP

David Buchanan

Dabbler

rs225

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Advice on faulty drive/resilvering

Dabbler

Patron

Dabbler

Pony Wrangler

Dabbler

Pony Wrangler

Dabbler

Pony Wrangler

Sweet'NASty

Dabbler

Sweet'NASty

Pony Wrangler

Old Man

MVP

Dabbler

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Advice on faulty drive/resilvering"

Similar threads