Resilvering wont finish, restarts every few minutes

natobyte · Apr 29, 2016

I think I broke something while replacing a failed hdd. I took many notes though in hopes that one of you kind and experienced FreeNAS users can help me.

Day 1 - Drive Failures
1 drive has too many errors in my RAIDZ2 volume:
I check zpool status reports a degraded state and the offline drive (ada4).
I remove the drive and mail it back to seagate.
When I reboot the server another disk reports missing

Code:

~> zpool status
  pool: deepblue
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
    the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: scrub repaired 0 in 0h32m with 0 errors on Sun Apr 10 00:32:57 2016
config:
    NAME                                            STATE     READ WRITE CKSUM
    deepblue                                        DEGRADED     0     0     0
      raidz2-0                                      DEGRADED     0     0     0
        gptid/15425d13-181c-11e5-8529-00259047751d  ONLINE       0     0     0
        gptid/15db69e4-181c-11e5-8529-00259047751d  ONLINE       0     0     0
        gptid/165fbb0a-181c-11e5-8529-00259047751d  ONLINE       0     0     0
        9665275614133904940                         UNAVAIL      0     0     0  was /dev/gptid/16e24a17-181c-11e5-8529-00259047751d
        gptid/17661f3c-181c-11e5-8529-00259047751d  ONLINE       0     0     0
        9790684121614842782                         UNAVAIL      0     0     0  was /dev/gptid/17e6fc25-181c-11e5-8529-00259047751d
errors: No known data errors

Now Im worried and power off the machine until I can get a replacement drive.

Day2 - Install new HDD
I installed the WD Red and until the evening when I was ready to add it back to the volume.
I check the manual and *DOH* I didnt properly remove the old drive first, so now there is no gui option to "replace" the original missing drive as its not listed in the table.
The Volume Status table lists 2 drives as UNAVAIL: the WD Red and the second failing drive (ada2)

At this point I plan to use the new WD Red to replace the second failing drive (ada2) instead of the original failing drive.

OK- Here's where maybe I broke it-
I shutdown the machine and unplugged the new drive, then restarted, but the server wouldn't connect.
Plugged in a monitor and it says:
Port 2: ST3000DM001-1ER166
S.M.A.R.T Status Bad, Backup and Replace.
Press F1 to Resume...
ok then, I resume and log back into gui
Now the Volume status seems more normal:

It shows only 1 UNAVAIL drive, the second failing drive. But I did not notice that it also says ""Resilver Status: Completed". I dont know if that is significant.

Anyway, I (without rebooting this time) installed the WD Red and the resilvering is initiated.
http://postimg.org/image/51sxvi1b3/

Now the dmesg is scrolling error messages like:
(ada2:ahcich2:0:0:0): CAM status: Uncorrectable parity/CRC error
Retrying command
READ_FPDMA_QUEUED

So I check the zpool status

Code:

~/scripts> zpool status -x
  pool: deepblue
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Apr 29 00:36:49 2016
        10.3G scanned out of 721G at 51.8M/s, 3h53m to go
        1.69G resilvered, 1.43% done
config:
    NAME                                            STATE     READ WRITE CKSUM
    deepblue                                        ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/15425d13-181c-11e5-8529-00259047751d  ONLINE       0     0     0
        gptid/15db69e4-181c-11e5-8529-00259047751d  ONLINE       0     0     0
        gptid/165fbb0a-181c-11e5-8529-00259047751d  ONLINE       0     0     0
        gptid/16e24a17-181c-11e5-8529-00259047751d  ONLINE     659 5.79K     2  (resilvering)
        gptid/17661f3c-181c-11e5-8529-00259047751d  ONLINE       0     0     0
        gptid/125aaef4-0ddd-11e6-a9d9-00259047751d  ONLINE       0     0     0  (resilvering)

Day3 - What do I do?
Every time I check the zpool status the percent complete is a different amount

Code:

scan: resilver in progress since Fri Apr 29 07:12:32 2016
        7.50G scanned out of 721G at 175M/s, 1h9m to go
        2.44G resilvered, 1.04% done

then again 5 min later

Code:

scan: resilver in progress since Fri Apr 29 07:16:59 2016
        207M scanned out of 721G at 25.8M/s, 7h56m to go
        54.3M resilvered, 0.03% done

I ran "systat -vm" to verify that ada2 is being utilized at max capacity.

Here are the SMART attributes on the drive getting replaced (ada2)

Code:

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   111   099   006    Pre-fail  Always       -       35555616
  3 Spin_Up_Time            0x0003   094   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       29
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   060   055   030    Pre-fail  Always       -       55848810753
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3773
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       29
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   001   000    Old_age   Always       -       125 125 25514
189 High_Fly_Writes         0x003a   099   099   000    Old_age   Always       -       1
190 Airflow_Temperature_Cel 0x0022   063   052   045    Old_age   Always       -       37 (Min/Max 35/37)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       2
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       61
194 Temperature_Celsius     0x0022   037   048   000    Old_age   Always       -       37 (0 18 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   131   000    Old_age   Always       -       107515
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       3771h+48m+38.255s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       13924507252
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       36418178742

And the new WD red hdd (ada5)

Code:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   253   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   100   253   021    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       2
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       15
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       2
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       9
194 Temperature_Celsius     0x0022   117   116   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

rs225 · Apr 29, 2016

You should probably leave ada5 alone.

Check ada2 for a lose cable or bad cable. If it still has problems, remove ada2 and then let ada5 finish resilver. Then, replace ada2 with a new drive and see if the new drive can complete resilver.

Robert Trevellyan · Apr 29, 2016

natobyte said:
Port 2: ST3000DM001-1ER166
S.M.A.R.T Status Bad, Backup and Replace.
Press F1 to Resume...

This is from the BIOS?

natobyte said:
CAM status: Uncorrectable parity/CRC error

Agree with the possibility of a loose or bad cable.

natobyte said:
ST3000DM001

Unfortunately, this model is notorious for early failures. Do you have SMART tests and checks running, with email notifications working correctly?

natobyte said:
Every time I check the zpool status the percent complete is a different amount

This may be a symptom of a corrupted pool, which is possible, since at one point you had no redundancy.

If you can access the pool, consider backing up the data now. It looks like it could all fit on a 1TB drive (a mirrored pair would be better).

DrKK · Apr 30, 2016

Robert Trevellyan said:
Unfortunately, this model is notorious for early failures. Do you have SMART tests and checks running, with email notifications working correctly?

My colleague, Mr. Trevellyan, is understating the case. The ST3000DM001 is notorious for being the biggest piece of shit in the history of hard drives in the past 10 years, is what he meant to say. From an important online cloud provider:

Beginning in January 2012, Backblaze deployed 4,829 Seagate 3TB hard drives, model ST3000DM001, into Backblaze Storage Pods. In our experience, 80% of the hard drives we deploy will function at least 4 years. As of March 31, 2015, just 10% of the Seagate 3TB drives deployed in 2012 are still in service. This is the story of the 4,345 Seagate 3TB drives that are no longer in service.

And this experience has been echoed by many many users of FreeNAS who have been similarly disappointed by the lack of reliability and longevitity of this drive model.

I think I speak for everyone when I suggest that you, as soon as possible, replace all of your Seagate desktop 3TB models.

natobyte · May 2, 2016

Thank you all so much for helping me out. I will take your advice as soon as my new HDD arrives.
@DrKK Thanks for the backblaze article, it is an excellent read. I will add it to my signature.
@Robert Trevellyan yep the message was reported in the BIOS (to the best of my knowledge. I could not ping the box until I hit resume) also I do receive SMART email alerts, that is how I was first notified of the failure.

DrKK · May 2, 2016

It is our pleasure to assist you.

Important Announcement for the TrueNAS Community.

Resilvering wont finish, restarts every few minutes

natobyte

Cadet

Attachments

rs225

Guru

Robert Trevellyan

Pony Wrangler

DrKK

FreeNAS Generalissimo

natobyte

Cadet

DrKK

FreeNAS Generalissimo

Similar threads

Important Announcement for the TrueNAS Community.

Resilvering wont finish, restarts every few minutes

natobyte

Cadet

Attachments

rs225

Guru

Robert Trevellyan

Pony Wrangler

DrKK

FreeNAS Generalissimo

natobyte

Cadet

DrKK

FreeNAS Generalissimo

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Resilvering wont finish, restarts every few minutes"

Similar threads