Recovering a degraded RAIDZ2 FreeNas 11.2 U3

Burlumpu Bumpu

Dabbler
Joined
Jun 23, 2018
Messages
20
Dear community

It's still a learning process for me, as I shall demonstrate now:

I am running a RAIDZ2 on a DELL R510 with a crossflashed H200. Today, my RAIDZ2-Pool was marked degraded, there were some write errors in one disk.

My plan was (and is) to get a replacement for this disk and resilver the pool (as I have done before). To make sure (now comes the beginner part :) ) there's nothing wrong with the backplane or some cabling, I shut down the server and took a look at the failing harddrive, the backplane and generally my server.

I then restarted and now there's a problem: FreeNas keeps trying to resilver my Pool, all the disks (including the faulty one) are now online. The resilver process works it's way to 1% and starts again from 0%.

What should I do now in order to replace the failed disk? Can I replace a disk while the pool is resilvering?
There are no SMART errors in any of the disks.

Code:
[root@freenas ~]# zpool status
  pool: RAIDZ2
state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed May  1 00:48:46 2019
        78.7G scanned at 167M/s, 50.9G issued at 108M/s, 16.4T total
        5.96G resilvered, 0.30% done, 1 days 20:11:54 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        RAIDZ2                                          ONLINE       0     0 0
          raidz2-0                                      ONLINE       0     0 0
            gptid/fbbfea97-78cb-11e8-8f71-842b2b4f1e06  ONLINE       0     0 0
            gptid/ff414dca-78cb-11e8-8f71-842b2b4f1e06  ONLINE       0     0 0
            gptid/02bb0a27-78cc-11e8-8f71-842b2b4f1e06  ONLINE       0     0 0
            gptid/0645a295-78cc-11e8-8f71-842b2b4f1e06  ONLINE       0     0 0
            gptid/7f2b5a7c-f102-11e8-84c2-842b2b4f1e06  ONLINE       0    3667
            gptid/0d636c3e-78cc-11e8-8f71-842b2b4f1e06  ONLINE       0     0 0
            gptid/10eaa592-78cc-11e8-8f71-842b2b4f1e06  ONLINE       0     0 0
            gptid/14762089-78cc-11e8-8f71-842b2b4f1e06  ONLINE       0     0 0
        cache
          gptid/73396a40-b812-11e8-8f7e-842b2b4f1e06    ONLINE       0     0 0

errors: No known data errors

  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0 days 00:59:13 with 0 errors on Wed Apr 24 04:44:13 2019
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            da9p2   ONLINE       0     0     0
            da8p2   ONLINE       0     0     0

errors: No known data errors


da7 is the failing drive..:

Code:
[root@freenas ~]# smartctl -a /dev/da7
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               IBM-XIV
Product:              ST4000NM0043  C1
Revision:             EC5C
Compliance:           SPC-4
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c50083c28b3f
Serial number:        Z1Z8TF6D0000R546K5WD
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Wed May  1 00:57:44 2019 CEST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     38 C
Drive Trip Temperature:        65 C

Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 23000.42
  number of minutes until next internal SMART test = 4

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   3928830803        0         0  3928830803          0     600550.787      0
write:         0        0         0         0          0     224974.613  0
verify: 988837325        0         0  988837325          0       6966.842    0

Non-medium error count:      161

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   23000                 - [-   -    -]
# 2  Background short  Completed                   -   22975                 - [-   -    -]
# 3  Background short  Completed                   -   22951                 - [-   -    -]
# 4  Background short  Completed                   -   22927                 - [-   -    -]
# 5  Background short  Completed                   -   22903                 - [-   -    -]
# 6  Background short  Completed                   -   22879                 - [-   -    -]
# 7  Background short  Completed                   -   22855                 - [-   -    -]
# 8  Background short  Completed                   -   22831                 - [-   -    -]
# 9  Background short  Completed                   -   22807                 - [-   -    -]
#10  Background short  Completed                   -     147                 - [-   -    -]
#11  Background short  Completed                   -     113                 - [-   -    -]
#12  Background long   Completed                   -      56                 - [-   -    -]
#13  Background short  Completed                   -      50                 - [-   -    -]
#14  Background short  Aborted (by user command)   -      37                 - [-   -    -]

Long (extended) Self Test duration: 32700 seconds [545.0 minutes]


Thanks and kind regards

Alex
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
da7 is the failing drive..:
How did you determine this?
gptid/7f2b5a7c-f102-11e8-84c2-842b2b4f1e06 ONLINE 0 3667
I would ensure that this is the problem drive and remove it to stop accumulating more data errors. You have two disks (RAIDz2) of fault protection so it should still be reasonably safe and it should stop trying to resilver. Get a new drive in there as soon as you can and start a rebuild on it which should correct the data errors from the redundant data on the other disks.
 

Burlumpu Bumpu

Dabbler
Joined
Jun 23, 2018
Messages
20
Thanks for your reply!
How did you determine this?
Before restarting the system, the pool was marked as DEGRADED and glabel status identified the id of the failing drive as da7. So you would proceed "normally", which would mean setting the drive to offline, shut down the system when the replacement disk is here and then replace the disk in de GUI?
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
It's a rack chassis with hot-swap bays, there's no need to turn power off before, during, or after a drive replacement. That part of the instructions is for people that don't have a real server, so they need to disassemble the system to get the drive out.
 

Burlumpu Bumpu

Dabbler
Joined
Jun 23, 2018
Messages
20
It's a rack chassis with hot-swap bays, there's no need to turn power off before, during, or after a drive replacement. That part of the instructions is for people that don't have a real server, so they need to disassemble the system to get the drive out.
Thanks, it works now (as far as I can judge) I am just formatting the disk, tomorrow I'll start the resilver (hopefully..)
 

pro lamer

Guru
Joined
Feb 16, 2018
Messages
626

Burlumpu Bumpu

Dabbler
Joined
Jun 23, 2018
Messages
20
No, I am sg_formatting the new disk because there were some partitions on it. (This has worked for me in the past with this type of disk in this system..)
 

Burlumpu Bumpu

Dabbler
Joined
Jun 23, 2018
Messages
20
Code:
root@freenas:~ # zpool status
  pool: RAIDZ2
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu May  2 13:11:56 2019
    4.68T scanned at 287M/s, 1.90T issued at 116M/s, 16.4T total
    238G resilvered, 11.59% done, 1 days 12:18:14 to go
config:

    NAME                                              STATE     READ WRITE CKSUM
    RAIDZ2                                            DEGRADED     0     0     0
      raidz2-0                                        DEGRADED     0     0     0
        gptid/fbbfea97-78cb-11e8-8f71-842b2b4f1e06    ONLINE       0     0     0
        gptid/ff414dca-78cb-11e8-8f71-842b2b4f1e06    ONLINE       0     0     0
        gptid/02bb0a27-78cc-11e8-8f71-842b2b4f1e06    ONLINE       0     0     0
        gptid/0645a295-78cc-11e8-8f71-842b2b4f1e06    ONLINE       0     0     0
        replacing-4                                   FAULTED      0     0     0
          gptid/7f2b5a7c-f102-11e8-84c2-842b2b4f1e06  FAULTED      0    54    67  too many errors
          gptid/1625246f-6ccb-11e9-a16e-842b2b4f1e06  ONLINE       0     0     0
        gptid/0d636c3e-78cc-11e8-8f71-842b2b4f1e06    ONLINE       0     0     0
        gptid/10eaa592-78cc-11e8-8f71-842b2b4f1e06    ONLINE       0     0     0
        gptid/14762089-78cc-11e8-8f71-842b2b4f1e06    ONLINE       0     0     0
    cache
      gptid/73396a40-b812-11e8-8f7e-842b2b4f1e06      ONLINE       0     0     0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 01:09:24 with 0 errors on Thu May  2 04:54:24 2019
config:

    NAME        STATE     READ WRITE CKSUM
    freenas-boot  ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        da9p2   ONLINE       0     0     0
        da8p2   ONLINE       0     0     0

errors: No known data errors


It's working, thank you for your support!
 
Top