Recovering a degraded RAIDZ2 FreeNas 11.2 U3

Burlumpu Bumpu · Apr 30, 2019

Dear community

It's still a learning process for me, as I shall demonstrate now:

I am running a RAIDZ2 on a DELL R510 with a crossflashed H200. Today, my RAIDZ2-Pool was marked degraded, there were some write errors in one disk.

My plan was (and is) to get a replacement for this disk and resilver the pool (as I have done before). To make sure (now comes the beginner part :) ) there's nothing wrong with the backplane or some cabling, I shut down the server and took a look at the failing harddrive, the backplane and generally my server.

I then restarted and now there's a problem: FreeNas keeps trying to resilver my Pool, all the disks (including the faulty one) are now online. The resilver process works it's way to 1% and starts again from 0%.

What should I do now in order to replace the failed disk? Can I replace a disk while the pool is resilvering? There are no SMART errors in any of the disks.

Code:

[root@freenas ~]# zpool status
  pool: RAIDZ2
state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed May  1 00:48:46 2019
        78.7G scanned at 167M/s, 50.9G issued at 108M/s, 16.4T total
        5.96G resilvered, 0.30% done, 1 days 20:11:54 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        RAIDZ2                                          ONLINE       0     0 0
          raidz2-0                                      ONLINE       0     0 0
            gptid/fbbfea97-78cb-11e8-8f71-842b2b4f1e06  ONLINE       0     0 0
            gptid/ff414dca-78cb-11e8-8f71-842b2b4f1e06  ONLINE       0     0 0
            gptid/02bb0a27-78cc-11e8-8f71-842b2b4f1e06  ONLINE       0     0 0
            gptid/0645a295-78cc-11e8-8f71-842b2b4f1e06  ONLINE       0     0 0
            gptid/7f2b5a7c-f102-11e8-84c2-842b2b4f1e06  ONLINE       0    3667
            gptid/0d636c3e-78cc-11e8-8f71-842b2b4f1e06  ONLINE       0     0 0
            gptid/10eaa592-78cc-11e8-8f71-842b2b4f1e06  ONLINE       0     0 0
            gptid/14762089-78cc-11e8-8f71-842b2b4f1e06  ONLINE       0     0 0
        cache
          gptid/73396a40-b812-11e8-8f7e-842b2b4f1e06    ONLINE       0     0 0

errors: No known data errors

  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0 days 00:59:13 with 0 errors on Wed Apr 24 04:44:13 2019
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            da9p2   ONLINE       0     0     0
            da8p2   ONLINE       0     0     0

errors: No known data errors

da7 is the failing drive..:

Code:

[root@freenas ~]# smartctl -a /dev/da7
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               IBM-XIV
Product:              ST4000NM0043  C1
Revision:             EC5C
Compliance:           SPC-4
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c50083c28b3f
Serial number:        Z1Z8TF6D0000R546K5WD
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Wed May  1 00:57:44 2019 CEST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     38 C
Drive Trip Temperature:        65 C

Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 23000.42
  number of minutes until next internal SMART test = 4

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   3928830803        0         0  3928830803          0     600550.787      0
write:         0        0         0         0          0     224974.613  0
verify: 988837325        0         0  988837325          0       6966.842    0

Non-medium error count:      161

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   23000                 - [-   -    -]
# 2  Background short  Completed                   -   22975                 - [-   -    -]
# 3  Background short  Completed                   -   22951                 - [-   -    -]
# 4  Background short  Completed                   -   22927                 - [-   -    -]
# 5  Background short  Completed                   -   22903                 - [-   -    -]
# 6  Background short  Completed                   -   22879                 - [-   -    -]
# 7  Background short  Completed                   -   22855                 - [-   -    -]
# 8  Background short  Completed                   -   22831                 - [-   -    -]
# 9  Background short  Completed                   -   22807                 - [-   -    -]
#10  Background short  Completed                   -     147                 - [-   -    -]
#11  Background short  Completed                   -     113                 - [-   -    -]
#12  Background long   Completed                   -      56                 - [-   -    -]
#13  Background short  Completed                   -      50                 - [-   -    -]
#14  Background short  Aborted (by user command)   -      37                 - [-   -    -]

Long (extended) Self Test duration: 32700 seconds [545.0 minutes]

Thanks and kind regards

Alex

Chris Moore · Apr 30, 2019

Burlumpu Bumpu said:
da7 is the failing drive..:

How did you determine this?

Burlumpu Bumpu said:
gptid/7f2b5a7c-f102-11e8-84c2-842b2b4f1e06 ONLINE 0 3667

I would ensure that this is the problem drive and remove it to stop accumulating more data errors. You have two disks (RAIDz2) of fault protection so it should still be reasonably safe and it should stop trying to resilver. Get a new drive in there as soon as you can and start a rebuild on it which should correct the data errors from the redundant data on the other disks.

Burlumpu Bumpu · May 1, 2019

Thanks for your reply!

Chris Moore said:
How did you determine this?

Before restarting the system, the pool was marked as DEGRADED and glabel status identified the id of the failing drive as da7. So you would proceed "normally", which would mean setting the drive to offline, shut down the system when the replacement disk is here and then replace the disk in de GUI?

Chris Moore · May 1, 2019

It's a rack chassis with hot-swap bays, there's no need to turn power off before, during, or after a drive replacement. That part of the instructions is for people that don't have a real server, so they need to disassemble the system to get the drive out.

Burlumpu Bumpu · May 1, 2019

Chris Moore said:
It's a rack chassis with hot-swap bays, there's no need to turn power off before, during, or after a drive replacement. That part of the instructions is for people that don't have a real server, so they need to disassemble the system to get the drive out.

Thanks, it works now (as far as I can judge) I am just formatting the disk, tomorrow I'll start the resilver (hopefully..)

pro lamer · May 2, 2019

Burlumpu Bumpu said:
formatting the disk

Did you mean burn in? And did you mean the new disk?

Sent from my phone

Burlumpu Bumpu · May 2, 2019

No, I am sg_formatting the new disk because there were some partitions on it. (This has worked for me in the past with this type of disk in this system..)

Burlumpu Bumpu · May 2, 2019

Code:

root@freenas:~ # zpool status
  pool: RAIDZ2
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu May  2 13:11:56 2019
    4.68T scanned at 287M/s, 1.90T issued at 116M/s, 16.4T total
    238G resilvered, 11.59% done, 1 days 12:18:14 to go
config:

    NAME                                              STATE     READ WRITE CKSUM
    RAIDZ2                                            DEGRADED     0     0     0
      raidz2-0                                        DEGRADED     0     0     0
        gptid/fbbfea97-78cb-11e8-8f71-842b2b4f1e06    ONLINE       0     0     0
        gptid/ff414dca-78cb-11e8-8f71-842b2b4f1e06    ONLINE       0     0     0
        gptid/02bb0a27-78cc-11e8-8f71-842b2b4f1e06    ONLINE       0     0     0
        gptid/0645a295-78cc-11e8-8f71-842b2b4f1e06    ONLINE       0     0     0
        replacing-4                                   FAULTED      0     0     0
          gptid/7f2b5a7c-f102-11e8-84c2-842b2b4f1e06  FAULTED      0    54    67  too many errors
          gptid/1625246f-6ccb-11e9-a16e-842b2b4f1e06  ONLINE       0     0     0
        gptid/0d636c3e-78cc-11e8-8f71-842b2b4f1e06    ONLINE       0     0     0
        gptid/10eaa592-78cc-11e8-8f71-842b2b4f1e06    ONLINE       0     0     0
        gptid/14762089-78cc-11e8-8f71-842b2b4f1e06    ONLINE       0     0     0
    cache
      gptid/73396a40-b812-11e8-8f7e-842b2b4f1e06      ONLINE       0     0     0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 01:09:24 with 0 errors on Thu May  2 04:54:24 2019
config:

    NAME        STATE     READ WRITE CKSUM
    freenas-boot  ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        da9p2   ONLINE       0     0     0
        da8p2   ONLINE       0     0     0

errors: No known data errors

It's working, thank you for your support!

Important Announcement for the TrueNAS Community.

Recovering a degraded RAIDZ2 FreeNas 11.2 U3

Burlumpu Bumpu

Dabbler

Chris Moore

Hall of Famer

Burlumpu Bumpu

Dabbler

Chris Moore

Hall of Famer

Burlumpu Bumpu

Dabbler

pro lamer

Guru

Burlumpu Bumpu

Dabbler

Burlumpu Bumpu

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Recovering a degraded RAIDZ2 FreeNas 11.2 U3

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Guru

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Recovering a degraded RAIDZ2 FreeNas 11.2 U3"

Similar threads