Resilvering Progress Keeps Resetting

burrm

Cadet
Joined
Jun 17, 2018
Messages
6
I had a drive in a pool which started throwing checksum errors, so I decided to initiate a replace with a spare from the GUI.

The resilvering process has been running for almost two days, (it's a 1TB drive), which I guess is to be somewhat expected. However, whenever I check the status it keeps fluctuating up and down. Below for example is the current status:

Code:
root@freenas:~ # zpool status external-tank-01
  pool: external-tank-01
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Apr 26 12:13:17 2020
    1.90T scanned at 2.60G/s, 244G issued at 333M/s, 4.46T total
    40.4G resilvered, 5.33% done, 0 days 03:41:46 to go
config:

    NAME                                              STATE     READ WRITE CKSUM
    external-tank-01                                  DEGRADED     0     0     0
      raidz2-0                                        DEGRADED     0     0     0
        replacing-0                                   DEGRADED     0     0 33.3K
          da5p2                                       DEGRADED     0     0     0  too many errors
          gptid/16b1571d-86fb-11ea-bb2b-001517acd8c9  ONLINE       0     0     0  (resilvering)
        gptid/702d0621-e9fc-11e9-9ee0-001517acd8c9    ONLINE       0     0     0
        gptid/71121431-e9fc-11e9-9ee0-001517acd8c9    ONLINE       0     0     0
        gptid/720e10fe-e9fc-11e9-9ee0-001517acd8c9    ONLINE       0     0     0
        gptid/72f36ff2-e9fc-11e9-9ee0-001517acd8c9    ONLINE       0     0     0
        gptid/7681e9dd-e9fc-11e9-9ee0-001517acd8c9    ONLINE       0     0     0

errors: No known data errors


These numbers keep moving up and down. For example, right now it is at 5.33% with 40.4G resilvered, but earlier this morning it was
107G resilvered, 14.04% done. And yesterday there was a similar pattern, with the status numbers moving up and down. This of course logically seems odd- one would expect the number to gradually increase toward 100%, even if slowly.

I also note that the timestamp of the resilver process starting also changes. For example now it says "resilver in progress since Sun Apr 26 12:13:17 2020", but when I checked this morning it stated "resilver in progress since Sun Apr 26 09:32:46 2020". In fact I actually started the replacement process late on Friday night. This leads me to believe that the resilvering process keeps perhaps re-starting?

Is there a reason to expect the process to restart and/or the numbers to go down as opposed to gradually up, or is this indicative of some other problem? Are there any detailed logs written by the resilvering process that I could check for potential issues?

Version is FreeNAS-11.1-U7 (I know it's old- was planning to upgrade once this array problem is resolved in order to reduce the number or variables at play).

Thanks in advance for any ideas.
 

burrm

Cadet
Joined
Jun 17, 2018
Messages
6
Update: I caught it in the act today right before the status reset, and it looks like one of the other drives was also throwing errors right before the status started over. Here is what I saw right before the status went back to 0% a few moments later:

Code:
root@freenas:~ # zpool status external-tank-01
  pool: external-tank-01
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Apr 26 12:13:17 2020
    4.02T scanned at 435M/s, 3.08T issued at 333M/s, 4.46T total
    529G resilvered, 68.93% done, 0 days 01:12:49 to go
config:

    NAME                                              STATE     READ WRITE CKSUM
    external-tank-01                                  DEGRADED     0     0     0
      raidz2-0                                        DEGRADED     0     0     0
        replacing-0                                   DEGRADED     0     0 33.3K
          da5p2                                       DEGRADED     0     0     0  too many errors  (resilvering)
          gptid/16b1571d-86fb-11ea-bb2b-001517acd8c9  ONLINE       0     0     0  (resilvering)
        gptid/702d0621-e9fc-11e9-9ee0-001517acd8c9    ONLINE      86    93     0  (resilvering)
        gptid/71121431-e9fc-11e9-9ee0-001517acd8c9    ONLINE       0     0     0
        gptid/720e10fe-e9fc-11e9-9ee0-001517acd8c9    ONLINE       0     0     0
        gptid/72f36ff2-e9fc-11e9-9ee0-001517acd8c9    ONLINE       0     0     0
        gptid/7681e9dd-e9fc-11e9-9ee0-001517acd8c9    ONLINE       0     0     0


Seems like that drive might be having issues too, but why would this cause the resilvering of the other drive to reset? Seems like it would be best to let this replacement finish before attempt to replace a second drive?
 
Joined
Jul 2, 2019
Messages
648
What is (are) the makes and models of your drivers
 

burrm

Cadet
Joined
Jun 17, 2018
Messages
6
Quite old Western Digital Caviar Blacks (01FALS-00J7B1) are the failing ones. Not surprising they are coming to end of life-- trying to replace them before it's too late.

Replacement is Hitachi/HGST Ultrastar 7K4000
 
Joined
Jul 2, 2019
Messages
648
Ok. I don't think that those disks would be affected by the SMR issues that have been reported. Could there be a cable problem?
 

burrm

Cadet
Joined
Jun 17, 2018
Messages
6
Could there be a cable problem?

Don't know.

I'm tempted to offline that drive that's throwing the read/write errors and see if that will allow it to finish. Since it's raidz2 and I have a total of 6 drives in the pool theoretically should be OK (provided a third drive doesn't fail next) right?
 
Top