Resilvering appears to run over and over again.

gg_morpheus

Dabbler
Joined
Jan 11, 2019
Messages
14
I replaced a disk that I had received some errors on a particular disk. 427 currently unreadable (pending) sectors. I connected the new disk to an unused cable from my SAS 8087 to SATA cable. Via the GUI in Storage >Pools I selected the failing disk and selected replace and selected the new disk, this was three days ago on 03/15/2020. Then I left the Resilvering process to finish reportedly it was only going to take about 6 hours or so. However the next day it appeared to be running the resilver again. I don't have a way that I know of to verify that the resilver succeeded. but I can see my daily run output emails and see that it has run multiple times. I can include the few outputs of zpool status. Raid Z1. Other information in my signature but if you need anything else let me know. I'd appreciate any help or advice, Thanks.

Code:
 pool: Media
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Mar 16 02:04:20 2020
        3.84T scanned at 1.16G/s, 1.79T issued at 552M/s, 9.11T total
        289G resilvered, 19.64% done, 0 days 03:51:48 to go
config:

        NAME                                              STATE     READ WRITE CKSUM
        Media                                             ONLINE       0     0     0
          raidz1-0                                        ONLINE       0     0     0
            gptid/630a7525-de75-11e5-be1b-d050991c32b1    ONLINE       0     0     0
            replacing-1                                   ONLINE       0     0 2.25K
              gptid/63dc2528-de75-11e5-be1b-d050991c32b1  ONLINE   1.45K 7.97K     0
              gptid/0b751328-670a-11ea-a9ca-d050991c32b1  ONLINE       0     0     0
            gptid/64af09ea-de75-11e5-be1b-d050991c32b1    ONLINE       0     0     0
            gptid/6582f461-de75-11e5-be1b-d050991c32b1    ONLINE       0     0     0
            gptid/665b4259-de75-11e5-be1b-d050991c32b1    ONLINE       0     0     0
            gptid/672c632c-de75-11e5-be1b-d050991c32b1    ONLINE       0     0     0

errors: No known data errors


Code:
  pool: Media
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Mar 17 02:36:10 2020
        3.02T scanned at 2.07G/s, 604G issued at 415M/s, 9.11T total
        94.6G resilvered, 6.47% done, 0 days 05:58:52 to go
config:

        NAME                                              STATE     READ WRITE CKSUM
        Media                                             ONLINE       0     0     0
          raidz1-0                                        ONLINE       0     0     0
            gptid/630a7525-de75-11e5-be1b-d050991c32b1    ONLINE       0     0     0
            replacing-1                                   ONLINE       0     0 3.92K
              gptid/63dc2528-de75-11e5-be1b-d050991c32b1  ONLINE   2.69K 12.1K     0
              gptid/0b751328-670a-11ea-a9ca-d050991c32b1  ONLINE       0     0     0
            gptid/64af09ea-de75-11e5-be1b-d050991c32b1    ONLINE       0     0     0
            gptid/6582f461-de75-11e5-be1b-d050991c32b1    ONLINE       0     0     0
            gptid/665b4259-de75-11e5-be1b-d050991c32b1    ONLINE       0     0     0
            gptid/672c632c-de75-11e5-be1b-d050991c32b1    ONLINE       0     0     0

errors: No known data errors


Code:
  pool: Media
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Mar 18 02:01:41 2020
        3.71T scanned at 1.07G/s, 1.66T issued at 488M/s, 9.11T total
        268G resilvered, 18.19% done, 0 days 04:26:46 to go
config:

        NAME                                              STATE     READ WRITE CKSUM
        Media                                             ONLINE       0     0     0
          raidz1-0                                        ONLINE       0     0     0
            gptid/630a7525-de75-11e5-be1b-d050991c32b1    ONLINE       0     0     0
            replacing-1                                   ONLINE       0     0 3.92K
              gptid/63dc2528-de75-11e5-be1b-d050991c32b1  ONLINE   2.69K 12.1K     0
              gptid/0b751328-670a-11ea-a9ca-d050991c32b1  ONLINE       0     0     0
            gptid/64af09ea-de75-11e5-be1b-d050991c32b1    ONLINE       0     0     0
            gptid/6582f461-de75-11e5-be1b-d050991c32b1    ONLINE       0     0     0
            gptid/665b4259-de75-11e5-be1b-d050991c32b1    ONLINE       0     0     0
            gptid/672c632c-de75-11e5-be1b-d050991c32b1    ONLINE       0     0     0

errors: No known data errors


Code:
 pool: Media
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Mar 18 12:58:45 2020
        5.08T scanned at 477M/s, 4.78T issued at 450M/s, 9.11T total
        775G resilvered, 52.50% done, 0 days 02:48:14 to go
config:

        NAME                                              STATE     READ WRITE CKSUM
        Media                                             ONLINE       0     0     0
          raidz1-0                                        ONLINE       0     0     0
            gptid/630a7525-de75-11e5-be1b-d050991c32b1    ONLINE       0     0     0
            replacing-1                                   ONLINE       0     0 3.92K
              gptid/63dc2528-de75-11e5-be1b-d050991c32b1  ONLINE   2.69K 12.1K     0
              gptid/0b751328-670a-11ea-a9ca-d050991c32b1  ONLINE       0     0     0
            gptid/64af09ea-de75-11e5-be1b-d050991c32b1    ONLINE       0     0     0
            gptid/6582f461-de75-11e5-be1b-d050991c32b1    ONLINE       0     0     0
            gptid/665b4259-de75-11e5-be1b-d050991c32b1    ONLINE       0     0     0
            gptid/672c632c-de75-11e5-be1b-d050991c32b1    ONLINE       0     0     0

errors: No known data errors
 
Joined
Jan 7, 2015
Messages
1,155
Run zpool status -v and see if there are files that are corrupt in that pool. Could be that one whole drive wasnt enough parity causing the resilver to run in a loop trying to fix itself. The good news is not all might be lost. If there are corrupt files like I might suspect, you can either restore the suspect files from backup or delete them. Once you do this you can probably run the resilver successfully, in theory.
 

gg_morpheus

Dabbler
Joined
Jan 11, 2019
Messages
14
Here is the output of zpool status -v. I don't see anything indicating corrupt files. Thanks for your help.


root@freenas:/var/log # zpool status -v
pool: Media
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Wed Mar 18 19:40:41 2020
3.06T scanned at 1.08G/s, 1.20T issued at 434M/s, 9.11T total
194G resilvered, 13.20% done, 0 days 05:18:23 to go
config:

NAME STATE READ WRITE CKSUM
Media ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gptid/630a7525-de75-11e5-be1b-d050991c32b1 ONLINE 0 0 0
replacing-1 ONLINE 0 0 3.92K
gptid/63dc2528-de75-11e5-be1b-d050991c32b1 ONLINE 2.69K 12.1K 0
gptid/0b751328-670a-11ea-a9ca-d050991c32b1 ONLINE 0 0 0
gptid/64af09ea-de75-11e5-be1b-d050991c32b1 ONLINE 0 0 0
gptid/6582f461-de75-11e5-be1b-d050991c32b1 ONLINE 0 0 0
gptid/665b4259-de75-11e5-be1b-d050991c32b1 ONLINE 0 0 0
gptid/672c632c-de75-11e5-be1b-d050991c32b1 ONLINE 0 0 0

errors: No known data errors

pool: freenas-boot
state: ONLINE
status: One or more devices are configured to use a non-native block size.
Expect reduced performance.
action: Replace affected devices with devices that support the
configured block size, or migrate data to a properly configured
pool.
scan: scrub repaired 0 in 0 days 00:00:36 with 0 errors on Wed Mar 11 03:45:36 2020
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
gptid/172478fd-de9e-11e4-bb81-d050991c32b1 ONLINE 0 0 0 block s

errors: No known data errors
 

gg_morpheus

Dabbler
Joined
Jan 11, 2019
Messages
14
This might be easier to read.
1584581764131.png
 
Joined
Jan 7, 2015
Messages
1,155
What about edit: zpool history
 

gg_morpheus

Dabbler
Joined
Jan 11, 2019
Messages
14
Maybe there is something different about my system zfs history does not appear to be a command.

zfs history
unrecognized command 'history'
 
Joined
Jan 7, 2015
Messages
1,155
sorry its zpool history
 

gg_morpheus

Dabbler
Joined
Jan 11, 2019
Messages
14
No problem.
 

Attachments

  • zpool history.txt
    54.7 KB · Views: 286
Joined
Jan 7, 2015
Messages
1,155
Strange there are no entries after March 16th. But I can see you started a resilver again on the 18th.
Code:
2020-03-15.12:12:14 zpool import -c /data/zfs/zpool.cache.saved -o cachefile=none -R /mnt -f 9108863970127805509
2020-03-15.12:12:14 zpool set cachefile=/data/zfs/zpool.cache Media
2020-03-15.12:48:28 <iocage> zfs set org.freebsd.ioc:active=yes Media
2020-03-15.12:49:08 <iocage> zfs set org.freebsd.ioc:active=yes Media
2020-03-15.13:24:49 <iocage> zfs mount Media/iocage/download/11.3-RELEASE
2020-03-15.13:25:03 <iocage> zfs mount Media/iocage/releases/11.3-RELEASE/root
2020-03-15.13:25:42 zfs snapshot Media/iocage/releases/11.3-RELEASE/root@test
2020-03-15.13:25:48 zfs clone -p Media/iocage/releases/11.3-RELEASE/root@test Media/iocage/jails/test/root
2020-03-15.13:30:41 <iocage> zfs set org.freebsd.ioc:active=yes Media
2020-03-15.13:30:46 <iocage> zfs set org.freebsd.ioc:active=yes Media
2020-03-15.13:31:58 <iocage> zfs set org.freebsd.ioc:active=yes Media
2020-03-15.13:31:59 zfs create -p Media/iocage/jails/transmission_1/root
2020-03-15.13:31:59 <warden-migration> zfs destroy Media/iocage/jails/transmission_1/root
2020-03-15.13:32:00 <warden-migration> zfs snapshot -r Media/jails/transmission_1@WardenMigration_2020-03-15
2020-03-15.13:32:54 <warden-migration> zfs destroy WardenMigration_2020-03-15
2020-03-15.13:35:11 <iocage> zfs snapshot -r Media/iocage/jails/transmission_1@ioc_update_9.3-RELEASE
2020-03-15.17:00:36 zpool import -c /data/zfs/zpool.cache.saved -o cachefile=none -R /mnt -f 9108863970127805509
2020-03-15.17:00:36 zpool set cachefile=/data/zfs/zpool.cache Media
2020-03-15.17:12:39 zpool replace Media /dev/gptid/63dc2528-de75-11e5-be1b-d050991c32b1 gptid/0b751328-670a-11ea-a9ca-d050991c32b1
2020-03-15.17:20:43 <iocage> zfs destroy test
2020-03-15.17:20:57 <iocage> zfs destroy Media/iocage/jails/test/root
2020-03-15.17:21:21 <iocage> zfs destroy Media/iocage/jails/test
2020-03-15.18:31:38 zpool import -c /data/zfs/zpool.cache.saved -o cachefile=none -R /mnt -f 9108863970127805509
2020-03-15.18:31:38 zpool set cachefile=/data/zfs/zpool.cache Media
2020-03-16.09:41:22 <iocage> zfs rename -f  Media/iocage/jails/transmission_1
2020-03-16.09:41:31 zfs rename -r Media/iocage@transmission_1 @transmission


Not sure if destroying pools during a resilver could have mucked some things up. Best practice is to let the machine sit during a resilver minimize disc access during a resilver.

At this point id try to do a full scrub of that pool and see what it returns, I still think you might have some corrupt data lurking somewhere.
 
Joined
Jan 7, 2015
Messages
1,155
If you lose one disk, and then you replace it, and another disk discovers an error while resilvering, you lose some data. Is there anything more when you run zpool status -xv
 

gg_morpheus

Dabbler
Joined
Jan 11, 2019
Messages
14
No nothing else that I see. I didn't think about it removing a pool. I was just playing with iocage jails and created a test jail and removed it later. Also I did not intentionally start the resilver on the 18th. I started it initially when I "replaced" the disk in the GUI and it has just been running ever since. Should zpool history show every resilver operation? I would find it interesting and informative to see if the resilver does indeed finish and if it just starts again as soon as it is done. I will start the full scrub before I go to bed.

Code:
root@freenas:/var/log # zpool status -xv
  pool: Media
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Mar 18 19:40:41 2020
        4.38T scanned at 787M/s, 2.52T issued at 453M/s, 9.11T total
        407G resilvered, 27.63% done, 0 days 04:14:29 to go
config:

        NAME                                              STATE     READ WRITE CKSUM
        Media                                             ONLINE       0     0     0
          raidz1-0                                        ONLINE       0     0     0
            gptid/630a7525-de75-11e5-be1b-d050991c32b1    ONLINE       0     0     0
            replacing-1                                   ONLINE       0     0 3.92K
              gptid/63dc2528-de75-11e5-be1b-d050991c32b1  ONLINE   2.69K 12.1K     0
              gptid/0b751328-670a-11ea-a9ca-d050991c32b1  ONLINE       0     0     0
            gptid/64af09ea-de75-11e5-be1b-d050991c32b1    ONLINE       0     0     0
            gptid/6582f461-de75-11e5-be1b-d050991c32b1    ONLINE       0     0     0
            gptid/665b4259-de75-11e5-be1b-d050991c32b1    ONLINE       0     0     0
            gptid/672c632c-de75-11e5-be1b-d050991c32b1    ONLINE       0     0     0

errors: No known data errors
 
Joined
Jan 7, 2015
Messages
1,155
I would think it would show everything dealing with zfs, especially resilvering attempts and those results. How often do you scrub?
 
Joined
Jan 7, 2015
Messages
1,155
Also if possible keep checking in every hour or two and see how for along does it get before restart. 52.50% is your record.
 

gg_morpheus

Dabbler
Joined
Jan 11, 2019
Messages
14
IF I can read cron jobs correctly it looks like every Sunday at midnight. 00 00 * * 7 but I also have Threshold days set to 35 so I think that means it would not run if it had run withing 35 days not sure how I messed that up.
 

gg_morpheus

Dabbler
Joined
Jan 11, 2019
Messages
14
I found something that looks like what I am experiencing. https://github.com/openzfs/zfs/issues/840 it is years old but I think it could still be helpful. I borrowed the script from here and have it running so it should tell us when it happens and how far it gets next time. The Github forum looks like good information and looks to be related to OpenZFS which may not be exactly the same as what freenas is using however, I would be surprised if this was not related to my problem as the manner in which it restarts appears the same and they mention SATA disks connected to SAS ports via SAS expanders they may not be using the vocabulary that I am used to but I think they are talking about SFF-8087 to SATA breakout cables like this https://tinyurl.com/yyy7mey3 which is exactly what I am using. but, I may need help interpreting their comments. Problem solving a researching is fun.

Code:
#!/bin/bash

while true; do ts=$(date +"%F-%T"); echo -e "$ts \c" >> resilver.txt; zpool status Media | egrep -1 scanned >> resilver.txt; sleep 30; done
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
It looks to me like you're resilvering because the bad disk is still online.

What happens if you offline that disk which has the read/write errors?
 

gg_morpheus

Dabbler
Joined
Jan 11, 2019
Messages
14
The resilver is indeed restarting when it is between 50% and 57% done. As evidenced from the details below. Also I am not able to run a scrub as the resilver operation is still running. I have ordered new SAS sff 8087>SATA breakout cables so I will try replacing the cables when they come in. I have not tried disconnecting the bad drive as now I know the resilver is not completing. But I may try removing the bad drive or the replacement drive to see if I can run the scrub.

Code:
2020-03-18-23:08:27   scan: resilver in progress since Wed Mar 18 19:40:41 2020
    7.30T scanned at 614M/s, 5.10T issued at 429M/s, 9.11T total
    827G resilvered, 55.97% done, 0 days 02:43:26 to go
2020-03-18-23:08:57   scan: resilver in progress since Wed Mar 18 23:08:37 2020
    503G scanned at 1.17G/s, 5.22G issued at 268M/s, 9.11T total
    0 resilvered, 0.06% done, 0 days 09:54:58 to go

2020-03-19-02:01:56   scan: resilver in progress since Wed Mar 18 23:08:37 2020
    5.49T scanned at 553M/s, 4.58T issued at 462M/s, 9.11T total
    742G resilvered, 50.24% done, 0 days 02:51:38 to go
2020-03-19-02:02:26   scan: resilver in progress since Thu Mar 19 02:02:13 2020
    283G scanned at 1.81G/s, 4.98G issued at 392M/s, 9.13T total
    0 resilvered, 0.05% done, 0 days 06:46:16 to go

2020-03-19-05:39:06   scan: resilver in progress since Thu Mar 19 02:02:13 2020
    7.35T scanned at 592M/s, 5.21T issued at 420M/s, 9.13T total
    845G resilvered, 57.11% done, 0 days 02:42:51 to go
2020-03-19-05:39:36   scan: resilver in progress since Thu Mar 19 05:39:33 2020
    285G scanned at 3.07G/s, 4.98G issued at 1.66G/s, 9.13T total
    0 resilvered, 0.05% done, 0 days 01:33:44 to go

2020-03-19-09:01:16   scan: resilver in progress since Thu Mar 19 05:39:43 2020
    7.25T scanned at 629M/s, 5.14T issued at 446M/s, 9.13T total
    833G resilvered, 56.32% done, 0 days 02:36:19 to go
2020-03-19-09:01:46   scan: resilver in progress since Thu Mar 19 09:01:20 2020
    2.48T scanned at 1.49G/s, 9.49G issued at 374M/s, 9.13T total
    547M resilvered, 0.10% done, 0 days 07:06:13 to go

2020-03-19-12:30:10   scan: resilver in progress since Thu Mar 19 09:01:20 2020
    7.25T scanned at 606M/s, 5.23T issued at 438M/s, 9.13T total
    848G resilvered, 57.33% done, 0 days 02:35:26 to go
2020-03-19-12:30:40   scan: resilver in progress since Thu Mar 19 12:30:27 2020
    1.41T scanned at 2.78G/s, 5.91G issued at 466M/s, 9.13T total
    0 resilvered, 0.06% done, 0 days 05:42:07 to go
 

gg_morpheus

Dabbler
Joined
Jan 11, 2019
Messages
14
Sretalla, you were so right. I off-lined the disk and the resilver completed. In about 6 hours. Thanks John and Sretalla!
 
Joined
Jan 7, 2015
Messages
1,155
It's elementary apparently. I've never replaced an online disk. Good shiz.
 
Top