Replication will not send corrupt data - resends over and over

Status
Not open for further replies.

amorton12

Dabbler
Joined
May 19, 2016
Messages
10
My pool had two disks (3T Barracudas, surprise:mad:) start throwing uncorrectable sector errors, so I replaced the one that was worst. I also noticed that replication hadn't been running since I updated the push machine from 9.3 to to 9.10 a few weeks ago. During the resilver, I ended up with some minor data corruption, and I am trying to make sure I can "save" the rest in case the other drive craps out before the resilver completes since its error count keeps going up.

I updated my replication pull target to 9.10 and moved the system dataset to make replication run again, but the one affected snapshot fails to send and replication just runs over and over.

This is the state of the pool. I appear to have the rest of the data replicated now, but even after I swap the disks and save the pool, I will still need to fix this somehow.

Code:
[root@freenas] ~# zpool status -v zfspool
  pool: zfspool
state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu May 19 06:30:07 2016
        2.85T scanned out of 4.01T at 79.4M/s, 4h14m to go
        612G resilvered, 71.08% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        zfspool                                         ONLINE       0     0     8
          raidz1-0                                      ONLINE       0     0    16
            gptid/ae451901-b3d1-11e4-b68a-001e4fb0f51d  ONLINE       0     0     0
            ada1                                        ONLINE       0     0     0  (resilvering)
            gptid/1d7db6e9-add2-11e2-ab62-525400390d09  ONLINE       8     0     0
            gptid/1e1b3145-add2-11e2-ab62-525400390d09  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        zfspool/home@auto-20160516.1822-7d:/path/to/badfile.gz


I continually get emails with this:


Code:
Hello,
    The replication failed for the local ZFS zfspool/home while attempting to
    apply incremental send of snapshot auto-20160515.1822-7d -> auto-20160516.1822-7d to 10.0.0.52



  • Is there something like zfs_send_corrupt_data tunable for ZFSoL that will override this behavior and allow me to send the rest of the intact data?
  • Can I tell ZFS to ignore the corrupt file in the snapshot and carry on?
  • If I can stop the replication in order to unlock and delete the affected snapshot and its children, how would I do that?
 
D

dlavigne

Guest
AFAIK the only way to fix this is to check the "Delete stale snapshots on remote system" box on the replication task. Note that this will delete all of the snapshots on the remote system which will require a full replication.
 

amorton12

Dabbler
Joined
May 19, 2016
Messages
10
I checked, and that option is enabled for my replication job already. I seem to remember something about initializing the remote side when I set this up on 9.3, but I don't see that option anywhere since I switched to 9.10.
 

amorton12

Dabbler
Joined
May 19, 2016
Messages
10
Gotcha. Since that's enabled, is this a bug? I'm not entirely sure what the expected behavior is if that doesn't do it.
 
D

dlavigne

Guest
Me neither so probably wouldn't hurt to create a bug at bugs.freenas.org. If you do, post the issue number here.
 

amorton12

Dabbler
Joined
May 19, 2016
Messages
10
I just realized I posted the wrong version number. I am running 10.3, not 9.10. 9.10 is what I just upgraded from.
 
D

dlavigne

Guest
I assume you mean 9.3? If so, that is a downgrade from 9.10.
 

amorton12

Dabbler
Joined
May 19, 2016
Messages
10
No, I upgraded from 9.10 to 10.3, the newest version.

Code:
FreeBSD 10.3-RC3 (FreeNAS.amd64) #0 86b9b91(freebsd10): Mon Mar 21 17:43:20 PDT 2016
 

amorton12

Dabbler
Joined
May 19, 2016
Messages
10
Currently, the pool looks like this, presumably because the snapshot rolled off due to age.


Code:
[root@freenas] ~# zpool status -v zfspool
  pool: zfspool
state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 44K in 22h34m with 1 errors on Mon May 23 06:29:11 2016
config:

        NAME                                            STATE     READ WRITE CKSUM
        zfspool                                         ONLINE       0     0   289
          raidz1-0                                      ONLINE       0     0   578
            gptid/ae451901-b3d1-11e4-b68a-001e4fb0f51d  ONLINE       0     0     0
            ada1                                        ONLINE       0     0     0
            gptid/1d7db6e9-add2-11e2-ab62-525400390d09  ONLINE     289     0     0
            gptid/1e1b3145-add2-11e2-ab62-525400390d09  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0x230b>:<0x29927>
 
D

dlavigne

Guest
That's just the underlying FreeBSD version. What is the actual FreeNAS "Build" (from System -> Information)?
 

amorton12

Dabbler
Joined
May 19, 2016
Messages
10
upload_2016-5-24_10-35-52.png
 

Sakuru

Guru
Joined
Nov 20, 2015
Messages
527
Do you have any updates available under System --> Update? If I recall correctly, 9.10-RELEASE is the first version of 9.10.
 

amorton12

Dabbler
Joined
May 19, 2016
Messages
10
Yes, it does offer me some updates, but I didn't see anything in the changelog emails it sends that appeared to reference this. I had planned to update it anyway, but it will be a while before I can schedule downtime to update this machine.

I have filed this issue in the bug tracker:
https://bugs.freenas.org/issues/15532
 

amorton12

Dabbler
Joined
May 19, 2016
Messages
10
That got quickly marked as "Behaves correctly", with the note that I need to remove the offending data. Since it did that for me, I went ahead and started a scrub. As soon as that started, the pool began reporting no errors.
 
Status
Not open for further replies.
Top