Replication process at 100%. Local replication.

scurrier · Apr 16, 2014

Summary: I'm getting an "error getting available space" and "error getting total space" on my backup mirrors while replicating. I may have brought this on myself while changing the compression method on the backup mirrors while replicating to it.
--------------------------------------------------------------
Today, I set up replication and got it running. I am using it to backup from one local zfs array to another. The target dataset is nested one level below a top level zvol. After the replication started, an entry appeared on the replication page with a status showing that the replication progress was 0%. I'm replicating about 4tb of data, but the progress % didn't move after waiting for a period where I thought it should have progressed.

So I began to investigate:

I looked at the disk usage and found that it was barely using the disks. We're talking occasional flashes of a few MB/s read and write.
I looked at the process and found that it was at 100% WCPU.

My replication is set to use the fast encryption and my CPU supports AES-NI (i3-4130), so I would not have expected the CPU to be such a choke point. Then it occurred to me... I set the backup mirror zvol to use gzip (maximum, slow). This was my first backup and I have a copy of the data so I figured I'd experiment and see if I can change the compression parameter of the backup mirror as a replication is occurring. First I changed it to gzip default and saw that it didn't fix the CPU usage issue. Then I changed it back to lz4, which didn't fix it either. I noticed this a short while later:

2014-04-16 20_00_26-Internet Explorer.png

...the storage tab shows an additional row with an error message about my backup dataset. So that's a problem. I did some searching on the forum and couldn't find anyone who had this issue created in the same way I did. The zvol parent of this dataset still is reported as "healthy."
You'll also notice the backup dataset has very low used space. So even if I borked this dataset by changing compression while replicating, I'm doubting that everything was A-OK before that, because it should have been filling up the dataset faster.
Is this a bug or did I do something stupid? Or both?
Any data I could gather to help with an investigation?
Any advice how to cleanly stop a replication in progress?

scurrier · Apr 16, 2014

My system specs, recorded here in case my signature ever changes.

FreeNAS-9.2.1.3-RELEASE-x64 (dc0c46b)
Intel i3-4130
Corsair 2x8GB RAM (CT2KIT102472BD160B)
Supermicro X10SL7-F (onboard IT-flashed LSI 2308 for 8 SAS / onboard Intel for 6 SATA)
Supermicro 933T-R760B, 3U 15-bay hotswap case

cyberjock · Apr 16, 2014

Well, changing stuff while replicating.... not a good idea.

But, ZFS goes bonkers for reporting available disk space while replicating data. That's completely typical.

I'd say stop customizing stuff and stop playing with it while it's working. My guess is there's no bug, but you've got too many variables to really say that for certain. Replication problems become a big problem for a lot of people real fast and it makes the forums, bug tickets, and IRC. So my guess is this is user error.

scurrier · Apr 16, 2014

Full disclosure, I forgot to say... I have multiple snapshot periods with different expirations set up for this volume being backed up. Some of my research showed that this may be an issue with replication. Wish I had known that when I set it up.

I have to agree that user error is the prime suspect for the error message I see on the storage tab. But that's only one issue.

The CPU usage and low disk usage seems wrong. That was happening before I goofed with the compression.

Is 100% WCPU in a local replication scenario expected?

Code:

  PID USERNAME    THR PRI NICE  SIZE    RES STATE  C  TIME  WCPU COMMAND
17811 root          1 103    0  149M 47328K CPU2    2 169:05 100.00% python2.7
17834 root          1  20    0 69524K  8772K select  0  3:04  1.27% sshd
17833 root          1  20    0 38340K  5168K select  3  3:01  1.27% ssh
17838 root          1  20    0 37588K  2844K piperd  1  0:26  0.00% zfs
3167 root          6  20    0  367M  178M usem    1  0:22  0.00% python2.7
17832 root          2  20    0 37720K  3280K nanslp  1  0:15  0.00% zfs
14836 root        12  20    0  142M 17464K uwait  0  0:10  0.00% collectd
21101 root          2  20    0  102M 20700K select  1  0:01  0.00% python2.7

Check my volume's iostats for proof of just how slow it is going:

Code:

[root@thumper] ~# zpool iostat firstvol backupone 1
              capacity    operations    bandwidth
pool        alloc  free  read  write  read  write
----------  -----  -----  -----  -----  -----  -----
backupone  35.0G  3.59T      0    22    577  1.56M
firstvol    4.01T  3.24T    13    13  1.51M  58.3K
----------  -----  -----  -----  -----  -----  -----
backupone  35.0G  3.59T      0      0      0      0
firstvol    4.01T  3.24T    29      0  3.75M      0
----------  -----  -----  -----  -----  -----  -----
backupone  35.0G  3.59T      0      0      0      0
firstvol    4.01T  3.24T    30      0  3.87M      0
----------  -----  -----  -----  -----  -----  -----
backupone  35.0G  3.59T      0    279      0  19.4M
firstvol    4.01T  3.24T    29    159  3.63M  675K
----------  -----  -----  -----  -----  -----  -----
backupone  35.0G  3.59T      0      0      0      0
firstvol    4.01T  3.24T    29      0  3.75M      0
----------  -----  -----  -----  -----  -----  -----
backupone  35.0G  3.59T      0      0      0      0
firstvol    4.01T  3.24T    29      0  3.75M      0
----------  -----  -----  -----  -----  -----  -----
backupone  35.0G  3.59T      0      0      0      0
firstvol    4.01T  3.24T    30      0  3.87M      0
----------  -----  -----  -----  -----  -----  -----
backupone  35.0G  3.59T      0      0      0      0
firstvol    4.01T  3.24T    29      0  3.63M      0
----------  -----  -----  -----  -----  -----  -----
backupone  35.1G  3.59T      0    272      0  18.5M
firstvol    4.01T  3.24T    28      0  3.62M      0
----------  -----  -----  -----  -----  -----  -----
backupone  35.1G  3.59T      0      0      0      0
firstvol    4.01T  3.24T    30      0  3.87M      0
----------  -----  -----  -----  -----  -----  -----
backupone  35.1G  3.59T      0      0      0      0
firstvol    4.01T  3.24T    29      0  3.75M      0
----------  -----  -----  -----  -----  -----  -----
backupone  35.1G  3.59T      0      0      0      0
firstvol    4.01T  3.24T    29      0  3.75M      0
----------  -----  -----  -----  -----  -----  -----
backupone  35.1G  3.59T      0      0      0      0
firstvol    4.01T  3.24T    28      0  3.62M      0
----------  -----  -----  -----  -----  -----  -----
backupone  35.1G  3.59T      0    280      0  19.5M
firstvol    4.01T  3.24T    29    159  3.75M  675K
----------  -----  -----  -----  -----  -----  -----
backupone  35.1G  3.59T      0      0      0      0
firstvol    4.01T  3.24T    30      0  3.87M      0
----------  -----  -----  -----  -----  -----  -----
backupone  35.1G  3.59T      0      0      0      0
firstvol    4.01T  3.24T    29      0  3.63M      0
----------  -----  -----  -----  -----  -----  -----

Note that the backup target volume can write in little bursts just fine. Then it pauses and waits for the source. It's chugging along reading the source volume at a constant, slow rate. Possibly that's because it's CPU limited? Seems so damn slow!

I wouldn't blame anyone who's not willing to troubleshoot this with me since I did goof with the compression. How do I stop this bus so I can get off to try again? Killing the process seems wrong. I can't block the network transfer like some other chump on these forums did, since this is a local replication. Is there no safe way to abort?

Thanks.

scurrier · Apr 16, 2014

I just checked and the only snapshots that have had a chance to be taken are from a single snapshot task. The second task on this volume has not had a chance to happen since it is only scheduled weekly.

cyberjock · Apr 16, 2014

There is a bug with replication... 9.2.1.4-beta has it fixed.

scurrier · Apr 16, 2014

I am really showing myself to be a noob today. I just realized that my source volume is bigger than my target volume. Sorry to waste everyone's time.

How do I safely stop this replication?

cyberjock · Apr 16, 2014

reboot the server is what I'd do... or you could kill the task

scurrier · Apr 16, 2014

Thanks. Rebooting the server seemed to work fine and it also made the error on the storage tab go away.

If I remember I'll report back about my replication results when I try again after I move some of the things off of the source dataset that I forgot to move before backing up. Then I won't have the space problem that I presume was making this flaky today.

scurrier · Apr 17, 2014

OK. I emptied out my source volume which I am having trouble replicating, so that it now is much smaller than the size of my mirrored backup volume. I destroyed and then recreated my mirrored backup volume. I deleted and recreated the snapshot task (only a single one this time) and replication task. I pretty much blew out everything and recreated it. This seems to be a vanilla replication. I am still having the issue.

Here's what the console logs when the replication starts. (I had the option checked to "Initialize remote side")

Code:

Apr 17 18:48:01 thumper autorepl.py: [tools.autorepl:291] Destroying remote backupone/bfirstmain/firstmain
Apr 17 18:48:01 thumper autorepl.py: [common.pipesubr:72] Executing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -o ConnectTimeout=7 -p 22 localhost /sbin/zfs destroy -rRf backupone/bfirstmain/firstmain
Apr 17 18:48:01 thumper common.pipesubr: cannot open 'backupone/bfirstmain/firstmain': dataset does not exist

...from reading that, it seems like it deleted the dataset and then complains that the dataset it just deleted doesn't exist. heh? But then the replication seems to "start" anyway and the dataset is still visible in the storage tab.

After the replication starts, I immediately get the red traffic light and the "volume/dataset -> replication host failed: None" error when you click the light. A critical email is sent to me with a similar message. The CPU goes to 100% WCPU for one of the python instances, as described above. The profile of the zpool iostat is the same as above, indicating a very very slow transfer. I am again seeing the "error getting available space" and "error getting total space" messages in a row of the storage tab representing the target dataset being replicated to.

The only bug that I could find that seemed related to my issue was this one:
https://bugs.freenas.org/issues/4631

In that bug, William suggests looking at a /tmp/repl-* file for the true error message. The file was there, but the file was empty when I tried to open it with ee.

Eventually, the critical warning went away, but the errors on the storage tab still exists. Another oddity is that the row with the errors is a child of my mirrored volume, but it doesn't appear indented like it should. That's a minor issue but might be of use to someone.

Not sure what to do now. Should I file this as a new bug or is it really just the existing, resolved bug? Is there really a bug which prevents a vanilla replication like this? What is the best version of FreeNAS to be using right now? I am on 9.2.1.3.

cyberjock · Apr 17, 2014

There shoud not have been any datasets on the destination pool. Additionally you probably have to check the "initialize destination" or whatever it is when you create the task. The destination pool should have been empty when you started.

The error getting space is ALWAYS going to happen on replication. It's 100% normal. It's because the command the GUI parses to get your info won't parse appropriately during replication. Get used to it. ;)

There's a python bug for FreeNAS 9.2.1.3 too causing 100% CPU usage. It's fixed in the 9.2.1.4-beta.

scurrier · Apr 17, 2014

To be clear-

Replication task replicates to:
backupone/bfirstmain

...where bfirstmain is an empty dataset. I did check the "initialize remote side" option in the replication task.

You said "there should not have been any datasets on the destination pool." Then why is this in the GUI?

2014-04-17 23_10_15-thumper - FreeNAS-9.2.1.3-RELEASE-x64 (dc0c46b).png

scurrier · Apr 22, 2014

FreeNAS-9.2.1.4.1-RELEASE-x64 (fec915c) fixed this issue for me. All I had to do was re-enable my replication task that I had previously disabled due to the bug and it worked this time. Thanks!

Not sure if this means I can start using multiple snapshots or not.

Important Announcement for the TrueNAS Community.

Replication process at 100%. Local replication.

scurrier

Patron

scurrier

Patron

cyberjock

Inactive Account

scurrier

Patron

scurrier

Patron

cyberjock

Inactive Account

scurrier

Patron

cyberjock

Inactive Account

scurrier

Patron

scurrier

Patron

cyberjock

Inactive Account

scurrier

Patron

scurrier

Patron

Similar threads

Important Announcement for the TrueNAS Community.

Replication process at 100%. Local replication.

Patron

Patron

Inactive Account

Patron

Patron

Inactive Account

Patron

Inactive Account

Patron

Patron

Inactive Account

Patron

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Replication process at 100%. Local replication."

Similar threads