Replication Error - Cannot Create Device Links

TScott · Aug 30, 2012

I've run into a strange replication issue between 2 FreeNAS boxes. Details of their config are below:

SAN1
FreeNAS-8.2.0-RELEASE-p1-x64 (r11950)
1x 1GB NIC for Management
1x 1GB NIC for iSCSI
ZFS Pool:
1 ZFS Pool with two 3TB mirrored (2 way mirror) vdevs
2 zvols (1.07TB and 2.05TB)

SAN2
FreeNAS-8.2.0-RELEASE-p1-x64 (r11950)
1x 1GB NIC for Management
1x 1GB NIC for iSCSI
ZFS Pool:
1 ZFS Pool with two 3TB mirrored (2 way mirror) vdevs
2 zvols (1.07TB and 2.05TB)

The configuration for both (and the hardware) is pretty much identical (save unique things like IP, Public Key, etc). They each have 2 zvols "native" to them, with one of those zvols replicated to the other SAN. So, one of SAN1's zvols gets sent over to SAN2, and one of SAN2's zvols gets sent over to SAN1. At first blush, replication seems to be working fine. With mostly empty zvols, a periodic snapshot task is created, followed by a replication task for both boxes. The empty zvols are replicated over just fine. I can see the most recent snapshot of the partner SAN appear on the "Pull" side, and I can see the zvol added to the filesystem under the Active Volumes page as well as zfs list (from SSH). An initial test with a few hundred MB of data on one zvol (just as a simple test) and replication continues to work. The periodic snapshots are sent over (from both sides) just fine. However, once I moved over the roughly 500GB of data targeted for one of the zvols on SAN1, during the next snapshot/replication interval I received a "broken pipe" error. Further up the logs it noted that there was a "mismatch after replication" of the first snapshot vs the new snapshot. That was in the direction of SAN1 to SAN2. The SAN2 to SAN1 replication continues to work, but it has virtually no data on it. Unfortunately, I did not save that specific log detail as I figured it had something to do with such a huge difference in data from the first snapshot to the next. So, I deleted the replication task, the periodic snapshot task, all of the snapshots for that zvol on the Push side, the snapshots on the Pull side, as well as the replicated zvol itself on the Pull side (so a fresh start, in essence).

After that, I recreated the periodic snapshot task on the Push side, along with the replication task and waited for it to kick off. This replication attempt failed as well, and the errors in the message log (/var/log/messages) are below:

Code:

Aug 30 00:00:01 0E1-PRD-SAN01P autosnap.py: [tools.autosnap:42] Popen()ing: /sbin/zfs snapshot -o freenas:state=NEW V3T-7k-STA3-2MRx2/0E1PRDvDR01-100@auto-20120830.0000-2w
Aug 30 00:00:02 0E1-PRD-SAN01P autorepl.py: [tools.autorepl:241] Destroying remote V3T-7k-STA3-2MRx2/0E1PRDvDR01-100
Aug 30 00:00:02 0E1-PRD-SAN01P autorepl.py: [common.pipesubr:49] Executing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 10.80.1.52 /sbin/zfs destroy -rRf V3T-7k-STA3-2MRx2/0E1PRDvDR01-100
Aug 30 00:00:02 0E1-PRD-SAN01P common.pipesubr: cannot open 'V3T-7k-STA3-2MRx2/0E1PRDvDR01-100': dataset does not exist
Aug 30 00:00:02 0E1-PRD-SAN01P autorepl.py: [tools.autorepl:247] Creating V3T-7k-STA3-2MRx2/0E1PRDvDR01-100 on remote system
Aug 30 00:00:02 0E1-PRD-SAN01P autorepl.py: [common.pipesubr:49] Executing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 10.80.1.52 /sbin/zfs create -o readonly=on -p V3T-7k-STA3-2MRx2/0E1PRDvDR01-100
Aug 30 00:00:03 0E1-PRD-SAN01P autorepl.py: [common.pipesubr:49] Executing: (/sbin/zfs send -R V3T-7k-STA3-2MRx2/0E1PRDvDR01-100@auto-20120830.0000-2w | /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 10.80.1.52 "/sbin/zfs receive -F -d V3T-7k-STA3-2MRx2 && echo Succeeded.") > /tmp/repl-6189 2>&1
Aug 30 00:39:02 0E1-PRD-SAN01P kernel: pid 6245 (zfs), uid 0: exited on signal 6
Aug 30 00:39:02 0E1-PRD-SAN01P autorepl.py: [common.pipesubr:42] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 10.80.1.52 "zfs list -Hr -o name -S creation -t snapshot -d 1 V3T-7k-STA3-2MRx2/0E1PRDvDR01-100 | head -n 1 | cut -d@ -f2"
Aug 30 00:39:03 0E1-PRD-SAN01P autorepl.py: [tools.autorepl:307] Replication of V3T-7k-STA3-2MRx2/0E1PRDvDR01-100@auto-20120830.0000-2w failed with warning: cannot send 'V3T-7k-STA3-2MRx2/0E1PRDvDR01-100@auto-20120830.0000-2w': Broken pipe internal error: Broken pipe Abort trap
Aug 30 00:40:02 0E1-PRD-SAN01P autorepl.py: [tools.autorepl:241] Destroying remote V3T-7k-STA3-2MRx2/0E1PRDvDR01-100
Aug 30 00:40:02 0E1-PRD-SAN01P autorepl.py: [common.pipesubr:49] Executing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 10.80.1.52 /sbin/zfs destroy -rRf V3T-7k-STA3-2MRx2/0E1PRDvDR01-100
Aug 30 00:40:02 0E1-PRD-SAN01P common.pipesubr: cannot destroy 'V3T-7k-STA3-2MRx2/0E1PRDvDR01-100': dataset is busy
Aug 30 00:40:02 0E1-PRD-SAN01P autorepl.py: [tools.autorepl:247] Creating V3T-7k-STA3-2MRx2/0E1PRDvDR01-100 on remote system
Aug 30 00:40:02 0E1-PRD-SAN01P autorepl.py: [common.pipesubr:49] Executing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 10.80.1.52 /sbin/zfs create -o readonly=on -p V3T-7k-STA3-2MRx2/0E1PRDvDR01-100
Aug 30 00:40:02 0E1-PRD-SAN01P common.pipesubr: cannot create 'V3T-7k-STA3-2MRx2/0E1PRDvDR01-100': dataset already exists
Aug 30 00:40:02 0E1-PRD-SAN01P autorepl.py: [common.pipesubr:49] Executing: (/sbin/zfs send -R V3T-7k-STA3-2MRx2/0E1PRDvDR01-100@auto-20120830.0000-2w | /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 10.80.1.52 "/sbin/zfs receive -F -d V3T-7k-STA3-2MRx2 && echo Succeeded.") > /tmp/repl-6661 2>&1
Aug 30 00:40:02 0E1-PRD-SAN01P kernel: pid 6693 (zfs), uid 0: exited on signal 6
Aug 30 00:40:02 0E1-PRD-SAN01P autorepl.py: [common.pipesubr:42] Popen()ing: /usr/bin/ssh -i /data/ssh/replication -o BatchMode=yes -o StrictHostKeyChecking=yes -q -p 22 10.80.1.52 "zfs list -Hr -o name -S creation -t snapshot -d 1 V3T-7k-STA3-2MRx2/0E1PRDvDR01-100 | head -n 1 | cut -d@ -f2"
Aug 30 00:40:03 0E1-PRD-SAN01P autorepl.py: [tools.autorepl:307] Replication of V3T-7k-STA3-2MRx2/0E1PRDvDR01-100@auto-20120830.0000-2w failed with cannot receive new filesystem stream: dataset is busy cannot create device links for 'V3T-7k-STA3-2MRx2/0E1PRDvDR01-100': dataset is busy warning: cannot send 'V3T-7k-STA3-2MRx2/0E1PRDvDR01-100@auto-20120830.0000-2w': Broken pipe internal error: Broken pipe Abort trap

So, I get the initial broken pipe error, followed by the dataset is busy cannot create device links on the subsequent attempts error. That error pattern seems to be repeat itself several times until I stop the replication (I have more of that log if needed). Thinking that perhaps its having a problem sending the initial FS stream over with whatever ZFS send | receive command FreeNAS' script is using, I figured I would try a manual push. So, using the Oracle documentation as reference (noting the difference to myself of a full send to create the FS vs an incremental send), I issued this command from an SSH session on the Push server:

zfs send V3T-7k-STA3-2MRx2/0E1PRDvDR01-100@auto-20120830.0808-2w | ssh 10.80.1.50 zfs recv V3T-7k-STA3-2MRx2/0E1PRDvDR01-100

V3T-7k-STA3-2MRx2 is the Push pool name (as well as the Pull pool) and 0E1PRDvDR01-100 is the zvol name with about 500GB of data on it. If it matters, that is setup as a device extent for an iSCSI LUN with a VMFS5 datastore on it. The initial, successful, replication (from SAN1 to SAN2) had the datastore on it but not the 500GB of data.

Trying the manual push of the FS, if I check the Pull end, I can see the zvol being created in the pool by using zfs list. Re-running zfs list command I can see the zvol size increase as time goes on (at a pretty steady pace, at that). However, before it finishes I get another broken pipe error (on the Push end), but not much more detail than that in the messages log:

Code:

Aug 30 05:06:57 0E1-PRD-SAN01P autorepl.py: [tools.autorepl:307] Replication of V3T-7k-STA3-2MRx2/0E1PRDvDR01-100@auto-20120830.0000-2w failed with warning: cannot send 'V3T-7k-STA3-2MRx2/0E1PRDvDR01-100@auto-20120830.0000-2w': Broken pipe internal error: Broken pipe Abort trap

The Push IP and Pull IP are on different subnets with a router between them. They are, however, currently at the same physical site (for this initial seeding). To eliminate the router as an issue, I tried a manual push using 2 IPs on the same network on the iSCSI NICs (same switch). This resulted in the same error as above. Just to clarify, the zvol replication going from SAN2 to SAN1 has continued to function w/o issue, but again it does not have any data on it.

Any suggestions that anyone can offer would be greatly appreciated. For that matter, if there are any other logs in particular I can look at to glean more details about the issue that would be quite helpful as well. Thank you.

TScott · Sep 4, 2012

I thought my original post got eaten. :p

I still have not been able to fully resolve the issue. But did get a lot of interesting, and sometimes conflicting, data. One key thing being that a 110GB replication stream from SAN 2 to SAN 1 (once some data was finally placed on it) went through just fine. I was expecting that one to crash like the 500GB stream from SAN 1 to 2. Since it didn't, this lead me to believe that perhaps either ZFS send/receive couldn't handle that volume of data, or perhaps there was a bottleneck somewhere else that was interfering with zfs send (maybe system resources). I also wasn't sure what affect the zpool version being on 18 and zfs version on 4 might be having on the issue (as compared to our NexentaStor systems which are on 28 and 5, respectively).

Regardless, with the volume of data being the most likely culprit, I trimmed down what would be sent in the initial stream to as small as a I could, which was 155GB, only 45GB more than the largest successful transfer I saw going in the opposite direction. Its initial replication failed as well, unfortunately, somewhere between 50 – 75GB into the transfer. No detail in the error, just the same “broken pipe abort trap.” This was quite discouraging, as volume of data was my best lead as to the primary factor in the transfers failing. I left the periodic snapshot and replication task alone, however, as I set about manually getting the initial file system stream copied over. It was something I had experimented with earlier to good success, except I was unable to get FreeNAS replication jobs to kick off to handle the incremental updates without selecting the “Initialize remote side for once” option. Which I didn’t want in this case since that would put me back in the same boat.

The method I had used to manually get the initial FS data over was to pipe a ZFS send into a gzip file, rsync that file over, and then pipe the gunzip into ZFS receive:

Code:

SAN 1: cd /mnt/V3T-7k-STA3-2MRx2/
SAN 1: mkdir temp
SAN 1: zfs send V3T-7k-STA3-2MRx2/SAN 1PRDvDR01-100@auto-20120830.1848-2w | gzip > /mnt/V3T-7k-STA3-2MRx2/temp/SAN 1PRDvDR01.gz
SAN 2: cd /mnt/V3T-7k-STA3-2MRx2/
SAN 2: mkdir temp
SAN 1: rsync -av /mnt/V3T-7k-STA3-2MRx2/temp/SAN 1PRDvDR01.gz root@10.80.1.52:/mnt/V3T-7k-STA3-2MRx2/temp/
SAN 2: gunzip < /mnt/V3T-7k-STA3-2MRx2/temp/SAN 1PRDvDR01.gz | zfs receive V3T-7k-STA3-2MRx2/SAN 1PRDvDR01-100

The last time I tried that got the initial large FS (zvol) over to the other SAN, and I was able to manually do a zfs send | receive over SSH with incremental updates. Unfortunately, I wasn’t able to get a new FreeNAS Replication Task to kick off in order to update that FS on the remote side. Without the initialize remote side once option checked, replication tasks don't seem to run. This time I was hoping that since the task was already created (even though it failed), that if I got the initial data over their manually, the FreeNAS replication script would be able to handle one of the incremental updates during one if its retry attempts. Which would allow automatic incremental updates without me manually scripting anything. Before I could test this, however, one of the retry attempts was actually successful sending the initial 155GB stream. The only difference I could note was the fact that the transfer speed was artificially capped because of other large transfer going on across the same path (namely the router that was being taxed by these simultaneous large transfers).

What further confuses the matter is that in my attempts to bypass any layer 3 bottlenecks, and do a IP to IP L2 transfer, the zfs sends seemed to fail even faster (same generic broken pipe, no additional details). Each of these SANs has a quad core Xeon (x3430) and 8GB of registered ECC memory. Could memory be the limiting factor here for fast, high volume zfs send | receive SSH transfers? This won’t be an issue once the SANs are split across the WAN (as WAN speeds will be the bottleneck), but it would be good to have some idea of what the issue was (is) for reference when using a similar setup for other purposes.

Important Announcement for the TrueNAS Community.

Replication Error - Cannot Create Device Links

TScott

Cadet

TScott

Cadet

Similar threads