ZFS Send/server migration help

Ixian · Jan 17, 2019

I built a new Freenas system (specs for both below). What I'd like to do is:

Copy data from system_old to system_new
Switch to using system_new (jails, etc.) as primary
When system_new is settled, destroy pools on system_old, reconfigure new pools using more disks (8 instead of 6), then use it as an ongoing replication target/backup system.

system_old:
Xeon D-1541 64GB ECC
Asrock Rack D1541 D4U-2TR MB
6x 5TB WD Red drives (poolname: Slimz) RaidZ2
2x 5TB WD Red drives (poolname: Backups) Mirror
1x 512GB Toshiba SATA SSD (poolname: Jails)
1x 240GB Intel P905 (log & cache for pools)
Intel x520 10GBase-T NIC (Storage interface, 10.0.0.2)
Intel i350 1GB NIC (Mgmt. interface, 192.168.0.90)

system_new:
Xeon E5-2680v3 64GB ECC
SM X10SRM-F MB
8x 10TB WD Red drives (poolname: Slimz) RaidZ2
2x 1TB Samsung 860 Evo SSD (poolname: Jails) Mirror
1x 400GB Intel DC3700 (log & cache for pools)
Intel x540 10GBase-T NIC (Storage interface, 10.0.0.3)
Intel i350 1GB NIC (Mgmt. interface, 192.168.0.91)

This is my first FreeNAS server migration so I need a little help. Specifically, I'm not familiar enough with ZFS send/receive on the CLI to set this up successfully.

Ideally I'd sync the entire pool "Slimz" and its datasets (19TB total) on system_old to the same pool on system_new, same for pool Jails and its datasets (340GB), and as for pool Backups I'd just like the datasets it contains to go under the Slimz pool on server_new as well.

I've done some basic tuning for the 10GB NICs on both - hw.ix.enable_aim = 0 as a tunable, mtu 9000 for both, etc. I have both servers peered together; here's my iperf report:

Which seems pretty decent for Intel NICs. I'm aware I won't get data transfers speeds anywhere near that as the Reds are 5400 spindles, but obviously I'd like to max it.

Can anyone assist:

With the correct command line process to create manual snapshots and use ZFS send/recv? I've read up on piping with netcat to max the transfer rate but clearly I am doing something wrong.

Should I disable the log/cache on one/both server pools?

Once the data is sync'd over, can I promote the snapshot datasets so they can be used? Or clone instead?

Really appreciate any help/advice. Thanks!

Ixian · Jan 18, 2019

Working through this, now troubleshooting performance:

I created a recursive test snapshot of my Jails/appdata dataset, which is 493M:

Code:

zfs snap -r Jails/appdata@migratetest

Then set up zfs recv on the target (server_new) piping mbuffer and using netcat, because this is a private direct peer link so ssh overhead is a waste:

Code:

nc -l 3333 | \
  mbuffer -q -s 128k -m 1G | \
  pv -rtab | \
  sudo zfs receive -vF Jails/appdata

And on server_old, the sender:

Code:

zfs send -I Jails/appdata@migratetest | \
   mbuffer -q -s 128k -m 1G | \
   pv -b | \
   nc 10.0.0.2  3333

This works and data is sent however it is either pitifully slow or I don't know how to interpret the results. Started at around 4MiB/s and went down all the way down to 165KiB/s. It's still running 50 minutes in. Clearly either something is A) very wrong or B) it actually finished and is just sort of pinging along until I kill it off because I didn't specify an idle timeout or something.

Anyone spot what I am doing wrong here?

Ixian · Jan 18, 2019

I'm going to keep updating the thread in case it helps others as I go along. Bonus: Enjoy reading my "duh" moments as I figure crap out.

Figured one thing I was doing wrong, a simple thing: as I suspected, I didn't set a timeout for netcat. I forgot (it has been many a year) if you don't it just keeps listening.

Adding -w 20 after nc did the trick.

I just transferred a 75GiB recursive snapshot in 3m 23s. I suspect that is pretty good, no?

Now to do some other datasets, and then figure out the best way to use them on server_new - rollback, or clone.

Ixian · Jan 18, 2019

Update 2:

I've settled on rolling back the transferred snapshots on system_new as A) My intent is to use the datasets on that server B) Cloning them links back to the original snapshot which seems unnecessary because C) The snapshots from server_old are a one-time deal since when this is done I'm going to blow away server_old's pools and rebuild. When I'm done with that, then it will become the replication target for server_new.

So, cloning is probably unnecessary and might even be a problem later. If I'm wrong, let me know.

Jail transfer/rollback seems to have worked. Now I'm transferring my primary media set, which is a little over 13TB. Seems to be going well:

Little burst-y in spots but a 3.6G Avg isn't bad for a bunch of WD Reds. I suspect I could improve this all the same. 13TB is going to take around 8 hours or so, certainly much faster than it would be over a 1GB link but thinking I could tune this more for future replication tasks. As always, any advice appreciated.

Ixian · Jan 19, 2019

Update 3:

Success. My biggest xfer took a smidge over 9 hours:

Which I am thinking doesn't look bad at all, considering the source and destination are Raidz2 stripes of 5400rpm disks.

Additional lessons learned:

Judging by my network stats for the largest snapshot, I'm not sure whether removing the SLOG pool would have helped; there was the occasional dip but it pretty much charged on steadily for the most part, so I don't think it was getting in the way.

For a migration like this, rollbacks are indeed the way to go vs. cloning the transferred snapshots.

I ran in to a few minor gotchas, mostly due to not thinking every piece through:

I couldn't get a snapshot of my backup pool transferred - I'd get a fault at the end of the xfer, and rather than troubleshoot after the second time (each attempt took about an hour) I bagged it and ended up transferring those files via Rysnc, which worked well. Which got me thinking that maybe, for this kind of server migration, Rsync wouldn't be better overall. Though nc + zfs send/recv certainly was fast for my biggest pool.

I ran in to an arcane problem with permissions, because back when I installed Freenas on server_old 4 years ago, the "media" user and group had a uid/guid of 816, but the newer versions of Freenas, which I installed on server_new, use 8675309 (Jeeeennny I got your number..). This was simple to fix in all my media jails that used the media user/group with pw usermod and groupmod, but threw me for a bit of a loop at first since I wasn't on the lookout for it.

Other than that it has gone smoothly and everything is running fine on server_new. I'm going to give it a few days to settle in and make sure I don't need server_old as a reference, then blow away the latter, reinstall it with new pools as server_backup, and then set up replication jobs from server_new.

Hope someone find this useful, but even if not, it was a good learning experience for me :)

Important Announcement for the TrueNAS Community.

ZFS Send/server migration help

Ixian

Patron

Ixian

Patron

Ixian

Patron

Ixian

Patron

Ixian

Patron

Similar threads

Important Announcement for the TrueNAS Community.

ZFS Send/server migration help

Ixian

Patron

Ixian

Patron

Ixian

Patron

Ixian

Patron

Ixian

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "ZFS Send/server migration help"

Similar threads