TrueNAS Core 13.0 hittin' the pipe

r00tb33r

Dabbler
Joined
Nov 25, 2017
Messages
26
Code:
[zettarepl.replication.run] After recoverable error sleeping for 1 seconds
[2022/12/26 11:44:31] INFO     [replication_task__task_1] [zettarepl.replication.pre_retention] Pre-retention destroying snapshots: []
[2022/12/26 11:44:31] ERROR    [replication_task__task_1] [zettarepl.replication.run] For task 'task_1' non-recoverable replication error ReplicationError("Full ZFS replication failed to transfer all the children of the snapshot NASvol@auto-20221226.0111-2w. The error was: cannot unmount '/var/db/system/syslog-cadb1ce96f8c4a01a65be3ef8f5cb996': pool or dataset is busy\nBroken pipe. The snapshot NASvol/.system@auto-20221226.0111-2w was not transferred. Please run `zfs destroy -r NASvol@auto-20221226.0111-2w` on the target system and run replication again.")


I built me a new box running Core 13.0-U3.1, 8TBx8 RAID-Z2, need to move everything I have from my old box running 11.0-U4 (2TBx8, RAID-Z2).

I looked up the best way to move everything, and the opinion I found on this very forum that ZFS replication has better performance than something like rsync, and I definitely wanted to avoid copying using SMB.

I set up snapshots on the old box, set up the replication task on the new box as a PULL, with SSH connection, it found the snapshots on the old box, then I encountered the "option c" problem which I found the solution to here by disabling compressed writes, and then the replication job started.

Here's the first problem: it's sloooow! Based on what I read here this was supposed to be the fast method. The old box easily saturates the gigabit link when I use SMB. (The two boxes were put on the same shelf connected to the same switch.) But this replication was only going 16Mbyte/second!!! That's way too slow, it should be in the neighborhood of a 100! At that rate it would take 6 days to replicate, instead of the expected day and a half. I was going to ask here, when I got around to it, so I left it running overnight.

...When I woke up I found that replication failed, not even close to making much progress, with the above error.

So my question is, how do I make the full replication work, and make it saturate the network, like it normally does when I copy files using SMB?

Or is there a better way to move my files? My files are mostly large files, movies, disk images, backup images, that sort of thing, about 95% of it, no reason for slowdown due to small files.

Insight is appreciated.
 
Last edited:

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
You could also simply copy them over using SSH/scp.

At the end of the day, your data volume is not huge by today's standards. So I personally wouldn't bother too much with how to copy files over.

What I recommend, though, is to have the new NAS run for a while as a burn-in phase. What I did during the first 3 months (yes, that was my burn-in period) was to have replication tasks from the old to the new NAS. I was still only working against the old one. So, if anything had gone wrong with the new NAS, nothing would be lost and no major re-configuration needed.

One last thing: You signature mentions the number of HDDs but not the vdev setup. IMHO you should add that.
 

r00tb33r

Dabbler
Joined
Nov 25, 2017
Messages
26
^ If I don't use replication to copy the data would it correctly replicate subsequent snapshots the way you describe it? Wouldn't it have to hash files to check if they are the same, are snapshots file based or block based?

The way I pictured between file copy and replication I can use one or the other but not both...?

I tried the replication again last night unticking the "Almost full filesystem replication" option, and left it running overnight. It made about 3GB progress during the entire night (which means its still much too slow to be useful!), and stalled there, it didn't fail but I don't see it making any progress beyond that.

In this state I don't even know how to stop the stuck replication job, there is no option for that in the UI.

:frown:
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
No idea why I didn't notice earlier: Your MSI board has a RealTek NIC. I wouldn't bet my right arm that this is causing the problems. But it is certainly a possibility.
 

r00tb33r

Dabbler
Joined
Nov 25, 2017
Messages
26
Think it would cause replication to be that much slower than SMB file copies I normally have? I don't have any discrete PCIE NICs on hand. I don't know about FreeBSD compatibility but I never had any problems with Realtek NICs (and Realtek products in general) on the Windows side of things.

I see the CPU on the old box is going 60-70% load during replication, looks like just about all of it is sshd, then again it's not 100%, and it didn't run out of physical memory, swap is empty. Maybe SSH is a bit much for that modest CPU? Since I'm the only user on this LAN and no one else has access, is there a way to relieve some load from that CPU? Maybe forego the cypher in some way?

On the new box I see replication progress is not updated in the UI anymore but if I refresh the pool view I can see the usage (slowly!) growing there.

...I guess if this is where we are I can let it keep doing it's thing and monitor progress. Worst case is it'll take 6 days like I estimated... Which is shamefully slow.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
I don't know about FreeBSD compatibility but I never had any problems with Realtek NICs (and Realtek products in general) on the Windows side of things.
You are not alone with this observation. Unfortunately, the Windows side has zero meaning for FreeBSD.
 

r00tb33r

Dabbler
Joined
Nov 25, 2017
Messages
26
So in other words replication is different and doesn't saturate the gigabit interface quite the same way SMB file transfers on FreeNAS do...?

This hasn't come up in the 6 years I had the box running, and I guess it's too late for this replication.

Honestly, I still think sshd is just too heavy on that Celeron.
 
Joined
Oct 22, 2019
Messages
3,641
How old exactly is the "old box"?
CPU: Celeron 1037U

Yikes! :eek: That CPU does not have AES acceleration.

Wouldn't using SSH + NETCAT bypass the issue of CPU/encryption overhead? You'd likely see a nice performance boost from the side of your source (old) box.
 
Last edited:

r00tb33r

Dabbler
Joined
Nov 25, 2017
Messages
26
I started looking into Netcat while researching the replication performance problem. What I don't understand in the UI, is when I switch active side between LOCAL and REMOTE, the Netcat listen field doesn't change. Shouldn't it just be localhost by default when it's LOCAL?

Anybody have suggestions on how to set up Netcat and how to make this replication job use it, since it's already running?
 
Last edited:

r00tb33r

Dabbler
Joined
Nov 25, 2017
Messages
26
How old exactly is the "old box"? You could be bottlenecking on the encryption if it lacks any sort of hardware acceleration and the default cipher is highly dependent on it.
I bought the hardware sometime in 2015 and built the NAS in 2016. It was the only reasonably priced ITX board that fit the compact 8-drive NAS chassis (DS380B has a backplane) that also had enough PCIE lanes (!) for a decent HBA card. This is the first time I ran into the limitation of that processor, otherwise it's near idle most of the time.

At the time I followed a guide to build me a NAS like that, and it served me well for a number of years. I recognize that not everything about it is theoretically ideal. I think I did better with my new one, though not having a backplane for the drives is a disappointment, but it isn't worth hundreds of dollars.

And yes, I do believe encryption is in fact the bottleneck.
 

r00tb33r

Dabbler
Joined
Nov 25, 2017
Messages
26
So overnight the new box crashed after losing the mirrored boot pool, my understanding it's a power issue, as at the moment it's running off a separate power supply as the SATA power splitters aren't here until Friday. The power supply just doesn't have 10 SATA power connectors on it, and I don't have enough Molex splitters and adapters to cover it all.

Perhaps it's a blessing in disguise, I restarted the replication with SSH+Netcat. I found a post by someone here on the forum about the same thing, in the end they found they didn't need to fill out any of the UI fields to take advantage of Netcat. So I tried, and it's saturating the gigabit connection, should take a day to complete the replication at this point, and this I'm satisfied with.

Elitists can crap on Realtek all they want, but for 99.99% use cases their NICs are more than adequate. This time it turned out it wasn't the NIC's fault at all anyway.

I will be looking for a reasonably priced 10gig NIC for the new box, though, my house isn't wired for 10gigs so not sure when or how, but I'll take a suggestion for a PCIE 10gig NIC,
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
Elitists can crap on Realtek all they want, but for 99.99% use cases their NICs are more than adequate.
You have an "interesting" way to show appreciation for the time other people spent to help you.

In addition, your claim is simply wrong. If you had spent a little bit of time to research older posts here, you would have found a considerable number of cases where the RealTek NIC was the root cause for various issues. @jgreco has a lot more background on this than I, so perhaps he wants to add something.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Elitists can crap on Realtek all they want, but for 99.99% use cases their NICs are more than adequate.

That's not true, and we even have a resource that explains it in more detail. Please see


You could also search the forums for the terms "hamster" and "hamsters", which I used to use semi-regularly to tag Realtek performance problems. But you're certainly allowed to believe whatever fantasy about Realtek that you prefer.

I will be looking for a reasonably priced 10gig NIC for the new box, though, my house isn't wired for 10gigs so not sure when or how, but I'll take a suggestion for a PCIE 10gig NIC,

Check out the 10 Gig Networking Primer.


By far the best choices are the Intel X520 cards, which come in single and dual port variants {DA1, DA2, SR1, SR2} making sure to buy a legitimate Intel card and not a knock-off, or else the Chelsio T520-CR
 
Top