Replication fails between two TrueNAS Scale systems using a back-end jumbo frame network

70tas

Cadet
Joined
Mar 30, 2022
Messages
8
I setup two TrueNAS systems, one bare metal, and the other a VM under Proxmox. They are both running 23.10.2.
Each system has a Front-End network (1GE) on a switch with pFsense routing three subnets; They also have a second (2.5GE) NIC, on an isolated switch as a back-end storage network, running jumbo-frames.
Each system is connected to its own Terramaster D6-320, six disc USB 3.2 JBOD.

The first, the bare-metal has 16TB drives. The second has 12TB drives. I will be using the 12TB as the A system, and the other as the B system for backups. Since I've been running on B for now, I tried to migrate the datasets to A by running ZFS replication. First I kept getting errors about the KEX exchange, so I set it to KexAlgorithms=ecdh-sha2-nistp521 in the .ssh/config for both admin and root users on both A and B.

After doing this, I still had a couple of problems; setting up a backup authentication, selected ssh+netcat behind the scenes. This is not shown in the backup authentication creation screen, but shows up after it is saved. I modified it to ssh and removed netcat. The second issue I had was that the replication would show running but would be stuck at 0bytes transferred, until it timed out.

Searching the forums I saw a note that replication will fail with jumbo-frames. I changed my ssh connection to the front-end and gave it a go. It is running and actually replicating data. I'm not sure if this is a known issue or if the work-around I used is not the correct one. I also don't know how to properly report this. If anyone can help, I would appreciate it.
70tas
 

70tas

Cadet
Joined
Mar 30, 2022
Messages
8
Ok, I've found two issues; 1) Running under Proxmox and using USB 3.2 for disc access and for a 2.5GE network. I don't think it is Proxmox, nor TrueNAS, I just think it is driver (im)maturity for USB 3.2. 2) As soon as I enabled jumbo frames on the back end, the system would stutter, freeze and unfreeze and doing anything through the UI was almost impossible. Throughout all this, I was able to copy some datasets over prior to going jumbo and reverting.

I then removed the EliteDesk Mini 800 G5 from the PVE cluster, and after installing a second NVME drive, installed baremetal TrueNAS 23.10.2. I first tested with importing the zpool back in to the new system, and TrueNAS recognized it and proceeded to import it. However, with all of the issues I dealt with I decided to wipe the disks, rebuild the array and just start over.

I am pleased to report that after creating the backup credentials, and creating the datasets on my destination, I was able to kick off all six of my replications using 'ssh'. I am not sure why I can't get 'netcat' working, but I can worry about that another time. I've spent enough time already.

Now, one thing for noobies to TrueNAS, like me. When the replication is first kicked off, I had rather expected a maelstrom of disk activity and furious data transfers; it gave me pause, because after kicking off six replications, I thought I had the same issue as before. Not very much data was being transferred and not much disk activity was taking place. !!! Give it ten or so minutes and come back to it. !!!

When I came back, the drives were still quiet, activity wise, but the transfer was furiously runnning between the systems. I've copied about 2TB in a little over an hour, and expect to have my 20TB remaining moved over overnight. My CPU'are averaging about 3% on the source and 1% on the destination.
And lastly, I forgot to modify the destination's configuration to go through the back-end network. So it is going in through the front-end at 1GE. I am NOT going to stop the replication to fix it. I'll do that later. I intend on sleeping tonight anyway.

I'll post an update tomorrow after the replication is done. It has been fun, and I see a lot of parallels between TrueNAS and ONTAP, which I've been working with for the last 25 years. And not bad of a USB JBOD drawer...
 

70tas

Cadet
Joined
Mar 30, 2022
Messages
8
It is now two days later. My two smaller replications have finished. The two larger ones are stuck at 50%.
Tried to login to each device, found that I could not login to replication source. It appears to have reverted to 'root' instead of admin. I have been able to login as root, and changed the password for admin, but it does not re-enable it. Not sure what is going on.

Solution: I think I found the solution. Looking at user attributes for the admin account, I noticed it was an auxillary member of wheel and sudo; I also noticed another post that showed the admin user as being a builtin_administrators. I changed to to bultin_administrators and I was able to login immediately via admin. I must have done a fupah!!! Learning new systems is fun... ...and frustrating.

Upon reboot, my one replication was running; this is the first time I've seen that. The second had failed.
I restarted the second one, but it both nodes are running at 0% utilization - average. I'm going to wait and see if it picks up.
 
Last edited:

70tas

Cadet
Joined
Mar 30, 2022
Messages
8
I'm sorry it took so long to post further data, but work comes first.

To make a long story short, TrueNAS eventually flagged one of my discs as failed. As soon as I offlined the disk things really picked up. Overnight, all my replications finished and I was ready for further work.
One more thing which I found confusing, was that I took a snapshot at the top of my tree and made it recursive; my replication jobs were reporting not finding snapshots, but I think this was a result of timeouts. I spent time creating manual snapshot for each dataset and trying to replicate that way, but it didn't work either. Regardless, once TrueNAS identified the failed drive all went well.

I then ran a long SMART test on the failed drive which succeeded; but I did see 47 write errors. I am not a SMART expert, but Water Panther sent me an RMA after reviewing the SMART results, which I presume means they saw a failure. But, of course, I couldn't leave well enough alone. I re-inserted the drive and let it start the resilver process. Two days later, I saw two drives having failed. I guess I should trust TrueNAS.

I replaced the drive in the vDEV and it is resilvering again now. However, performance is up to par and I don't see any errors.
More later.
 

70tas

Cadet
Joined
Mar 30, 2022
Messages
8
An issue I found with the UI concerns the setup of protection jobs; in particular the "Load Previous...", which is a nice shortcut to create multiple jobs. When you load the previous replication job, it keeps the datasets from the previous run. I found that, especially for embedded datasets, it is very easy to select a new source, and not notice that there is a source selected already from the previous run. When the job runs, it replicates not only the new dataset, but the older datasets from the previous run. Same for the destination. Perhaps IX-Systems can look at this, if they feel it could use polish. Since I am new to TN, I may not be the best to judge.

More info: I realized the issue I had was because when creating a new replication with "Load Previous...", all sources are blanked out. The destination may list the previous path, and unless unchecked will keep it selected. When the job is saved, if you go back and edit hte job, the source will show the current job's source path, and the previous path from the "Load Previous..." selection. I'm not sure how to pass this on to an engineer who can take a look at it.
 
Last edited:

70tas

Cadet
Joined
Mar 30, 2022
Messages
8
Another issue I encountered was with LSTD9 compression. I tried to use that on my destination TN host. Some time during the night, the host and the drives went crazy on me. The console was coming up with all sorts of storage device errors, while the Terramaster D6-320 had all drives blinking. I shut the server down, but the Terramaster was still blinking like crazy.
After power cycling the enclosure the drives came up with Power LED's only. But the Pool was gone. Four of the drives had become free, and I could not recover the pool. I probably could have, but since I was testing, I rebuilt the pool and datasets using LZ4. No further problems after that.
 
Top