Server keeps rebooting during snapshot replication

somewhatdamaged · Sep 30, 2022

Afternoon chaps

I've no idea why, but every time i kick off a snapshot replication to my tertiary server, the (receiving) server just reboots. I check the syslog and it really doesn't seem to say much

Server is a Supermicro X9SCL+-F with 16GB DDR3 ECC RAM and an Intel X540-T2 10Gb NIC, booting off a 16GB InnoDisk SataDom. Nothing really out of the ordinary

I've attached the files i think i need, could someone see if anything is jumping out?

thanks!

somewhatdamaged · Oct 1, 2022

Bit more info - to remove the 10Gb NIC potentially being the issue, i switched back to the onboard 1Gb LAN. No difference

What's odd though, is it reboots on the exact same dataset being copied. I've removed this dataset from the replication task and it's working (so far)

Anyone seen this behaviour before, and how do i go about sorting it?

Davvo · Oct 1, 2022

Did you memtest your RAM?
As a side note, you have an issue with your UPS.

winnielinnie · Oct 1, 2022

Are you using native ZFS encryption?

somewhatdamaged · Oct 1, 2022

Ah yes, the UPS issue i need to sort (it's connected to my main server and the others connect via slave). Memtest doesn't show any issues, ran ten passes and all seems OK

The pool is using encryption yeah

Davvo · Oct 1, 2022

How is swap space?

somewhatdamaged · Oct 1, 2022

hmm how do i check that? I'm only replicating about 3TB of data to a 6TB pool

Davvo · Oct 1, 2022

Under reporting.

somewhatdamaged · Oct 1, 2022

ok thanks

Looks fine, has 4GB available. Still going strong on the replication though since removing that one dataset. Very odd. Dataset is around 1.2TB of lossless music, so nothing unusual about it. If i was to re add the dataset and start the sync, it would reboot almost immediately! (technically it's a reset, as it doesn't actually shut down, it's like someone presses the reset button as i can watch it via IPMI)

Davvo · Oct 1, 2022

Yeah, my badly formuled questione was: is It being used? I was considering a memory issue, but that doesn't look likely.

somewhatdamaged · Oct 1, 2022

Ah i see! Yeah i don't think it's a memory issue. I'll let it finish the current sync, set the syslog to debug, and then re add the FLAC dataset and see if anything is picked up in the log (i don't think it will be due to it being a reset and not a reboot). Thanks for the ideas though, much appreciated

winnielinnie · Oct 1, 2022

somewhatdamaged said:
The pool is using encryption yeah

Is this native encryption on the top-level dataset, and all child datasets inherit the encryption?

What version of ZFS / TrueNAS is on the receiving server?

Is this an independent replication, of only the FLAC dataset? Or is it part of a larger replication task that includes this dataset?

What options are you using in your replication task?

somewhatdamaged · Oct 1, 2022

Yeah, top level encryption the childs all inherit

TrueNAS Scale 22.02.4 on both servers

This is a selection of various datasets, i've not just selected the parent / all datasets

Nothing out of the ordinary on the task, SSH+NETCAT to maximise my 10Gb NICs, no encryption, sending snapshots from all the periodic snapshot tasks with a custom 4 week retention

winnielinnie · Oct 1, 2022

somewhatdamaged said:
This is a selection of various datasets, i've not just selected the parent / all datasets

somewhatdamaged said:
no encryption

Are the datasets on the destination encrypted after being sent from the source server?

somewhatdamaged · Oct 1, 2022

Yes that's correct. Every dataset that's replicated is showing encrypted on the destination server

Just checked, and the server rebooted again around 00:18... baffling!

winnielinnie · Oct 1, 2022

somewhatdamaged said:
Yes that's correct. Every dataset that's replicated is showing encrypted on the destination server

Then try a quick test.

Create a new replication task only for the FLAC dataset, and make sure these three options are unchecked.

Then try to use this task to send to the destination server under a test dataset. (Perhaps something for destination that looks like poolname/test/FLAC)

* Make sure not to try to send/overwrite an existing dataset.

Yes, the destination will be non-encrypted. But you can delete the "test" dataset on the destination if this works. It's only a test.

somewhatdamaged · Oct 2, 2022

Excellent idea thanks. This is now running, and ill let you know what happens...

somewhatdamaged · Oct 2, 2022

OK completed successfully with no issues!

winnielinnie · Oct 2, 2022

It’s looking like a nasty bug when replicating a raw stream of a natively encrypted dataset. I’ve bumped into this a couple times before on TrueNAS Core.

It is really unnerving, since ZFS is supposedly a reliable and stable bulwark. Yet an entire server can be brought to its knees in a catastrophic crash when used normally because it panics during a standard operation.

In the previous bug, which affected multiple users, it had something to do with native encryption and a dataset that contains a very deep symlink. It would always hard reboot the server at the same moment during the replication.

I doubt you’re satisfied that this new test “worked”, since it results in a non-encrypted destination dataset.

Some things to consider:

Were any or either of these pools originally created under TrueNAS Core?
Does the FLAC dataset contain any symlinks?
Does the FLAC dataset contain any file or folder names that are exceptionally long?
Are you comfortable in destroying the FLAC dataset on the destination only in order to try to send from the source from scratch? (Your MIRROR pool supposedly contains a partially complete dataset named MIRROR/FLAC.)

somewhatdamaged · Oct 2, 2022

winnielinnie said:
It’s looking like a nasty bug when replicating a raw stream of a natively encrypted dataset. I’ve bumped into this a couple times before on TrueNAS Core.

It is really unnerving, since ZFS is supposedly a reliable and stable bulwark. Yet an entire server can be brought to its knees in a catastrophic crash when used normally because it panics during a standard operation.

In the previous bug, which affected multiple users, it had something to do with native encryption and a dataset that contains a very deep symlink. It would always hard reboot the server at the same moment during the replication.

I doubt you’re satisfied that this new test “worked”, since it results in a non-encrypted destination dataset.

Some things to consider:

Were any or either of these pools originally created under TrueNAS Core?

Does the FLAC dataset contain any symlinks?

Does the FLAC dataset contain any file or folder names that are exceptionally long?

Are you comfortable in destroying the FLAC dataset on the destination only in order to try to send from the source from scratch? (Your MIRROR pool supposedly contains a partially complete dataset named MIRROR/FLAC.)

Pools were all created under TrueNAS Scale
I've checked the entire FLAC folder, and there were a few rogue files that should not have been there (.cue files etc). now its purely folders and FLAC files
I've deleted all of the FLAC snapshots to start clean, and the destination datasets from the target server and kicked off replication using "send from source from scratch"

lets see how it gets on...!

Important Announcement for the TrueNAS Community.

Server keeps rebooting during snapshot replication

Dabbler

Attachments

Dabbler

MVP

MVP

Dabbler

MVP

Dabbler

MVP

Dabbler

MVP

Dabbler

MVP

Dabbler

MVP

Dabbler

Attachments

MVP

Dabbler

Dabbler

MVP

Dabbler

Similar threads