Server keeps rebooting during snapshot replication

somewhatdamaged

Dabbler
Joined
Sep 5, 2015
Messages
49
Afternoon chaps

I've no idea why, but every time i kick off a snapshot replication to my tertiary server, the (receiving) server just reboots. I check the syslog and it really doesn't seem to say much

Server is a Supermicro X9SCL+-F with 16GB DDR3 ECC RAM and an Intel X540-T2 10Gb NIC, booting off a 16GB InnoDisk SataDom. Nothing really out of the ordinary

I've attached the files i think i need, could someone see if anything is jumping out?

thanks!
 

Attachments

  • debug.txt
    94.8 KB · Views: 81
  • error.txt
    1.8 MB · Views: 136
  • messages.txt
    1.3 MB · Views: 76
  • syslog.txt
    4 MB · Views: 119
  • debug-DATA-BACKUPS-20220930202649.tgz
    1.3 MB · Views: 57
Last edited:

somewhatdamaged

Dabbler
Joined
Sep 5, 2015
Messages
49
Bit more info - to remove the 10Gb NIC potentially being the issue, i switched back to the onboard 1Gb LAN. No difference

What's odd though, is it reboots on the exact same dataset being copied. I've removed this dataset from the replication task and it's working (so far)

Anyone seen this behaviour before, and how do i go about sorting it?
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Did you memtest your RAM?
As a side note, you have an issue with your UPS.
 
Joined
Oct 22, 2019
Messages
3,641
Are you using native ZFS encryption?
 

somewhatdamaged

Dabbler
Joined
Sep 5, 2015
Messages
49
Ah yes, the UPS issue i need to sort (it's connected to my main server and the others connect via slave). Memtest doesn't show any issues, ran ten passes and all seems OK

The pool is using encryption yeah
 
Last edited:

somewhatdamaged

Dabbler
Joined
Sep 5, 2015
Messages
49
hmm how do i check that? I'm only replicating about 3TB of data to a 6TB pool
 

somewhatdamaged

Dabbler
Joined
Sep 5, 2015
Messages
49
ok thanks

Looks fine, has 4GB available. Still going strong on the replication though since removing that one dataset. Very odd. Dataset is around 1.2TB of lossless music, so nothing unusual about it. If i was to re add the dataset and start the sync, it would reboot almost immediately! (technically it's a reset, as it doesn't actually shut down, it's like someone presses the reset button as i can watch it via IPMI)
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Yeah, my badly formuled questione was: is It being used? I was considering a memory issue, but that doesn't look likely.
 

somewhatdamaged

Dabbler
Joined
Sep 5, 2015
Messages
49
Ah i see! Yeah i don't think it's a memory issue. I'll let it finish the current sync, set the syslog to debug, and then re add the FLAC dataset and see if anything is picked up in the log (i don't think it will be due to it being a reset and not a reboot). Thanks for the ideas though, much appreciated
 
Joined
Oct 22, 2019
Messages
3,641
The pool is using encryption yeah
Is this native encryption on the top-level dataset, and all child datasets inherit the encryption?

What version of ZFS / TrueNAS is on the receiving server?

Is this an independent replication, of only the FLAC dataset? Or is it part of a larger replication task that includes this dataset?

What options are you using in your replication task?
 

somewhatdamaged

Dabbler
Joined
Sep 5, 2015
Messages
49
Yeah, top level encryption the childs all inherit

TrueNAS Scale 22.02.4 on both servers

This is a selection of various datasets, i've not just selected the parent / all datasets

Nothing out of the ordinary on the task, SSH+NETCAT to maximise my 10Gb NICs, no encryption, sending snapshots from all the periodic snapshot tasks with a custom 4 week retention
 

somewhatdamaged

Dabbler
Joined
Sep 5, 2015
Messages
49
Yes that's correct. Every dataset that's replicated is showing encrypted on the destination server

Just checked, and the server rebooted again around 00:18... baffling!
 

Attachments

  • debug-DATA-BACKUPS-20221002002946.tgz
    453.7 KB · Views: 56
Joined
Oct 22, 2019
Messages
3,641
Yes that's correct. Every dataset that's replicated is showing encrypted on the destination server
Then try a quick test.

Create a new replication task only for the FLAC dataset, and make sure these three options are unchecked.
uncheck-these-test-task.png


Then try to use this task to send to the destination server under a test dataset. (Perhaps something for destination that looks like poolname/test/FLAC)

* Make sure not to try to send/overwrite an existing dataset.

Yes, the destination will be non-encrypted. But you can delete the "test" dataset on the destination if this works. It's only a test.
 
Last edited:

somewhatdamaged

Dabbler
Joined
Sep 5, 2015
Messages
49
Excellent idea thanks. This is now running, and ill let you know what happens...
 

somewhatdamaged

Dabbler
Joined
Sep 5, 2015
Messages
49
OK completed successfully with no issues!
 
Joined
Oct 22, 2019
Messages
3,641
It’s looking like a nasty bug when replicating a raw stream of a natively encrypted dataset. I’ve bumped into this a couple times before on TrueNAS Core. ☹️

It is really unnerving, since ZFS is supposedly a reliable and stable bulwark. Yet an entire server can be brought to its knees in a catastrophic crash when used normally because it panics during a standard operation.

In the previous bug, which affected multiple users, it had something to do with native encryption and a dataset that contains a very deep symlink. It would always hard reboot the server at the same moment during the replication.

I doubt you’re satisfied that this new test “worked”, since it results in a non-encrypted destination dataset.

Some things to consider:
  • Were any or either of these pools originally created under TrueNAS Core?
  • Does the FLAC dataset contain any symlinks?
  • Does the FLAC dataset contain any file or folder names that are exceptionally long?
  • Are you comfortable in destroying the FLAC dataset on the destination only in order to try to send from the source from scratch? (Your MIRROR pool supposedly contains a partially complete dataset named MIRROR/FLAC.)
 

somewhatdamaged

Dabbler
Joined
Sep 5, 2015
Messages
49
It’s looking like a nasty bug when replicating a raw stream of a natively encrypted dataset. I’ve bumped into this a couple times before on TrueNAS Core. ☹️

It is really unnerving, since ZFS is supposedly a reliable and stable bulwark. Yet an entire server can be brought to its knees in a catastrophic crash when used normally because it panics during a standard operation.

In the previous bug, which affected multiple users, it had something to do with native encryption and a dataset that contains a very deep symlink. It would always hard reboot the server at the same moment during the replication.

I doubt you’re satisfied that this new test “worked”, since it results in a non-encrypted destination dataset.

Some things to consider:
  • Were any or either of these pools originally created under TrueNAS Core?
  • Does the FLAC dataset contain any symlinks?
  • Does the FLAC dataset contain any file or folder names that are exceptionally long?
  • Are you comfortable in destroying the FLAC dataset on the destination only in order to try to send from the source from scratch? (Your MIRROR pool supposedly contains a partially complete dataset named MIRROR/FLAC.)

Pools were all created under TrueNAS Scale
I've checked the entire FLAC folder, and there were a few rogue files that should not have been there (.cue files etc). now its purely folders and FLAC files
I've deleted all of the FLAC snapshots to start clean, and the destination datasets from the target server and kicked off replication using "send from source from scratch"

lets see how it gets on...!
 
Top