TrueNAS Core Unscheduled Reboot

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
The QNAS in my sig started rebooting last night and I don't know why. This NAS is a replication target from my main NAS and is part of my backup strategy. There is no ECC (its an old repurposed QNAP)

Step 1: Run memtest on the box for a few passes - no issues detected
Step 2: Revert to older version of TN (I have recently upgraded to latest version) - NAS rebooting.

OK - so its either hardware (it is old, although the disks are mostly new) or summat else going on. So what is going on.

On reflection - the NAS seems stable, until I kick of a replication task.
SO I kick of a small, no real changes task - it works
I kick off some others - they work
I kick off the big task - the whole dataset (including child datasets) is 25TB and there are regular, fairly significant, changes going on. NAS reboots after 5-10 seconds - this is repeatable

So I zfs rename the old target dataset, create a new one and then kick off the replication again - and now it seems to be working although given its a 1Gb NIC its gonna take a while to finish a complete replication again and whilst I do have 25TB of spare space (just) that will bing me to 98-99% full. So I will have to delete as I go along

I have even kicked off all the replication jobs and they are running (slowly) and the NAS is staying up.

A scrub showed no issues with the pool and I am scrubbing the source pool as well (currently says 38 years - but I am hoping that will shrink rapidly) :smile:

I is confused - and not sure what to make of this - looking for ideas
 
Joined
Jan 7, 2015
Messages
1,155
Im not saying that its your issue but I have had horrible luck with Crucial SSDs. I got both of mine from amazon and they failed in seriously short order (like ~3 months) under high load in my plex rig with no warning other than the type of gremlins you describe. Random freezing/reboots/no boot, but not in a QNAP never touched one. Just BOLO. Good luck.
 

samarium

Contributor
Joined
Apr 8, 2023
Messages
192
Replication of encrypted snapshots? Another instance of the existing forum thread where deleting a replicated encrypted snapshot causes a crash?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Well - it took an hour or so, but the NAS rebooted itself several time. I guess that wasn't it.
As for the Crucial - I used that to replace an NVMe to USB bridge that I thought might be the issue.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
I guess replacing the PSU is my only option now - which will be a nusiance, assuming I can even get one
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Replication of encrypted snapshots? Another instance of the existing forum thread where deleting a replicated encrypted snapshot causes a crash?
One of the child datasets is encrypted - but the replication task hasn't got to that one yet. Its also been replicating just fine for a year or so

Actually, on consideration, there might be something to this. I will need to run some more tests
 
Last edited:

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
@samarium it looks like you may be on to something. I cleared out all the snapshots (may have been a bit enthusiastic) and redid the replication. Its been running all day with several replications at the same time. I am waiting for the current set to finish before I retry the encrypted dataset

However the NAS has been stable - so it looks like software, not the hardware I thought initially
 

samarium

Contributor
Joined
Apr 8, 2023
Messages
192
Interesting, from what I recall people saying, the server with received encrypted incremental snapshots was crashing when trying to delete snapshots as part of the replication or manully. So I don't know if this exactly the same bug. It may be related and you are lucky enough not to get a crash when you were cleaning out. There is a reference to a JIRA ticket, and to a github PR from iX somewhere around here, might be worth tracking down and reading. I suggest you use temporary pool for testing the encrypted dataset, so you can destroy the pool if it becomes infected and you are unable to delete snapshots without a crash. Even building a zvol and then building a temporary pool on the zvol for testing would seem to be safer than allowing potentially undeletable datasets onto the main pool. I would also be creating a small temporary dataset to replicate for testing, rather than your main data, since you now have an idea of what might be tested. You could even do it in a VM for further isolation, that is what I would be doing in this case.
 
Last edited:

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Well - that was underwhelming.
All (ignoring the largest that will take a week+) the non-encrypted datasets replicate just fine - destination NAS stays up
So I add the encrypted dataset - which replicates just fine as well.

I did delete everything at the destination - so it was starting a-fresh and I deleted all the snapshots at the source end. I guess its now wait and see what happens when snapshots start being deleted (2 months at the destination in this case). MIght change that for testing purposes once the big one has completed

[I need a faster destination NAS]
 

ikarlo

Dabbler
Joined
Apr 21, 2021
Messages
18
Hi,
following threads deal with the above mentioned issue:

 
Top