Core vs Scale Snapshot Problems

Pmantis · Dec 27, 2022

I used to run FreeNAS, then TrueNAS (now core, of course) and several months ago, I migrated to SCALE. I really like the Debian base which allows for docker integration and better virtualization, so I didn't migrate back... but...

Quick history:

When I upgraded in place from core to scale, it bombed - kernel panic on every boot.
I reinstalled scale, imported the pool and reconfigured all services.
When I set up snapshots, the pool wedged at 100% full, 0 bytes.
I had to 'zfs send' all datasets to a new pool. No way to delete from a full COW FS.
I avoided setting up snapsopts so this didn't happen again.

I updated to Bluefin recently and decided to clean up my pools (I have 2 on this system) by deleting old files and such, then set up snapshots to start having some protection again. I started at 64.9% usage on this pool which only houses VM datasets:

I created a snapshot task for the pool, scheduled at midnight (it was at 10:30 PM). I then created a replication task to send the snapshots to the second pool on the same system (for now). About 10 minutes later, I saw the pool was at 0 bytes free... The snapshot task wasn't even SCHEDULED to run yet, since it still isn't midnight yet... but now the pool is wedged, and I can't delete any snapshots to free it up:

Code:

root@nas[/mnt/raid1]# zfs destroy sas15k/dsVMs/xcp-ng@auto-2022-12-27_22-28     
internal error: cannot destroy snapshots: Channel number out of range
zsh: abort (core dumped)  zfs destroy sas15k/dsVMs/xcp-ng@auto-2022-12-27_22-28
root@nas[/mnt/raid1]#

So, is scale still not ready for snapshots? It's quite ironic that snapshots (used for protecting data) is causing a pool to be mostly unusable. At least my VMs are still running, but I cannot create or delete anything new on the pool.

What options do I have other than a backup, recreate the pool and restore?

Thanks!

morganL · Dec 27, 2022

Pmantis said:
I used to run FreeNAS, then TrueNAS (now core, of course) and several months ago, I migrated to SCALE. I really like the Debian base which allows for docker integration and better virtualization, so I didn't migrate back... but...

Quick history:

When I upgraded in place from core to scale, it bombed - kernel panic on every boot.

I reinstalled scale, imported the pool and reconfigured all services.

When I set up snapshots, the pool wedged at 100% full, 0 bytes.

I had to 'zfs send' all datasets to a new pool. No way to delete from a full COW FS.

I avoided setting up snapsopts so this didn't happen again.

I updated to Bluefin recently and decided to clean up my pools (I have 2 on this system) by deleting old files and such, then set up snapshots to start having some protection again. I started at 64.9% usage on this pool which only houses VM datasets:
View attachment 61591

I created a snapshot task for the pool, scheduled at midnight (it was at 10:30 PM). I then created a replication task to send the snapshots to the second pool on the same system (for now). About 10 minutes later, I saw the pool was at 0 bytes free... The snapshot task wasn't even SCHEDULED to run yet, since it still isn't midnight yet... but now the pool is wedged, and I can't delete any snapshots to free it up:

Code:
root@nas[/mnt/raid1]# zfs destroy sas15k/dsVMs/xcp-ng@auto-2022-12-27_22-28 internal error: cannot destroy snapshots: Channel number out of range zsh: abort (core dumped) zfs destroy sas15k/dsVMs/xcp-ng@auto-2022-12-27_22-28 root@nas[/mnt/raid1]#

View attachment 61592

So, is scale still not ready for snapshots? It's quite ironic that snapshots (used for protecting data) is causing a pool to be mostly unusable. At least my VMs are still running, but I cannot create or delete anything new on the pool.

What options do I have other than a backup, recreate the pool and restore?

Thanks!

Haven't seen this before .. please report a bug Provide detailed info on the snapshot task.

Anything unusual about the VMs ??

That's 250GB in 90 minutes... = 3GB/ minute or 50MB/s

What was your typical data write rate?

Pmantis · Dec 28, 2022

I don't have a lot of spare time right now, so I'm probably going to leave this pool as-is for a week or two until I decide what to do with the VMs. I am considering replacing the drives with a larger set. Will this help make this pool accessible, or is there a write operation somewhere that will disallow expanding onto larger drives?

To answer your questions, I don't consider the VMs to be anything unusual. I have a Windows 2019 AD server, my Accounting PC and another VM for various things (both are Windows 10). The rest are little Linux machines for a small mysql db, postgress db, Apache web server for some light development, self-hosted BitWarden VM, a Ubiquiti controller, pi.hole and a pfSense box for a VPN endpoint. I'm mostly the only user of these machines, so I don't consider it to be high volume. Is there a good way to get you some i/o information?

Your math seems to be based on an incorrect assumption. As I looked at the logs, I found that I received a "full" alert before 10:30.. so I was working on creating the snapshots a little before that... but not much. The data seemed to fill up in seconds..

Critical
Quota exceeded on dataset sas15k/.system/cores. Used 100.00% (1 GiB of 1 GiB).
2022-12-27 22:28:33

I still don't understand how it ran well before it was scheduled to run. I first created a snapshot task, then a replication task. After that, I had 2 snapshot tasks for the same pool, so it appears that the replication task auto-created the snapshot task. I'm quite sure I scheduled it for 1 AM, so it would trigger after the snapshot existed. I have have done something with an incorrect understanding. This is, to the best of my recollection, how I created it:

The snapshot task was mostly defaults. 2 week retention, recursive, run daily. I deleted the tasks for that pool, but in hindsight it really didn't matter... the damage was done. However, this is a snapshot that I created for a different pool, but at the same time, so it's likely to be identical except for excludes. I'm a little afraid of enabling it...

Regarding the bug report, I don't really know what to report. Is it a Linux ZFS implementation issue, or a TrueNAS issue? Is it snapshots on a pool with all Zvols that is the issue? Too many unknowns. Given my issues with scale, I'm hesitant to upgrade any customer NASs from core to scale.

truecharts · Dec 29, 2022

Pmantis said:
Regarding the bug report, I don't really know what to report. Is it a Linux ZFS implementation issue, or a TrueNAS issue? Is it snapshots on a pool with all Zvols that is the issue? Too many unknowns. Given my issues with scale, I'm hesitant to upgrade any customer NASs from core to scale.

Just fill it as best as you can and the devs will get the rest out-off the debug you need to attach.

Important Announcement for the TrueNAS Community.

Core vs Scale Snapshot Problems

Pmantis

Dabbler

morganL

Captain Morgan

Pmantis

Dabbler

truecharts

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Core vs Scale Snapshot Problems

Pmantis

Dabbler

morganL

Captain Morgan

Pmantis

Dabbler

truecharts

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Core vs Scale Snapshot Problems"

Similar threads