Datasets Disappearing\Deleting Randomly

Pennco Tech

Cadet
Joined
Jan 19, 2022
Messages
4
We have a few Truenas servers, but one of them is randomly deleting datasets. 1/14/22 was the last day I can say everything appeared fine. So something happened over the weekend to cause two datasets to be deleted. We are not running out of space on this server. We have two replication PULL tasks running. There is a scrub task that runs every Sunday. I don't see any HDD errors. Server is running TrueNAS-12.0-U3.1.

I recreated the two shares yesterday, but when I checked today, one of them were gone. I also remember seeing we had backed up our IT folder from our other Truenas, but now that is missing as well.

I've also never seen this critical message before:
Replication "BRPool2 - BRNas1-Pool1" failed: cannot unmount '/mnt/BRNas1-Pool1/Bristol Scans': unmount failed cannot destroy 'BRNas1-Pool1/PTSQL2SCSI': dataset is busy cannot destroy 'BRNas1-Pool1/BackupTest/Instructors/%recv': dataset is busy cannot destroy snapshot BRNas1-Pool1/BackupTest/Instructors@auto-2021-01-05_07-54: snapshot is cloned cannot destroy 'BRNas1-Pool1/BackupTest': dataset already exists cannot destroy 'BRNas1-Pool1/BWNas-Backup/Marketing': dataset is busy cannot destroy 'BRNas1-Pool1/BWNas-Backup': dataset already exists cannot unmount '/var/db/system/syslog-6c934beb66de482e8faef2d3b30acc82': unmount failed cannot unmount '/var/db/system/samba4': unmount failed cannot unmount '/var/db/system': unmount failed cannot destroy 'BRNas1-Pool1/PTSQL1SCSI': dataset is busy cannot destroy 'BRNas1-Pool1/CVSQL Backup': dataset is busy..
2022-01-19 00:01:34 (America/New_York)

Why is it trying to destroy our datasets?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
You could start by looking at zpool history for that pool to understand exact times when things are happening... maybe that can start to give you clues as to why.
 

Pennco Tech

Cadet
Joined
Jan 19, 2022
Messages
4
You could start by looking at zpool history for that pool to understand exact times when things are happening... maybe that can start to give you clues as to why.
Thank you! I didn't know there was a command like that.

I found this which looks important. Unfortunately I don't understand the syntax here. I'm guessing recv is recovery? The bottom section of text looks like it deleted all the snapshots. I just see a bunch of the same output repeat, but for different data sets. I noticed it doesn't delete, or can't delete whatever is being used at the moment. I stayed in the Bristol Scans share overnight and it's still there. The other dataset, BRMarketing, was deleted again.

Is there a way to see user logins?

2022-01-19.16:43:40 zfs recv -s -F -x mountpoint -x sharenfs -x sharesmb BRNas1-Pool1/BackupTest/Maintenance
2022-01-19.16:43:43 zfs recv -s -F -x mountpoint -x sharenfs -x sharesmb BRNas1-Pool1/BackupTest/Maintenance
2022-01-19.16:43:48 zfs set readonly=on BRNas1-Pool1/BackupTest/Maintenance
2022-01-19.17:00:04 zfs snapshot -r BRNas1-Pool1@auto-2022-01-19_17-00
2022-01-19.17:00:09 zfs destroy BRNas1-Pool1/BackupTest/Maintenance@auto-2021-01-05_07-54,auto-2021-10-26_09-00,auto-2021-10-26_18-00,auto-2021-10-27_00-00,auto-2021-10-27_09-00,auto-2021-10-27_18-00,auto-2021-10-28_00-00,auto-2021-10-28_09-00,auto-2021-10-28_18-00,auto-2021-10-29_00-00,auto-2021-10-29_09-00,auto-2021-10-29_18-00,auto-2021-10-30_00-00,auto-2021-10-30_09-00,auto-2021-10-30_18-00,auto-2021-10-31_00-00,auto-2021-10-31_09-00,auto-2021-10-31_18-00,auto-2021-11-01_00-00,auto-2021-11-01_09-00,auto-2021-11-01_18-00,auto-2021-11-02_00-00,auto-2021-11-02_09-00,auto-2021-11-02_18-00,auto-2021-11-03_00-00,auto-2021-11-03_09-00,auto-2021-11-03_18-00,auto-2021-11-04_00-00,auto-2021-11-04_09-00,auto-2021-11-04_18-00,auto-2021-11-05_00-00,auto-2021-11-05_09-00,auto-2021-11-05_18-00,auto-2021-11-06_00-00,auto-2021-11-06_09-00,auto-2021-11-06_18-00,auto-2021-11-07_00-00,auto-2021-11-07_09-00,auto-2021-11-07_18-00,auto-2021-11-08_00-00,auto-2021-11-08_09-00,auto-2021-11-08_18-00,auto-2021-11-09_00-00,auto-2021-11-09_09-00,auto-2021-11-09_18-00,auto-2021-11-10_00-00,auto-2021-11-10_09-00,auto-2021-11-10_18-00,auto-2021-11-11_00-00,auto-2021-11-11_09-00,auto-2021-11-11_18-00,auto-2021-11-12_00-00,auto-2021-11-12_09-00,auto-2021-11-12_18-00,auto-2021-11-13_00-00,auto

2022-01-19.18:09:19 zfs recv -s -F -x sharesmb -x mountpoint -x sharenfs BRNas1-Pool1/BlackwoodFNPool1Backup/Admissions
2022-01-19.18:09:23 zfs recv -s -F -x sharesmb -x mountpoint -x sharenfs BRNas1-Pool1/BlackwoodFNPool1Backup/Admissions
2022-01-19.18:09:27 zfs recv -s -F -x sharesmb -x mountpoint -x sharenfs BRNas1-Pool1/BlackwoodFNPool1Backup/Admissions
2022-01-19.18:09:41 zfs destroy BRNas1-Pool1/BWBackupDataset@auto-2021-11-20_18-00
2022-01-19.18:09:41 zfs destroy BRNas1-Pool1/BackupTest/Instructors@auto-2021-11-20_18-00
2022-01-19.18:09:46 zfs destroy BRNas1-Pool1/BackupTest/Maintenance@auto-2021-11-20_18-00
2022-01-19.19:11:08 zfs snapshot -r BRNas1-Pool1/Bristol Scans@Bristol Scans manual-2022-01-19_19-10
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I'm guessing recv is recovery?
zfs recv is the receive end of a zfs replication.

I would start by looking at your replication tasks or other scripts you may have that are launched at 18:00.

If there are other systems that may be replicating toward the system in question, check them for replication tasks also.
 

Pennco Tech

Cadet
Joined
Jan 19, 2022
Messages
4
zfs recv is the receive end of a zfs replication.

I would start by looking at your replication tasks or other scripts you may have that are launched at 18:00.

If there are other systems that may be replicating toward the system in question, check them for replication tasks also.
2 of our Truenas servers are replicating. The third has no replication. There are two PULLs on the server in question. Both are in an error state, but they've been that way for a while. We just never cleaned them up. Our other server has 4 PUSH tasks. I didn't notice this, but one of the PULLs is actually pulling from the Truenas that is not replicating. It's coming up with this error:

Error​

cannot unmount '/mnt/BRNas1-Pool1/Bristol Scans': unmount failed
cannot destroy 'BRNas1-Pool1/PTSQL2SCSI': dataset is busy
cannot destroy 'BRNas1-Pool1/BackupTest/Marketing': dataset is busy
cannot destroy 'BRNas1-Pool1/BackupTest': dataset already exists
cannot destroy 'BRNas1-Pool1/BWNas-Backup/Marketing': dataset is busy
cannot destroy 'BRNas1-Pool1/BWNas-Backup': dataset already exists
cannot unmount '/var/db/system/syslog-6c934beb66de482e8faef2d3b30acc82': unmount failed
cannot unmount '/var/db/system/samba4': unmount failed
cannot unmount '/var/db/system': unmount failed
cannot destroy 'BRNas1-Pool1/PTSQL1SCSI': dataset is busy
cannot destroy 'BRNas1-Pool1/CVSQL Backup': dataset is busy.

Logs​

[2022/01/20 00:00:00] INFO [Thread-16396] [zettarepl.paramiko.replication_task__task_2] Authentication (publickey) successful!
[2022/01/20 00:00:02] WARNING [replication_task__task_2] [zettarepl.replication.run] No incremental base for replication task 'task_2' on dataset 'BRPool2', destroying destination dataset
[2022/01/20 00:00:36] ERROR [replication_task__task_2] [zettarepl.replication.run] For task 'task_2' unhandled replication error ExecException(1, "cannot unmount '/mnt/BRNas1-Pool1/Bristol Scans': unmount failed\ncannot destroy 'BRNas1-Pool1/PTSQL2SCSI': dataset is busy\ncannot destroy 'BRNas1-Pool1/BackupTest/Marketing': dataset is busy\ncannot destroy 'BRNas1-Pool1/BackupTest': dataset already exists\ncannot destroy 'BRNas1-Pool1/BWNas-Backup/Marketing': dataset is busy\ncannot destroy 'BRNas1-Pool1/BWNas-Backup': dataset already exists\ncannot unmount '/var/db/system/syslog-6c934beb66de482e8faef2d3b30acc82': unmount failed\ncannot unmount '/var/db/system/samba4': unmount failed\ncannot unmount '/var/db/system': unmount failed\ncannot destroy 'BRNas1-Pool1/PTSQL1SCSI': dataset is busy\ncannot destroy 'BRNas1-Pool1/CVSQL Backup': dataset is busy\n")
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/zettarepl/replication/run.py", line 162, in run_replication_tasks
run_replication_task_part(replication_task, source_dataset, src_context, dst_context, observer)
File "/usr/local/lib/python3.8/site-packages/zettarepl/replication/run.py", line 252, in run_replication_task_part
run_replication_steps(step_templates, observer)
File "/usr/local/lib/python3.8/site-packages/zettarepl/replication/run.py", line 492, in run_replication_steps
step_template.dst_context.shell.exec(["zfs", "destroy", "-r", step_template.dst_dataset])
... 5 more lines ...
cannot destroy 'BRNas1-Pool1/PTSQL2SCSI': dataset is busy
cannot destroy 'BRNas1-Pool1/BackupTest/Marketing': dataset is busy
cannot destroy 'BRNas1-Pool1/BackupTest': dataset already exists
cannot destroy 'BRNas1-Pool1/BWNas-Backup/Marketing': dataset is busy
cannot destroy 'BRNas1-Pool1/BWNas-Backup': dataset already exists
cannot unmount '/var/db/system/syslog-6c934beb66de482e8faef2d3b30acc82': unmount failed
cannot unmount '/var/db/system/samba4': unmount failed
cannot unmount '/var/db/system': unmount failed
cannot destroy 'BRNas1-Pool1/PTSQL1SCSI': dataset is busy
cannot destroy 'BRNas1-Pool1/CVSQL Backup': dataset is busy

I'm going to delete this replication seeing as how we don't use it anyway. I'll recreate the BRMarketing dataset and see if it disappears tomorrow along with our snapshots.
 

Pennco Tech

Cadet
Joined
Jan 19, 2022
Messages
4
Good news, the BRMarketing dataset is still there and I don't see any critical messages about destroying datasets. I'm going to keep an eye on this for a week just to make sure. It's weird that the PULL replication from that one server was causing this. It doesn't even have the same datasets that it was trying to delete on our server, and why it just suddenly started to do this. Thank you sretalla for your help!
 
Top