Zvols disappeared

Status
Not open for further replies.

DaveY

Contributor
Joined
Dec 1, 2014
Messages
141
Pretty serious problem here guys. I have a FN box (source) that I replicate iSCSI zvols to a remote data center (target) for DR. Few days ago, the replication process complained it can no longer receive the incremental stream due to the snapshots being out of sync. I tried all kinds of ways suggested on the forum, but in the end I deleted all of the snapshots on the replication server hoping the source server will kick off a brand new replication stream.

This morning I noticed the entire zvol is now missing on the target server. The source server now complains that it cannot send incrementals because the dataset is missing. ????

What in the world happened? Perhaps I don't fully understand zfs, but why would deleting snapshots cause the entire zvol to disappear? And how do I fix this?
 
Joined
Oct 2, 2014
Messages
925
have you done any updates to FreeNAS, hardware spec's would be nice, any changes at all to the environment
 

DaveY

Contributor
Joined
Dec 1, 2014
Messages
141
No updates except the snapshots I had deleted. The source HW is a DELL R5500 32GB RAM and the replication server is a DELL r515 AMD Opteron 4130 with 32GB of RAM. The R515 has a H200 HBA flashed to IT mode. Both running 9.3-STABLE-201506292130

I don't think it's hardware related since the missing zvols (4 of them) are all isolated to the ones that had all their snapshots deleted. I would think a hardware related issue would not be so selective??

Not sure if this is of any help, but the zvols were still there immediately after deleting all the snapshots. This was verified with a zfs list. But they disappeared maybe 2 hours later before the scheduled replication job kicked off. No one else who do replication has seen this? Is it a bug in the zfs code or just a zfs behavior that I'm not understanding? I can reproduce this by the way.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
It's probably your understanding of zfs. Can you outline your repro?
 

DaveY

Contributor
Joined
Dec 1, 2014
Messages
141
Sure, first I get a complaint from the sending server:

1. CRITICAL: Replication ZPOOL/iSCSIset -> rep01.myserver.com:REP01/STORE01 failed: cannot receive incremental stream: most recent snapshot of REP01/STORE01/ZVOLS/scratch01-lun0 does not match incremental source Error 33 : Write error : cannot write compressed block

2. I then verify that indeed the replication server(target) is a few snapshots behind. So I delete all of the snapshots on the target and verify no snapshots left
Code:
 [root@rep01 ~]# zfs list REP01/STORE01/ZVOLS/scratch01-lun0

NAME      USED  AVAIL  REFER  MOUNTPOINT
REP01/STORE01/ZVOLS/scratch01-lun0  17.8G  24.4T  2.20G  -

[root@rep01 ~]# zfs list -rt snapshot REP01/STORE01/ZVOLS/scratch01-lun0
NAME  USED  AVAIL  REFER  MOUNTPOINT
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150828.0820-6m  12.8K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150828.0820-1m  12.8K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150828.1119-2w  863K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150828.1219-2w  703K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150828.1319-2w  703K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150828.1419-2w  863K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150829.0820-1m  6.00M  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150831.2352-1m  5.81M  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150901.1104-2w  4.57M  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150902.1104-2w  4.99M  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150903.1104-2w  824K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150903.1300-1w  824K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150904.1104-2w  818K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150904.1159-1w  831K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150904.1359-1w  850K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150904.1559-1w  1.00M  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150904.1759-1w  863K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150904.1959-1w  927K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150905.1104-2w  4.57M  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150906.1104-2w  4.96M  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150907.0700-1w  1.11M  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150907.0900-1w  850K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150907.1100-1w  639K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150907.1104-2w  639K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150907.1300-1w  856K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150907.1500-1w  1016K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150907.1700-1w  844K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150907.1900-1w  850K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150908.0700-1w  1.06M  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150908.0900-1w  946K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150908.1100-1w  486K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150908.1104-2w  486K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150908.1300-1w  856K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150908.1500-1w  997K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150908.1700-1w  837K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150908.1900-1w  863K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150909.0700-1w  901K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150909.0900-1w  895K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150909.1100-1w  652K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150909.1104-2w  652K  -  2.20G  -
REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150909.1302-1w  1.21M  -  2.20G  -

root@rep01 ~]# zfs destroy -rRpv REP01/STORE01/ZVOLS/scratch01-lun0@%auto-20150909.1302-1w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150828.0820-6m
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150828.0820-1m
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150828.1119-2w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150828.1219-2w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150828.1319-2w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150828.1419-2w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150829.0820-1m
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150831.2352-1m
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150901.1104-2w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150902.1104-2w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150903.1104-2w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150903.1300-1w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150904.1104-2w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150904.1159-1w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150904.1359-1w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150904.1559-1w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150904.1759-1w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150904.1959-1w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150905.1104-2w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150906.1104-2w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150907.0700-1w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150907.0900-1w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150907.1100-1w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150907.1104-2w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150907.1300-1w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150907.1500-1w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150907.1700-1w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150907.1900-1w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150908.0700-1w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150908.0900-1w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150908.1100-1w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150908.1104-2w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150908.1300-1w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150908.1500-1w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150908.1700-1w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150908.1900-1w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150909.0700-1w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150909.0900-1w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150909.1100-1w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150909.1104-2w
destroy REP01/STORE01/ZVOLS/scratch01-lun0@auto-20150909.1302-1w
reclaim 148764752

[root@rep01] ~# zfs list -rt snapshot REP01/STORE01/ZVOLS/scratch01-lun0
no datasets available

[root@rep01 ~]# zfs list REP01/STORE01/ZVOLS/scratch01-lun0
NAME  USED  AVAIL  REFER  MOUNTPOINT
REP01/STORE01/ZVOLS/scratch01-lun0  17.8G  24.4T  2.20G  -



3. I wait a bit (time varies, anywhere from 30 minutes to 2 hours)

Code:
[root@rep01 ~]# zfs list REP01/STORE01/ZVOLS/scratch01-lun0
cannot open 'REP01/STORE01/ZVOLS/scratch01-lun0': dataset does not exist


4. I hold my breath, check the source server and verify the original zvol is still there, then I go and change my pants.
 

DaveY

Contributor
Joined
Dec 1, 2014
Messages
141
I forgot to add that the new error message on the source server is this:

CRITICAL: Replication ZPOOL/iSCSIset -> rep01.myserver.com:REP01/STORE01 failed: cannot receive incremental stream: destination 'REP01/STORE01/ZVOLS/scratch01-lun0' does not exist Error 33 : Write error : cannot write compressed block
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Well, snapshots should not normally be deleted by the end-user. The snapshots have ZFS tags on them to keep track of what has been sent and what hasn't.

Basically there has to be at least one snapshot in common between Server A and Server B. If there isn't then you cannot replicate.

If you did delete all of your snapshots on one side then you broke replication permanently. There is no fix except to reinitialize the destination via the checkbox for the replication task.

Unfortunately, since you deleted snapshots I can't really tell you what started the problem... But I can definitely tell you that by deleting all of the snapshots that existed you broke replication permanently and must reinitialize the destination.
 

DaveY

Contributor
Joined
Dec 1, 2014
Messages
141
Well, snapshots should not normally be deleted by the end-user. The snapshots have ZFS tags on them to keep track of what has been sent and what hasn't.

If snapshots should not normally be deleted bye the end-user, then why does "Storage->snapshots" have a checkbox next to each snapshot and the options for the end-user to destroy snapshots? There are also many posts in the forum of people deleting snapshots through CLI as an alternative option.

If deleting snapshots that are part of a replication task is not allowed, then that's a serious limitation of the replication process. People may NEED to delete snapshots for various reasons. Perhaps to free up disk space in a hurry, remove snapshots past retention period, but not destroyed by FreeNAS (various posts in the forum about this issue), or maybe older snapshots contain sensitive files that the company has asked IT to wipe for liability reasons. There could be many situations an end-user might need to manually delete snapshots and the system should (and I believe it is) be robust enough to recover. Just saying...

If you did delete all of your snapshots on one side then you broke replication permanently. There is no fix except to reinitialize the destination via the checkbox for the replication task.

Thanks for the clarification. I was able to fix replication by running zfs send -R on the latest snapshot, which is probably the same as "initialize the destination" that you mentioned.

My original question wasn't so much about breaking and fixing replication, but why destroying snapshots would cause the zvol itself to disappear. It's one thing to lose snapshots, it's extra frightening for the dataset itself to be gone.
 

ornstedt

Dabbler
Joined
Apr 26, 2014
Messages
10
When doing replication with zfs you create a filesystem in the destination. At that time it is empty. When you have done your first replication there will be a snapshot in the destination which is a copy of all data from the source. If you delete all snapshots on the receiving end there is no data left. I.e it is the original empty filesystem. As long as you had kept one snapshot you would have been able to do an incremental replication which is much less data to transfer as it is only the changed files being transfered. In your case I would have deleted the oldest snapshots on the destination. If you have a source with a lot of changes the snapshots will be large and you might not be able to keep that many snapshots in the destination.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
I think there is a way to resync Freenas GUI to sync replication.
If I recall properly, GUI, or more precisely, is a Python or similar fuzzy programming script done to perform replication. It should check which snapshot is tagged for replication and that Tag can be set (possibly) or removed (definitely) using CLI commands.
You can very well perform replication using CLI and even sync it manually, but I doubt GUI will be able to catch up with it.
To sync you dataset or entire volume, you need to know what are the last snapshot being held in your remove server and use it when doing ZFS send -RI ....

I think the GUI version of the replication script is going to scroll through the snapshot and check for the existing "Last replicated snapshot", then will compare on the destination server.
You can list which snapshot is Tagged, and you can very well do a replication of that dataset using CLI. I think doing the replication will set the Tag or clear it, or increment it (not quite entirely sure).
If you did entire recursive Pool replication, then you need to restore or create datasets with missing snapshots on the server, otherwise when GUI will check for the last snapshot within the remote server, if it can't find the latest snapshot, then it will throw an error.
It could be done but could be very tricky implementing it.
It is also possible the remote server has tags on snapshots as well.

I find the Oracle documentation extremely useful, but it seems not everybody on this forum would agree on that, for reasons unclear to me, maybe because Freenas ZFS is a fork of it and the Oracle documentation will not necessarily apply to it.
I would say, to dig into it, at least to broaden your knowledge and understanding. Just be cautious as there always a risk you can mess one of your server or both.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Ok, let me clarify what I said before.

You shouldn't be deleting snapshots if the snapshot is part of a replication task. The replication processes that iXsystems uses in FreeNAS and TrueNAS handle ZFS tagging/hold as well as removing snapshots when they expire. When you decide to delete stuff manually you run the risk of breaking the tagging. If this happens it takes considerable knowledge of how iXsystem's replication system works to "unbreak" what is broken. Depending on what you actually do, there may be no way to "unbreak" what you've done (for example, if there is no snapshots shared on both the source and destination). Then the only fix is to re-replicate everything again, from scratch.

The Oracle documentation can be useful, it also can be a problem. Oracle's ZFS implementation is a fork from Sun's ZFS implementation. Likewise the ZFS that FreeNAS uses is a fork of its own from Sun's ZFS, called OpenZFS. Oracle also works hard to make sure that stuff they add to their ZFS implementation isn't compatible with OpenZFS. So the documentation for a particular subject can be 100% correct, or 100% incorrect. If you are a ZFS expert you probably know what is and isn't correct so the Oracle documentation can be extremely useful. Unfortunately if you aren't already an expert, it's a very poor document to use to learn ZFS because you have no way to know if something you are thinking about doing (or trying to do) should or shouldn't work. For learning, it is very important that you have a document that teaches you what you can do and isn't some document where some of it is applicable but much of it isn't.
 
Status
Not open for further replies.
Top