Question about how zfs initializes a two drive pool vs three

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
implementation of this "checksum" block which it trusts as being accurate whenever it uses it.
Well, not really. If the data (or metadata, which is also checksummed) matches its checksum, both are concluded to be good--which is likely to be a safe assumption, as the odds against both being corrupted in such a way that bad data would match a bad checksum are astronomical. IOW, if they don't match, it really doesn't matter which one is in error, as ZFS looks for either a duplicate (in the case of mirrors) or sufficient parity (in the case of RAIDZn), in either case also checksummed, which matches its checksum, to provide the requested data--and at the same time fixes the corrupt data/checksum previously detected.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
And the only reason why you are able to boot to the previous version of your Linux install is that you use ZFS?

...
No.

When I started using alternate boot environments, it was a second partition for the OS. The place I work at that time, (Sun Microsystems), was using split mirrors for patching, (using Live Upgrade). This seemed like a good idea for me, so I started provisioning a second OS partition. Because most of my computers at the time did not have a second disk, I could not use split mirror. Instead, I used "rsync" to copy the current OS to the non-active partition. I would would then boot off of it, perform any updates and if the updates went bad, go back to the prior partition.

Later I used other methods. Here are 4 Linux methods for alternate boot environments.
  1. Second OS partition, (could also be split mirror)
  2. Second LVM logical volume for OS, (LVM could be mirrored underneath)
  3. BTRFS snapshots
  4. ZFS snapshots & clones
Options 1, 3 & 4 I have personally used extensively. For option 2, I've tested it but never used it beyond that.

Using snapshots allows me to have more than 1 alternate boot environment. I don't generally need to go back more than 1, but having 3 or so gives me the ability to go back in history. Somewhat like a backup, but easily available.

BTRFS, while native to Linux, never stabilized. It further had data loss potential, (at the time I moved to ZFS on Linux in 2014). With ZFS I get some more advanced features that BTRFS did not support at the time. Specifically various on-line data compression per dataset, dataset quotas, (and reservations), and for me, the VERY important dataset space usage from "df".

On that last item, it was a real annoyance that BTRFS could not give me the amount of used space in my "/home" dataset verses any other dataset. BTRFS gave the used & free space in the "pool" for all datasets. You had to use a special tool to get that information, not the common "df" command. This might have changed, but I've moved on and don't care about BTRFS anymore.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Can you use ZFS snapshots in the same way that we use VMWare snapshots? Only .... snapshots for bare metal booting?
Yes and no. ZFS snapshots can live for YEARS without problems.

VMWare snapshots, (and Linux LVM snapshots), appear to be some type of transaction log, which accumulates all the writes while the main storage is read only. This means the longer it lives, and more data written, the longer it takes to remove. This seems to be due to applying all the changes back to main storage, which I've seen in production take hours. And if the VMWare snapshot accumulated a huge number of changes, absolutely KILL performance, (because of all the disk I/O).

To be fair and clear, destroying a ZFS large or huge snapshot can take a long time too. But, an extremely early feature added to OpenZFS, (but not to Sun / Oracle ZFS until years later), was asynchronous destroy. This means destroying a large or huge ZFS snapshot, (or dataset), might take hours, even days. But, it's done in the background with full pool access, (though a little slower due to all the disk work). The formerly used space becomes incrementally available as it's freed. You can even export the pool or reboot without problems. This async destroy continues where it left off without issue. ZFS will even tell you how much still needs to be "freed".

A new feature probably not yet added to OpenZFS, (and might not ever be), is an asynchronous delete. Deleting large or huge files can take time. There is a thought that OpenZFS could use the same method as async destroy, but for deleted files that are large or huge. The exact size between immediate delete, and async delete seemed to be tunable.
 

EasyGoing1

Dabbler
Joined
Nov 5, 2021
Messages
42
Yes and no. ZFS snapshots can live for YEARS without problems.

VMWare snapshots, (and Linux LVM snapshots), appear to be some type of transaction log, which accumulates all the writes while the main storage is read only. This means the longer it lives, and more data written, the longer it takes to remove. This seems to be due to applying all the changes back to main storage, which I've seen in production take hours. And if the VMWare snapshot accumulated a huge number of changes, absolutely KILL performance, (because of all the disk I/O).

To be fair and clear, destroying a ZFS large or huge snapshot can take a long time too. But, an extremely early feature added to OpenZFS, (but not to Sun / Oracle ZFS until years later), was asynchronous destroy. This means destroying a large or huge ZFS snapshot, (or dataset), might take hours, even days. But, it's done in the background with full pool access, (though a little slower due to all the disk work). The formerly used space becomes incrementally available as it's freed. You can even export the pool or reboot without problems. This async destroy continues where it left off without issue. ZFS will even tell you how much still needs to be "freed".

A new feature probably not yet added to OpenZFS, (and might not ever be), is an asynchronous delete. Deleting large or huge files can take time. There is a thought that OpenZFS could use the same method as async destroy, but for deleted files that are large or huge. The exact size between immediate delete, and async delete seemed to be tunable.
When I take a snapshot in VMWare Fusion while the OS is running, it immediately returns me to the OS to continue working but when I look at the icon down in the dock, I can see a progress bar slowly growing (it can take up to 10 minutes for it to finish on a NvME drive) which indicates to me that it's writing "something" - "somewhere" that is related to that snapshot. Obviously, it's wrong to assume that it's going through all of the data on the drive and making some kind of point-in-time record of it since it does give you access to make any changes you want while it's going through whatever process it's going through and if I made a change that it had not yet accounted for, then it would incorporate that change into the snapshot which would place that data as being in the past in relation to when I took the snapshot.

I had always visualized snapshots as being a file system level process that simply marks blocks of data with some kind of serialized number that it can reference when the user asks it to restore a previous snapshot and I assumed that the filesystem merely reserved space in the data blocks for tagging a snapshot - or more efficiently, it would merely assign the incremental number to new blocks written where the act of creating a new snapshot merely causes the file system to mark new blocks with the next number in line then when it's asked to go back in time, it changes that pointer value for reading only ... and I assumed, of course, some related code that would ensure the file allocation table followed suit. It's the only way I could visualize what is happening and explain the speed at which a "snapshot" can occur.

But it seems that it's far more complex than my simple perspective.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Snapshots are implemented differently depending on the software or firmware. (I had an LSI disk array that supported LUN level snapshots.) What VMWare Fusion does may be different that what older VMWare did in the past. I don't know. While I use VMWare at work, it's as a Unix server admin. So I do console, power and other client side work, not admin the VMWare itself.

I once had to explain to a senior Unix SysAdmin that ZFS snapshots did not have to be "applied" to the main data, on deletion. He kept getting stuck comparing Linux LVM and others, with ZFS.


The best way to think about ZFS snapshots, is read/only hard-links to a dataset at the time of snapshot. Even if you delete the original file, the snapshot's R/O data still exists. Thus, the snapshot maintains it's data at the time of the snapshot. (And all the used space too...)

So removing a ZFS snapshot is simple. If any data is still in use, that data stays with the parent dataset but the snapshot's reference is removed. If the snapshot is now the "owner" of some data, that data is added to the free space list. No data needs to be moved or applied, like a transaction log would have to be.
 

EasyGoing1

Dabbler
Joined
Nov 5, 2021
Messages
42
Snapshots are implemented differently depending on the software or firmware. (I had an LSI disk array that supported LUN level snapshots.) What VMWare Fusion does may be different that what older VMWare did in the past. I don't know. While I use VMWare at work, it's as a Unix server admin. So I do console, power and other client side work, not admin the VMWare itself.

I once had to explain to a senior Unix SysAdmin that ZFS snapshots did not have to be "applied" to the main data, on deletion. He kept getting stuck comparing Linux LVM and others, with ZFS.


The best way to think about ZFS snapshots, is read/only hard-links to a dataset at the time of snapshot. Even if you delete the original file, the snapshot's R/O data still exists. Thus, the snapshot maintains it's data at the time of the snapshot. (And all the used space too...)

So removing a ZFS snapshot is simple. If any data is still in use, that data stays with the parent dataset but the snapshot's reference is removed. If the snapshot is now the "owner" of some data, that data is added to the free space list. No data needs to be moved or applied, like a transaction log would have to be.
This sounds like an incredible space consumer albeit it seems like it's merely "locking" the data down to its current state when the snapshot id is taken. I assume that taking a snapshot in ZFS then is not an instantaneous process since it has to tag all the data?
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
I assume that taking a snapshot in ZFS then is not an instantaneous process since it has to tag all the data?
It only has to tag the then-current transaction group at the time the snapshot is taken. Since the metadata of every block contains its "birth time" (i.e., the transaction group with which it was written), that's all ZFS needs in order to determine whether a particular block is part of the snapshot or not. Thus, snapshots are near-instantaneous.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Snapshots are as @danb35 wrote nearly instantaneous and you can easily have thousands of them.
Code:
root@freenas[~]# zfs list -t snap|wc -l
    9712

That's just my home NAS - snapshots of everything every hour kept for a week.

This one of our backup servers:
Code:
[ry93@backupr03 ~]$ zfs list -t snap|wc -l
   76371
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
This sounds like an incredible space consumer albeit it seems like it's merely "locking" the data down to its current state when the snapshot id is taken.

...
It's only a space consumer IF the source dataset / zVol deletes or over-writes lots of data. The amount a ZFS snapshot uses is directly related to the amount of changes in the source dataset / zVol. Thus, sometimes a snapshots uses a tiny amount of space.

Using ZFS snapshots is both optional, and a highly desirable feature. If you don't have space for them, don't use them. If you NEED them for various purposes, (ransomware recovery, file history, easy user restores, etc...), then you plan on having extra space in your ZFS pool for the snapshots.

For example, my miniature media server's home dataset has nothing in use for the snapshots. That's because I rarely need to login and make changes.

Code:
> zfs list -t all -r rpool/home
NAME                USED  AVAIL     REFER  MOUNTPOINT
rpool/home         21.3M  1003M     21.1M  legacy
rpool/home@dow_5      0B      -     21.1M  -
rpool/home@dow_6      0B      -     21.1M  -
rpool/home@dow_0      0B      -     21.1M  -
rpool/home@dow_1      0B      -     21.1M  -
rpool/home@dow_2      0B      -     21.1M  -
rpool/home@hod_20     0B      -     21.1M  -
rpool/home@hod_21     0B      -     21.1M  -
rpool/home@hod_22     0B      -     21.1M  -
rpool/home@dow_3      0B      -     21.1M  -
rpool/home@hod_23     0B      -     21.1M  -
rpool/home@hod_00     0B      -     21.1M  -
rpool/home@hod_01     0B      -     21.1M  -
rpool/home@hod_02     0B      -     21.1M  -
rpool/home@hod_03     0B      -     21.1M  -
rpool/home@hod_04     0B      -     21.1M  -
rpool/home@hod_05     0B      -     21.1M  -
rpool/home@hod_06     0B      -     21.1M  -
rpool/home@hod_07     0B      -     21.1M  -
rpool/home@hod_08     0B      -     21.1M  -
rpool/home@hod_09     0B      -     21.1M  -
rpool/home@hod_10     0B      -     21.1M  -
rpool/home@hod_11     0B      -     21.1M  -
rpool/home@hod_12     0B      -     21.1M  -
rpool/home@hod_13     0B      -     21.1M  -
rpool/home@hod_14     0B      -     21.1M  -
rpool/home@hod_15     0B      -     21.1M  -
rpool/home@hod_16     0B      -     21.1M  -
rpool/home@hod_17     0B      -     21.1M  -
rpool/home@hod_18     0B      -     21.1M  -
rpool/home@dow_4      0B      -     21.1M  -
rpool/home@hod_19     0B      -     21.1M  -
 

EasyGoing1

Dabbler
Joined
Nov 5, 2021
Messages
42
It's only a space consumer IF the source dataset / zVol deletes or over-writes lots of data. The amount a ZFS snapshot uses is directly related to the amount of changes in the source dataset / zVol. Thus, sometimes a snapshots uses a tiny amount of space.
Can ZFS snapshots be used for recovering a single file? Which is kind of like Apple's Time Machine only it snapshots then copies over to an external.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Can ZFS snapshots be used for recovering a single file? Which is kind of like Apple's Time Machine only it snapshots then copies over to an external.
Yes.

As long as the entire file was on the pool at the time of the snapshot. Continuously written files, like log files, only reflect the state at the time of the snapshot.

Some people make visible the normally hidden ZFS snapshot directory, to allow simple user initiated restores. For example;
Code:
root@~# ls -la /home
total 43
drwxr-xr-x  5 root    root    5 Oct  5  2016 .
drwxr-xr-x 22 root    root   22 Dec  6 01:57 ..
drwxrwxrwx  1 root    root    0 Dec  6 01:57 .zfs
drwxr-xr-x  2 user1   users   2 Nov 21  2020 user1
drwx------ 64 user2   users 118 Dec 23 19:41 user2

root@~# ls -la /home/.zfs/snapshot/dow_1
total 34
drwxr-xr-x  4 root    root    5 Oct  5  2016 .
drwxrwxrwx  2 root    root    2 Dec 23 19:00 ..
drwxr-xr-x  2 user1   users   2 Nov 21  2020 user1
drwx------ 64 user2   users 118 Dec 18 18:47 user2

Note the difference in "user2"'s home directory date. I used the snapshot from Monday, which shows the last directory level change from Saturday night.

Don't be misled by the seeming world writable .zfs directory. ZFS still won't allow writing to it.

OpenZFS does add an additional feature over Oracle ZFS, (if I understand this correctly). OpenZFS allows you to also mount a ZFS snapshot at any location. Oracle ZFS used on Solaris only supports the path hidden / visible method for access to ZFS snapshots.
 
Top