Question about how zfs initializes a two drive pool vs three

EasyGoing1 · Dec 12, 2021

When installing TrueNAS on fresh hardware (the drives have never been used before), I'm curious about how ZFS configures a new drive pool when you start with two drives vs. three drives.

And my question is rooted in my prior experience with RAID, wherein a two-drive scenario, if you want redundancy, you have to mirror the drives, but if you have at least three drives, then you can move to a RAID-5 which, of course, mitigates a single loss among all three drives.

If one were to start their TrueNAS with only two drives (assume a separate drive runs the OS and is not part of the pool), does ZFS set them up as a mirror? Or does it implement some other algorithm such that adding a third drive causes ZFS to be able to merge the data from the first two and spread the load over all three so that you have a true three drive scenario where you could lose any one of the three and still not lose data? Or would you have to back the data up, and then re-create your ZFS with all three drives then restore the data?

Heracles · Dec 12, 2021

EasyGoing1 said:
When installing TrueNAS on fresh hardware (the drives have never been used before), I'm curious about how ZFS configures a new drive pool when you start with two drives vs. three drives.

TrueNAS will not do anything about this.

it will be up to you to configure your pool. For that, you create your vdevs first and then add them to a pool.

You can not convert a mirror to Raid-Z1 (the structure of many drives grouped as one logical drive with 1 drive for redundancy).

Also, ZFS will not re-balance the disks when you add storage in the pool. To avoid that is one reason to put all the storage you need and more day 1.

Should you have a pool with a single mirror and wish to turn it to a Raid-Z1 pool with an extra drive, indeed you will have to empty the server, destroy the pool and re-create from scratch.

jgreco · Dec 12, 2021

EasyGoing1 said:
rooted in my prior experience with RAID

This is mostly worthless; the common experiences such as "initializing an array" or insights into how parity works from RAID5 is worthless with ZFS, which does it all differently.

ZFS doesn't need to initialize an array because the act of initializing an array on RAID1/RAID5 is an artifact of the per-sector protection provided by a hardware controller which doesn't understand the nuances of what's being stored. It only knows that the parity block in a stripe is supposed to be the XOR result of the data blocks. Therefore, RAID5 initializes the entire array so that the XOR results are consistent, even though the space is not actually allocated to any files. Meanwhile, ZFS writes parity as a function of DATA BLOCKS (not stripes). Since it understands which sectors are part of the stored data, it doesn't care what's in unallocated sectors. When it writes a data block, it calculates the parity as part of the data block write. Thus, there is only parity for written data blocks. The following graphic would be confusing to anyone familiar with RAID5:

Each color is a different block of stored data, and ZFS has variable-length blocks. There's no way to predict where the parity sectors end up, because you only know where they'll be once you've written a data block to the pool, therefore no *way* to "initialize" a pool, and also no *need*... this has melted the brains of more than one RAID5 storage designer.

EasyGoing1 said:
does ZFS set them up as a mirror

No. ZFS is very UNIX-y and will do whatever you tell it to. The FreeNAS GUI will probably recommend a mirror for redundancy, but will also let you set them up as a stripe (storage only, no redundancy available).

EasyGoing1 said:
adding a third drive causes ZFS to be able to merge the data from the first two and spread the load over all three

No. ZFS was originally designed by Sun Microsystems for their large server storage systems. They intended to be able to sell systems with 12, 24, 48, hundreds, or even thousands of drives. They included the ability to expand storage by adding new vdevs (such as a new shelf of disks), but reorganizing existing storage is an INCREDIBLY complicated task, which requires a feature called "block pointer rewrite" which ZFS lacks (but everyone's been wishing for for many years). Because the original ZFS design did not include BPR, and there isn't anyone willing to fund a huge rewrite of ZFS fundamentals, we're not likely to see this any time soon.

EasyGoing1 · Dec 13, 2021

jgreco said:
No. ZFS was originally designed by Sun Microsystems for their large server storage systems. They intended to be able to sell systems with 12, 24, 48, hundreds, or even thousands of drives. They included the ability to expand storage by adding new vdevs (such as a new shelf of disks), but reorganizing existing storage is an INCREDIBLY complicated task, which requires a feature called "block pointer rewrite" which ZFS lacks (but everyone's been wishing for for many years). Because the original ZFS design did not include BPR, and there isn't anyone willing to fund a huge rewrite of ZFS fundamentals, we're not likely to see this any time soon.

Everything you said made sense ... even when you said that there is no way to know where the parity bit will end up ... because I have no problem with the notion that the file system knows what it's doing in terms of redundancy.

But towards the end there ... and based on what Heracles said above ... I have two questions:

1) Let's say you start out with a four-drive pool ... and after some time, one of the drives fails. Are you then able to replace that drive and it then gets mixed back as a true replacement in that pool in the same way that it works with raid-5? I'm talking macro level - as a user and not an engineer - would I be able to expect the same net result as I would with raid-5 in a failed drive scenario?

2) Assuming we again have a four-drive pool that's been online and in use for a long time then I decide to add a 5th drive ... does that then become a five-drive pool where I can lose any of the 5 without losing data, or does that 5th drive merely become another segmented volume that is an island unto itself and it will have nothing to do with the original four-drive pool?

Patrick M. Hausen · Dec 13, 2021

1. Yes, full replacement.
2. No, ZFS does not allow that. You cannot add a drive to a RAIDZ vdev. Also topolgy changes like RAIDZ1 -> RAIDZ2 are not possible. You can only add another vdev, but adding a single disk non-redundant vdev to a redundant (e.g. RAIDZ2) pool doesn't make much sense.

May I suggest the ZFS primer?

ZFS Primer

Background information about the Zettabyte File System (ZFS).

www.truenas.com

All you ever wanted to know but were afraid to ask

jgreco · Dec 13, 2021

EasyGoing1 said:
Are you then able to replace that drive and it then gets mixed back as a true replacement in that pool in the same way that it works with raid-5?

Depends on what you mean by "in the same way".

Yes, it recovers your data using an XOR-style parity reverse computation, in something similar to the computation a RAID5 would. However, a RAID5 array does this by reading a stripe of all three surviving hard drives (in your example) at once, back-computing the missing content, and then writing that. If we ignore the obvious optimization of reading more than one sector at a time, it is effectively reading LBA 0 on each drive, doing the math and write, then LBA 1, then LBA 2, then LBA 3, 'til the end. At the end, the array is successfully rebuilt. That's RAID5.

However, RAIDZ1 does it differently.

And I'm going to segue for a moment. What I'm going to say next isn't technically true, but is still the correct way to THINK about ZFS and how it works:

Every time ZFS reads a block, it has a checksum for the block available. If the checksum matches the contents of the block, the block is returned to the reader. If the checksum fails, ZFS attempts to rebuild the block from redundancy. If it is successful, it returns the block to the reader AND ALSO writes the repaired data back to disk. If, instead, the repair is unsuccessful, it returns a zero-filled block to the caller.

So, an interesting thing here -- READING data from a ZFS pool will repair damage within the data that is being read. Key concept. Remember this.

So a ZFS scrub, the process that checks the entire pool for consistency. It starts with the root metadata block, and then traverses the entire pool, reading every block and making sure it is consistent. It is VERY similar to just running "find /pool -print | xargs cat > /dev/null" except that it also deliberately reads and validates the redundant sectors too. The act of reading the pool actually causes the scrub to also repair any damage that it finds.

The thing is, it is doing a filesystem traversal, so unlike a RAID5 rebuild, an empty ZFS pool scrubs and resilvers in mere seconds. However, if you have lots of files and need to do lots of seeks, ZFS can take much longer than the sequential RAID5 rebuild.

So, circling around to what I said earlier, most "RAID5" knowledge does not translate well or cleanly to ZFS. It works differently. Some significant advantages, some disadvantages.

EasyGoing1 · Dec 14, 2021

jgreco said:
So, circling around to what I said earlier, most "RAID5" knowledge does not translate well or cleanly to ZFS. It works differently. Some significant advantages, some disadvantages.

That's quite clever how they ended up implementing INTEGRITY of the data and it seems redundancy is achieved by the implementation of this "checksum" block which it trusts as being accurate whenever it uses it. I would have to assume that it stores that checksum at least in three different places because if the checksum block gets corrupted, it would have no way of knowing if it is validating correctly. Having two other copies of the checksum would then give it the assurance if two of the three are identical (where it could then repair the damaged checksum).

All of that is great, but I still can't visualize how that implements the ability to replace a failed drive where the original pool is able to be brought back into its original state ... unless I'm not seeing something obvious in your answer?

You did say:

unlike a RAID5 rebuild, an empty ZFS pool scrubs and resilvers in mere seconds. However, if you have lots of files and need to do lots of seeks, ZFS can take much longer than the sequential RAID5 rebuild.

Which I THINK means that when a failed drive is replaced, the pool gets scrubbed and the data is re-configured accordingly, but that process will take longer than it would in a hardware raid-5 as the amount of data being scrubbed is a lot... and that's fine ... time to rebuild in home use cases isn't all that important ... but am I understanding what you're saying accurately?

To me, the term SCRUBBED means WIPED OUT ... so maybe that's where my confusion exists?

... and I apologize for fumbling my way through this ... I'm doing my best to comprehend what you're saying.

AlexGG · Dec 14, 2021

EasyGoing1 said:
by the implementation of this "checksum" block which it trusts as being accurate whenever it uses it. I would have to assume that it stores that checksum at least in three different places

Actually, it does not.
The "root" checksum is stored in the fixed location, and there are multiple copies of it. 32 per disk or something. Then, when some location (parent) references another location (child) e.g. like a file references its contents, the parent location stores its child checksum. The usual rules of redundancy apply for anything other than "root" checksums. If you have a single disk pool with mirrored metadata, two copies are stored for metadata and one for user data. If you have RAIDZ1, the checksums themselves are stored in blocks protected by RAIDZ1. So the correctness sort of flows from the root of the filesystem to every object in it. If you can reach some object from the root of the filesystem by following appropriate pointers, and checksums are correct all along the way, then the object you just reached is also correct.

EasyGoing1 said:
ability to replace a failed drive where the original pool is able to be brought back into its original state

The pool is not brought into its original state. It is brought into an equivalent redundant state once the resilver is complete, but the new state is not identical to the original state, and it is not intended to be. So, the filesystem starts at the root and walks through every object. If the read fails or there is a checksum mismatch, it will attempt to reconstruct the object through redundancy, to the point where the object matches the checksum stored in its parent. If successful, the reconstructed form of the object is written back using normal write operations.

Edit to add: the process of looking through all objects in the filesystem to find and fix anything which fails to read or fails to match a checksum is called a scrub.

Arwen · Dec 15, 2021

The old style re-silver / scrub did start from the top level directory, and read it's way down to the object list, verifying and correcting if needed. On a fullish disk set, this would take longer than a normal RAID-1 or RAID-5/6, as ZFS would perform more disk seeks to get to the data.

There is a recent optimized version of the re-silver / scrub released for OpenZFS. This is more linear, to reduce the disk seeks and is supposed to be noticeably faster. Naturally it still does the overall thing.

Now on to one of the odd features of ZFS, "copies=", which defaults to 1. This is not something used in TrueNAS.

By default, file & zVol data have 1 copy, (ignoring the redundancy, Mirroring or RAID-Zx).

Standard metadata, like directories, has 2 copies, (ignoring the redundancy, Mirroring or RAID-Zx). This is done in an attempt to over-come the problem of a bad directory entry block or blocks taking out an entire file. (The file could be TeraBytes in size, yet a single bad bit in a directory entry block could have killed it.) So, normal metadata better protected than file data.

Critical metadata has 3 copies.

So, if you have a 2 way mirror with "copies=1", you get 1 copy of your data on each sub-mirror of a mirrored virtual device. But, if you have "copies=2", you get 2 copies per sub-mirror of a mirrored virtual device, thus having 4 total copies.

In the case of standard metadata, using "copies=2", causes standard metadata to have 3 copies. Basically 1 more copy than data, because again standard metadata is a bit more important that the data itself.

This "copies=" feature is useful for single disk ZFS installations, where you can set specific datasets to be better protected than others. Or all could be set to higher than 1.

Arwen · Dec 15, 2021

Oh, I forgot to mention an example of redundant metadata.

My miniature media server has 2 "disks". One is a 2TB SATA hard drive and the other is a 1TB mSATA SSD. I take a small portion from each and mirror the OS. But, the rest is concatenated into 1 single ZFS pool for the media.

Every now and then I get a file with a problem. Since I have 2 backups, (almost 3 backups...), it's not a problem to restore the file. Most times the error is in a video file, not music, photos or eBooks. Probably because the videos take up 1,000 times more space so statistically more likely.

One day I noticed I had a checksum error. But ZFS did not list the file I needed to restore. It automatically corrected the error! HOW??? It took me a bit of thought to figure out what happened. It was in the metadata, which has at least 2 copies by default. So, one copy was bad, ZFS picked up the other copy, which was good, (by checksum), and re-wrote the bad copy.

So this is an example of why ZFS has redundant metadata. I could have "lost" a very large file due to a bad directory entry.

EasyGoing1 · Dec 15, 2021

Arwen said:
This "copies=" feature is useful for single disk ZFS installations, where you can set specific datasets to be better protected than others. Or all could be set to higher than 1.

OK, my brain hurts ... when you say "single disk ZFS installations" - you do mean single PHYSICAL disk and not some virtual disk?

EasyGoing1 · Dec 15, 2021

Arwen said:
Since I have 2 backups, (almost 3 backups...),

I hesitate to ask ... but how does one have "almost" 3 backups ... it's that kind of like being "almost" pregnant?

EasyGoing1 · Dec 15, 2021

AlexGG said:
Edit to add: the process of looking through all objects in the filesystem to find and fix anything which fails to read or fails to match a checksum is called a scrub.

Thanks for clarifying that ... as I was reading and when I saw the word scrub my brain instantly visualized data being wiped ... though assuming maybe it was some process of re-organizing data then wiping the left-behind trash or something ... but it makes more sense now that I know it's a term that has a different definition in this context.

Heracles · Dec 15, 2021

EasyGoing1 said:
I hesitate to ask ... but how does one have "almost" 3 backups ... it's that kind of like being "almost" pregnant?

@Arwen and I had a little discussion about this in some other threads

She has 2 pools in the same server. Being 2 different pools, one may well survive the incident that affect the other (so be a backup) but may very well also fail to the same incident (so not be a valid backup).

Patrick M. Hausen · Dec 15, 2021

Single physical disk. Of course you can use ZFS on a laptop or an embedded device running e.g. OPNsense firewall software.

ZFS is not strictly a "RAID implementation". It's a block device/redundancy manager, a volume manager, and a POSIX compliant filesystem all rolled into one.

It implements checksums at the block level for everything. Data and metadata blocks. So if you have a mirror or some more complicated setup, it knows which blocks are valid and which are not. And in the single disk device case it knows if the data is good or if "sorry that one particular file is broken beyond repair, better make a backup of the rest and shop for a new disk". These checksums are as I wrote everywhere in addition to the redundancy in the form of mirror, RAIDZn, ... so it knows if all redundant copies are still good. It's not simply XORing N copies of a block to compute a "parity".

This is in my opinion the single most important feature. Specifically when there is no redundancy like possibly in a laptop. ZFS knows if the data is good.

Heracles · Dec 15, 2021

EasyGoing1 said:
when you say "single disk ZFS installations" - you do mean single PHYSICAL disk and not some virtual disk?

Yep. And TrueNAS must NEVER be presented with any kind of virtual disk. All drives must be physical.

Arwen · Dec 16, 2021

Looks like others answered your questions.

I have and continue to use ZFS on my laptops. The old one had only 1 x 2.5" SATA disk, and my new one has both 2.5" SATA and NVMe, so I could do proper mirroring for the OS. (And for that mater, the left over space on both as a separate pool.)

Besides the data integrity benefit, I use alternate boot environments with my home Linux computers, (mini-desktop, mini-media server, and both old and new laptops). Thus, if a kernel update or OS update goes south, simply reboot to prior boot environment selectable with Grub.

EasyGoing1 · Dec 18, 2021

Arwen said:
Looks like others answered your questions.

I have and continue to use ZFS on my laptops. The old one had only 1 x 2.5" SATA disk, and my new one has both 2.5" SATA and NVMe, so I could do proper mirroring for the OS. (And for that mater, the left over space on both as a separate pool.)

Besides the data integrity benefit, I use alternate boot environments with my home Linux computers, (mini-desktop, mini-media server, and both old and new laptops). Thus, if a kernel update or OS update goes south, simply reboot to prior boot environment selectable with Grub.

And the only reason why you are able to boot to the previous version of your Linux install is that you use ZFS?

I spent the better part of the day trying to spec hardware for my TrueNAS build, and It's just overwhelming how many options there are these days. RAM seems to be relatively inexpensive, but when it comes to the CPU, I just can't seem to figure out which one I should go with. I intend on using the NAS for both file storage and virtual machines. I typically have one VM running on my MBP, from time to time, I'll have two running at the same time, and I almost never have three going at once ... like maybe once or twice in a year will I have three going at one time. I just can't seem to gauge which CPU I should get that can handle that load without over-spending on more power than I need.

Oh and I'm the only one who will be using the NAS ... I might also use it for Plex movie serving but that wont be used all that often.

NugentS · Dec 18, 2021

It depends on what the VM's are going to be doing. How much CPU do they need.
Most computers do very little most of the time and use very little CPU. I say most as someone will have a use case with a server that uses a lot of CPU. This "use very little" is particularly true when you are talking a personal setup (but again someone will have an edge case)
NAS Services require very little CPU. Someone on this board said you could run a NAS an an RPi for CPU. SMB is single threaded so for one user = a single core. Plex needs very little CPU, unless its transcoding video, in which case it needs a lot.

My opinion, based on my experience says 4-6 real cores at a reasonable speed, rather than slow cores (which strike me as a false saving) which should keep the power use down a bit. This is however assuming a great deal. This will allow a reasonable number of VM's (don't forget memory) without costing the earth, and allow reasonable transcoding. High Core Counts + High Speed = expensive

For transcoding performance look at the plex website - this may help you define which CPU's are useable in your case

EasyGoing1 · Dec 18, 2021

NugentS said:
How much CPU do they need.

If this helps at all ... right now, I have a 2019 MacBook Pro with 32 gigs of ram with an 8 core Intel i9 CPU and I can run two virtual machines usually one Windows 11 and another Linux (usually assigned 2 vCPUs and 8 gigs of ram each) ... then have Chrome open sometimes with more than 50 tabs open (I know ... I'm lazy when it comes to closing things out) ... spreadsheet, wp, IntelliJ Idea ... various SQL tools ... all of that without the fans kicking up at all.

So I would say that my VM CPU tax is quite low...

The reason why I want to offload the VMs onto the NAS is that I'm considering going to an M1 mac and they don't play nice with virtual machines so I need something that handles that for me and it's always nice to offload work from your main machine when you can.

Important Announcement for the TrueNAS Community.

Question about how zfs initializes a two drive pool vs three

Dabbler

Wizard

Resident Grinch

Dabbler

Hall of Famer

Resident Grinch

Dabbler

Contributor

MVP

MVP

Dabbler

Dabbler

Dabbler

Wizard

Hall of Famer

Wizard

MVP

Dabbler

MVP

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Question about how zfs initializes a two drive pool vs three"

Similar threads