Slideshow explaining VDev, zpool, ZIL and L2ARC for noobs!

trek102

Dabbler
Joined
May 4, 2014
Messages
46
Sorry for the disjointed text but the quoting doesnt really work. Might be my browser.

Really? You've never had fsck do anything but preen, never found any contents in a lost+found directory, never found a mysteriously truncated file?
>> yes maybe but nothing as disastrous as described in the Powerpoint about ZFS.


It's a complexity issue. The design expects to keep bad things from happening to begin with. A ZFS system can be storing a petabyte or more of information, and it would both take a huge amount of memory and also a huge amount of time to do a full "fsck-like" thing on ZFS. Your Linux MDADM, LSI RAID, and ext4 do not have the ability to manage petabyte-sized filesystems.

>> I doubt size is so relevant. You can always split data across filesystems. Practicality, manageable complexity and recover-ability are much more important. Not having recovery tools is just negligent. Who cares if a fsck takes resources. A filesystem should let the user decide if their data is worth that resource.
>> besides ext4 can handle volumes up to 1 Exibibyte (2^60)


Then ext3, ext4, and ffs all need to be labeled as no-gos too, because they also are not portable across "different server environments."

>>They are indeed. I have ported them many times.

How do you build a several petabyte sized filesystem with LVM?

>>who needs petabyte in one single filesystem?

Already discussed.



Really? "force" mounting shouldn't be allowed? There's a reason it is called "force" mounting, it bypasses the safety checks. And I betcha I can dd /dev/zero onto a bunch of MDADM or LVM disks and destroy them, so apparently they're not very good at preventing damaging actions.

>>Thats not what I said. I said force-mounting should not allow to damage the data integrity.

This is why ECC memory is pushed so heavily. But other filesystems can also be damaged by their own repair tools if bad memory is present, so this is a specious argument.



This is just an insane statement. The power of ZFS is that you can easily have a terabyte of RAM and terabytes of L2ARC fronting a petabyte of hard disk storage, and it will be insanely fast; your RAID controller with hardware cache and backup battery can't even begin to do that.

>> I doubt size is so relevant. You can always split data across filesystems. Practicality, managable complexity and recover-ability are maybe more important.
>> besides ext4 can handle volumes up to 1 Exibibyte (2^60)

Well, that's just an opinion, and one not really backed up by any facts.



That's not really an issue, though. ZFS is just designed differently.



But virtually EVERYONE uses compression, so you don't really have your facts straight.

>> Not anyone I work with bothers with compression but I am not complaining if ZFS can do it. I am just saying its not anything special and might not be a feature thats worth all the other downsides.

But no downside either.

iSCSI works fine and is absolutely recommended IF you are willing to play by ZFS's rules. I've said many times that ZFS, being a CoW filesystem, needs lots of resources to make iSCSI work well, but properly resourced, it will make HDD storage perform almost as well as SSD.

>> maybe but that doesnt come out in the Powerpoint. It reads like even with resources and lots of trial and error its a mess. While under Linux it works out of the box and performs pretty well. Again, practicality is important.

And lots of people use it as ESXi datastore in the real world, so I don't know if this is just you making uninformed statements or what. Lots of enterprises use ZFS for their most challenging storage needs where performance on huge amounts of storage is a key consideration, because when you give ZFS a bunch of resources, it will give you storage that is much faster than standard Linux or BSD filesystems.

>> Well the Powerpoint does not encourage the use as ESXI Datastore. So I am not sure if its wise to risk it if some people say yes and some no. At least with Linux filesystems it works and performs 100%
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,175
- first of all, and this hasn't even been mentioned here: Zpools are not (!) compatible across different server environments.
This is either very ignorant or stated in bad faith. ZFS is probably the most portable filesystem out there that does not have "FAT" in the name. Yes, different platforms support different features, but minimal planning allows you to use the same pool on:
  • FreeBSD/FreeNAS/TrueNAS/other FreeBSD-derived things
  • Linux, with out-of-the-box support in Ubuntu, even
  • Illumos
  • OS X
  • Windows, if you're crazy enough
  • Some others, like NetBSD
If you're interested enough, you can even go for Oracle Solaris.

- ZFS comes with a bunch of complexity but no real new features (vdevs already existed in Linux. Its just just called LVM); why even use more than 1 vdev in a zpool if a problem in any vdev kills the entire zpool? Obviously its more secure to have a separate zpool for each vdev. So just even more complexity for no apparent benefit.
"RAID 10 comes with no apparent benefit. If any group of disks fails, there goes my whole array. Obviously it's more secure to have a bunch of individual RAID1 volumes."
To put it differently, it's a compromise. Sure, you risk losing more data at once, buuuuuut, you can, you know... manage your storage much more easily? No "oh, I need 1 TB of contiguous space, but I only have 3 blocks of 500 GB available in separate volumes!" moments. And if you don't like that, use separate pools. Or don't be a masochist and make your vdevs reliable. I mean, partial data loss is better than complete data loss, but the goal really needs to be zero data loss.

Also in all of 20 years, I had 3 Harddisk physically die on me so RAID5 worked pretty well. Some of our servers had EEC memory and some don't.
I'm a small-time admin at best and in the past year or two, I've dealt with nine-ish failing hard drives of various types, SSDs included. I find it downright inconceivable that someone would manage 20 years with three dead disks.

Do we need all the fancy features of self healing etc etc? In my 20 years of data and storage experience I never had anything like bit-rot happen to me.
How would you know? You're not checksumming all your data. You're stating that you've never noticed one. A corrupted jpeg is awfully evident, but what about a plain text file? A single bit flip can make some fairly substantial changes without being immediately obvious. The same argument can be made for many other situations. That random crash that one time? Maybe the RAID controller returned the wrong LBA and caused the kernel to panic.
For anything of relevance, you cannot just state "everything's fine", you have to prove that everything is fine and ZFS does that.

- Scrubs (a key feature of ZFS self-healing) can cause more damage than people know.
It's borderline mythical and not a serious practical concern. Besides, ECC is standard in servers.

reading about it just makes you want to hug your RAID controller with hardware cache and backup battery :)
Hugging a RAID controller does not come to mind, really, at any time. Furiously cursing at it because it did something stupid and is wasting my time? That sounds more like a RAID controller.

- I am not sure if ZFS encryption has any advantages over LUKS so no benefit here again
You can zfs send your encrypted data to a third-party and still have them be able to keep the data safe by scrubbing it regularly and protecting it with all of ZFS' capabilities without giving them the decryption key.

compression is nice in theory
And awesome in practice. Just last weekend, I moved 9 TB of old satellite images and related data from a legacy system running LSI/Dell HW RAID to a FreeNAS server. Old system was 1 GbE only, so I could do GZip-6 at line speed. In the end, I got a ~1.18x compression ratio. I put together a bunch of sample satellite images the other day and got ~1.4x compression.
This adds up to substantial space savings. Plus, if this stuff ends up in ARC or L2ARC, it'll be compressed in there as well. Storing extra bits is nice, but magically increasing the amount of stuff you can cache in RAM? That's fantastic.
I'll grant you that dedupe is close to hopeless, but it's useful in some niche scenarios.

>> besides ext4 can handle volumes up to 1 Exibibyte (2^60)
Not with any sort of practicality.

>> maybe but that doesnt come out in the Powerpoint. It reads like even with resources and lots of trial and error its a mess. While under Linux it works out of the box and performs pretty well. Again, practicality is important.

>> Well the Powerpoint does not encourage the use as ESXI Datastore. So I am not sure if its wise to risk it if some people say yes and some no. At least with Linux filesystems it works and performs 100%
You have a presentation explicitly aimed at newbies telling them not to jump into block storage thinking it's a panacea and that file storage is probably more suitable for their needs. Very different from your interpretation.
"Works out of the box" is also not something that fairly describes Linux or most Unix-like OSes. That's not a value judgement, just a statement of fact. Setting things up is work and no Linux distro is magical about it.

>> yes maybe but nothing as disastrous as described in the Powerpoint about ZFS.
Think "healthy fear", not "death threat".

>>who needs petabyte in one single filesystem?
Again, small-time admin here. I'm uncomfortably close to filling a 90ish TB pool in half the time I thought was an absurdly over-the-top estimate of how much data I'd be storing. And that's with rather limited volume, the competition is doing probably way more services and larger image files. And I'm not storing all the things I'd like to store.
1 PB is not chump change, but it's far from crazy these days. Sure, you could have multiple volumes, but you could also be extra-ultra safe and not use any RAID and manually manage copies on multiple disks. Except that's nonsense - it doesn't scale, performs terribly and is less safe.

And there's one fundamental point that has only come up tangentially: Manageability.

LVM's UI is nothing short of a trainwreck. Wanna figure out what physical devices are backing a certain directory? Here's a wall of text or two. The administration experience has the same heritage as the work of the average XML enthusiast, with a lot of useless formalisms that address all sorts of problems except those that exist in the real world. Take lvdisplay. I need a damned pager just to figure out how many logical volumes I have. Same goes for physical volumes. ZFS? zpool status and zpool list show the relevant information at a glance. Visually, even, with the vdev structure being clear as day. And this is the tip of the iceberg. Add storage with LVM? Well, now you have to go expand this filesystem and oh wait, I kinda need some of this space over there... With zfs? zpool add and there's my new storage. Minimum hassle, very flexible.
And LVM is an improvement over HW RAID.
 

trek102

Dabbler
Joined
May 4, 2014
Messages
46
@Ericloewe
Thanks for your detailed comments.
I really appreciate it and I accept all your arguments. Maybe I have just been lucky in the past and my infrastructure pool is also not as demanding as yours (I only run 10-20 TB usage, 60% documents, 10% media files; 30% application development with heavy DB usage, c.300 users access simultaneously).

It seems that ZFS has a clear superiority in many areas over standard Linux filesystems/volume managers/RAID managers. And especially through the Truenas GUI it is extremely easy and user friendly to administer. So I am keen to rely on ZFS, my only concern left is this:

"What to do when things go wrong". The Powerpoint slides refer several times to the fact that there are no tools for ZFS and that mounting could be damaging. So I am asking the experts, what would you guys do if something does go wrong and you need to either try to fix the pool or recover as much data as possible? Whats the approach with ZFS?

thanks again for all your helpful input!
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,175
mounting could be damaging
It's been a while since I've gone through the document, but it's probably meant as a general warning. "Don't go -f'ing things without understanding what's going on". I can imagine scenarios with some minor cases of data loss (rolling back the last few transactions, that sort of thing), but I have a hard time imagining a catastrophic scenario just from importing a pool. Maybe if you just push a disk beyond its breaking point and it draws its last breath while reading pool metadata, or maybe a system so truly borked that even having disks attached is a danger to data, but these scenarios are not exclusive to ZFS.
I believe that the presentation was also written from the point of view of someone trying to guide newbies with zero storage/sysadmin experience. So this sort of warning is general advice more than ZFS advice.
As a sidenote, an improvement made to ZFS something like two years ago allows for pools to be imported if there are missing vdevs, with a number of caveats. But, if you have a dataset that's set to copies=2 or =3, there's a very good chance that you can recover all its contents, since ZFS tries to store copies in different vdevs, as much as possible. It's not magic or a panacea, but it was a cheap improvement with no real downsides.

So I am asking the experts, what would you guys do if something does go wrong and you need to either try to fix the pool or recover as much data as possible?
In a nutshell, it's not letting things get to the point where bad things can start to happen. Very, very, very, very rarely does a pool get corrupted without hardware having gone crazy or just plain incompetence.
Since ZFS makes it easy to maintain redundancy, recover from disk failures, identify misbehaving disks, etc., the focus really is on prevention.
Recovery is possible to some extent, but it's something needed rarely enough that you don't see it discussed much. I don't know how I'd really start to try to recover data from a pool that won't import, for instance, but I know that there are people who do.
I guess you could sum it up as "early warning". With traditional filesystems, you often only find out something is wrong when things start not working. With ZFS, you can immediately know if any bits are out of place and do something about it.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
So I am asking the experts, what would you guys do if something does go wrong and you need to either try to fix the pool or recover as much data as possible? Whats the approach with ZFS?

Almost universally, "restore from backup" is my first option. Because the one thing that all of ext4/lvm/mdadm/hardware RAID/ZFS have in common is that they are not a substitute for proper backups. If I manage to have a non-fatal read-failure during a rebuild or other degraded-redundancy (eg: I run mirrors, lose a disk, and the member pair throws a CKSUM error on a read) ZFS will also do me a favour by identifying which file(s) had a problem, so I know what specifically needs to be recovered.

If I was helping someone else out, I'd be suggesting they pull the label information from a device to search for a previous transaction ID, then try force-importing the pool in read-only mode to see if it will present data there. Once that's done, back up and rebuild.

In the case where it's "pool is permanently unavailable" eg: hardware failure loss of two devices in a RAIDZ1 vdev - to be honest, I wouldn't bother to try to recover data from any type of LVM/RAID/ZFS pool that's been damaged that heavily, as I can no longer trust the data. I'd revert to stage 1 - "restore from backup."
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,828
Use case matters, combined with a pool / hardware optimized for same. In a production environment where such planning and the funds for a proper setup should be available, no one expects to run every type of storage application (cold, video editing, VMs, databases, etc.) all out of one pool. There is no universal answer re: pool types but FreeNAS / TrueNAS and ZFS come pretty close to allowing you to tailor a specific, optimized solution for just about every storage scenario.

As for failures, I have had ~10 hard drives give up the ghost in the 30+ years I have used them. I have also had hardware RAID controllers (looking at you, XFX) do a fabulous job of corrupting data silently, with no fix in sight, because XFX on Mac OS. I have also had crummy software issues with the likes of SoftRAID, where stuff simply was not stable, or portable. Never mind their issues with encrypted volumes thanks to Apple's unwillingness to share specifications re: CoreStorage.

So I landed on ZFS and FreeNAS specifically, because it has a proven track record and it allowed me to put together a storage array that meets my needs well. That hasn't eliminated the need for backups, nor will my system break speed records, but it simply works and I sleep well having able to consolidate a lot of data into a bit-rot proof storage container. It's a much better solution than APFS with its absence of individual file integrity checksumming.

The irony for me is that my backups are less reliable re: the file system than my main server. However, I accept that in return for something that my computer can read natively. I'm still working on a off-site ZFS backup storage solution but that may take some time.
 

trek102

Dabbler
Joined
May 4, 2014
Messages
46
Thank you all for the input on recovery which is all noted.
2 more questions.
1) I noticed that ZFS Raid expandability is somewhat limited. Is there any chance to make use of 1 or more additional HDs when they become available in a lets say 3-disk RAIDz array? The additional disk(s) would be same size. Ideally, I would like to turn a Raidz1 into Raidz2 by adding a 4th but I believe that doesnt work. Then I could do a Raid10 but it would mean moving the existing data around. Is there any other solution?
2) SSD cache: would you recommend even adding L2ARC and ZIL on a server with 8GB Ram, a 4 disk array and mainly used for Nextcloud/Samba?
thanks
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,740
Ideally, I would like to turn a Raidz1 into Raidz2 by adding a 4th but I believe that doesnt work.
Correct. Not possible currently.

SSD cache: would you recommend even adding L2ARC and ZIL on a server with 8GB Ram, a 4 disk array and mainly used for Nextcloud/Samba?
No and no. With less than 64 GB of RAM don't even think about L2ARC and an SLOG only accelerates synchronous writes (if at all) - which you don't have in your use case.
 

trek102

Dabbler
Joined
May 4, 2014
Messages
46
Would your view change if the ZFS server also hosts an email server with large accounts, i.e. 10 years worth of business email traffic incl attachments?
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,828
If speed is of great importance, I’d suggest a mirrored SSD pool for the database associated with the mail files. You can always back them up on a snapshot basis to a HDD pool.
 

trek102

Dabbler
Joined
May 4, 2014
Messages
46
thanks - you mean L2ARC on SSD. And what size would you recommend (8GB RAM)
What about ZIL?
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,740
Would your view change if the ZFS server also hosts an email server with large accounts, i.e. 10 years worth of business email traffic incl attachments?
You want to host that on a machine with 8 GB of memory?
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,740
thanks - you mean L2ARC on SSD
No, he means what he wrote. A separate pool for the jail running the email server including the database or whatever the server in question uses for metadata. While the mail store stays on the spinning disk pool.
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,828
No, he means what he wrote. A separate pool for the jail running the email server including the database or whatever the server in question uses for metadata. While the mail store stays on the spinning disk pool.
Yup. Usually, it makes sense to keep the message corpus on a SSD. Attachments may be kept on a separate pool or in the same one - depends on how much room they take up. You can always back up the SSD pool to a HDD pool on a Snapshot basis to secure it.

Running jails and other additional services in addition to FreeNAS / TrueNAS will take more than 8GB of memory. The minimum recommended now is around 16GB, with 32GB preferable, and 64GB+ for anyone contemplating a L2ARC.
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,828
I'd focus on RAM. Better use of your money given your use case.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,740
Ok understand so still no ZFS cache?
What are you aiming to achieve with the "ZFS cache"? Managing the L2ARC needs RAM. So with 8 GB in the system and an L2ARC your performance will probably get worse. 8 GB is a ridiculously tiny amount of memory for a server. Our developer laptops have twice or four times that memory. You want to store "10 years worth of business email" - so I guess it's business critical? Do yourself a favour and get a decent machine, please.

Look below in my signature for the specs of my "Home NAS". That's a reasonable sized server system for a small company while still small and rather quiet.
 

trek102

Dabbler
Joined
May 4, 2014
Messages
46
ok thanks. I will look to upgrade.
In terms of RAID, for 4 disks, do you recommend RAIDz2 or RAID10?
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,828
Performance of 2VDEV mirror pool (raid 10) is likely better than single VDEV Z2 pool. On the other hand, single-VDEV Z2 pool likely has better survivability.
 
Top