Is there a ZFS virtualization problem?

jixam

Dabbler
Joined
May 1, 2015
Messages
47
Because ZFS has no "fsck"/"chkdsk" type tools to fix your pool when you manage to corrupt your pool.
ZFS has zpool import -F which reverts to a previous uberblock, it seems designed for exactly this situation. Along with zpool scrub one can be sure that the file system is still consistent.

Meanwhile, there is no way to know if fsck fixed the file system or broke it further by dropping inconsistent parts.

It is not clear to me that random, silent corruption is preferable to losing a few seconds of writes in an otherwise undamaged pool. In fact, if one is in this situation it seems like the ZFS integrity check is as relevant as ever.

Sure, but you can easily replicate data to a second (on-site standby) machine, and a third (off-site disaster recovery) machine.
That is a lot of extra hardware and extra work when one already has a redundant VM setup with working backups. Those extra things and processes can break too and IMHO that must be considered as well when considering what is "preferred and safest".
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
ZFS has zpool import -F which reverts to a previous uberblock, it seems designed for exactly this situation.
Only in the specific situation where the corruption is to newly-written data.
It is not clear to me that random, silent corruption is preferable to losing a few seconds of writes in an otherwise undamaged pool.
Nobody's saying it is, rather that there's no such thing as a free lunch.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
ZFS has zpool import -F which reverts to a previous uberblock, it seems designed for exactly this situation. Along with zpool scrub one can be sure that the file system is still consistent.

This is just FALSE, it's like saying you can fix gangrene by cutting off your leg. Which is true. But really not a generally applicable fix.

A scrub does not detect and cannot repair any damage beyond a simple corruption of a data block that is detected (and detectable) due to a checksum error. If you have an inconsistency that is committed because your hardware or system has broken ZFS rules, such as flushing data to drives incorrectly, you might end up with an irretrievable block because the data is LOST (on read, ZFS sees the bad checksum, cannot rebuild from redundancy, thus retrieves as zeroes) or other edge cases that the designers did not anticipate because they expected direct and reliable access to the disks.

The write cache in a typical RAID controller may hold many megabytes of data; some of the more recent LSI's have 8GB or more of read and write cache. Losing some or all of that is very bad.

Rolling back to a particular uberblock is really only a fix possible if the issue is detected and mitigated very soon after the error is introduced, such as within seconds. That might help if the system crashes, giving you the needed pause in your I/O, but if you trash your pool and then write tens of thousands of additional transaction groups, you're going to lose all of that more recent data. That's why I used the carefully selected words no "fsck"/"chkdsk" type tools to fix your pool because those tools are designed to work weeks or months later, and are intended to validate the STRUCTURE of the metadata, not the consistency of the block checksums (which is what a scrub does). ZFS lacks tools to validate or repair the structure of the filesystem metadata.

Therefore I would have to submit that zpool import -F is not "designed for exactly this situation".

Meanwhile, there is no way to know if fsck fixed the file system or broke it further by dropping inconsistent parts.

Well, that's certainly true, but it is going to also be true for ZFS if ZFS were to have some hypothetical fsck. Once blocks are corrupt in ZFS, they typically appear as zero-filled blocks, which means that it is not too hard to have large amounts of stranded data floating around on a pool.

Basically this all boils down to ZFS being a particular way of thinking about storing data. If you are not interested in taking advantage of the hard compsci that was put into the design, by all means, go use ext4 or whatever alternative you prefer. ZFS isn't for everyone, and those of us who are willing to discuss it honestly will also concede that it has weak points, such as its reliance on pool integrity and a lack of fsck. You can address these issues using "the ZFS way" and get a workable solution. Or you don't have to, and then you get what you get.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
As a data point, the only data recovery tool for ZFS seems to run abysmally slowly, highlighting the challenges involved in implementing repair tools for ZFS (i.e. It's not a philosophical choice shoved down users' throats).
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I was indeed talking about the specific case of a crash causing data not to be written. This is the only situation where it matters that writes are reordered.

This is not true. With a 4GB cache, you could potentially have several transaction groups in a RAID controller's write cache. Reordering of writes to do, let's say an elevator sort, could certainly get a later transaction group to be written before an earlier one. Order matters. Maybe not to you, but ZFS was not written to allow out-of-order writes. The correct sequence of writes is an underlying assumption that ZFS makes about how the disk subsystem will behave.
 

jixam

Dabbler
Joined
May 1, 2015
Messages
47
It is in fact true. Even if blocks are written in a different order, everything eventually ends up in the right place. ZFS cannot see the reordering – unless there is a crash where the delayed blocks never get written.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
It is in fact true. Even if blocks are written in a different order, everything eventually ends up in the right place. ZFS cannot see the reordering – unless there is a crash where the delayed blocks never get written.
But ZFS does make certain guarantees to its users. And it's not alone in that regard. UFS + softupdates is in the same boat. Both ZFS and softupdates guarantee that when you pull the power in the middle of "whatever", you will never have an inconsistent state on disk. You will lose data in flight, sure, but the on disk structures will always be intact.

To be able to do that both need to rely on a disk device not lying about write completion - ever.

That's what a transactional model is all about. Sendmail similarly guarantees never to lose mail. If the receiving MTA acknowledges the mail, fine. While it has not, Sendmail will never delete anything from the queue. Transactions ...
 
Last edited:

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
...every few months someone comes out of the woodworks to tell the resident grinch that he is wrong and should just concede and they make such informative reads.
 

jixam

Dabbler
Joined
May 1, 2015
Messages
47
...every few months someone comes out of the woodworks to tell the resident grinch that he is wrong and should just concede and they make such informative reads.
I think the pinned advice is only "wrong" in that it heavily promotes a One True Way and ignores that all else is not equal.

The pinned posts left me wondering if there was something fundamentally wrong with virtualization and ZFS. Like the RAID5 write hole is a fundamental issue and not just a buggy implementation.

Therefore, this thread has indeed been informative for me. It clarifies that the advice is mostly a strong preference and that reasonable people can have other priorities.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
The "don't Virtualize you fools!" principle is from more than 10 years ago. Things might have improved.
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
The problem is the number of people who come to the forums looking to rescue a VM setup that was designed to fail. if they had simple followed the guidelines and recommendations they would, at the very least, likely not be as bad off.

There was a post last week about this. They used a proxmox VM disk and it would no longer import.

Many of the frequenters here care about not loosing data and I know I really hate having to tell people their pool is dead and everything on it lost and it could have prevented with just a little effort.


If you feel you are a hypervisor wizard and you know all your risks and chose to disregard them cuz the data isn't important or you have backups? Then you aren't really the target of the recommendations; you are going offroad, and that's totally fine as long you were prepared.
 
Top