Risk of using non-ecc ram

cyberjock · May 21, 2015

/gets popcorn.

Yet another thread on this topic. Woohoo!

Fun to watch when you aren't participating. :P

jgreco · May 22, 2015

Actually I prefer a discussion minus the absolutism. There's colors other than black and white in the world, and while there's good reason to strongly promote ECC, there are also use cases where it isn't strictly necessary. The usual problem is the same as with virtualization: people with insufficient guru-points will assume that they too understand it well enough to do it.

"Fate protects fools, little children and ships named Enterprise" -- Cmdr. Riker

But fate has it out for NAS users who engage in risky behaviour.

cyberjock · May 22, 2015

jgreco said:
"Fate protects fools, little children and ships named Enterprise" -- Cmdr. Riker

I liked that movie, despite what critics said.

jgreco · May 22, 2015

cyberjock said:
I liked that movie, despite what critics said.

Look who's participating in the discussion! ;-)

SirMaster · May 22, 2015

I see it from two situations.

1. You are building a new system in which to store your data. In this case i'd definitely say go with ECC, it's really not too difficult to accommodate.

2. You have an existing system and want to use it to store your data. This system already does not have ECC. The comparison to make is, is lack of ECC really a valid reason to dismiss ZFS as a storage option for this existing server. I would argue that it's not and that ZFS without ECC is not worse than any other software RAID without ECC.

This idea that ZFS is no worse off has been confirmed to me at least by several of the ZFS developers as well as one of the co-founders himself too so I tend to trust it.

jgreco · May 23, 2015

SirMaster said:
This idea that ZFS is no worse off has been confirmed to me at least by several of the ZFS developers as well as one of the co-founders himself too so I tend to trust it.

Well, that's wrong as outlined above, of course. The additional risk is definitely there, but can be partially mitigated through proper burn-in testing (which so many times new users don't bother to do). It's kind of like the risk of RAIDZ1. Yeah, it works, for many people. Statistically speaking, you're not personally likely to have problems with it (especially if you do proper burn-in testing), but the risk remains there.

Just because I've never been saved by the seatbelts in my car doesn't mean I don't use them. I have airbags too.

As for the "ZFS developers", now we're talking about people with a vested interest in seeing wider adoption of ZFS. Software devs are often optimistic. The official FreeNAS position is that ECC is strongly recommended.

As someone who's been providing support on these forums for years, we definitely saw a lot more problems in the form of pool corruptions and pool losses back in the days before I started telling people to buy ECC-supporting server grade hardware. I'm not militantly ECC-or-GTFO (read the link) but I have *absolutely* observed that pool corruption complaints on server-grade platforms are a LOT rarer than what we were seeing on cheap-ass consumer grade builds using stuff like the E-350. Now the truth of the matter is that I'd bet a hundred bucks that for every problem we saw reported that there were at least ten, hell, maybe even a hundred, users who had systems that had no problems. The question boils down to ... is that risk level acceptable to you?

We kind of assume people end up with FreeNAS because they love their data and want the integrity and reliability of ZFS. If you don't give a fsck, then why the hell are you here? There are many other ways to do NAS that are nowhere near as resource-intensive, can work on SUBSTANTIALLY smaller systems, etc. By the time you've committed to spending the money to build a ZFS-capable system, it just seems insane to not go that last 10% for the ECC.

Now, of course, I realize that some people do things because they want to. I buy a truck because I need the cargo carrying capacity. I enjoy the ability to throw whatever I need in the back, or to hook up a trailer, or to wade through deep snow in bad weather. I have absolutely no idea what the point of this truck is:

But I guess that's a thing and someone wants that. Seems useless to me. (What happens when you hit a bump?) But you too can build yourself a FreeNAS system with an i7, 64GB of non-ECC RAM, and 24 15K RPM drives configured as non-redundant striped storage. I'm sure it will be fast and cool, right up 'til something goes wrong.

I see my role here as being that of trying to push people towards doing the safe and sensible thing. Non-ECC is less-safe. That's not debatable. It's a fact. Is it an issue for me? Will it ever kill YOUR pool? Who the hell knows. Personally I have better things to worry about and there's already enough things out there that are bad. I don't need more risks.

SirMaster · May 23, 2015

jgreco said:
Well, that's wrong as outlined above, of course.

But why does the co-founder say it is not wrong and not more dangerous than any other filesystem. That's what I'm trying to get at. I'm speaking of course about Matthew Ahrens who co-developed ZFS at Sun from 2001-2010 and is the OpenZFS leader currently working on Illumos-gate for Delphix. I've spoken to him from time-to-time on the OpenZFS IRC channel I am sure you are already familiar with him. All the other platforms, BSD included are ports of mostly his work. Even the developers who work on the Linux port seem to agree too.

I mean he has been developing ZFS for 14 years, surely nobody knows the internals and their capabilities better than him.

I'm not convinced that just because of the presence of an ARC in ZFS that it is automatically more prone to corruption via memory. As long as the ARC is handled between memory and disk intelligently the risks can be mitigated at the very least as well as any other filesystem. Plus of course the other data integrity strengths that ZFS holds (that no other FS does) even without ECC still can make it the most reliable choice.

I think it's a stretch to try to claim that if you had 2000 data storage servers all without ECC and 1000 of them ran NTFS or EXT4 and the other 1000 ran ZFS that you would see more on-disk corruption manifest in the ZFS servers.

jgreco said:
As for the "ZFS developers", now we're talking about people with a vested interest in seeing wider adoption of ZFS. Software devs are often optimistic. The official FreeNAS position is that ECC is strongly recommended.

See I do not doubt this. It's clearly strongly recommended, but that's because ZFS can only make it's data integrity guarantees in the presence of ECC thus why it is strongly recommended. It's the weakest link otherwise.

jgreco said:
As someone who's been providing support on these forums for years, we definitely saw a lot more problems in the form of pool corruptions and pool losses back in the days before I started telling people to buy ECC-supporting server grade hardware. I'm not militantly ECC-or-GTFO (read the link) but I have *absolutely* observed that pool corruption complaints on server-grade platforms are a LOT rarer than what we were seeing on cheap-ass consumer grade builds using stuff like the E-350.

Once again I feel you are missing the point. I am not every trying to make a claim or say that non-ECC is of equal reliability in a ZFS system or any system for that matter. I clearly agree that ECC is more reliable with ZFS and that you will have a lower chance of corruption with ECC.

What I am trying to say is, would ZFS have been worse off than another filesystem.

I know that you have reported cases of pool corruption where the user was using non-ECC. But we can't be completely sure that they would really have been safer with NTFS or EXT4 instead. I do not believe that we could say with confidence whether or not they would have stood a lesser chance of data corruption with a different filesystem on their faulty machine.

Because once again, the ONLY point I am trying to get across is whether or not someone should be turned away or discouraged from ZFS in favor of another filesystem simply because they lack ECC support. Neither strong recommendations about using ECC form FreeBSD/NAS developers nor examples of ZFS corruption on non-ECC ZFS systems are sufficient evidence to make this claim.

You can find provide plenty of references from the NTFS developers at Microsoft who will strongly recommended the use of ECC memory on Windows Servers while using NTFS as well as documented cases of systems with non-ECC getting unrecoverable corruption on NTFS as well.

If the people who actually developed the ARC and other bits of code that handle the data integrity say that ZFS is not more likely to propagate memory corruption onto the disk than any other filesystem on average, then I think they would be be trusted, unless there is some scientific test data that actually compares the different filesystems and compares how well they handle the different memory failure scenarios. The developers have claimed to me that ZFS has a few tricks up its sleeve that it specifically was designed with memory corruption in mind so that it would not so easily and blindly propagate onto the disk should memory corruption actually happen. Not that it's not possible, thus why ECC memory is recommended, just that it's at least somewhat resilient to memory errors and perhaps more resilient or at least not less resilient than any other filesystem.

jgreco · May 23, 2015

SirMaster said:
But why does the co-founder say it is not wrong and not more dangerous than any other filesystem.

Co-founder of what? Paetzel's one of the primary devs for FreeNAS and I linked to his opinion.

The real problem is that we're talking small differences in risk percentage. Opinions are like arses, everyone has one. I don't really give a crap what a software developer thinks of his own awesomeness and how well he thinks his crap will hold up under duress.

The lack of filesystem repair utilities for ZFS introduces additional considerations for ZFS that do not apply to many other filesystems. You are welcome to go run a stack of a thousand ZFS versus a thousand non-ZFS servers and see what the observed reliability is like; all I can say for sure is that I would be forever nervous about the state of a pool that had some sort of error introduced into it, because there's nothing there to repair it. It is potentially damaged forever.

My background is the design of operating systems for medical monitoring devices. My nightmare is that a system could panic in the operating room and fail to reboot, and a patient could die as a result. I am highly oriented towards minimizing unnecessary risks when designing systems. I have a strong preference for ECC in order to increase the chances of data integrity not being compromised.

You seem to be interested in trying to quantify what exactly that risk is. I don't give a frak. The additional cost of ECC isn't onerous when you look at the cost incurred to get into the ZFS game to begin with.

So let me summarize this for you:

Boring conversation anyways.

Bidule0hm · May 23, 2015

I think the disagree come from this:

ZFS without ECC is not worse than any other software RAID without ECC.

ZFS without ECC is not worse than any other filsystems without ECC.

I'd say the first is true (AFAIK) but the second is false.

jgreco · May 23, 2015

Bidule0hm said:
I'd say the first is true

I'd have to disagree. The problem is that ZFS has the ARC cache. Most other software RAID needing to update a block would read-update-write from/to the disk because most software RAID operates at a layer below the filesystem, on a per-block basis. ZFS, on the other hand, has a much greater likelihood of retrieving a block being updated from the ARC, where there's a lot more chance that bits have rotted away if you've got a bad memory module, and then flushing that out to disk. This results in a greater opportunity for corrupted data to be committed to disk. Complete with valid checksums. Eugh.

Bidule0hm said:
the second is false.

But I agree with that, because damage to the pool is generally not repairable through any rigorous process such as a fsck.

Bidule0hm · May 23, 2015

Ok, that's what I wanted to know ;)

jgreco · May 23, 2015

Bidule0hm said:
Ok, that's what I wanted to know ;)

Good. I'm not willing to entertain a discussion of "how likely" is it, because predicting how hardware fails is a losing game. All I know for sure is "greater opportunity." The specifics of that are like debating how many angels can fit on the head of a pin. Let someone else conduct a rigorous study. ;-)

Bidule0hm · May 23, 2015

Yep, I assumed that I was right but I wasn't sure (that's why the AFAIK) so I've learned something, thanks ;)

But yeah, between less costs + possible data loss (even if not very likely) and more costs + reliability I chose the second one :)

SirMaster · May 27, 2015

jgreco said:
Co-founder of what? Paetzel's one of the primary devs for FreeNAS and I linked to his opinion.

I was referring to the co-founder of ZFS itself, the guy who actually wrote most of the code that we are all using today. The guy and the rest of this small team who actually invented ZFS, came up with it's subroutines, figured out how to reliably prevent and repair bitrot, ran it through it's paces and vast amounts of systems testing, etc.

http://www.open-zfs.org/wiki/User:Mahrens
http://patents.justia.com/inventor/matthew-a-ahrens?page=3

The other co-founder would be Jeff Bonwick but he stopped being a part of it in 2010 when Oracle bought SUN. Matt though has still been continuing to push ZFS forward every day since it's inception in 2001.

Developers on other platforms like BSD and Linux are mainly just porting the work that was done at SUN Microsystems where ZFS was created. Nowadays they port the new ZFS development from Illumos (the ZFS upstream) which is being created by the developers at Delphix, Joyent, and Nexenta, all still being lead by Matt. Very little "new" code is developed on BSD or Linux, mainly just adjustments to the codebase to make the minor platform-specific changes in order to get everything to work there. But even then, the developers have been working at minimizing any platform differences.

jgreco said:
The real problem is that we're talking small differences in risk percentage. Opinions are like arses, everyone has one. I don't really give a crap what a software developer thinks of his own awesomeness and how well he thinks his crap will hold up under duress.

But it's not just opinion, I mean they obviously had to run ZFS through loads of real testing while designing it and getting it ready for production use. Surely they compared it and benchmarked it against other file systems and measured the impacts of ECC vs non-ECC memory.

jgreco said:
My background is the design of operating systems for medical monitoring devices. My nightmare is that a system could panic in the operating room and fail to reboot, and a patient could die as a result. I am highly oriented towards minimizing unnecessary risks when designing systems. I have a strong preference for ECC in order to increase the chances of data integrity not being compromised.

You seem to be interested in trying to quantify what exactly that risk is. I don't give a frak. The additional cost of ECC isn't onerous when you look at the cost incurred to get into the ZFS game to begin with.

Boring conversation anyways.

Yes, but this is a very different viewpoint than many users who come to this forum care about either.

The only reason I like to discuss this is because it's a real situation that I see users ask about often in regards to data storage and to ZFS. The user has an existing system that they want to use to store data of some sorts and of varying importance. Building a new system is sometimes out of the question. They simply want to know whether or not they can use ZFS or should choose a different software for this existing hardware.

There clearly are still many benefits to using ZFS even on a non-ECC system that can provide great reliability features that other systems cannot. I think it's worthwhile to explore the benefits and downsides to using ZFS even on non-ECC RAM.

After all, no data storage system is 100% reliable. Data storage like anything is a balance of risk, reliability, cost, and performance.

But surely you can see from my point of view that if i've been told from the guys who invented ZFS and who are still the primary guys driving the project forward today that:

"There's nothing special about ZFS that requires/encourages the use of ECC RAM more so than any other filesystem. If you use UFS, EXT, NTFS, btrfs, etc without ECC RAM, you are just as much at risk as if you used ZFS without ECC RAM."

That I would be inclined to believe them as there is nobody else with nearly as much overall, in-depth, or otherwise experience with the ZFS filesystem and it's internals and operation than anyone else on the planet. Surely they more than anyone else know ZFS' true strengths as well as its limitations.

Robert Trevellyan · May 27, 2015

SirMaster said:
"There's nothing special about ZFS that requires/encourages the use of ECC RAM more so than any other filesystem. If you use UFS, EXT, NTFS, btrfs, etc without ECC RAM, you are just as much at risk as if you used ZFS without ECC RAM."

Regardless of the risk of corruption, it is important to remember that all those other filesystems have recovery tools available, but ZFS doesn't. I'm sure this was already mentioned in this thread.

SirMaster · May 27, 2015

Yes, I did mention that in an earlier post when I was responding to one of jgreco's original posts. I thought that was a good way to put it and an important point to consider.

FS recovery tools are nice, but they aren't foolproof. Chkdsk on NTFS and fsck on EXT or BTRFS can work sometimes, but other times they may not help at all. ZFS takes another approach and that is to minimize the chances of an inconsistent filesystem via its atomic write design. They specifically choose not to have a fsck tool by design of course. It also tries to minimize metadata corruption by always keeping at least 2 copies of all metadata, and if your ZFS also has redundancy (nearly everyone does of course) then you have even more copies of the metadata. Many people use RAIDZ2 for instance where ZFS will keep 4 copies of all metadata and it only needs 1 copy in tact to be able to function.

I would bring to question how often ZFS would actually become corrupted beyond it's own ability to function compared to how often the other filesystems like EXT and NTFS become corrupt beyond the recovery ability of chkdsk and fsck.

Perhaps if NTFS or EXT became corrupt and the recovery tools were able to fix the issues, ZFS may not in that case have even become corrupted at all and may have been able to repair itself from the redundant metadata. And perhaps in cases where ZFS becomes so corrupt that it cannot recover itself, that level of corruption would be too great even for chkdsk and fsck to recover from either.

jgreco · May 27, 2015

SirMaster said:
But it's not just opinion, I mean they obviously had to run ZFS through loads of real testing while designing it and getting it ready for production use. Surely they compared it and benchmarked it against other file systems and measured the impacts of ECC vs non-ECC memory.

What a quaint claim. What exactly do you think they were designing ZFS to run on?

Hint: It might have been Sun kit, all of which is ECC.

solarisguy · May 27, 2015

jgreco said:
But it's not just opinion, I mean they obviously had to run ZFS through loads of real testing while designing it and getting it ready for production use. Surely they compared it and benchmarked it against other file systems and measured the impacts of ECC vs non-ECC memory.

Click to expand...

What a quaint claim. What exactly do you think they were designing ZFS to run on?
Hint: It might have been Sun kit, all of which is ECC.

Let me add to jgreco's comment, that at that time Sun has added more ECC checks to their systems, beyond ECC RAM. And I have serious doubts that Matt Ahrens or OpenZFS have carried large scale tests on systems with non-ECC memory after 2010.

Let me insert here the famous quote :)

we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don't know we don't know.

Thus:

known knowns can be exemplified by non-ECC memory impact on ZFS reliability;
known unknowns are unattributed causes of ZFS total failures;
unknown unknowns are, well, unknown...

If one is concerned with reliability and availability, the known factors are taken into account: redundancy, high quality software and hardware, proper procedures, etc. There are calculations that show relative gains or losses in probability of data loss depending whether RAIDZ, RAIDZ2 or RAIDZ3 is chosen. There are no calculations, that I am aware of, showing how different filesystems behave under ideal circumstances. However, there seems to be a consensus that original UFS, FAT, ext and other filesystems with known flaws are not as good as Btrfs, NTFS and ZFS (just examples..., insert your best filesystem here).

So why would you risk using non-ECC memory, if non-ECC RAM is a known factor in data loss? Only because there are no specific numbers attached to the frequency of the phenomena? But you had chosen ZFS, despite no values that show how many times ZFS is better than NTFS...

It could very well be that systems with ECC RAM are just better hardware, so the positive effect of ECC RAM is compounded. But I do not care whether that factor is a known unknown or an unknown unknown, ECC RAM is something we know we know...

P.S.
I did not learn about non-ECC RAM impact until 9.1, so at home I did have FreeNAS 8.2.3 on a server with non-ECC RAM.

jgreco · May 28, 2015

SirMaster said:
FS recovery tools are nice, but they aren't foolproof. Chkdsk on NTFS and fsck on EXT or BTRFS can work sometimes, but other times they may not help at all. ZFS takes another approach and that is to minimize the chances of an inconsistent filesystem via its atomic write design. They specifically choose not to have a fsck tool by design of course. It also tries to minimize metadata corruption by always keeping at least 2 copies of all metadata, and if your ZFS also has redundancy (nearly everyone does of course) then you have even more copies of the metadata. Many people use RAIDZ2 for instance where ZFS will keep 4 copies of all metadata and it only needs 1 copy in tact to be able to function.

No, it's still only two copies, though those data blocks are protected with redundancy. The redundancy isn't really an additional copy.

I would bring to question how often ZFS would actually become corrupted beyond it's own ability to function compared to how often the other filesystems like EXT and NTFS become corrupt beyond the recovery ability of chkdsk and fsck.

Perhaps if NTFS or EXT became corrupt and the recovery tools were able to fix the issues, ZFS may not in that case have even become corrupted at all and may have been able to repair itself from the redundant metadata. And perhaps in cases where ZFS becomes so corrupt that it cannot recover itself, that level of corruption would be too great even for chkdsk and fsck to recover from either.

Well, let's think about this hypothetically. There are all sorts of metadata smashes that could adversely impact a pool, but some of the more insidious ones would involve the marking of allocated blocks as free in the freelist. Especially if some of those involved metadata, which would then be a great way to eat ones own tail.

So you're "okay" after the pool damage and you're "okay" for a long time afterwards, because by sheer dumb luck the space isn't allocated. Then one day you try to fill the pool a little more, and suddenly you've overwritten both copies of the metadata for inode 4.

That's clearly damage that's well within the scope of chkdsk/fsck, but incredibly dangerous to a ZFS pool.

Important Announcement for the TrueNAS Community.

Risk of using non-ecc ram

Inactive Account

Resident Grinch

Inactive Account

Resident Grinch

Patron

Resident Grinch

Patron

Resident Grinch

Server Electronics Sorcerer

Resident Grinch

Server Electronics Sorcerer

Resident Grinch

Server Electronics Sorcerer

Patron

Pony Wrangler

Patron

Resident Grinch

Guru

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Risk of using non-ecc ram"

Similar threads