FreeNAS/ZFS reliability

divB · Feb 19, 2014

Hey,

First of all: I am using FreeNAS with ZFS for over one year as backup machine and it works flawlessly.

However, by accident I came across a nice tutorial and a couple of posts which make me a little bit worried because they describe worst-case scenarios due to RAM failure or too few RAM.

I realized that this seems to be an emotional discussion but here I want just objective explanations.

In my opinion, the ultimate MCA should be "impossible" for a reliable system due to minor things like power outage, resource exhaustion.

So from the beginning, I read:

1.) Using non ECC RAM can cause total data loss.
I understand that there are no further checks and if a RAM cell is defective it could be replicated all over again. But is this much more a problem than for a conventional system? (e.g. Linux + XFS, ext, Windows+NTFS, whatever, ...). I get the point that due to COW, the probability of writing wrong data might be increased? But does this really result in the ultimate MCA?

2.) Using less than 8GB RAM can cause total data loss.
How can that be? Of course, a system can become unreliable, processes may crash, parts of data can get lost. But how on earth could a minor cause like a memory outage cause total data loss?
This would also sound to me that any simple DoS attack (fork bomb, loading too many applications or everything else which causes an out-of-memory condition) could destroy a complete FreeBSD+ZFS system. That's [...]

In particular (2) makes me worried: If this is true, how is it different from a crash? A power outage? Any other non-expected event?

I understand that FreeBSD and ZFS is not a desktop system but a server system, requiring certain reliable hardware. I also have a 24x7 server with UPS etc.
But if it is really true that my backup is so sensitive and "BIBO unstable" to external events, I can't trust that any more and need to move away.

Can anyone clarify (objectively - not emotionally) what's really behind that?

Thanks

cyberjock · Feb 20, 2014

As for the ECC RAM, there's a thread where I discuss that in detail. I won't discuss this further because that thread is extremely detailed. http://forums.freenas.org/index.php?threads/ecc-vs-non-ecc-ram-and-zfs.15449/

As for less than 8GB of RAM, we don't know. It's one of those things where, if you are really interested in maximizing reliability of your system, and you are choosing to go with ZFS, it's a concession you'll just have to accept. The real problem isn't that 8GB of RAM is required. The real problem is that too many people think that "minimum requirements" mean "I can ignore these because I know better". We already know better, which is why I personally put the 8GB of RAM in there. Want to take the risk of doing less, feel free. But you'll get no sympathy from us when you choose to use "less than the minumum requirements" for FreeNAS. We see this all the time, and to be honest, I don't even bother answer people anymore. I don't care how much you want to cry over your data, I'm not in the business of spending my volunteer/unpaid time to help solve a problem you created by choosing to ignore the manual, regardless of why you did it. It was your conscious choice, period. 8GB of RAM seems to be the minimum amount of RAM, regardless of your expected performance, to keep the system stable. You want a 100TB pool and the minimum ram for stability.. 8GB. You want a 1TB pool and minimum ram for stability... 8GB. Now, I can guarantee you that a 100TB pool will suck horribly in the performance arena, so you will definitely feel the pressure to upgrade.

How are these things different? Well, we're not exactly sure. Nobody has investigated it and knows the mechanism for failure. Generally, we kind of consider you to be crazy to use less than 8GB anyway, so why our limited developer resources would be spent solving a problem that shouldn't exist is just not reasonable. It's a matter of efficiency.

As for DoS attacks, different method for bringing a system down, and you should be using a firewall to separate FreeNAS from the internet anyway. So unless you have internal network problems that are also behind the firewall, the argument for it is somewhat mute. Personally, I've never seen someone complain of crashes related to a DoS attack.

As for BIBO, I'm not familiar with that term.. if you could clarify what it is I could probably discuss that more.

divB · Feb 20, 2014

Thanks, ECC is a good explanation. That's clear now.

But for <8GB RAM:

[...] We see this all the time, and to be honest, I don't even bother answer people anymore. [...]

I understand that and I can imagine that a system might now run reliably enough.

You can safely assume I am NOT asking if I can run my box with <8TB RAM.

It is the fact that this is similar to power outage, crashes, just any exception which might occur. An operating system (and in particular a modern, enterprise system) MUST be able to handle exceptions. And I can hardly imagine that FreeBSD/ZFS do not do that.

It is a difference if a system runs unreliable or just looses all of its assets just because it does not handle exceptions.

So in particular:
* Can I loose all my data because of a power outage?
* Can I loose more data than it's currently in caches or being processed due to a power outage?
* Can I loose all my data because of a crash?
* Can I loose more data than it's currently in caches or being processed due to a crash?

If any of those questions are answered with "Yes" this makes me worried. A lot. And I would like to understand why this should be the case.
If all these questions are answered with "No" then I don't understand the difference to an out-of-memory condition. Or stated differently: Why would an arbitrarily low amount of RAM destroy my data? (in contrast that the system does not work at all/does not work reliably)

Nobody has investigated it and knows the mechanism for failure.

I had classes in Operating systems and even built one on my own (long time ago, though).
An operating system must handle exceptions and all possible external events properly.
As for RAM for example, an OS is usually intelligent enough to predict Out-of-memory conditions before they occur and allocations will just fail. In rare cases, in an out-of-memory condition, the OOM killer must kill processes to free memory. But this does not happen arbitraily too: Less important/critical processes will be killed first, for example.

So once again: I buy and believe that a system does not run reliably with less RAM but I want to know if I loose all my data if unpredictible conditions (out of memory (this includes <8TB!), infinite loop, kernel ops, crashes, power outages, ...) occur.

As for DoS attacks, different method for bringing a system down, and you should be using a firewall to separate FreeNAS [...]

It's not about attacks! Of course, my FreeNAS box is protected.

See above: Bugs, glitches in power supply, cat sitting on keyboard (no active shell), power outages, nearly everything can cause exceptions.
Also an infinite loop which allocates all memory or the whole CPU could just happen due to a bug!
If the system goes down, that's fine.
If all assets are lost, that's not good. And it's important to understand these conditions.

As for BIBO, I'm not familiar with that term...

Nevermind, it's a term from control theory. I just used it as a metapher here.

Thanks

cyberjock · Feb 20, 2014

divB said:
It is a difference if a system runs unreliable or just looses all of its assets just because it does not handle exceptions.

So in particular:
* Can I loose all my data because of a power outage?
* Can I loose more data than it's currently in caches or being processed due to a power outage?
* Can I loose all my data because of a crash?
* Can I loose more data than it's currently in caches or being processed due to a crash?

If any of those questions are answered with "Yes" this makes me worried. A lot. And I would like to understand why this should be the case.
If all these questions are answered with "No" then I don't understand the difference to an out-of-memory condition. Or stated differently: Why would an arbitrarily low amount of RAM destroy my data? (in contrast that the system does not work at all/does not work reliably)

I had classes in Operating systems and even built one on my own (long time ago, though).
An operating system must handle exceptions and all possible external events properly.
As for RAM for example, an OS is usually intelligent enough to predict Out-of-memory conditions before they occur and allocations will just fail. In rare cases, in an out-of-memory condition, the OOM killer must kill processes to free memory. But this does not happen arbitraily too: Less important/critical processes will be killed first, for example.

So once again: I buy and believe that a system does not run reliably with less RAM but I want to know if I loose all my data if unpredictible conditions (out of memory (this includes <8TB!), infinite loop, kernel ops, crashes, power outages, ...) occur.

You aren't understanding me.. read this sentence again.

cyberjock said:
How are these things different? Well, we're not exactly sure. Nobody has investigated it and knows the mechanism for failure. Generally, we kind of consider you to be crazy to use less than 8GB anyway, so why our limited developer resources would be spent solving a problem that shouldn't exist is just not reasonable. It's a matter of efficiency.

ZFS should be immune to any kind of partial transaction written to a pool. It should intelligently roll back to the last completed transaction, discarding the partially written data. So the answer to all of your questions should be that you won't lose the pool, but you might lose a couple of seconds worth of writes.

* Can I loose all my data because of a power outage? Yes, and everyone should be using a UPS anyway, so this isn't a change to the status quo.
* Can I loose more data than it's currently in caches or being processed due to a power outage? Yes, but only if the transaction is incomplete. Typically it won't be more than about 10 seconds worth of data except in extreme circumstances, and based on how your data is being written. Any "flying" transaction could be lost.
* Can I loose all my data because of a crash? Yes, and it happens from time to time. Almost always for those with less than 8GB of RAM, but people with more RAM seem to have crashes that don't cost them their pool. Again, the reason why <8GB of RAM is important is not understood. Just take it for what it is.
* Can I loose more data than it's currently in caches or being processed due to a crash? If you don't meet the system requirements, you might.

Then let me add these:

* Can I lose data because I put ZFS on a hardware RAID controller? Yep.. sure can.
* Can I lose data because I put ZFS on a redundant hardware RAID? Yep.. sure can.

It's more about doing the right things with ZFS and less about specs, models, and brands. I could probably come up with more examples, but I think my point was made.

For those that "do the right things" they are virtually immune to all of the nasty pool eating problems. Every time we see someone lose a pool, it's usually because they didn't do what we tell people to do in the stickies. I'm always asking myself if ZFS really is stable when all these people are losing data all the time. I think about it alot actually. And to be honest, I've never seen a case where someone did all the right things and still lost data. They always did several things wrong, they justified their excuses to themselves and ignored the stickies, and they paid the price for it. People joke in IRC with me because people really do show up with 2GB of non-ECC RAM, a realtek NIC that isn't working right, and ZFS on a hardware RAID and ask why they lost their data. Well, gee, all of those things are no-nos here. So not surprisingly, your pool was lost. Feel free to be upset, it was your choice and you didn't make the right one.

Important Announcement for the TrueNAS Community.

FreeNAS/ZFS reliability

divB

Dabbler

cyberjock

Inactive Account

divB

Dabbler

cyberjock

Inactive Account

Similar threads

Important Announcement for the TrueNAS Community.

FreeNAS/ZFS reliability

divB

Dabbler

cyberjock

Inactive Account

divB

Dabbler

cyberjock

Inactive Account

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "FreeNAS/ZFS reliability"

Similar threads