So... what comes after RAID-Z3?

cyberjock · May 26, 2015

jgreco said:
However, it is worth noting that that this has always been a dicey metric to begin with, and probably doesn't translate to useful data, in the same way that MTBF isn't really directly meaningful. Nonrecoverable read errors aren't likely to magically all be 1x10^14 for consumer grade drives and 1x10^15 for the enterprise drives, across many years, underlying technology changes, etc. It MAY be indicative of somewhat better materials/design/etc but it is also definitely indicative of the fact that they'd prefer enterprises to buy the more pricey drives.

The whole point of RAID, however, was to create an redundant array of inexpensive disks, and to tackle the problem that way. The difference between 1x10^14 and 1x10^15 isn't particularly meaningful in that context, because, again, data loss is tied to the probability of two drives losing the same block simultaneously - not just two drives losing some arbitrary unrelated blocks simultaneously.

Which is precisely why I said "claim". I did engineering work for many years. I was one of those poor souls that provide input into the MTBF calculations that we had to go by. They, for all intents and purposes, are basically useless metrics.

What makes this whole conversation uglier isn't what I'm about to quote below. But the fact that if we are going to dismiss the 1x10^15 as a bunch of crap, why should we even trust the 1x10^14 value? Isn't that number just as meaningless then? I'm not saying we should or shouldn't dismiss those numbers. I'm just saying that if we are going to dismiss the smaller number for enterprise drives, why the hell aren't we also dismissing the higher value for consumer drives? Seems to be logical to me...

jgreco said:
It's actually really only tangentially related, and I'm kinda surprised you'd say such a thing. What you're actually looking for is the likelihood of data loss on a pool, and how we can affect that in the future.

Bull shit it's only tangentially related. It's totally and unequivally related. Those failure rates that are sold are the math behind why RAIDZ1/RAID5 died in 2009. That precise math. Nothing else was involved.

RAIDZ1 dies "in 2009" for a very specific reason: the loss of the parity disk results in the elimination of redundancy for the pool. When you're rebuilding, you actually do need each and every sector on the remaining drives that contains pool data to be readable, or you will encounter some loss of data. That is very much intertwined with the URE values you're discussing.

jgreco said:
RAIDZ2, however, retains redundancy. Because of that, the URE values are of less concern. As long as the redundancy is capable of recovering the data, you're still fine. The problem with RAIDZ2 is that if you lose a drive, any block on the remaining drives which falls victim to a URE is still recoverable, but has effectively lost redundancy. Still, it is totally recoverable.

And you don't hear me harping on RAIDZ2 being dead do we? Because I know better than to assume that a single URE is a death kneel for ZFS. For hardware RAID, it could very well be since many RAID controllers drop a drive for the controller on first sign of problem.

We do run into a problem with that, however, as the rebuild times increase. The likelihood of a second drive failing during a multi-day rebuild with these modern large drives is substantially greater than the chance of failure striking during the rebuild of a much smaller drive.

jgreco said:
RAIDZ3 extends that out further. At this point, the impact of the URE rate is essentially meaningless, because you're multiply covered even for two failures. Again, as I pointed out earlier on, this is actually a problem in statistics, and statistically speaking, you're very likely to retain availability of a data block as long as you haven't managed to lose access to that block either due to a URE on that same block on the other drive, or lose that drive entirely.

Which I've said many times on this forum...

RedBear said:
By the way, I don't know if WD Reds are still being recommended much around here but they don't seem to be doing well on Backblaze's report. Link. Failure rates of up to 13%.

You know that link has been provided twice on the last 7 days, right? Linking to it again doesn't make it any more trustworthy than the other thread that linked to it yesterday afternoon.

Yes, Reds are still recommended around here. Probably in more than 50% of cases compared to all other brands combined. I'm using 10xWD Reds and I have no reason to be concerned, at all. Backblaze has been publicly humiliated for more than one report they've put out in the past, and even admitted that the report wasn't meant to extrapolate assumptions that people started making.

Ericloewe · May 26, 2015

It's hard to trust data from someone who happily writes down "Confidence Interval: 0%-106.9% [failure rate]".

rogerh · May 26, 2015

Ericloewe said:
It's hard to trust data from someone who happily writes down "Confidence Interval: 0%-106.9% [failure rate]".

At least they didn't put -3.6%-106.9%. Otherwise there would be a small chance of ending up with more working drives than you started with.

Ericloewe · May 26, 2015

Not really relevant for the WD Reds, since they're new in the so-called study, but they keep repeating the biggest mistake they could make with their data short of averaging the results for all drives:
They don't track the probability distribution function as a function of time.
Overall failure rates are absolutely useless long-term, as they're pretty much guaranteed to tend towards 1, if drives aren't replaced before they die.

What I *do* care about is the expected lifespan of the drive, not whether it'll fail (it will), whether there are any clusters of failures and how nicely the results fit some common models for device failures.

Bidule0hm · May 26, 2015

Let alone the temps and vibrations problems...

SirMaster · May 27, 2015

The thing is, "traditional parity RAID" is not sustainable in the long run and is a dying paradigm in the large data storage world.

Even if you add higher parity levels, the computations get more complex and eventually you will cross a threshold where rebuilding a disk takes longer than the average time between disk failures in large arrays. I've seen documented systems where this is nearly the case with RAIDZ3 already and moving to "RAIDZ4"parity would surely push it past the limit in even more systems. This is obviously a problem. You would always be resilvering and would suffer from a pool that is always operating at degraded performance which is simply not acceptable and certainly not optimal.

The first simplest solution is to make more pools rather than single large pools.

Another answer that some have turned to is to use mirrors which scale a lot better. Also the main benefit to mirrors are that rebuilds are extremely fast. It's essentially just a copy operation from the other mirror. Meanwhile, all the other mirrored sets can operate at full capacity and serve up all the IOPS that the customer needs during these mirror resilvers.
http://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs-not-raidz/

For even larger systems companies are moving to distributed filesystems.

See this video to explain what companies who have outgrown traditional parity RAID solutions are now using:
https://www.youtube.com/watch?v=x54s9cjMjPU

Now parity itself is not dead, but as RAID it is. There are new custom storage systems that still utilize parity data that are sustainable and scalable much larger via distributed means.

To understand how these new custom data storage systems operate I found these articles very helpful:

https://www.backblaze.com/blog/vault-cloud-storage-architecture/
https://code.facebook.com/posts/1433093613662262/-under-the-hood-facebook-s-cold-storage-system-/

It's interesting that both these companies created their own data storage software from scratch and both came up with roughly the same solution. Reed-Solomon erasure coding and storing distributed "shards" of data with parity shards. Backblaze splits files into 20 shards, 3 of which are used as parity, and Facebook splits files into 14 shards, 4 of which are used for parity.

jgreco · May 27, 2015

cyberjock said:
Bull shit it's only tangentially related. It's totally and unequivally related. Those failure rates that are sold are the math behind why RAIDZ1/RAID5 died in 2009. That precise math. Nothing else was involved.

RAIDZ1 dies "in 2009" for a very specific reason: the loss of the parity disk results in the elimination of redundancy for the pool. When you're rebuilding, you actually do need each and every sector on the remaining drives that contains pool data to be readable, or you will encounter some loss of data. That is very much intertwined with the URE values you're discussing.

Well, welcome to the conversation, then. What I said was only tangentially related is indeed only tangentially related, because we're discussing the post-RAIDZ3 world, and the impact of URE's on anything other than RAIDZ1 is minimal at best.

mjws00 · May 27, 2015

Actually the links by @SirMaster probably hit the next step of storage in my mind. Shards, dfs, and ceph like structures seem like they will play and are already at scale for the big data guys. Variations on that tend to be where I'd look longer term. Scaling exponentially with separation from direct hardware crushes raid/ZFS style schemes.

In fact the subsequent YouTube rabbit hole will probably cost me a significant chunk of time and money. ;)

jgreco · May 28, 2015

SirMaster said:
Even if you add higher parity levels, the computations get more complex and eventually you will cross a threshold where rebuilding a disk takes longer than the average time between disk failures in large arrays. I've seen documented systems where this is nearly the case with RAIDZ3 already and moving to "RAIDZ4"parity would surely push it past the limit in even more systems. This is obviously a problem. You would always be resilvering and would suffer from a pool that is always operating at degraded performance which is simply not acceptable and certainly not optimal.

The problem here is more that RAIDZn isn't supposed to be used on pools with heavy workloads, as the vdev performance is similar to one of the component devices. You're getting an archival quality storage vdev.

I too have seen where "rebuilding a disk takes longer than the average time between disk failures" but this is invariably from a misdesigned system that is expecting ongoing high performance out of such a pool.

The first simplest solution is to make more pools rather than single large pools.

Huh??? You could just add more vdevs to a single large pool for the same effect.

Another answer that some have turned to is to use mirrors which scale a lot better. Also the main benefit to mirrors are that rebuilds are extremely fast. It's essentially just a copy operation from the other mirror. Meanwhile, all the other mirrored sets can operate at full capacity and serve up all the IOPS that the customer needs during these mirror resilvers.
http://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs-not-raidz/

Except that in practice, that's not what happens. If you've got a system that is expecting high performance out of a set of disks, you suddenly develop a massive hot spot on the surviving mirror disk when one of the disks dies. This is a major impediment to rebuild.

While ZFS has no easy way to deal with that, other than to provision your mirrors at N+1 (or more) width, I can tell you that there ARE ways to cope with that problem from a CS point of view. For example, consider a 24 disk system that is designed to have 12 disks and then another 12 disks "mirroring" the data of the first 12. Again, remember we're not talking ZFS here... but rather a concept. On the first set of disks, you use a hash function to determine on which disk to store a given unit of data. On the second set of disks, you use a different hash function to determine on which disk to store that same unit of data. The data is stored on both sets of disks, but if a disk fails, because you've used a different hash to distribute the data on the other set of disks, the missing data from the failed disk is stored evenly across the entire set of disks in the opposing mirror.

This is extremely cool, because each drive in the opposing mirror only sees a 1/12 increase in its workload due to the lost disk - severely reducing the "hot spot" effect. And when you're rebuilding the failed disk, the data is also pulled evenly from all those other drives, meaning your rebuild probably progresses at nearly the write speed of the replaced drive...!

So anyways this isn't theoretical. This is the technique I developed more than a decade ago for Usenet service providers to provide hotspot-resistant redundancy at a server level (the hash controls distribution of articles to storage servers). It works very well. The whole thing ended up functioning as a primitive pseudo-distributed-file-system that took advantage of some various statistical properties of the traffic.

Getting back to the issue at hand: The usual problem with mirrors is that you often get to a point where you're reliant on that performance from each component device, because most often that's just two devices. If you can commit to keeping at least N+1 (and preferably N+2) width, where N is the number of drives (minimum 2) you need in a vdev to sustain your required performance, then this isn't much of an issue. But I don't usually see that happening. Most people get all crazy eyed when they think about all the "wasted disk space."

But that's what mirrors are for. We have RAIDZn for mega space and poor performance. We have mirrors for modest space and high performance.

For even larger systems companies are moving to distributed filesystems.

See this video to explain what companies who have outgrown traditional parity RAID solutions are now using:
https://www.youtube.com/watch?v=x54s9cjMjPU

Now parity itself is not dead, but as RAID it is.

The linked video is essentially just discussing the next evolutionary step for RAID parity, which isn't parity. It's a more clever strategy - great, we do need that. The fact that it's being stored on various servers is just a necessary side effect of scale, because there's no way to attach 100PB of storage to a single server and still be able to make use of its performance potential. So, yes, distributed filesystem for that, but we already knew we needed multiple storage servers for massive scale storage.

The problem is, moving to multiple servers isn't a comprehensive answer. You have to be able to also scale up an individual machine. Drives will continue to get bigger, and we've got to figure out how to cope with this. We need a little of both.

SirMaster · May 28, 2015

jgreco said:
The problem is, moving to multiple servers isn't a comprehensive answer. You have to be able to also scale up an individual machine. Drives will continue to get bigger, and we've got to figure out how to cope with this. We need a little of both.

But do you not think that mirrors cope better with larger disks than RAIDZ? To me it seems they would, but maybe I'm not thinking about it right.

jgreco said:
Huh??? You could just add more vdevs to a single large pool for the same effect.

I was kind of thinking of LLNL's Lustre storage system when writing that. They are the ones driving the ZFS on Linux port for the specific reason of using it with a large-scale distributed file system. But like you said too they split up the zpools into separate servers because it doesn't make as much sense to try to connect that much storage into one system for a number of reasons already.

jgreco · May 28, 2015

SirMaster said:
But do you not think that mirrors cope better with larger disks than RAIDZ? To me it seems they would, but maybe I'm not thinking about it right.

I believe you're still doing a meta traversal for the resilver regardless. With mirrors, people are less likely to go beyond two drive mirrors, because the space consumed rapidly becomes onerous - but two-disk mirrors are basically vulnerable to the same sort of URE argument that kills RAID5.

To put it in real world perspective: I'm building a new VM storage server here. I've decided to go with 2TB 2.5" disks, which gets me 24 in a 2U form factor. I can get seven three-way mirrors in that space, plus three warm spare drives, so this gives me a 14TB pool from 48TB of disks.. But for iSCSI, you still cannot use more than ~50% of a pool, so in the end 48TB of tri-way mirrored disk delivers 7TB of usable space.

People are generally incapable of making that sort of commitment to "waste" space. By way of comparison, you could do two 11-disk RAIDZ3's, end up with a 32TB pool, of which 16TB would be usable. It'd be SLOWER, yes, but it'd actually have better redundancy characteristics AND a lot more space.

Sometimes it isn't just the technology. Sometimes you have to get things through to people too.

I was kind of thinking of LLNL's Lustre storage system when writing that. They are the ones driving the ZFS on Linux port for the specific reason of using it with a large-scale distributed file system. But like you said too they split up the zpools into separate servers because it doesn't make as much sense to try to connect that much storage into one system for a number of reasons already.

Well, okay, yes, I would agree with the sentiment of moving towards multiple servers, but really only once a single server had grown to some size where it was impractical to expand, or due to some other similar issue (such as a DFS strategy). The problem is that large numbers of servers can scale just as poorly as large numbers of disks, and you're probably just moving pain points around.

So to bring that back around, the answer is that, no, I don't necessarily think RAID is dead. I do think that we need to be using smarter strategies than just trying to make a bunch of little disks into a larger LUN (the way your typical RAID controller does with RAID5/6).

I would love to see some additional vdev types added to ZFS, such as a strategy to implement some sort of better mirroring as I suggested a few messages above. However, ultimately, I expect that ZFS will end up being just a lower layer on a software stack that provides redundancy and high availability strategies through clustered techniques. ZFS will still be extremely beneficial in that it is able to provide for stored data the same sorts of protection that ECC provides to in-core data, and with modern systems now being large enough that ZFS is no longer onerously piggy, I would love to imagine that in ten years silent corruption of data would be a "!!!!WTF!!!!".

JoanTheSpark · Jun 7, 2015

wait a second..
On page 1 SSDs got mentioned, but the discussion wasn't up to speed for 'conclusions' of page2&3 - "URE doesn't matter with Raid/Z2 or higher, the problem is the time it takes to resilver the replaced drive" - right?
So far no one did lay down what is needed from a write/read speed point of view to get back to where 'we' were in the golden days of the 90's or 00's, when the capacity to write/read-speed ratio was better.
Wouldn't SSD's (or any technology coming thereafter) with higher write/read speeds alleviate this?

jgreco · Jun 7, 2015

It's the seek speed that's killing us more than anything else with HDD's these days.

While SSD's and other non-physical-platter based technologies may address the seek speed issue, the fact remains that we've gotten rather used to storing large amounts of data connected through a relatively narrow pipe. We have 1TB SSD now, and the ability to do 2TB, so if you look at something like the 1TB 850 EVO and that it's limited by the 6Gbps SATA (450MByte/sec read), we still have a device that takes nearly an hour to read fully. Okay, yeah, SAS 12Gbps. But oh look, a hypothetical 2TB 12Gbps drive would read at around 1000MByte/sec, so the doubling of bus speed is offset by the doubling of capacity.

And we're not likely to see 24Gbps SAS showing up RSN, since 12Gbps is still pretty new. So what we're likely to see is something happening with flash that's similar to what happened back in the 1990's with hard disks... the amount of time it takes to recover data off them is going to steadily increase.

JoanTheSpark said:
So far no one did lay down what is needed from a write/read speed point of view to get back to where 'we' were in the golden days of the 90's or 00's, when the capacity to write/read-speed ratio was better.

1) Make smaller drives, and

2) Store less data?

JoanTheSpark · Jun 7, 2015

So the cycle repeats?

..> Parallel > Serial > Parallel > Serial >..

Is just a matter of time really.. what happens if you could attach 8x 6Gbps SATA ports to a single SSD ~3GB/s?
That would get the time down from ~40min to ~6mins for that 1TB volume.
Too dreamy?

Consider that the stuff get's smaller with every iteration, this means shorter cables or even straight plug in ports - see mSATA or M.2.

Some nice read here: StorageSearch.com this way to the petabyte SSD - 2016 to 2020 roadmaps

jgreco · Jun 7, 2015

JoanTheSpark said:
So the cycle repeats?

..> Parallel > Serial > Parallel > Serial >..

Is just a matter of time really.. what happens if you could attach 8x 6Gbps SATA ports to a single SSD ~3GB/s?
That would get the time down from ~40min to ~6mins for that 1TB volume.
Too dreamy?

Consider that the stuff get's smaller with every iteration, this means shorter cables or even straight plug in ports - see mSATA or M.2.

Too dreamy and impractical.

Look back 20 years at a typical server grade drive like the Seagate Barracuda 4XL ST32272N. 2GB, about 5 MBytes/sec read speed, which implies it can read the full disk in around 7 minutes (400 seconds).

Now we have a 4TB drive like the Seagate ST4000DM000, 4TB, about 150MBytes/sec read, which implies that it can read the full disk in around 8 hours.

But even a 1TB SSD running at 6Gbps (around 500MBytes/sec) today would take around 30 minutes.

So, yeah, you'd be TEMPTED to say "well that SSD, we could just jack up the transfer rate and get that down to 7 minutes." Correct, but, there are practical limits to how much stuff we can throw around inside a system, and how fast we can process it, and the reality is that the storage grows more quickly than some of those other variables. And we're basically just ignoring the HDD problem, which isn't good.

Fundamentally, we've found storing more things is convenient, and the scales now tip towards having a huge storage device.

jgreco · Jun 7, 2015

Code:

# dd if=/dev/da0 of=/dev/null bs=1048576
2157+1 records in
2157+1 records out
2262765568 bytes transferred in 449.028252 secs (5039250 bytes/sec)
#

Hey lookitthat. Almost right on the nose.

SirMaster · Jun 10, 2015

Found another article talking about the current paradigm in large scale storage. The direction that storage system architectures have moved instead of RAIDZ4 etc.

http://wikibon.org/wiki/v/Erasure_Coding_and_Cloud_Storage_Eternity

Important Announcement for the TrueNAS Community.

So... what comes after RAID-Z3?

cyberjock

Inactive Account

Ericloewe

Server Wrangler

rogerh

Guru

Ericloewe

Server Wrangler

Bidule0hm

Server Electronics Sorcerer

SirMaster

Patron

jgreco

Resident Grinch

mjws00

Guru

jgreco

Resident Grinch

SirMaster

Patron

jgreco

Resident Grinch

JoanTheSpark

Dabbler

jgreco

Resident Grinch

JoanTheSpark

Dabbler

jgreco

Resident Grinch

jgreco

Resident Grinch

SirMaster

Patron

Similar threads

Important Announcement for the TrueNAS Community.

So... what comes after RAID-Z3?

Inactive Account

Server Wrangler

Guru

Server Wrangler

Server Electronics Sorcerer

Patron

Resident Grinch

Guru

Resident Grinch

Patron

Resident Grinch

Dabbler

Resident Grinch

Dabbler

Resident Grinch

Resident Grinch

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "So... what comes after RAID-Z3?"

Similar threads