So... what comes after RAID-Z3?

DrKK · May 23, 2015

jgreco said:
That won't happen until the US last mile providers get a frickin' clue.

My last mile is FTTP/FiOS. ;)

jgreco · May 23, 2015

DrKK said:
Cloud storage is becoming many folks' primary way of using data?

It probably is true.

The cloud is just an idiotic marketing term for making something that used to be your responsibility into some random company's responsibility, with a corresponding reduction in any ability to determine what sort of actual reliability and security characteristics might be in play.

In most cases, the interests of that random company are more inline with "making money fast" than "reliably keeping your data safe and secure." To the extent that they're a larger company like a Netflix or an Apple, there's a better chance that the stuff will still be there next year, but look at the history of the industry. AOL, once huge, now a shell. MySpace, once THE default place for users to go for home pages, dead. T-Mobile/Microsoft Danger. Laugh. Etc. People these days live on Google and Facebook and Netflix. What happens when they eventually fail? What happens to all the pictures and data and content you've stored with them over the years?

DrKK · May 23, 2015

It is well-known, sir, that using "cyber-" and/or "-cloud" and especially "cybercloud" in any presentation, talk, or written proposal, for suits, results in increased gravitas and funding.

mjws00 · May 23, 2015

Heh. We aren't there yet. But I chuckle a little when I think about the amount of data I hoard. It is completely unnecessary, simply convenient, and mainly habit/hobby/relevant to business. There is almost nothing I couldn't stream, or download. Given instant access and corresponding bandwidth.... I would NEVER store it.

I'm not commenting on the masses, only the plugged in folks and enthusiasts. Might be a generation or two for trickle down... don't know. But consider current teens, my son is pretty plugged in. He has never had less than LTE mobility. He literally has unlimited bandwidth and storage (mine). From day one his content is cross-platform, mobile, and cloud based. Everything he consumes in terms of media streams. Period. It is 24/7/365 streaming (we have no data caps).

Whether that is safe or reliable is a different conversation. But most end users do a piss poor job of protecting irreplaceable data. Hedging across Google, Microsoft, and Amazon protects at a FAR higher level than most are capable of. For me it is about convenience, mobility, and cross-platform. I work everywhere and on every device with all assets available. (I am still paranoid and maintain private infrastructure. However, that would be unnecessary if it didn't make sense as a platform for me to service and provide.)

I'm not pretending to be "normal". Not saying it will work for anyone else. But I do see trends and have the luxury of bleeding edge.

Robert Trevellyan · May 23, 2015

RedBear said:
After reading the article online talking about how (and why) RAID5 "died" in 2009, and seeing a reference in the newbie slideshow that even RAID-Z3 will stop being capable of guaranteeing data protection as soon as 2019, I am very curious as to what comes next.

The way I see it, replication (combined with automatic testing and recovery, ala ZFS scrubs) is what comes next, and is already the solution when reliability is paramount. For example, when you put your data on S3, which has all those lovely nines after the decimal point, it's not relying simply on the integrity of the local array, it's also replicated to two other locations. Obviously geographic redundancy is one reason for that, but for sure replication is part of the solution to concerns about the reliability of individual storage arrays.

jgreco · May 23, 2015

The part of the equation that's going to fail is ultimately the fact that seek times aren't improving linearly with storage - in fact, a 20 year old drive is at least half as fast seekwise as a current drive, despite the factor-of-3000 difference in size (2GB Barracuda vs 6TB SATA).

So if you're seeking around storing small files, the 2GB drive will fill a lot more quickly than the 6TB. The question becomes, at what point can you no longer get data on and off the drive expeditiously enough without the drive failing in the meantime? It's just a more practical variation on the problem.

Growing the pool size doesn't exactly help.

jgreco · May 24, 2015

RedBear said:
You are so right. Silly me, my brain assumed that since Figure 7 has four sections and it's just labeled "probability" it was 25% per section, but it's just an expansion of the same data on Figure 6. My mistake.

I assume Figure 6 is supposed to be the one actually showing 0-100% on the left, which means RAID5 reaches 100% probability of data loss about 2017. RAID-Z3 (or their very interesting RAID-7.3 terminology) appears to be in pretty good shape till about, what, 2030, at a wild guess?

Thanks for the perspective.

I don't think it's showing 0-100% either. I believe that the scale was actually left off intentionally because to come up with actual solid figures, you need to have a specific number of drives, amount of space, and other characteristics to derive an actual failure probability. The conclusion to draw is that RAIDZ3 is substantially better than RAIDZ1 or RAIDZ2.

cyberjock · May 24, 2015

mjws00 said:
When the drive density starts to approach the mathematical limits of URE, it would seem to me that we will build error correction into the device.

Uh.. we already have that. Do you know what SMART parameter 195 is for? That's to track the number of errors that the hard drive was not able to correct itself with its own internal ECC. For some brands and models, that value is to track the number of times the ECC had to correct more than X bits, where X is some threshold the hard drive manufacturer chose to show that the hard drive is relying on internal ECC often enough to consider the drive at risk for not being able to reliably store data. I remember talking to a Seagate engineer about 10 years ago. I asked about internal ECC and he said that it is used more often than people want to realize, and for some people they would be flat out terrified if they knew how often it is relied on to restore your data.

Remember the whole 4k sector revolution that took place years ago? That was to increase performance (more bytes per read), increase reliability (by having more ECC bits, but less ECC bits per byte of user-data), and as a result, increase disk platter utilization for user data.

Here, read about it yourself: http://www.seagate.com/tech-insights/advanced-format-4k-sector-hard-drives-master-ti/

In fact, even solid state media does this, and has for decades. In fact, even BIOSes themselves have CRC or ECC of some type (depending on various aspects of the system).

So yeah, your great idea was implemented long ago. Even if you are my age, it was implemented when your parents were in grade school.

SSDs are their own beast because MLC (and TLC) has required more and more bytes of ECC. With each downsize in the memory cell, more ECC bytes are needed to maintain reliability of the media. In fact, one of Samsung's initial problems with initial TLC testing was that it was so unreliable that the number of bytes of ECC that were needed to make the media usable exceeded the added value that TLC was expected to bring to the table. At that time, TLC cost more and stored less than MLC of the same time-period. Luckily for Samsung they managed to fix it somehow, probably with a combination of firmware and manufacturing changes. You can read about SLC and MLC a bit here... The Inconvenient Truths of NAND Flash Memory. That presentation is a bit old, but its still a very good read IMO.

Remember these things:

1. Hard drive manufacturers want to sell more drives.
2. Hard drive manufacturers want the drives to be as big as they can make it.
3. They accomplish #2 by trying to minimize the amount of ECC they need, therefore maximizing the amount of user-data they can use. (this is literally robbing from Peter to pay Paul)
4. They will do #3 as best as they can, so long as everyone isn't losing data due to their drives being unreliable.
5. If *you* aren't happy with that, you are always welcome (and even invited by the hard drive manufacturers) to buy those horribly overpriced drives that have error rates that are an order of magnitude more reliable (according to their numbers) and provide better reliablity (again, according to their numbers).

Remember, hard drive manufacturers will juggle all of this to their advantage. If its not good for you because you can't buy those cheap WD Greens and doing a RAIDZ2, well, fsck you. They don't owe you anything, but you owe them your hard earned money, right? :P

But seriously, this is nothing more than business decisions right now. There is no doubt that things will change between now and 2019. I have no doubt that innovation of some kind will take shape, somehow. How much that will affect things right now is anyone's guess. HAMR technology will be out in large numbers by 2019. For all we know HAMR will make hard drives so reliable that doing RMAs for hard drives will be an exception instead of the norm. They might need 1/2 the ECC bits of current disks. On the other side, they might need 10x the ECC bits. Only those working in laboratories can probably provide any clue as to what is really going on with that technology.

mjws00 · May 24, 2015

LOL. I should have wrote 'additional' error correction into the device. I was proposing something stronger than a larger ECC block... But also NOT suggesting my idea was original or unique.

Thanks for the nice links. I didn't notice they had to ratchet up the ECC block size.

Z300M · May 24, 2015

DrKK said:
Cloud storage is becoming many folks' primary way of using data? I mean, sure, it's growing, but, I don't that I would characterize it as taking over traditional data storage. My only cloud storage is emergency backups.

Without going to one of our cable company's "business" plans, backing up any reasonable proportion of my data to "the cloud" is out of the question: our "residential" plan is only 4Mbps up, and there is a monthly cap as well. The "business" plan seems to eliminate the "cap" but is still only 4Mbps up.

cyberjock · May 24, 2015

mjws00 said:
LOL. I should have wrote 'additional' error correction into the device. I was proposing something stronger than a larger ECC block... But also NOT suggesting my idea was original or unique.

Thanks for the nice links. I didn't notice they had to ratchet up the ECC block size.

Right, but every byte the add to ECC means one less byte to store real data. So they have to decide how much to allocate for real data, and how much to allocate for ECC. People don't buy hard drives that have a box that says "now with 4 more bytes of ECC per sector!". People *do* buy the drives that say "6TB" over "5TB".

This is one of those "secret sauce behind the recipe" that 99.9% of sheeple will never know about, never care about, and never bother to question. So long as Brand-X's hard drives aren't super reliable compared to Brand-Y then everything is okay. Even if they are all equally unreliable, that's not a problem that the manufacturer's care about.

They are there to sell you hard drives; as many of them as they can for as much profit as they can. Any choice that doesn't promote at least one of those goals (and preferably both) are not going to happen. So manufacturers are not going to double the ECC on their hard drives tomorrow and tell you that the drive is so much better because they did that. At least, not unless some other company does it first. And will you be the guy that buys that 7TB hard drive for $300 or will you be the guy that buys the 7.5TB drive for the same price. Betting you'll buy the 7.5TB and unknowingly be totally unaware that maybe that 7TB drive has more ECC.

Decisions like this have already been made based on market forces. So unless you are going to start your own company and compete with the other companies to make the most reliable hard drive, you are stuck with what you can buy.

Z300M · May 25, 2015

cyberjock said:
Right, but every byte the add to ECC means one less byte to store real data. So they have to decide how much to allocate for real data, and how much to allocate for ECC. People don't buy hard drives that have a box that says "now with 4 more bytes of ECC per sector!". People *do* buy the drives that say "6TB" over "5TB".

This is one of those "secret sauce behind the recipe" that 99.9% of sheeple will never know about, never care about, and never bother to question. So long as Brand-X's hard drives aren't super reliable compared to Brand-Y then everything is okay. Even if they are all equally unreliable, that's not a problem that the manufacturer's care about.

They are there to sell you hard drives; as many of them as they can for as much profit as they can. Any choice that doesn't promote at least one of those goals (and preferably both) are not going to happen. So manufacturers are not going to double the ECC on their hard drives tomorrow and tell you that the drive is so much better because they did that. At least, not unless some other company does it first. And will you be the guy that buys that 7TB hard drive for $300 or will you be the guy that buys the 7.5TB drive for the same price. Betting you'll buy the 7.5TB and unknowingly be totally unaware that maybe that 7TB drive has more ECC.

Decisions like this have already been made based on market forces. So unless you are going to start your own company and compete with the other companies to make the most reliable hard drive, you are stuck with what you can buy.

Do the "Enterprise" drives have better ECC?

hugovsky · May 25, 2015

well... AFAIK, enterprise doesn't use hardrives.... :D

jgreco · May 25, 2015

hugovsky said:
well... AFAIK, enterprise doesn't use hardrives.... :D

The future of the IT professional ...

If you can get the isolinear chips in the right slots, you get to live another day.

Ericloewe · May 25, 2015

jgreco said:
The future of the IT professional ...

If you can get the isolinear chips in the right slots, you get to live another day.

They're secretly running ZFS, so the chips can go in any slot, but Engineering pretends the slots matter, for job security reasons.

cyberjock · May 25, 2015

Z300M said:
Do the "Enterprise" drives have better ECC?

I would guess not, but there's no information to really validate or invalidate that assumption. Companies don't talk about how many bytes of ECC per sector.

But, Enterprise drives claim a lower rate of UREs, which *is* the mathematical chances of a read error, which is what this whole thread *is* actually talking about even if not everyone knows that.

Ericloewe · May 25, 2015

Besides better ECC, I can think of several other options:

Tighter platter binning
Better/more expensive surface coatings
Better signal processing hardware
Sensors to improve feedback loops
Marketing lies

jgreco · May 26, 2015

cyberjock said:
But, Enterprise drives claim a lower rate of UREs, which *is* the mathematical chances of a read error,

However, it is worth noting that that this has always been a dicey metric to begin with, and probably doesn't translate to useful data, in the same way that MTBF isn't really directly meaningful. Nonrecoverable read errors aren't likely to magically all be 1x10^14 for consumer grade drives and 1x10^15 for the enterprise drives, across many years, underlying technology changes, etc. It MAY be indicative of somewhat better materials/design/etc but it is also definitely indicative of the fact that they'd prefer enterprises to buy the more pricey drives.

The whole point of RAID, however, was to create an redundant array of inexpensive disks, and to tackle the problem that way. The difference between 1x10^14 and 1x10^15 isn't particularly meaningful in that context, because, again, data loss is tied to the probability of two drives losing the same block simultaneously - not just two drives losing some arbitrary unrelated blocks simultaneously.

which is what this whole thread *is* actually talking about even if not everyone knows that.

It's actually really only tangentially related, and I'm kinda surprised you'd say such a thing. What you're actually looking for is the likelihood of data loss on a pool, and how we can affect that in the future.

RAIDZ1 dies "in 2009" for a very specific reason: the loss of the parity disk results in the elimination of redundancy for the pool. When you're rebuilding, you actually do need each and every sector on the remaining drives that contains pool data to be readable, or you will encounter some loss of data. That is very much intertwined with the URE values you're discussing.

RAIDZ2, however, retains redundancy. Because of that, the URE values are of less concern. As long as the redundancy is capable of recovering the data, you're still fine. The problem with RAIDZ2 is that if you lose a drive, any block on the remaining drives which falls victim to a URE is still recoverable, but has effectively lost redundancy. Still, it is totally recoverable.

We do run into a problem with that, however, as the rebuild times increase. The likelihood of a second drive failing during a multi-day rebuild with these modern large drives is substantially greater than the chance of failure striking during the rebuild of a much smaller drive.

RAIDZ3 extends that out further. At this point, the impact of the URE rate is essentially meaningless, because you're multiply covered even for two failures. Again, as I pointed out earlier on, this is actually a problem in statistics, and statistically speaking, you're very likely to retain availability of a data block as long as you haven't managed to lose access to that block either due to a URE on that same block on the other drive, or lose that drive entirely.

As we move out from RAIDZ1 to RAIDZ3, the statistical likelihood of URE's simultaneously affecting enough replicas of a block to render the block unavailable becomes more and more unlikely, making the issue of URE's less and less important. What's becoming more and more important is MTBF of a drive, since loss of a drive renders ALL blocks on the drive as unrecoverable, and the window during which reduced redundancy exists continues to grow due to the massive size of modern drives.

Increasing the RAIDZ level reduces the importance of URE's from fairly important to almost meaningless.

Increasing the size of the drive increases the resilvering time, which provides a larger window of reduced redundancy for the pool.

These are actually independent variables, but they can cooperate to help build a more reliable pool.

RedBear · May 26, 2015

jgreco said:
However, it is worth noting that that this has always been a dicey metric to begin with, and probably doesn't translate to useful data, in the same way that MTBF isn't really directly meaningful. Nonrecoverable read errors aren't likely to magically all be 1x10^14 for consumer grade drives and 1x10^15 for the enterprise drives, across many years, underlying technology changes, etc. It MAY be indicative of somewhat better materials/design/etc but it is also definitely indicative of the fact that they'd prefer enterprises to buy the more pricey drives.

The whole point of RAID, however, was to create an redundant array of inexpensive disks, and to tackle the problem that way. The difference between 1x10^14 and 1x10^15 isn't particularly meaningful in that context, because, again, data loss is tied to the probability of two drives losing the same block simultaneously - not just two drives losing some arbitrary unrelated blocks simultaneously.

It's actually really only tangentially related, and I'm kinda surprised you'd say such a thing. What you're actually looking for is the likelihood of data loss on a pool, and how we can affect that in the future.

RAIDZ1 dies "in 2009" for a very specific reason: the loss of the parity disk results in the elimination of redundancy for the pool. When you're rebuilding, you actually do need each and every sector on the remaining drives that contains pool data to be readable, or you will encounter some loss of data. That is very much intertwined with the URE values you're discussing.

RAIDZ2, however, retains redundancy. Because of that, the URE values are of less concern. As long as the redundancy is capable of recovering the data, you're still fine. The problem with RAIDZ2 is that if you lose a drive, any block on the remaining drives which falls victim to a URE is still recoverable, but has effectively lost redundancy. Still, it is totally recoverable.

We do run into a problem with that, however, as the rebuild times increase. The likelihood of a second drive failing during a multi-day rebuild with these modern large drives is substantially greater than the chance of failure striking during the rebuild of a much smaller drive.

RAIDZ3 extends that out further. At this point, the impact of the URE rate is essentially meaningless, because you're multiply covered even for two failures. Again, as I pointed out earlier on, this is actually a problem in statistics, and statistically speaking, you're very likely to retain availability of a data block as long as you haven't managed to lose access to that block either due to a URE on that same block on the other drive, or lose that drive entirely.

As we move out from RAIDZ1 to RAIDZ3, the statistical likelihood of URE's simultaneously affecting enough replicas of a block to render the block unavailable becomes more and more unlikely, making the issue of URE's less and less important. What's becoming more and more important is MTBF of a drive, since loss of a drive renders ALL blocks on the drive as unrecoverable, and the window during which reduced redundancy exists continues to grow due to the massive size of modern drives.

Increasing the RAIDZ level reduces the importance of URE's from fairly important to almost meaningless.

Increasing the size of the drive increases the resilvering time, which provides a larger window of reduced redundancy for the pool.

These are actually independent variables, but they can cooperate to help build a more reliable pool.

It's meaty posts like this that make me glad I started this thread.

For sure the probability of losing a second or even a third drive during ever-longer rebuild times is a big factor in why even RAID-Z2 makes me nervous on more than about 6 or 7-disk arrays. According to the Backblaze blog they've had annual failure rates of around 30% with some models of 1.5TB & 3TB Seagates. If I'd been unlucky enough to choose those drives for a 9-disk array, even RAID-Z3 would have been hard pressed to keep that array functional. Couple those failure rates with the fact that in the real world an array will often be built with drives from the same manufacturing batch that experience identical usage, vibration and heat conditions, and you've got a drastically higher probability of multiple drive failures within a relatively shorter time period than MTBF would seem to suggest. Fortunately most drive models have failure rates less than 5%. I'm planning on sticking with HGST NAS drives as they seem to mostly hover around 0.5 to 2.5% failure rates pretty reliably.

I am fully in agreement that long rebuild times are becoming more of an issue than URE rates. Either way, ZFS needs to be able to deal with it.

By the way, I don't know if WD Reds are still being recommended much around here but they don't seem to be doing well on Backblaze's report. Link. Failure rates of up to 13%.

jgreco · May 26, 2015

RedBear said:
It's meaty posts like this that make me glad I started this thread.

It's interesting, to be sure.

For sure the probability of losing a second or even a third drive during ever-longer rebuild times is a big factor in why even RAID-Z2 makes me nervous on more than about 6 or 7-disk arrays.

I'm not terribly scared of that, but RAIDZ3 does give that extra good-feely. One of the things I'd note is that 20-25 years ago, your typical low end SCSI drive (1GB Seagate Hawk, 5400RPM, lowish end drive) was $500 in bulk. These days, you're seeing 4TB drives for $125. We thought *$500* was the "inexpensive" in "RAID." From my perspective, I find it hard to go with RAIDZ2 except for practical reasons, such as you can't do RAIDZ3 in a 4-drive chassis, or the data isn't critical, or whatever. The extra "cost" for RAIDZ3 in my mind is a total noncost.

According to the Backblaze blog they've had annual failure rates of around 30% with some models of 1.5TB & 3TB Seagates.

We've seen 50% on the early gen 3TB'ers (small sample size though) and I know of some arrays of the 1.5TB'ers where they literally cannot finish a resilver of a 1.5 before another one fails. Not Kidding.

If I'd been unlucky enough to choose those drives for a 9-disk array, even RAID-Z3 would have been hard pressed to keep that array functional.

Quite possibly.

Couple those failure rates with the fact that in the real world an array will often be built with drives from the same manufacturing batch that experience identical usage, vibration and heat conditions, and you've got a drastically higher probability of multiple drive failures within a relatively shorter time period than MTBF would seem to suggest.

I've argued that as being a design error in the past, but I got tired of arguing with people who have been indoctrinated to the hardware RAID array sales strategy. I've had people argue with me that it is "impossible" to get heterogeneous arrays from a vendor (it isn't, you just need to be pushy), or that somehow mixing drive types will cause things will explode or the universe to end. Some or all of that was true in the old days of SCSI where you'd really want all the drives on a bus to have the same hardware and firmware to reduce SCSI communications errors and to support spindle sync, but with modern SAS/SATA topologies, this is significantly less of a factor. Ideally you want drives with similar performance characteristics (spindle speed, seek time, cache), but even if you don't, performance will approximate the worst of your devices - which, while not great, is perfectly acceptable in many cases.

Fortunately most drive models have failure rates less than 5%. I'm planning on sticking with HGST NAS drives as they seem to mostly hover around 0.5 to 2.5% failure rates pretty reliably.

I am fully in agreement that long rebuild times are becoming more of an issue than URE rates. Either way, ZFS needs to be able to deal with it.

Well, yes. And this is where I see one aspect of ZFS that troubles me ... rebuilds work on the basis of metadata traversal, which is why a rebuild of a nearly empty pool is much faster. I haven't looked at this in some time, but basically there's a lot more seek activity going on during a ZFS RAIDZ rebuild than there would be during a RAID5 rebuild, which is just linearly traversing all LBA's.

As drive sizes have increased, the practicality of filling a disk with small files has decreased. In the old days with a Seagate Hawk 1GB, with 512 byte sectors, you had approximately 2 million sectors. So at a seek time of ~10ms, and two seeks per file write, and 1 byte files, that means you could write 50 files per second. (Note this is a demonstration of a concept and not intended to be a technically rigorous analysis of actual filesystem performance.) At that speed, for two million files, you need about 42,000 seconds, or half a day, to fill the disk.

Now, with a new 6TB disk, with 4K sectors, you have about 1.6 BILLION sectors. At a seek time of ~5ms, and two seeks per file write, and 1 byte files, you can write 100 files per second. At that speed, for 1.6 billion files, you need 16 million seconds, or 186 DAYS, to fill the disk. That's even assuming you waste all the space for the 4096 byte sectors, and don't try to use it as a 512 emulated disk. If you do that, then suddenly you're out at about 4 YEARS to fill the disk.

And that's around the estimated lifetime of a device. You've maybe barely managed to write all your stuff on the drive and it's failing!

The takeaway from this is that the big drives are not good for storing lots of small files, and ZFS has some issues related to the size of drives as well.

By the way, I don't know if WD Reds are still being recommended much around here but they don't seem to be doing well on Backblaze's report. Link. Failure rates of up to 13%.

Eh. I'm not impressed either way.

Important Announcement for the TrueNAS Community.

So... what comes after RAID-Z3?

FreeNAS Generalissimo

Resident Grinch

FreeNAS Generalissimo

Guru

Pony Wrangler

Resident Grinch

Resident Grinch

Inactive Account

Guru

Guru

Inactive Account

Guru

Guru

Resident Grinch

Server Wrangler

Inactive Account

Server Wrangler

Resident Grinch

Explorer

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "So... what comes after RAID-Z3?"

Similar threads