Let's talk about UREs?

souporman · Nov 1, 2022

I've been thinking about UREs and RAID1 (or RAID10). Long ago, in a galaxy far, far away (the Spiceworks forums), I was reading about UREs. There was a trusted storage resource around those parts named Scott Alan Miller, who posited that RAID1 and RAID10 were "immune" to UREs because they do not use "parity" disks, instead a mirrored clone, which is inherently not at risk for encountering a URE, or rather, that when a URE is encountered, it simply does not cause a failure as this is not a block-by-block rebuild, but instead, a simple file copy (here is the link to the pinned forum post I'm referring to). Subsequently, this person is no longer a member of that forum and flamed out in a very cyberjock-esque fashion, and now gives advice on becoming an ex-pat on his YouTube channel, so maybe he wasn't the right person to be taking advice from.

I made a comment here a few months ago about how I thought RAID1/10 was immune to URE encounters and was quickly corrected that a URE will still kill a RAID1/10 during a resilver. Indeed, I cannot find a single other source on the internet that thinks the same way as the gentleman from Spiceworks, so I am inclined to believe that the one source of opposing data is probably the wrong one, and yet...

I built a 360 disk TrueNAS of mirrors. 180 Mirrors, 14TB disks with a URE of 10^14. It clocks in at 2PB useable and has been running for over 2 years now. Before upgrading to 14TB disks, the same NAS housed 8TB disks. The whole monster has been running without data loss since 2016. It holds old cold WORM data and has lived most of its life at or around 85% - 90% capacity. I've replaced at least 30 - 40 failed disks over the years, and never sweated a resilver. I've never lost any data, but the math of a 10^14 URE says that I should encounter a URE while rebuilding a 12TB disk nearly 100% of the time, right? I know the math doesn't say exactly that, but what I'm getting at is if a URE should kill a 12TB disk during resilver pretty darn frequently then am I the luckiest man on the face of the Earth? Should this NAS be admitted to the Smithsonian to be kept as a monument to the hubris of man not doing enough research? I'd love to hear some thoughts!

EDIT: @Arwen @danb35 I hope you don't mind me tagging a few of you whose opinion I've learned to respect over the years!

danb35 · Nov 1, 2022

souporman said:
when a URE is encountered, it simply does not cause a failure as this is not a block-by-block rebuild, but instead, a simple file copy

That's utterly nonsensical; no RAID system I'm aware of operates at the file level, and mirrors aren't any more immune to URE-caused damage than RAIDZ. The "RAID5 is dead" hysteria has always been exaggerated, as it assumes a URE during rebuild is fatal to the volume, and that's unlikely to be the case unless the URE was in critical metadata (or it causes the drive to drop offline). ZFS does some things that make it even less likely:

It stores multiple (at least 2, up to 6 IIRC) copies of all metadata
it checksums all data and metadata, so it can know if the drive returns bad data

I also strongly suspect it to be the case that real-world URE rates are considerably better than spec.

HoneyBadger · Nov 1, 2022

I'll try to revisit in detail later, but the short version is that a UREs is not only a "worst case scenario" but also an "average over lifespan" - a drive is expected to produce a read error no more frequently than its stated URE rate over its entire warranty period.

It can absolutely produce them less frequently - in fact, it may give you petabytes of working life before finally giving you a bad bit.

As mentioned by @danb35 ZFS also has the advantage of knowing if it gets handed bad data from a device.

souporman · Nov 1, 2022

Interesting stuff. I was hoping you'd chime in as well, @HoneyBadger but I thought it would be annoying if I just started tagging a bunch of people. So, with ZFS being aware of bad data and having the ability to deal with it, and having multiple copies of the precious metadata, and with a heck of a lot of anecdotal evidence to point to the fact that real-world UREs are considerably better than spec... It almost seems like it's not even worth mentioning UREs--especially with even "consumer-y" drives like the Exos line even have 10^15 these days.

I built the giant 360 disk mirror NAS in part just to see if it would work. I managed several other smaller NASes that were ~100 disks or less and they were arranged in 10-disk wide VDEVs in a RAIDz2 configuration. I'd hit UREs on those when they were rebuilding from time to time. I'd see the files that were lost on the zpool status -v output. I wouldn't say it was a regular occurrence, but out of the 9 NASes I administered (8 of them were the 100 or less disk RAIDz2, and 1 was the mirror NAS), I lost 2 disks and hit a URE during resilver twice over the course of 7 years, resulting in minor data loss, but not the entire array.

It would be really fun to just keep adding mirrors. I wish I had the resources! I was reaching the limits of what the TrueNAS gui could handle with that many disks. I know they've made great strides to increase performance when multiple hundreds of disks are present in the last release or two, but I left that job a year ago, so I can't tell how much better it is now. It would be fun to see how many mirrors one can successfully deploy... the resilver time was astoundingly fast considering I had 2PB usable!

Ericloewe · Nov 1, 2022

souporman said:
I'd hit UREs on those when they were rebuilding from time to time. I'd see the files that were lost on the zpool status -v output.

Well, something was incredibly wrong, then. The odds of a typical hard drive spitting out a bad read are "low", I think we can mostly agree. With RAIDZ2, you'd need three disks to spit out a bad read affecting the same block. Unless some machine at the HDD assembly line was bonking a very specific sector of the disks, that's just not a realistic failure mode.

souporman said:
I lost 2 disks and hit a URE during resilver twice over the course of 7 years, resulting in minor data loss

How often were you scrubbing those pools? That's a lot of co-located bad data making it through until it's too late.
Anecdote time: I've recently had to deal with a pair of Samsung 870 Evos that trashed themselves way too early in their life (a year in, with moderate writes well below their spec). They were each throwing hundreds or thousands of read errors and returning data that didn't match the checksums... After scrubbing, I had zero corruption. Just added a third disk while replacements were being sorted and poof. Zero data loss, zero downtime.
Statistically, that makes sense. A disk these days has a ton of sectors, and even a thousand sectors are a drop in the bucket. How likely is it for even one to match up, if the distribution is homogeneous over the LBAs?

danb35 said:
It stores multiple (at least 2, up to 6 IIRC) copies of all metadata

Pool-level metadata has three copies, other metadata has two copies. Commentators disagree on whether copies=3 results in data/meta/critical being written with 3/3/3 or 3/6/9 copies. The source doesn't lie, but I haven't had the motivation to go spelunking yet. I know that Block Pointers have room for three copies only, so I suspect it's the former.

souporman said:
I've been thinking about UREs and RAID1 (or RAID10). Long ago, in a galaxy far, far away (the Spiceworks forums), I was reading about UREs. There was a trusted storage resource around those parts named Scott Alan Miller, who posited that RAID1 and RAID10 were "immune" to UREs because they do not use "parity" disks, instead a mirrored clone, which is inherently not at risk for encountering a URE, or rather, that when a URE is encountered, it simply does not cause a failure as this is not a block-by-block rebuild, but instead, a simple file copy (here is the link to the pinned forum post I'm referring to).

The whole vibe I get from that reasoning is weird and somewhat mystical.

AlexGG · Nov 1, 2022

Ericloewe said:
How likely is it for even one to match up, if the distribution is homogeneous over the LBAs?

I distinctly remember reading somewhere that someone observed that some (round) LBAs were more prone to failure than others across a certain model of (rotational) drives, suggesting a firmware problem of some sort. With the increasing complexity of flash firmware, this is going to happen often.

souporman said:
There was a trusted storage resource around those parts named Scott Alan Miller, who posited that RAID1 and RAID10 were "immune" to UREs because they do not use "parity" disks, instead a mirrored clone, which is inherently not at risk for encountering a URE

No, that is not correct. If one side of the mirror fails and the other develops a URE, the same data loss happens as in RAID5. However, the amount of data required to rebuild is smaller in RAID1, thus less probability of encountering a URE.

Arwen · Nov 2, 2022

Ericloewe said:
...
Pool-level metadata has three copies, other metadata has two copies. Commentators disagree on whether copies=3 results in data/meta/critical being written with 3/3/3 or 3/6/9 copies. The source doesn't lie, but I haven't had the motivation to go spelunking yet. I know that Block Pointers have room for three copies only, so I suspect it's the former.
...

The thing about "copies=" is that it is on the block level. So if that block is written to a Mirrored vDev, that doubles the copies automatically. So, "copies=2", would be 4 copies of data, 6 of metadata, and whatever it works out for critical metadata. And yes, ZFS will attempt to write any additional copies to different storage drives in the vDev. So in the case of a 2 way Mirror, their is 2 copies of the data on each storage drive. Though copies is not the most correct term, when comparing ZFS "copies=" and Mirroring.

This copy spread out is supposed to work with striped vDevs, that have no redundancy. If "copies=1", no data redundancy, but metadata will be written to 2 separate vDevs, (if space is available).

If the block is on a RAID-Zx vDev, no extra copies. But, you get parity to protect & recover how ever many copies you have.

@souporman - One thing can help prevent problems during disk replacements, is to perform the replace in place. So, in the case of a 2 way Mirror vDev with bad blocks on 1 disk, ZFS can add the replacement disk as a 3rd Mirror element. On re-silver completion, it will detach the failing drive.

This allows any bad sector on the supposed "good" disk to have a chance that the failing, (but not failed yet), disk may have an un-damaged sector for that specific ZFS block.

This works with RAID-Zx too, and can help eliminate any need to restore a file upon any URE.

Back to UREs, yes, you got lucky, though perhaps not that lucky since you have data loss on RAID-Z2. Though I do agree with others that the actual URE rate of 10^14 or 10^15 is just a minimum.

souporman · Nov 2, 2022

Ericloewe said:
How often were you scrubbing those pools? That's a lot of co-located bad data making it through until it's too late.

Scrubbing every 30 days, but this one was particularly ignored.

This NAS in particular had 8x 10-disk wide RAIDz2 VDEVs (6TB disks), about 70% full and no production data on it. I was going to be repurposing it, and as I was looking around to verify that all of its data had been copied to another destination I found that 2 disks from one VDEV had failed completely and weren't even recognized anymore. I was replacing them, because we weren't quite sure what the plan was going to be for this NAS, so I figured why not. I popped out both bad disks, put in two new ones, and started resilvering. Kind of a fun failure mode to experience when the stakes aren't high. Anyway, this is the one that had some files show up in the "errors" list of the zpool status -v.

As far as that fella's explanation of RAID being weird and mystical... well, since nobody else on the entire internet seems to think the same thing in any way I was able to Google, I would say you are not alone! I wish I still had access to these thousands of drives and chassis, it might be fun to dig a little deeper in on how frequently UREs would kill a large RAID10 resilver.

I promise I'm not obsessed with UREs. I appreciate the conversation, guys! I always learn a lot around here, even if it's mostly lurking.

souporman · Nov 2, 2022

Arwen said:
@souporman - One thing can help prevent problems during disk replacements, is to perform the replace in place. So, in the case of a 2 way Mirror vDev with bad blocks on 1 disk, ZFS can add the replacement disk as a 3rd Mirror element. On re-silver completion, it will detach the failing drive.

This allows any bad sector on the supposed "good" disk to have a chance that the failing, (but not failed yet), disk may have an un-damaged sector for that specific ZFS block.

This works with RAID-Zx too, and can help eliminate any need to restore a file upon any URE.

This was something I'd never considered until somewhat recently (it may have even been you that said it in another thread). I consider this great advice, and will always try leave open space (or a warm spare, depending on circumstances) in a NAS going forward for this reason.

Ericloewe · Nov 2, 2022

AlexGG said:
I distinctly remember reading somewhere that someone observed that some (round) LBAs were more prone to failure than others across a certain model of (rotational) drives, suggesting a firmware problem of some sort. With the increasing complexity of flash firmware, this is going to happen often.

No argument here, I'm talking of a typical, somewhat-hypothetical disk that is an equal opportunity data corrupter. Naturally, there are plenty of effects that can cause damage to be more localized - such as physical damage to all (mechanical) disks simultaneously. Of course, we end up in a 6000 hulls sort of scenario.

souporman said:
I wish I still had access to these thousands of drives and chassis

Don't we all...

Important Announcement for the TrueNAS Community.

Let's talk about UREs?

souporman

Explorer

danb35

Hall of Famer

HoneyBadger

actually does care

souporman

Explorer

Ericloewe

Server Wrangler

AlexGG

Contributor

Arwen

MVP

souporman

Explorer

souporman

Explorer

Ericloewe

Server Wrangler

Similar threads

Important Announcement for the TrueNAS Community.

Let's talk about UREs?

Explorer

Hall of Famer

actually does care

Explorer

Server Wrangler

Contributor

MVP

Explorer

Explorer

Server Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Let's talk about UREs?"

Similar threads