Two large disks mirrored or array of smaller disks?

Sokonomi

Contributor
Joined
Jul 15, 2018
Messages
115
Im looking for some advice on my current situation;

My current build is an ASROCK Workstation C236 motherboard with an i3-6100 CPU and 32Gb of Kingston ECC memory. The drives are a Teamgroup 120Gb SSD 'os drive' along with a bank of 5 WD red 3Tb drives, running in a RAIDZ1 config. All drives have seen about 50k hours of service, and sadly one of the WD red plus drives has started to fault, so I need to come up with a plan.

My use scenario is not that demanding;
I am the sole user of the network, and I just require a centralized storage for documents that I can access from 5 different PCs and a phone. It boils down to light/mild business use (messing with photoshop files all day), and accessing things like plex from any machine in the house. Sidenote is that my documents are backed up off site weekly.

My current plan is to just discard the clapped out WD red drive, then mirror 2 of the 4 remaining drives together to use as a dedicated business pool, and cold storing the other 2 as replacements. And then just buying whatever it takes to have a nice multimedia pool of at least 12Tb.

So my question is; How do I proceed in a sensible and cost efficient way?
Is keeping the higher traffic document dataset on a separate pair of disks desirable in terms of wear and safety?
Regarding the large multimedia pool, is it better to just mirror a pair of large drives, or should I opt for an array of smaller ones?

I've looked at some of the info from backblaze, and it appears the best +12Tb category horse in the race currently is the Western Digital Ultrastar DC HC530 14Tb drive (WUH721414ALE6L4). Any thoughts on that?

Any other insight is gladly received as well. I'm just trying to gather info on what's best for me to do.
 

Nick2253

Wizard
Joined
Apr 21, 2014
Messages
1,633
It sounds like your goal is to stick with your existing hardware, except for drives. Is that correct? (Since that's my assumption, all the following is just advice pertaining to drives and pool configuration).

mirror 2 of the 4 remaining drives together to use as a dedicated business pool
Not sure what your budget is, but whenever I see "business", I immediately go into cost-benefit mode. For most business applications, losing business data is a death knell, so this data needs to be properly protected. If you need performance, I'd recommend a mirrored pair (or even a three-way mirror) of SSDs. Otherwise, I'd still recommend buying new drives; your drives are around 6-years old after all.

In your shoes, I'd probably buy three new HDDs, and put them in a three-way mirror. Having a cold spare doesn't benefit you much over a three-way mirror.

And then just buying whatever it takes to have a nice multimedia pool of at least 12Tb.
From a price/TB perspective, doing parity RAID is going to be the best. For multimedia, you probably don't need that much redundancy, since you can re-rip whatever is on the array, so RAIDZ1 will do what you need.

However, if you're also putting other data on that array, then you'll want at least RAIDZ2. I'd probably do a 6x4TB RAIDZ2 array, which gives you 16TB raw, which is about 11.6TiB usable (considering the 80% threshold). If you needed more, then I'd probably go 8x4TB, or even 10x4TB.

One of the beautiful things about ZFS is that, with enough redundancy, "bad" drives won't take out your data. For my "backup" NASes, I use almost exclusively refurbished enterprise drives that I buy from eBay. The cost/TB is substantially lower, and I'm able to use RAIDZ2 or RAIDZ3 to survive multiple drive failures. And since these are just one component in a multi-layered backup approach, a catastrophic failure won't take down my data.

I've looked at some of the info from backblaze, and it appears the best +12Tb category horse in the race currently is the Western Digital Ultrastar DC HC530 14Tb drive (WUH721414ALE6L4). Any thoughts on that?
Backblaze's data must be taken with a huge grain of salt. The way that they use drives is substantially different to the way most home users use drives, so the failure modes they are capturing in their data are at best loosely correlated with the kinds of failure modes you should be anticipating.

The vast majority of Backblaze's data is WORN - write once, read never. This means that drives are heavily loaded during the early parts of their lives, and then sit idle for the vast majority of the rest of it. For most NASes, at least some of the data is read regularly, which means that the NAS sees regular use throughout its life.

Furthermore, Backblaze uses custom-designed pods for their drives. This subjects the drives to significantly more heat and vibration compared to a typical home or business NAS case, which means that their data shows drive failures under these conditions, not the conditions that you or I are using. And don't assume this is the case of "well, if that drive survives the more rigorous Backblaze conditions, then it would survive even better in my setup." Backblaze generally kills their drives significantly faster than home or business users, so that data tells you nothing about how it can survive your conditions. Put another way, imagine that Backblaze is killing its drives by shooting at them. And the drives that live in your system are dying of cancer. Just because they survive Backblaze's bullets tells you nothing about how well those drives will survive cancer.

Lastly, Backblaze is not testing drives through any kind of controlled conditions or random tests. They make specific buying decisions about their drives (including many schucked drives), which significantly biases their results. If Backblaze never owns a drive model, then it will never show up in their data. If they buy a relatively small quantity of a particular model, then the statistical accuracy of the failure rates for that model will be smaller, which makes the conclusions less accurate.

All this to say, I put no stock in the Backblaze data as a useful indicator of drive longevity.
 

Sokonomi

Contributor
Joined
Jul 15, 2018
Messages
115
I'm glad you highlighted the business part of my NAS, as this is the most mission critical bit of it, and I should have elaborated more on how I intend on running this part in my opening post. I'll start off with the fact that its not a large dataset. I think with 3Tb its still almost three times as roomy as id need it to be. To secure my business data I have 3 points of recovery in mind;
1. Mirrored pair of HDD, if one fails, I can replace it and let it resilver, zero loss.
2. Nightly incremental backup to the second pool. If entire business pool fails at once, I still only lose 24hrs at most.
3. Offsite backup, weekly incremental, monthly full. If my entire NAS catches fire, id lose a week at most.
Would you consider this sufficient redundancy to warrant just using my remaining four 3Tb drives for at least another year or two?

The multimedia pool will mainly just serve some plex at night, and wont see heavy traffic since I'm essentially the only one using it. There is the nightly backup from the business pool, but that will rarely exceed a 100mb write each day.

There are some future plans for a surveillance system, though that will have its own pool as well, and I'm not sure where to house those drives (likely a pair of WD purples) just yet, as it might be more sensible to keep them on the NVR itself.

I do agree on the RAIDZ1 setup being good enough for multimedia, since that has a 1 drive failure recoverability all the same, it would just be cheaper to replace the faulty drive since its smaller. However, I'm a bit on the fence about buying such small capacity drives. Apart from the noise and powerdraw (which might be negligible anyway), there is also the issue of availability once the time comes to seek a replacement drive. I ran into this issue with my current 3Tb reds, as my specific version turned out to be no longer available. In my experience a typical HDD sees about 5 years of life before it starts going; how do you reckon availability is going to look like in 5 years from now, for 4Tb drives? Also I'm somewhat limited to the number of SATA ports my motherboard provides, as it seems expansion cards are rather costly and my bank account is finite. ;-)

I have bought some refurbished reds, sadly due to my inexperience with this, one of those turned out to be a lemon for which I'm now paying the price. So id like to stick to factory new drives to avoid such disappointment.

And that is some insight on backblaze that I didn't think about. Their use scenario is way different, so not that applicable to our use case. So there's no merit in seeing what drives survive best in their charts? In that case, how do you find out what drives are currently most suitable?

Thanks for your detailed response!
 

pschatz100

Guru
Joined
Mar 30, 2014
Messages
1,184
I would not agree that RaidZ1 is good enough for multimedia data, or anything else when using large disks. Sure, you can probably re-rip and restore your library if you have a catastrophic disk failure - but what is your time worth, not to mention the hassle and aggravation of having to repeat all the work? I would always go with RaidZ2. That way, in case of a disk failure you still have some redundancy - which will save your pool in case you have a resilvering problem or another disk crashes. You could replace your 5 disk RaidZ1 with a 4 or 5 disk Raidz2 (with larger disks) without using more SATA ports and have a much more resilient setup.

When you shop for new drives, be certain to avoid the SMR (shingled) models. They don't play nice with ZFS.

There are lots of posts on the forum that discuss the issues and concerns of using RaidZ1 with large disks.

As for the business data, I would do a nightly backup off-site or to the cloud. I would not want to risk losing a week's worth of business data.
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
You're overthinking it. Go buy a 4Tb CMR drive, perform a 24 hour burn in to guard against early factory failure. If at all possible place it in service without removing the failing unit. You'll need an extra SATA/SAS plug/slot to do this. Perform a disk replacement, wait for it to silver in to the pool. Remove failed drive. The 4Tb unit will simply waste the extra 1Tb until all the drives are updated. Repeat 4 more times over the next several weeks/months for the other drives as your budget allows. Once all the 3Tb drives are gone, expand the pool and enjoy the extra space & new drives.
 

Sokonomi

Contributor
Joined
Jul 15, 2018
Messages
115
I would not agree that RaidZ1 is good enough for multimedia data, or anything else when using large disks. Sure, you can probably re-rip and restore your library if you have a catastrophic disk failure - but what is your time worth, not to mention the hassle and aggravation of having to repeat all the work? I would always go with RaidZ2. That way, in case of a disk failure you still have some redundancy - which will save your pool in case you have a resilvering problem or another disk crashes. You could replace your 5 disk RaidZ1 with a 4 or 5 disk Raidz2 (with larger disks) without using more SATA ports and have a much more resilient setup.

When you shop for new drives, be certain to avoid the SMR (shingled) models. They don't play nice with ZFS.

There are lots of posts on the forum that discuss the issues and concerns of using RaidZ1 with large disks.

As for the business data, I would do a nightly backup off-site or to the cloud. I would not want to risk losing a week's worth of business data.
Most of the ripping process is automated, I could literally do it while I sleep. ;) But I get your point, it's not a fun activity. You make it sound like RaidZ1 has no safety at all though, how so? If one disk fails you can simply replace and resilver the pool, can you not? The way I suggested (4 disk raidZ1 + 2 disk mirrored) would let my easy to replace multimedia survive one failure, and my business data would survive potentially 3 disk failures all at once. At which point does it become overkill though? Im genuinely curious now, are there stats that show the probability of 2 disks to die simultaneously? I can increase the offsite backup frequency to a nightly incremental, so i'll never get set back more than a day even if the whole thing implodes.
You're overthinking it. Go buy a 4Tb CMR drive, perform a 24 hour burn in to guard against early factory failure. If at all possible place it in service without removing the failing unit. You'll need an extra SATA/SAS plug/slot to do this. Perform a disk replacement, wait for it to silver in to the pool. Remove failed drive. The 4Tb unit will simply waste the extra 1Tb until all the drives are updated. Repeat 4 more times over the next several weeks/months for the other drives as your budget allows. Once all the 3Tb drives are gone, expand the pool and enjoy the extra space & new drives.
You can actually do this? That certainly opens up some options. The extra strain of performing a resilver 5 times wont be an issue? I always thought it was a nono to mix drives in a pool.
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
You can actually do this? That certainly opens up some options. The extra strain of performing a resilver 5 times wont be an issue? I always thought it was a nono to mix drives in a pool.
Yes, people do it all the time. Nobody can promise you the strain of resilvering won't kill another drive, but you're already degraded, so you need to do something to get back the healthy, as the next failure the pool is gone. I suggested a 4Tb drive as they're readily available in CMR at many local retailers. I'm under the impression 3Tb CMR drive production has been halted for some time.

Mixing drives in a pool isn't a great idea, but it's not forbidden either. The goal here is to get you back to a healthy state as fast as possible. The replacement drive has to be the same size or bigger than the drive being replaced. You can figure out what to do once you're not degraded. This is a common way of expanding a pool. The size will depend on the smallest drive in the pool, so in effect you get no extra space until the last drive is replaced and silvered in. If you can connect the new drive without disconnecting the old one, so much the better. You can move the new drive after data integrity has been restored. ZFS tracks drives by a kind of uuid, not cable / slot position. You can scramble the drive connections and ZFS just stitches it back together at boot/import. Yes, this means you can resilver a drive in an external USB case, and move it to your SAS/SATA ports after it's a member of the pool, provided the USB interface doesn't alter the drive geometry. (I don't recommend this USB method, but any port in a storm...)

One caveat for you to consider while thinking about this: RAIDz1 starts to get unsafe somewhere between 2Tb & 4Tb drives, depending of the uncorrectable error rate for the drive. Some are one in 1e14 LBAs, and some are 1e15. Once you get to 8Tb drives, you're almost guarantied to have one LBA on each drive that will throw an error, and a drive failure will result in data loss if the bad LBA on the surviving drives contain data, so the recommendation is to go to RAIDz2. If you check my sig you'll see I haven't followed this advice, but I do split my pools by data importance, and keep lower value bulk / transient stuff on the RAIDz1 pool. Only you can assess the risk, but this is the reason I didn't suggest just jumping to an 8Tb drive. Wasting 1Tb is reasonable under your circumstances. Converting a RAIDz1 pool from 3Tb to 8Tb drives unwise, and you can't convert from RAIDz1 to RAIDz2.
 

pschatz100

Guru
Joined
Mar 30, 2014
Messages
1,184
Most of the ripping process is automated, I could literally do it while I sleep. ;) But I get your point, it's not a fun activity. You make it sound like RaidZ1 has no safety at all though, how so? If one disk fails you can simply replace and resilver the pool, can you not? The way I suggested (4 disk raidZ1 + 2 disk mirrored) would let my easy to replace multimedia survive one failure, and my business data would survive potentially 3 disk failures all at once. At which point does it become overkill though? Im genuinely curious now, are there stats that show the probability of 2 disks to die simultaneously? I can increase the offsite backup frequency to a nightly incremental, so i'll never get set back more than a day even if the whole thing implodes.
If you search just a little bit, you will find much discussion about using RaidZ2 versus RaidZ1 for large drives. But at the end of the day, it's all about risk tolerance. RaidZ1 gives some resiliency against drive failure - RaidZ2 gives more. For large volumes that would be a pain to restore, I am much more comfortable with RaidZ2. When I built my first system many years ago, I used a 3 disk RaidZ1. After a few years, I had one disk fail. It gave me a terrible feeling knowing that any additional disk error (or problem with resilvering) would cause me to lose data and I was extremely uncomfortable during that time. Although I didn't actually lose any data, I didn't want to feel that way again and immediately rebuilt the volume as a RaidZ2.

I have two datasets defined on my volume: 1) business data, and 2) media and other stuff. The business data gets backed up nightly. My media does not change very often, so I do an external back up once a week for that.

My family has become a big user of Plex. They don't have much tolerance when something doesn't work, so uptime and quick recovery when something does go wrong are priorities for me.
 

Sokonomi

Contributor
Joined
Jul 15, 2018
Messages
115
For me, at the end of the day, at least for the media pool, it's about ease of access, not so much about keeping them secure. The fact that all may not yet be lost if one drive blows is just an extra bit of comfort. Only a very small part of my NAS is of actual importance, and that bit sees regular backups to several places (cloud and external). So I don't feel the need to 'double bag' my stuff as much as others might. Im comfortable keeping it a Z1 raid, with the files worth, and other backup systems in place.

That said, after you mentioned that you can just 'silver in' new drives on the fly and let it organically grow into its new size, I went looking around, and plenty of people seem to follow this faith. I guess for some reason I figured when it's time to upgrade you just decommission the whole array for new ones, to reset the clock on the whole array. But doing so will cause all drives to hit EOL about the same time, so I presume its better to spread it out a bit?

My motherboard conveniently has one SATA port left available, so I guess that's my queue to go find a nice new replacement drive to silver in and chuck the broken one. It seems 4Tb drives are most cost effective at the moment (priced the same as 3Tb for some reason), though I am spoiled for choice. Is WD red plus still a solid choice? Is it worth considering the considerably more expensive red pro (5yr warranty instead of 3, but pegged at 7200rpm)?
 

Nick2253

Wizard
Joined
Apr 21, 2014
Messages
1,633
One caveat for you to consider while thinking about this: RAIDz1 starts to get unsafe somewhere between 2Tb & 4Tb drives, depending of the uncorrectable error rate for the drive. Some are one in 1e14 LBAs, and some are 1e15. Once you get to 8Tb drives, you're almost guarantied to have one LBA on each drive that will throw an error, and a drive failure will result in data loss if the bad LBA on the surviving drives contain data, so the recommendation is to go to RAIDz2
This really isn't true in a real-world sense.

The whole "RAID5 is Dead" is based on a misunderstanding of the URE rate with respect to independent vs dependent events. I go into it a bit more in my article on Hard Drive Failure Math (link in my signature).

As such, single parity RAID is not the disaster is "should" be for what are today considered "large" drives. I would have no problem recommending single parity for a 10x20TB array (or similar massive system) for relatively low-value data, assuming proper burn-in, testing, reporting, and backup were also present.
 

pschatz100

Guru
Joined
Mar 30, 2014
Messages
1,184
This really isn't true in a real-world sense.

The whole "RAID5 is Dead" is based on a misunderstanding of the URE rate with respect to independent vs dependent events. I go into it a bit more in my article on Hard Drive Failure Math (link in my signature).
@Nick2253 Maybe I am looking at your numbers wrong. but it looks like the potential for failure drops tremendously for a given array when using RaidZ2 versus RaidZ1.

@Sokonomi For the time and money I have invested in my system, including data, I consider the addition of an extra disk for RaidZ2 to be easily justified. This is why we refer to volume resiliency and risk tolerance. If you are willing to take more risk, then you can get by with less resiliency. I would say, however, that you should keep a sharp eye on your drives... they're old. If one of them is beginning to show problems, the others cannot be far behind.
 

Nick2253

Wizard
Joined
Apr 21, 2014
Messages
1,633
@Nick2253 Maybe I am looking at your numbers wrong. but it looks like the potential for failure drops tremendously for a given array when using RaidZ2 versus RaidZ1.
No doubt! Double parity is a huge step above single parity as far as resiliency is concerned. The point is that single parity RAID is far from "dead" in the sense that it was meant in the OG ZDNet article from the mid-2000s.
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
This really isn't true in a real-world sense.

The whole "RAID5 is Dead" is based on a misunderstanding of the URE rate with respect to independent vs dependent events. I go into it a bit more in my article on Hard Drive Failure Math (link in my signature).

As such, single parity RAID is not the disaster is "should" be for what are today considered "large" drives. I would have no problem recommending single parity for a 10x20TB array (or similar massive system) for relatively low-value data, assuming proper burn-in, testing, reporting, and backup were also present.

Go re-read both of my posts. You and I are both stating the same thing, and I even admitted to running a large drive RAIDz1. That's hardly calling it dead.

The point I was making is ZFS is quite flexible in drive replacement, and they should move to reestablish redundancy as a higher priority than pool redesign. They are at the point where the URE decision making starts to come into play, but it's secondary, I didn't want to leave it out. Leaving the failing drive in the pool during resilver / replacement allows it to be a parity source for the non failing LBA's, offering further reduction in the URE risk for exactly the reason you mention in your article. The failures are likely in clusters, and the clusters have to be in the same LBA as the theoretical UCE in the "good" drives. Or am I missing something you intended to convey?

Is WD red plus still a solid choice? Is it worth considering the considerably more expensive red pro (5yr warranty instead of 3, but pegged at 7200rpm)?
That's always a tough call. WD blighted us with SMR drives badged "NAS", so you have to do your homework. I believe the "Red Plus" is supposed to be CMR only, but they tried it once, so I have trust issues. The 3 vs 5 yr warranty is something only you can assess. I have some 7200 rpm drives in my pools, but they're mostly old rock solid HGST's (disclaimer: former HGST employee, I'm biased), and I find they tend to run a bit hotter in a home environment. I have a 7200 RPM Toshiba as well, but it's a single data point in a backup role at this point.
 

Nick2253

Wizard
Joined
Apr 21, 2014
Messages
1,633
I'm not sure we were saying the same thing. You said:
Once you get to 8Tb drives, you're almost guarantied to have one LBA on each drive that will throw an error, and a drive failure will result in data loss if the bad LBA on the surviving drives contain data
This is just not true. Empirical data just doesn't bear this out.

Furthermore, the reported URE rate is nominally based on bits, not on LBAs or sectors. And these numbers have been effectively the same for decades, even though technology underneath has substantially changed multiple times. (You're telling me that SMR didn't change the URE rate? Or that new motor technology hasn't improved it? Etc.) This is why the spec error rate is simply not a useful measure on actual UREs.

My point is not that RAIDZ1 has no risk; it's that you are dramatically overstating that risk to the point where you are effectively calling single parity RAID unsuitable. The fact that you are using RAIDZ1 vdevs for replicated data only seems, again, to say that you and I are saying different things.
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
I'm not sure we were saying the same thing. You said:

This is just not true. Empirical data just doesn't bear this out.
I don't work on empirical data. I work with manufacturer's published specifications.

Furthermore, the reported URE rate is nominally based on bits, not on LBAs or sectors.
And this distinction means nothing. A flipped bit is an error. Either on the drive, or when ZFS performs it's checksum.

And these numbers have been effectively the same for decades, even though technology underneath has substantially changed multiple times. (You're telling me that SMR didn't change the URE rate? Or that new motor technology hasn't improved it? Etc.) This is why the spec error rate is simply not a useful measure on actual UREs.
SMR UCE rates are irrelevant, SMR is not suitable for ZFS use. If you have another manufacturer published specification to consider, offer it. For that matter why would they not tout their improvements in UCE rates? You're trying to make an argument to ignore the manufacturer's published data. You have to bring some serious proof to the table for that.

My point is not that RAIDZ1 has no risk; it's that you are dramatically overstating that risk to the point where you are effectively calling single parity RAID unsuitable. The fact that you are using RAIDZ1 vdevs for replicated data only seems, again, to say that you and I are saying different things.
Dramatically? All I said was the discussion has to start at some point around the time drives 4Tb and larger come into play. There's lots of slop between 1e14 and 1e15 even at 8Tb.

You seem to be here simply to argue. Good day.
 

Nick2253

Wizard
Joined
Apr 21, 2014
Messages
1,633
You seem to be here simply to argue. Good day.
I'm sorry that it's coming across this way. I promise you that I'm not. Tone is exceptionally hard to convey via text, so I'd ask you to give me the benefit of the doubt that I'm not here to be petty.

And if there are opportunities for me to learn, I'm here to learn. I would hope that 8 years and 1600 posts here shows that I'm not a fly-by-night rabble-rouser. I also recognize that you've been here quite a while, and that 800+ posts is a heck of a commitment. So let me be clear: I recognize that there's a lot of knowledge here, and I'm willing to learn when I'm wrong.

My intention here is to help educate and pass on useful information. Most people who are here are dealing with a massively complex problem, and are looking for good information to help them make good decisions to manage risk to their data.

I don't work on empirical data. I work with manufacturer's published specifications.
Ok, so let's be clear here then when we talk about manufacturer's publish specifications. Just in the same way that MTBF is not a straight-up expectation of lifetime for an individual disk, neither is "non-recoverable error rate" simply a measure of how many UREs you should expect to get on a specific drive. These specifications are statistical abstractions based on manufacturer-decided workloads, and must also account for marketing needs and liability concerns.

Empirical data is useful because it helps us understand what these statistical "specifications" actually mean in a real-world sense. Again, just as empirical data shows us that MTBF is not actually the measure of the expected lifetime of a single drive, we also know that non-recoverable error rate is not actually the measure of the expected URE rate for a single drive.

And I want to point out that this empirical testing is not something that just happens in some far-away lab. Every time a ZFS user does a scrub on their data, they are checking for UREs. If you look at the health status of your pool, any URE will show up as a "CKSUM" error. In this day-and-age, if published URE rates were even close to accurate, nearly every ZFS user on r/datahorders would get 100s or 1000s of UREs every time they scrubbed, and we obviously don't see that.

SMR UCE rates are irrelevant, SMR is not suitable for ZFS use. If you have another manufacturer published specification to consider, offer it. For that matter why would they not tout their improvements in UCE rates? You're trying to make an argument to ignore the manufacturer's published data. You have to bring some serious proof to the table for that.
I'm afraid that I was unclear when I started talking about SMR. I would hope that you can give me the benefit of the doubt that I'm aware that SMR is grossly unsuitable for ZFS.

My point in bringing in SMR is that published URE rates between SMR and CMR drives are identical, even though we know that SMR drive heads do a lot more work (and therefore are subjected to a corresponding increase in wear). And that's not the only technology change that has happened, even though the published URE rate remains constant. Of course, this, in-and-of-itself, is not justification for ignoring manufacturer's published data. I would, however, hope that this opens the door to recognizing that the published URE rate may not be quite what it seems on the surface.

And you are exactly right: serious claims require serious proof. And we actually have that serious proof: we have peer-reviewed data on this issue, and that data shows that manufacturer's published specifications are exceptionally conservative, by at least an order of magnitude, and quite possibly by two or even three orders of magnitude.

Dramatically? All I said was the discussion has to start at some point around the time drives 4Tb and larger come into play. There's lots of slop between 1e14 and 1e15 even at 8Tb.
Yeah, maybe I was being overdramatic :wink:.

And to your point, the slop between 1e14 and 1e15 for 4TiB of data is the difference between a URE happening once a quarter (assuming a monthly scrub), and once a decade.

Why I think this is important to have this conversation, however, is because, at published URE rates, even double parity RAID is theoretically "dead" at this point. With a 10x3TB RAIDZ2 array, you still have ~10% chance of total array failure per year assuming published URE rates are accurate. And triple parity is getting there: with a 11x4TB RAIDZ3 array, you still have over ~1% total array failure per year at those published URE rates.

And that's with the relatively small 3 and 4 TB drives available when I wrote that article. With modern drives of 10TB+, the annual array failure rates (again, using those published URE rates) of a double parity array gets into 30%+.

Of course, anecdotally we see nothing like this kind of failure rate. Double parity RAID still provides exceptional risk prevention, and is plenty for most people. So it really comes down to being accurate about UREs, so that people can make a good decision about their data risks.

And when we're accurate, we realize that single parity RAID is actually good enough for most people as well. With most people storing easily replaceable media on their home NASes, single parity RAID works just fine. Sure, in the case where they are unlucky enough to suffer a URE during rebuild, they usually only lose a single file, and that can quite easily be recreated (re-ripped). Even in cases where much less replaceable data is stored (like family photos, etc), a good backup plan is a lot better than a second parity drive.

But since we are living in the real world, we also have to recognize that a good backup plan is difficult for most people to implement and maintain, so adding a second parity drive is a much lower hanging fruit way of getting some extra security. Of course, RAID =/= Backup, but they both help in the specific case of HDD failure. I will forever recommend that double-parity be the minimum that anyone ever use on high-value data, simply because the cost of an extra HDD is so much lower than the cost of that data. It's much harder for someone to accidentally shoot themselves in the foot when they have double parity, so to speak.

Anyway, I feel like I've belabored the point long enough, and I'm sure I've said one or two things that are clear as mud, so apologies in advanced if this is somewhat confusing to follow.
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
@Nick2253 - Unfortunately I'm time limited. It's a long slog to work thru, and my goal was to help Sokonomi get back to a data safe state as quickly and economically as possible. We're kind of highjacking their thread to discuss this here.

Some thoughts: I suspect the root cause for the differences in the ZFS community's empirical data, and the drive manufacturer's published URE rates lies with the ZFS scrub, environmental constraints, and of course the lawyers. Consider, the ZFS scrub is going to sweep the device, allowing the drive firmware to detect a "weak" LBA before it presents as problem to ZFS. What the drive manufacturer's do under the covers are considered trade secrets. The drive manufacturer's are going to publish a URE spec that they can adhere to at the edges of the drives environmental spec. They have to meet that URE spec at max temp, min temp, max temp + max vibration, etc... Most of us only encroach those edges inadvertently, as we've far too much $$ invested in drives to abuse them. The reason I mention lawyer's is there are people making purchasing decisions with much larger risk considerations than my Zoneminder feeds, or your Plex library. Somewhere there's a bank with a monster Oracle or IBM DB deployment sitting on drives containing a Monarch's bank balance. The people that insure that bank like conservative URE numbers, and the drive manufacturers like selling extra parity drives. :smile:

On the SMR wear point: The SMR actuator should actually see less wear. It has to remain in position for 2 - 3(?) platter revolutions per write to perform the re-shingling. A 5900 rpm drive is 98.3 revolutions per second, or 9.9% of a revolution per millisecond. I'll guess average seek times are around 8ms today, maybe half that on the high end drives, and the track to track seek invoked for SMR is likely ~2ms... So full seek is close to parity with rotation, but the point is the head isn't taking off to the other side of the platter and returning half a rotation later like a CMR drive might do.
 
Top