Of course not. Only you and other selected unnamed members who meet your criteria of "senior" do. ;)
I consider people that frequent the forums and try to provide help regularly as "senior", even if they don't officially have that title in the forum. I consider you a "senior" member.
Unless I see some calculations you used to arrive at that 99% number, it's a made up number and adds no more weight to your position than saying a unicorn told you.
I said 99% because I'm sure there's an except somewhere that I haven't heard of. There's always "that one exception" people grab onto. Just ask the morons that don't wear a seatbelt because they heard a story about a single person that was saved from 1 accident because they didn't wear a seatbelt and ignore the fact that many more are saved because they wore a seatbelt every year. I prefer not to use absolutes and accept the fact that I don't know everything about everything and there are always exceptions in the world. There is almost always a non-zero chance of any given event occurring. Big deal. But feel free to argue at my "math" and not the real issue though.
You're cherry picking when you want to trust manufacturers based on when it lends credence to your already-held positions. Here you trust the manufacturer, but when it comes to long tests (the true subject of our discussion), you don't. Again, I have no problem with that. My own model is I trust them 'somewhat" always. And I distrust them "somewhat" always.
I don't cherry pick what they do and don't tell me. I trust them at their word when there's more than a decade of SMART experience from many vendors(many that don't even exist anymore) that says that, and if you read some of the technical manuals on SMART(there is an established standard that is only loosely followed by some manufacturers *cough* Seagate *cough*). If you aren't going to trust the standard to be mostly followed(if not absolutely) you shouldn't be trusting SMART at all. For all you know the manufacturer could just have a timer that says "after 2 minutes say it passes" and you and I would be none the wiser for it. We'd be a little upset when a hard drive keeps saying nothing is broken but we know we can't trust the drive with data. I'd bet everyone has had one of "those" drives before.
I get from that a long test will give some information about the quality of the drive. I don't need any further recommendations to run that once a month. If the drive can't handle a long test once per month, I want to know it so I can stop buying them. I'll address this more later.
It's not that a hard drive can't "handle" a long test once per month, it's just what they recommend. There's a point at which too much work on a hard drive wears it down faster. Where you want to consider that line is up to you. The manufacturers have their own values they use and came up with their own numbers. If you think you can do that failure analysis better than they can I'm sure they'd be happy to hire you.
Alternatively, you could do it overnight like I said and avoid the pounding. I don't think it's useful to create situations that don't exist to support your position.
And you've never gone to bed leaving a movie streaming and a scrub started, right? And you've never accidentally had a long test and scrub run at the same time, right? It's already been mentioned that running scrubs is hard on the pool. It's not that reading is hard, its that constant seeking from a drive having to multitask between the scrub and retrieving data is hard on them.
Another hyperbolic (and definitive) statement you have no way of supporting. Allow for the possibility I (basically) know what it does and simply value it more than you do.
Really? I have no way of supporting what a SMART short test does? Check out this short paragraph from wikipedia..
http://en.wikipedia.org/wiki/S.M.A.R.T. Don't be shocked.. it says checks the electrical and mechanical performance. If I had my FreeNAS server up right now I'd post the white paper on what a short test should include and what should NOT be included. The entire document (200 pages or so) is a very interesting read. If I can remember in 2 days when my server is back up I'll link it for you. So yes, I can support my comment.
Now that we've recapped what we both know, I'll restate that I see value in doing this regularly on some servers. You're in no position to say it definitively provides no value for my situation (or anyone else's, unless you really know their situation).
You are right, I am in no position to say it provides no value. But I'm not the one saying it provides no value. It doesn't, if you think about it(and read the white paper). But I let the manufacturers provide their own recommendations, along with the recommendations provided by the SMART documents on what a test should and shouldn't do.
It's not that I'm not in "their situation". See below where I explain long and short test and what they should(and shouldn't find to see my explanation of the short tests)
Titan_rw has it right when he said that its been his experience that smart short tests don't really do much. Their intended purpose was made mute because the issues that a SMART short test was intended to find would render the drive broken before you ever ran the test. Some manufacturers don't even have short tests at all anymore. They're easy to spot because if you run the test it returns "passed" 1 second later and does not show up in the short list of testing logs(because it literally does nothing and is only there because some RAID controllers freak out if a hard drive says it doesn't support SMART short tests).
I use each tool for what it's for. I don't expect a SMART Short Test to do what a SMART Long Test does, and I don't expect a SMART Long Test to do what a Scrub does (or vice versa).
If you did you'd recognize really quickly that a long test was not meant to be any kind of way to preemptively find errors beforehand. That was the intent of SMART when it was created in the mid 90s or so via monitoring... but not for testing. The aberration with long tests is that it was supposed to find bad sectors preemptively, but then they deliberately excluded ECC testing/verification(whoops). Whether that was an accident or on purpose is up for a very heated debate that started when SMART was being developed. Not to mention I don't think you even know what you expect to learn from a short or long test. You don't seem to have a grasp as to what is actually tested when you do a test. You just seem to understand that a short test isn't as thorough as a long test, but have no idea what stuff is actually tested. There is some stuff that a short test does test but isn't tested in a long test specifically because if you had those problem you'd know from other tests. (Car analogy: Check if your car can do 55mph without bad sounds but not test if the starter is good. If the starter was bad you'd know because you wouldn't get to the point of doing 55mph)
You want to do a good useful test of the drive? Do a
dd=if/dev/da0 of=/dev/null. That'll make the drive read every sector and even do the ECC calcs that the long test doesn't do. In fact, it's often mentioned that this test is better than any other test you can do that isn't a write test to the drive for a "surface scan". If you test all of your drive simultaneously and see a dd command that returns an error or one drive that took significantly longer than the rest of the drive you'll know something is up.
"Once per month or whatever" doesn't equal "once per month." For some reason, you're picking the most innocuous of comments to argue against. Montlhy (because that's the default) or whatever (you set it to). Cool? ;) But again, you're cherry picking how you apply your logic. When it suits your position, you'll say the people who put out a product (such as FreeNAS) know best. But when it doesn't suit your position, you call their decision making into question.
You've said you consider FreeNAS 8.x targeted towards enterprise users. It shouldn't be very difficult, then, to figure out why the team has defaulted to monthly.
I really have no idea what you were trying to convey right there. Check out
wikipedia's comment about scrubs and resilvering...
The official recommendation from Sun/Oracle is to scrub once every month with Enterprise disks, because they have much higher reliability than cheap commodity disks. If using cheap commodity disks, scrub every week.[41][42]
And check out the 41 and 42. 41 actually directs you to FreeNAS documentation that used to say weekly and monthly. How cool is that? Even FreeNAS documents themselves call out weekly and monthly as I prescribed above. Now maybe you understand why FreeNAS' default scrub is 35 days, which seems very awkward considering their own documents disagree. Of course, this is still ignoring Sun/Oracle who also say weekly and monthly. But don't let a few little links get in your way of facts. I don't consider 35 days monthly(but I'm anal like that with what is defined as a "weekly" or "monthly". Monthly implies 12 times per year. 35 days give you only 10.4 tests per year.
Also, I don't consider FreeNAS to be a product that is targeted at enterprises. I'm not sure exactly what iXsystems really targets TrueNAS to. Enterprises would hire true FreeBSD wizards to run their servers and would keep their own in-house support when using FreeBSD. There are so many options, settings, etc that FreeNAS doesn't even begin to cover every scenario. I'd say it targets small to small-mid sized business, but home users are also enjoying it. FreeNAS built the bridge between expensive salaried employees that manage FreeBSD servers and smaller business that would like most(but not all) of the benefits without the "expensive salaried employees".
Out of curiosity, who does "you" refer to here? It seems incredibly condescending to make such a statement because someone chooses to do something differently than you would. I assure you there are things you say on a regular basis that are different than how I would do them. But I have a big enough world view to understand other people can arrive at the same destination by taking different routes more in sync with their situations (needs, experiences, funds, etc). And that although my ways may seem "99%" right to me, that doesn't always make it so right for someone else.
Just to clarify, that was your response to my statement that " If you can't figure out which category you fall into and if you don't want to follow those recommendations, that's your own business." All I was saying was that if you can't figure out if you are using enterprise class or consumer class drives and if you choose to not follow the recommendations for the particular class you are in, that is your business. And you for some reason felt it necessary to bring up the 99% again when I was discussing SMART tests and not zpool scrubs. Please keep my comments to the actual topic I was discussing.
So you ignore their recommendations? Because you can't figure out which category you belong to? ;)
I don't ignore their recommendations for weekly scrubs(since I have consumer drives). If you read the entire ZFS document provided by Sun they recommend weekly for zpools that have changing data. My data is actually very static. Less than 10MB/day actually changes. And when I do move large quantities of data, I either move the data before a scheduled scrub or I run a manual scrub afterwards. I helped a friend with his desktop yesterday(copied 2TB to my server), scrub will trigger tomorrow morning. And his copy of the data still exists on his desktop until Wednesday, so I preemptively set myself up for the scrub. Just like the manual says if the server is offline for extended periods of time doing scrubs every 45 days is recommended. I wouldn't go booting up my server every single week just to run a weekly scrub.
In other words, you tailor your approach based on knowledge of the value of each product and your situation (experiences, etc.). Seems reasonable to me. Allow for others to do the same. I have no problem with anything you said here. What I have a problem with is if you think 14 days is the correct time period based on your situation, if someone else says 21 days (or 30 days), you seem to get a feeling in your gut they're wrong. My position is they'd be wrong only in the sense that you'd be wrong to someone doing them every 7 days. Some things aren't an exact science and we shouldn't treat them as if they are because we see a choice different from one we'd make.
No, I feel they're wrong because they feel they're deviating from the recommendations of the inventors of ZFS. Who really thinks they know more than the inventors of it? I'd add someone to my ignore list immediately if they started making those kinds of claims. I really don't need to read what spews from their keyboard into the forums. Just look at the discussion on another thread about the sync=disabled recommendation. That was an interesting topic to say the least.
You don't monitor because things do what they should do. You monitor to see when they do what they shouldn't do.
You're talking about normal operation, which to me is completely besides the point.
No, I'm talking about even the abnormal. I had a fan fail and it still took hours for 4C difference. If 4C is a difference between life and death for your hard drive, your hard drives are already running much too hot.
Just like the FreeNAS' "Difference" field for temperature checking. If you check it every single minute and put 2C you'll probably never ever hit the trigger. Hard drives don't heat up that fast even with no cooling. That's why you have the "Informal" and "Critical" fields as shown in 8.12 of the FreeNAS manual.
What does monitoring temps when things are going well have to do with monitoring for when things don't? I don't design systems based on things going well, and I doubt you do either. I design systems trying to anticipate every point of failure.
So what you're telling me is either a hard drive in a case will either run hot on day 1 or basically never? If so, I disagree.
Yes, only insofar as that if you build a system and everything is static the temps shouldn't change drastically. But things aren't static(fans fail, A/C breaks, etc) so temps do change, but you aren't going to see a hard drive go from 40C to 60C in a short time frame because a fan failed. Even when a server room I worked in had A/C break,it was 2 hours before it was even relatively warm for the humans. I'd wager if the ambient room temperature increases so fast you are worried about cooking a hard drive with 30 minute temp checks, you probably have a unique situation in which you are aware of that situation and probably have other plans in place to help mitigate the situation. Monitoring hard drive temps should be the least of your worries because you should already know from experience that you need to pre-emptively shut down the system.
I didn't quote everyting you said here since most of it was just clarifying what you were saying previously (I get what you're saying now), but if you feel I'm quoting you out of context, I apologize. That said, wow. I really don't get how you say option A) is the same whether you do a long test or a scrub when the scrub doesn't touch any portion of the disk without data. I simply say if I want to check montly if there's something wrong with the drive in areas of unwritten data, and I believe that can happen at times other than when the drive is first put into service. In fact, I know from experience it can. Thus I do long tests periodically.
I'll explain it slightly different:
When a read request is made to the drive and something goes wrong, the 3 most common failures are that there is a firmware bug, something physically wrong with the head or platter, or sector data + ECC don't match.
Firmware bugs are typically disk-wide and present themselves very quickly after the issue begins(See the seagate firmware bug from 2008 for a great example). Firmware bugs often turn into a drive that drops itself from the host machine, fail to even be detected on bootup, or other very catastrophic errors. You'll get an email if a drive drops from the host machine because the 30 minute SMART query will fail with no drive attached(and you'll continue to get an email every 30 minutes forever). If it fails on boot up you'll get your nightly email that a disk is missing as well as an email that the SMART query failed(again, every 30 minutes forever until a SMART query is made that succeeds).
If there is something wrong with the head or platters(such as debris or something physically breaks off) you won't have a single sector or even a small number of sectors that go bad. Physical issues inside the drive manifest themselves very quickly and are made worse very quickly if its debris because the disk head will spread the debris all over the platter(and the head may even be damaged by such debris). That disk will go very quickly from working just fine to significant problems with reads and writes all over the drive. Zpool performance will tank(maybe even become unresponsive), you'll start getting emails about disk read and write errors, zpool status will show error rates that increase rapidly as you read and write data, possibly click of death from the bad disk. Even something as innocuous as streaming a movie will manifest itself into major problems for a drive with a small amount of debris. Once the debris is deposited on the platter and the head makes a single pass over the debris, the damage will begin and progress rapidly.
Sector + ECC data would only be found on a short test, and only if the short test happened to test that one sector in the 1-2 minutes that it runs. Very unlikely that out of a few billion sectors your hard drive will happen to hit that "one" bad sector. So I consider it a waste of time to even consider that any smart test will find an issue with Sector + ECC data that is incorrect
So, for firmware bugs as well as physical issues(which are the only 2 failure mechanisms that long tests can easily find) you will know long before you even run that long test nightly/weekly/whatever that the drive has major problems. So what failure mode does a long test find exactly? None. They'll provide you with an error that will allow you to do an RMA, but they did nothing to preemptively find any new issues that a scrub(or even regular use if the system is idle since you'll get an email box full of emails in short order). But, you've added wear and tear by doing lots of long tests.
Now lets examine short tests...
Short tests are 90% silicon and only include a very short read test(could be pseudo-random or even just read the first 10 tracks). So not much reading from the hard drive, but enough to prove that the head and servo function properly. But then if you look at the silicon tests they include things such as communication checks between the SATA controller and hard drive, internal communications between different components on the hard drive, firmware checksum check, on-disk cache check, etc. If any of those things "went bad" you'd know right away. For communication issues firmware checksum failures and the like the disk won't even finish its own POST test and be unavailable to the host system. AKA you'll start getting 30 minute emails that a SMART query failed until you fix it. If the on-disk cache failed, you'll know very quickly also because as data is corrupted ZFS should start racking up lots and lots of errors. AKA, you'll get an email at night for sure with loads of errors, you'll potentially have other issues with the drive besides just some reads and writes since the loaded firmware may be trashed. If the disk goes offline you're sure to see 30 minute emails to your inbox.
So, what common failure mode was not sent to you in an email via a SMART query or regular use? None. That's precisely why I don't do the SMART tests at all except when I first buy them.
Now, what about desktops? In your standard desktop if you look at how those failure modes work and how they manifest themselves into "the drive no worky worky" and then think about how a long and or short test would have helped, would they have? Nope. Not any more than using them on FreeNAS.
What SMART does do is Self-Monitoring, Analyze, and Report. It records certain values throughout its drive life and tries to predict its own failures. Check out the list at
http://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes. Many of those are recorded based on regular use. They aren't related or influenced by scrubs, SMART tests, etc. For example, 0x03 is spin-up time. If that starts getting longer and longer then the drive can reasonably assume that its having some kind of error outside of established parameters. Your SMART reporting tool should give you a warning(I have a bad drive that gives an error for this exact parameter and warns to replace it soon) and I keep it in my box because its a great way to experiment with SMART apps in Windows, Linux, FreeNAS, etc.
So now lets throw ZFS scrubs into the mix. Sure, only 1/2 the drive may be full, but the point will still be made. ZFS is so great because if anything is wrong in the path the data takes between the hard drive patters and the CPU ZFS will throw a fit. So even if the issue isn't necessarily with the hard drive itself, but the issue is still responsible for corrupting data, you are still safe.(This is where Sun/Oracle list ECC as a requirement because to make the assumption that the data is being corrupted somewhere besides RAM you also are implicitly trusting that the data in RAM is correct). Even in a worst case where the hard drive is deliberately giving you trash results ZFS will still keep you safe. Because ZFS is doing its own internal checks to verify the data is correct you don't need to worry too much about a single point of failure(although a SATA controller failure, PSU failure, etc could be devastating)
So what should be taken away from this whole thing? Long tests don't really tell you much you won't learn on your own just by having the disk in the system. Short tests don't really tell you much you won't learn on your own just by having the disk in the system. SMART parameter monitoring WILL tell you what is going on.... just by having it in your system. So just use the drive and monitor the SMART parameters. That's why I setup the nightly SMART emails and I keep them in my archive. Every so often I do a line by line comparison of one from today and some one from the archive. Drives that only have small temperature changes can be ignored. But that one drive with 3 parameters that are slowly changing over a period of weeks or months may be a drive to look at more closely. Is it in a location where it is susceptible to higher temperatures, more vibrations, etc? Usually you can see what the trigger value is for the SMART warning that the disk should be replaced. If I see that some value increase by 2 per month, its at 4 and at 10 you get the warning I can decide if I want to replace the drive right now or wait until I get to 10. Do I want to replace the drive when it hits 10? Do I want to ignore the warning and keep using the drive as long as scrubs keep coming back clean? Those are the questions the server admin get to choose for themselves.
You're making up 99% again to support your position. It's an interesting way to structure a statement by excluding the value to make the statement there is no value. I'm not sure what you mean by "correct" but I think I've already said or implied I want to know about a fading disk so I can replace it. If it's fading 23 months into a 24-month warranty, I want to know before I start writing to that area in month 25 and start experiencing catastrophic errors.
I didn't exclude anything by saying 99%. I specifically included the chance of something else going wrong that is competely unexpected from the manufacturer's expectations. For instance, I'm sure they didn't account for the possibility that the hard drive might spontaneously catch on fire because it's just not that likely.
Can you provide a cite that doing a long test on a healthy drive during downtime is harmful to the drive running it once per month? Of course I could say I'm 99% sure that isn't true, but we've convered my view on statistics. Rather I'll just say that in my experience (and I've been doing it long before experimenting with FreeNAS), it has not caused my drives to prematurely fail. And we're talking thousands of drives if you include all the systems I'm responsible for. (Though I'm not the admin, I am responsible for approving policy).
1000s of drive.. big deal. I used to be the lead IT person at the largest Naval Command on the east cost. We had 1000s of machines, many with more than 1 drive. Quantity only means so much for so long. I'm sure your company weren't handhold and SMART monitoring every single drive on every single machine, getting an email the instant one machine had anything that remotely resembled an error, did you? Companies don't pay employees to monitor large quantities of hard drives in desktops. It's more cost effective to just replace the drive when it fails or the user calls up and says that when the computer POSTS they get this weird warning they don't understand and tell employees to do their own backups or store their data on the servers.
I think you do a lot right. I'd have no problem hiring you to build or admin a system for me. And one day I might. But only because I'm used to dealing with people who think they invented logic and I'm pretty good at letting it roll off most of the time. ;) But let's take the flip side of what you said above. If I tell you I've been doing things since soon after SMART technology arrived and they have been working for me, how does that fit in your world view? As I've said, sometimes there are many ways to skin a cat, all roads lead to Rome, etc. If you want to tell me you've decided you don't value the benefit of SMART long tests, I'm OK with that (as I've said). If you say there IS no value, I'm not.
It doesn't affect my world view at all. I was around when SMART was first being conceived. It never worked out as well as anyone had hoped, and some companies hate it(and deliberately modify SMART to be less thorough to fit their wants/needs..just look at the ECC thing) because the industry standard is that a failure of the long test(or failure to be able to run a long test) is the de-facto standard for "an RMA is authorized". Before SMART each company had their own tools you had to run, your own ways of testing the drive, and different levels of "bad" regarding when an RMA is or isn't authorized. SMART was supposed to help everyone by giving consumers a baseline for "good" and "bad" and (hopefully) give consumer a preemptive warning of impending failure as well as give manufacturers the ability to gather data on their hard drives' health and see how they could improve their own reliability. None of that really worked out too well.
Maybe, maybe not. I don't presume to know exactly what a long test will find because I don't keep up with what every firmware of every drive from every manufacturer does. I doubt many do. So in that sense I trust that it performs "some tests" and finds "some conditions". I'm good with that since it's only the blank areas of the drive I care about anyway (scrub will handle the other areas).
You don't need to. As long as you understand what stuff is possibly tested and understand that anything that would show up on a test failure would almost always manifest themselves with your drive behaving so badly you'd be getting emails from your server, who cares? Just like if Seagate does only 1/2 the tests WD does, that doesn't mean I'd trust WDs test any more or less than Seagate. It means that if I have to RMA the drive Seagate's test
may be less thorough than WD's and Seagate
might tell me the drive is good where WD wouldn't.
Again, for an unused area of the drive, scrub will do even less than a long test.
I explained above why the unused areas really don't matter anywhere as much as you think and included a dd command that works well if you really think it does matter.
Are we talking LONG or SMART tests here? I don't think a SMART test every 30 minutes (mea culpa, I actually do the SMART inquiry every 30 minutes) or a LONG test every month is equivalent to what you're suggesting, but either way, I want to know what you're referring to so I can respond to it. But that said, if I agreed they WERE analagous, I'd agree with your statement.
So you can't grasp the concept that when you use something you are always causing wear and tear, even if only so very slightly? Why do you think the servo and drive head are almost completely insusceptible to regular wear and tear factors? The head doesn't touch the platter and a magnetic field moves the head because the amount of wear and tear is minimal, but not zero. Drive wouldn't last an hour if they had wear and tear associated with physically moving parts.
Non-IT story: In a past life we had a pressure switch. It performed a very very important function. By law, we had to test it every 3 years. Well, we thought "we want to be superior performers so we're going to calibrate the pressure switch every 18 months". So we did. After 5 years we noticed that these switches were going bad far before their expected lifespan. Upper management freaked out. So lets start doing them ever 12 months! So now we're calibrating them every 12 months, and they're failing even more frequently than before. At $25k per switch and a 6 month lead time for a replacement, this was a very big deal. We had alot of these switches that we were trusting for safety reason and now their reliability was being questioned. They should have lasted at least 20 years, but we could predict the failures because if we had calibrated them more than 3 times it was about 50/50 chance they'd fail the next time. The manufacturers were baffled because these switches were sold all over the world, many in far more inhospitable conditions than our switches were in. What did we eventually figure out? That when we were calibrating the switch there was the possibility that even microscopic pieces of debris could get into the switch when we hooked up our calibration equipment and that we were actually ruining the switches by testing them more frequently than the manufacturer recommended. The actual test itself wasn't the problem, nor was it the equipment we used. It was just the fact that we were opening up the system via a single very small valve and hooking up test equipment. Surprise! Despite the fact that we were in the business of calibrating 1000s of switches from many different manufacturers, different designs, different materials, we learned a very important lesson. You shouldn't always think you know better and deviate from the manufacturer. So we replaced all the switches and started testing them every 3 years. Never had any problems with them again.
"Too much" is a subjective term. So while the statement is true on the surface, it doesn't inherently shed any light on our respective positions.
It is subjective. So be objective. At what point are you testing but not really getting any value from the test? If your test is doing more harm than good, why are you running the test. And by value I mean you are predicting failure rates ahead of time more than the wear and tear you are placing on the drives. Hint: You as a consumer are not privy to the information on how much wear and tear you are causing on a drive just by testing it.
So remind me why you think you know better than the manufacturers?
But again, you're using a far exaggerated example that has no relationship to anything we've discussed to support your point. Even a 30 minute SMART test (which you admit is mostly in silicon) is nowhere near doing a glucose test seconds after you just did one.
Exactly. Because your testing methods are being far exaggerated for the relationship between failures and your ability to find those failures. I chose seconds instead of 30 minutes because the human body is designed for self repair while a hard drive is not. A single drop of blood every 30 minutes wouldn't cause much of a blood loss over your entire lifetime, so I said seconds to prove the point. Your hard drive can only monitor its own demise over its life span.
I agree with the conclusion of your strawman. ;)
You call it a strawman because you don't like the fact that I've proved my point.
OK cool, we're back to long tests... ;) So I guess when I tell you I HAVE had long tests find failing drives preemptively, you can only conclude that I'm lying to you. At which point, I'm not sure where we go in the discussion.
Nice of you to presume that I'm accusing you of lying. I really appreciate it! I don't think you're lying, but I think your drive would have had a longer lifespan, and you still would have been able to replace the drive without a loss of data, if you hadn't been running all of those tests. And there's no way to know either, but its easy for you to jump to the conclusion that the long test must have saved you because the clearly are correlated.
Shorten it from expected, or shorten it from some theretical maximum (such as 10 years down to, say, 8)? Because we've been doing this on our production drives without issue. Let's say WD Blacks have a 5 year warranty, so the expected life is 5 years. That's 60 long tests if you're doing one per month. You're telling me you think 60 long tests is likely to kill the drive?. Or are you just saying over time, those 60 long tests will shorten the life expectancy from (say) 5 years to 4 years, 11 months, 30 days? Or what exactly is it you're saying, and can you provide any proof of it? Because the anecdotal experiences don't do it since we each have our own and it doesn't agree.
I don't give a crap about theoretical. Nor do I really care about the warranty. The warranty is only enough to release the manufacturer from replacing my drive.
I think that running long tests that really don't tell you if anything is broken that you won't figure out on your own from emails, etc. are only adding wear and tear. How much, I have no clue. To really have a solid value for "how much" would take a few $million and quite a bit of time and analysis. But I think we can both agree that if you can decrease the wear and tear on hard drives you should. And since I've tried to discuss above why the long and short tests really don't help identify a failing drive:
1. But a test could easily be "destructive" because it picks up a stray particle from a part of the disk you would never ever have used anyway and now just damaged the head or platters making the drive broken.
2. A long and short test don't even a reasonable chance of find a common problem from hard drives that wouldn't be self evident very quickly.
3. A long and short test adds wear and tear.
So why are you doing tests that really are useful only for "remote circumstances", can be destructive to the drive, and don't have a reasonable chance of even identifying a disk before its already not reliable.
Again, I'm finding it difficult to quantify what you're saying here. By your logic, doing CHKDSK on Windows should be avoided because it wears down the drives. Or for that matter, scrubs. Because doing a scrub every 2 weeks is SURELY killing the disk faster than doing scrubs every month. Right? All this is subjective? We have to separate the 2 sides of the equation so we're clear where your complaint lies. There's the cost side and the benefit side. I disagree with you on the benefit, but also consider it subjective based on situation. I think that ship has sailed and we probably won't meet up. On the cost side, I currently disagree with you, but I'm open to reading evidence that 60 long tests over 5 years poses an unacceptable risk to a healthy hard drive. Because I give it a greater value and a lesser cost, we arrive at different conclusions. The problem is you don't accept that the valuations are subjective, and not as absolute as you believe they are.
No, what I'm saying is that there is no value for long and short tests. I never made any comparison about chkdsk, and if you were doing scrubs every day I'd make the same argument that you are adding no real value but definitely making those drives work hard for their paycheck.
I laughed at the morons that had scripts setup on their first gen SSDs to erase the unused sectors(this was before GC and TRIM) and gee... later they found out that doing it more frequently than once a week(and for most users) once a month only caused excessive writing to the limited write cycles of the memory cells because the custom tool always rearranged the data(kinda like doing a defrag). And I laughed all the way home because I never had any issues, did the zeroing out tool once a month just like the manufacturers recommended, and I was happy. Every SSD manufacturer of the time that released tools to do GC saw a very high failure rate as soon as those tools were made available. Some of them even started rejecting RMAs from drives that had run out of write cycles because it wasn't likely that in 4-6 months you'd actually hit that limit with regular drive use. It wasn't a coincidence either. It was all the "cool kids" that wanted every single ounce of write performance from their SSD as they could get and thought they were so much smarter than the manufacturer that they did their own thing. Even now, I know people that use the Intel SSD toolbox and do daily runs of the tool "for maximum performance". I wonder why the SMART data on their drive says less than 80% life remaining despite his drive being slightly newer than mine, is the same gen and firmware, but mine says 96%. I don't make any effort to save cycles on my drive, and he's got 10% of the drive unpartitioned and does alot of other things to try to extend the life of his drive because "his SSD lifespan is sucking".
Exactly my feeling the same time I learned about ZFS scrubs, resilvers, et al. "Woah, that's really going to make the drives work up a sweat." Then I got over it. Because a healthy drive can deal with it. And if it makes a marginal drive fail, screw it. Life's short. I'll replace it and move on. It's worth the cost to me. And the flaky drive was probably going to go soon anyway -- probably on a Friday at 5 p.m..
I disagree! 6pm on a Friday before a 3 day weekend. That way you get home then get the emails that things are getting ugly. :P
But again, WRT ZFS scrubs, if you were doing them daily I'd be arguing that you're crazy for doing it that frequently. You're adding wear and tear with no value added. So why work a healthy drive down to marginal when you don't have to?
If it's under warranty, probably. It depends on how bad the drive is. You've no way of knowing a long test will only return 1 (or a few) bad sector. You have to run the test. I'm not sure I understand what you mean by pass fail. That's not what I recall seeing. On the last drive I RMA'd, it told me how many errors it encountered. It even told me the LBA. It just didn't stop on the first error and say FAIL.
I disagree, but that's your own personal preference. I'd never want to waste my time trying to RMA a drive that won't fail me in its expected lifetime. For all you know, the new drive might not be any better. I've had far more "recertified" drives require replacement than new drives. I have an old laptop IDE drive that has some problems about 119GB out of 120GB. Did I RMA it? Nope. Because I understand that an issue that suddenly appears and doesn't spread rapidly isn't likely to get worse, I just partitioned the first 110GB and left the rest unused. Statistically, I had better odds from the drive than a "recertified" drive. Had the drive for 7 years now and it still makes a great external USB drive. Just did a write/read/verify test pattern to the drive in January and zero errors.
Your choice, and a valid one. Different choices are also valid. As I said, I'd rather not wait until the warranty runs out. We have too many drives and honestly, to use one of your favorite phrases, I should REALLY be fired if I'm letting faulty drives sit in servers until after their warrantys expire when I could know they're faulty before the warranty expires. There are many factors that come into play. If you consider nothing else of what I've said, please at least consider this.
Yeah, except there's more factors than just getting a new drive before the warranty expires. What if the recertified drive only lasts until 1 day out of warranty. What did you save? This whole "what/if/but/maybe/kinda" is old. It's pure and simple statistics and the likelyhood of a scenario you don't want. Everyone wants to be on the winning side of the curve, but you definitely aren't considering the big picture.
All I ask is that we discuss the factors as best we can and indentify the credibillity we assign to said factors, as well as our assumptions. I've no horse in the race. I trust my experience and my clients trust my experience, so no loss there. I've been doing this a long time as well (35 years). I just want the people reading to be able to make decisions for their own situations based on understanding the factors involved in the decision, not, "A senior moderator said so."
There's a reason why many RAID controllers don't let you do SMART tests. And it seems to be getting worse. They really are a waste of time. If you have a disk that you think is flaky, a long test
may prove it is bad, but it won't do much else. At best, a long test is likely to tell you that a bad disk is bad(what a shocker), and at worst a long test is simply wearing down a drive that is currently fine.
So before you start prescribing tests that you don't understand, you should figure out what they do, how they do it, and if they really add any value in even performing them. If you don't intend to do that homework, you should seriously consider just following the manufacturer's recommendations.