HDD Standby and SMART problems

SavageBrewer · Apr 10, 2013

So I am new to FreeNAS, but not to storage, I usually work on large EMC and Hitachi/HP storage arrays that are always on and I dont worry about things like spinning down disks to save power.
But I am turning to FreeNAS for a home storage solution, and right now I am having an issue.

Is it normal to have SMART interfere with the spindown of drives?
In Services - S.M.A.R.T. I have the check interval at a default of 30 minutes.
I also adjusted the power mode to "Standby - Check the device unless it is in SLEEP or STANDBY mode"

Under System - S.M.A.R.T. Tests I have one job created and that is to do a "Short Self-Test" every 5 days at midnight.

After playing with APM I disabled that entirely, I either want the disks spun up at full speed, or in standby and not spinning at all to cut down on electricity and wear.

Here is the issue, SMART appears to be interfering with my drives going into standby.
If the standby timer is set to less than 30 minutes (the check interval) the drive will spin down at the specified time.
But if the standby timer is set to say 30 or 60 minutes, it never spins down.

Is the behavior that I am seeing normal or expected?
I assumed that because I set the Smart job to run every 5 days this should not happen.

Now this is an area that it would be nice to see better documentation on as the interaction between SMART, APM and Standby all seem to cause issues with each other.

On another note, if the drives are in standby when the job is set to go, will the system wait until the drives are spun up and then run the check, or is that interval skipped until the next scheduled job triggers.

It almost seems to me that you have two choices?
Either you leave the drives spun up all the time and you can get the SMART checks to run properly
or
You can put the drives in standby and not have regular SMART checks.

Or am I missing something entirely here?
Spinning my drives down only saves about 20 watts of power consumption, and the drives are consumer grade so they are not designed to be spinning 24x365.
The reality is I will only be accessing files on this probably one or two days a week, so having it spun up 24 hours a day seems excessive.

Any input is appreciated.

ProtoSD · Apr 10, 2013

Do some searching here in the forums, this topic has been discussed here numerous times.

cyberjock · Apr 11, 2013

SavageBrewer said:
Or am I missing something entirely here?
Spinning my drives down only saves about 20 watts of power consumption, and the drives are consumer grade so they are not designed to be spinning 24x365.
The reality is I will only be accessing files on this probably one or two days a week, so having it spun up 24 hours a day seems excessive.

This has been discussed to death, but I will give you 1 tidbit. Several of the senior posters in the forum have noticed something. Despite the fact that consumer grade drives aren't designed to run 24x7x365 the seem to last longer if they do. This has been observed by quite a few people. I have no idea why this is the case, just that it is. My instincts tell me its the constant temperature cycles from going up and down, but I don't have any solid evidence to prove that. In your case, I'd leave them spinning 24x7.

SavageBrewer · Apr 11, 2013

I know the standby / spin up - spin down discussion has been done over and over again, and in my case because I probably will only be using the array once or twice a week at most I figured by enabling some power features would be a good idea.
I agree that if you access your NAS multiple times a day or even daily, it is a good idea to leave the spindles spinning.
That is not what I am asking...

From my earlier post:
In Services - S.M.A.R.T. I have the check interval at a default of 30 minutes.
I also adjusted the power mode to "Standby - Check the device unless it is in SLEEP or STANDBY mode"

Under System - S.M.A.R.T. Tests I have one job created and that is to do a "Short Self-Test" every 5 days at midnight.

After playing with APM I disabled that entirely, I either want the disks spun up at full speed, or in standby and not spinning at all to cut down on electricity and wear.

Here is the issue, SMART appears to be interfering with my drives going into standby.
If the standby timer is set to less than 30 minutes (the check interval) the drive will spin down at the specified time.
But if the standby timer is set to say 30 or 60 minutes, it never spins down.

Is the behavior that I am seeing normal or expected?

I don't need a huge explanation of why, just is that normal, or is my system acting in a way it shouldn't.

Stephens · Apr 11, 2013

It depends on how you define normal. It depends on what the controller firmware does when it receives a given SMART command. FreeNAS can't possibly know what all drives will do for all commands. You can actually run the commands yourself from a shell/console if you're technically inclined that way. Then you can find the combination of options that works best for you, if any.

I ran into the same "issue" on one of my systems, but it's an archive machine that I turn on when needed. And no, I'm not worried by the drives dying prematurely because of it.

paleoN · Apr 12, 2013

SavageBrewer said:
Or am I missing something entirely here?
Spinning my drives down only saves about 20 watts of power consumption, and the drives are consumer grade so they are not designed to be spinning 24x365.
The reality is I will only be accessing files on this probably one or two days a week, so having it spun up 24 hours a day seems excessive.

Yes. Have the disks always on and just shutdown twice a week. You will save quite a bit more than 20 watts as well. Hit the power button to turn it back on.

joeschmuck · Apr 13, 2013

SavageBrewer said:
Here is the issue, SMART appears to be interfering with my drives going into standby.
If the standby timer is set to less than 30 minutes (the check interval) the drive will spin down at the specified time.
But if the standby timer is set to say 30 or 60 minutes, it never spins down.

So I'd recommend you turn off SMART. Let's be honest here, SMART almost never tells you there is a failure with the drive before a catastrophic failure really occurs, the ZFS scrub will likely tell you about problems first or possibly the drive will just fail to spin up one day.

But if you still want SMART testing, say a test every 5 days, create a script [for which there are several on these forums] to manually run "smartctl" with the appropriate values to do a Short or Long test every 5 days. I recently posted on this for of person so you should be able to locate it and adjust the script to match your needs. Look in the How-To-Guides --> Configuration.

Stephens · Apr 13, 2013

joeschmuck said:
Let's be honest here, SMART almost never tells you there is a failure with the drive before a catastrophic failure really occurs, the ZFS scrub will likely tell you about problems first or possibly the drive will just fail to spin up one day.

That's not my experience. I think SMART's a great advance for consumer hard drives. There are problems with the lack of standards in how it's implemented, and it's surely doesn't warn of all impending failures, but it's very helpful for monitoring certain types of degradation and decay which can lead to failure. There are numerout threads here on the forum about people reporting slow speeds (or other issues), running SMART, and finding out their drives are on a downward spiral. Pending sectors, CRC errors, seek errors. There's a lot of good information in there and even just going by this forum, I've seen a lot of people forewarned of impending failure by the reports.

I consider SMART monitoring a huge feature for a NAS. Actually, I think it's important for any computer system. The problem is the link between the information the drive collects and the admin/user is often broken. Aside from reporting often not even being enabled, there are the issues with how to interpret the inconsistent meanings amongst different manufacturers and products. In a perfect world, as soon as a drive passes manufacturer determined safety threshholds, I'd get a message soon afterwards regardless of whether I'm using Windows, Linux, FreeBSD, or whatever, and whether it's a desktop or server. We're not there, but the FreeNAS user can do some things to get closer to that level of automation. The fact that the reporting/management end of it isn't quite there doesn't eliminate the utility of the underlying SMART data.

That said, yes, there are times when the SMART data will say everything is fine right before it isn't fine.

joeschmuck · Apr 13, 2013

Yea, I'm just not a particular believer in the SMART short or long tests so I'm not running those any longer. From what I've read when running a SMART test, if it actually does find a problem, typically you have less than 24 hours to recognize it and fix it. This has nothing to do with a user taking note of other SMART values and interpreting them which is what I do as you mentioned above. As far as I'm concerned, the SMART Tests are completely different from the SMART raw data, but I get your point. I am a fan of RAIDZ2 ! HUGE FAN.

I am working on getting additional SMART data to the View Disks tab screen, stuff like temp, pending sector errors, etc... And I'd like specific alarm thresholds to trigger an email and maybe the speaker on the PC to beep. I have a lot of ideas and I'm just trying to figure out how to put them into the code.

cyberjock · Apr 13, 2013

Here's my 2 cents regarding SMART.

- It's a useful tool to monitor your drive.
- The SMART long test is the "de-facto" standard by the HD manufacturers to verify if a drive is "good" for RMA purposes.
- I've never had a long test tell me a drive was failing before I wouldn't trust it with my data any longer.
- I've have had a long test tell me the drive is "fine" even after I wouldn't trust it with my data any longer.
- I've had drives fail without any warning via SMART info.d
- I've had drives work great for months despite warning signs on the SMART info.
- If you ignore SMART, what other alternative is really available?

So what do you get from all of this? SMART is great as a monitor. It's like a credit rating. It shows that historically you've been good or bad with your money, but it doesn't tell you how you will do in 6 months. Banks still use credit ratings for a reason, they're a fairly good indicator over the long term, but not a 100% solution.

Stephens · Apr 13, 2013

It's all SMART, just different aspects. But that's not important (and I don't want to derail the thread). What's important is the integration of the raw data into a stable computing environment is unnecessarily fraught with complexity. I'll give an example. Manufacturers know what the acceptable operating temperature range of a drive is. Yet, the drives will happily allow you to toast them to death. Why wouldn't a firmware simply notice the drive is hot and refuse to spin up the drive? Then the next time you do a short test (which you'd do when the drive doesn't work), it should tell you the temperature sensor was triggered. It could all operate so much better if the various parties really cared.

I've been following (from 10k feet) your efforts to integrate SMART info into the GUI. I've also seen scripts here that attempt to refine the focus on some of the raw data and send e-mail alerts when necessary. They're all nice hacks, especially for admins. Unfortunately, FreeNAS has a lot of "users" and SMART currently places far to much of a burden on the user to know what's really a problem. It all sounded so clean and easy in the design stages (when I first heard about it). There'd be a register, and there'd be a trip level for the register. If the register exceeded the trip level, you have a problem. How'd that ever get so screwed up? As is always the case, some manufacturer decides, "I know the register is supposed to mean X, but we think it's far more meaningful to report Y. And not only that, we're not really going to tell you what Y means. If you think you have a problem, run the report and send it to us and we'll evaluate whether it's problematic or not." Voodoo black box much? Gee, thanks.

Anyway, I'm still a fan of SMART, I'm just disappointed with the implementation from platter to user. That disappointment isn't directed at FreeNAS specifically. I like a SMART short test every 30 minutes or so. I even like a SMART long test every month or so (overnight) because a scrub will only tell you about problems in the areas of a disk currently occupied by data. I'd really rather know about problems developing in other areas of the disk before I start trying to use them. But most importantly, I'm of the belief that computers are there to make my life easier. Making me constantly montitor things IT should be monitoring isn't my idea of what 2013 should look like. It's not fun and it isn't a test of my abilities/intelligence. It's just a waste of time.

I like looking at my UPS drop from 61 watts to 21 watts when my disks spin down. But I don't like it enough that I'm willing to forgo the SMART tests that are monitoring my data. It's the best you have during the period between scrubs.

- I've never had a long test tell me a drive was failing before I wouldn't trust it with my data any longer.

I have. One example: I purchased a Seagate 3TB drive for my NAS, did the long test before using it, and it reported a problematic drive. Just as a test, I went ahead and used it (it was already in the NAS). Sure enough, it failed. I don't have that NAS in my signature, but I have it. It's basically the same as the others but with 3TB Seagate 7200 RPM drives and 16GB of RAM.

cyberjock · Apr 13, 2013

30 minute SMART test are FAAAAAR too frequent. That's very far into the "far too frequent" category for me. I think once a day is plenty. The stuff the short test seems to test, if they were bad, would be noticed immediately for normal use. Hard drive disconnecting itself from the SATA controller, the small number of sector reads causing errors, etc. are all things you'd notice on a scrub(or within a few seconds of the drive going offline)

Long tests are nice, but what is the criteria for "not good"? I consider a sector "bad" when I can no longer read the data I wrote to that sector and trust it to be correct. ECC corrects many sector issues(and is used more than 20% of the time).

Some manufacturers consider the sectors good if the sector can be read(regardless of if ECC can correct the issue). Well, if ECC can't correct the sector then in my book the sector is bad. But that long test you just ran said the disk is fine because it didn't acknowledge a bad sector in my definition of it.

There's pluses and minuses to long test as well as zpool scrubs. What has been fairly well observed by end users is that the SMART long test used to be extremely comprehensive and very through, but because of high RMA costs companies are doing whatever they can to cut costs. If they can make the test less comprehensive(or make the warranty shorter... because they'd NEVER do that) they'll do it. Now we see very short warranty terms compared to 5 years ago and long tests that don't even do the same tests as before.

Now think about this... if a free sector is bad and you later write data to it one of two things will happen. The drive will realize its FUBAR(error signal to the system) or it won't notice(hooray for silent corruption). If it won't notice on the write what makes you think its going to notice on the read? After all, you are looking at method for silent corruption. There are many mechanisms to find write errors on drives without verifying the data was written correctly. What is the only known way to find silent corruption? zpool scrubs. Additionally, just because you could write and verify the data right now is no indicator that the data will be correct in 5 minutes, 5 hours, or 5 days. The future is always an unknown. So ultimately scrubs are a required component for identifying failing disks. Long tests, not so much. They don't provide any guarantee of problems in the past, now, or in the future. But scrubs will absolutely tell you about any errors from the past or right now. The future is still left to the unknown.

Stephens · Apr 13, 2013

OK, I'll bite... ;)

cyberjock said:
30 minute SMART test are FAAAAAR too frequent.

That is, of course, a subjective, not objective statement. It's a statement that can be disproven as objective fact simply by me stating that if my hard drive temperatures are rising, I'd like to know within 30 minutes. Thus I run SMART short tests every 30 minutes.

cyberjock said:
That's very far into the "far too frequent" category for me.

Which is why I don't like to tell people what they should do without having comprehensive knowledge of their situation. I don't know your situations, only my own. For some, I don't run SMART short tests at all, because it simply doesn't matter. For others, a 30 minute quick check of the condition of the HDD's is completely reasonable because I want an early warning as soon as possible.

cyberjock said:
I think once a day is plenty.

For ALL situations? You really feel comfortable saying that? Personally, I'd feel more comfortable making sure people know what the test does and applying that knowledge to their own situation because I wouldn't even feel comfortable saying once per day is valid for "most" situations (because realistically speaking, I don't know what "most" situations are).

cyberjock said:
The stuff the short test seems to test, if they were bad, would be noticed immediately for normal use.

No. Because the end users using the system aren't sitting there monitoring the fact that I/O is slowing down, or any number of other "usability" hints to disk problems. And frankly, neither am I. I have better things to do and would rather the system notify ME when there are problems. If that costs me 1-2 minutes of a quick SMART check every 30 minutes, I can live with that -- and so can my clients.

cyberjock said:
Hard drive disconnecting itself from the SATA controller, the small number of sector reads causing errors, etc. are all things you'd notice on a scrub (or within a few seconds of the drive going offline)

The scrub that is being done once per month or whatever is a non-starter. Scrubs have nothing to do with the type of tests (SMART short) we're talking about. No way I'm waiting until my monthly scrub to find out I have trouble with my drives for a production system. On those systems, I do scrubs in addition to long tests on alternating 30 day cycles (example: long test on 1st of month, scrub on 15th). But we're talking about SMART short tests here. IMO, 30 minutes is actually a LONG time to get updates on drive temperatures. It's a tradeoff I'm living with because I don't have the time to develop something better. On Windows, my HDD temps are monitored on a minute by minute basis, complete with pop-up alarms and e-mails.

cyberjock said:
Long tests are nice, but what is the criteria for "not good"? I consider a sector "bad" when I can no longer read the data I wrote to that sector and trust it to be correct. ECC corrects many sector issues (and is used more than 20% of the time).

I have no problem with your definition with the exception that I'd add the word "reliably" as in "...reliably (repeatedly) read the data I wrote...". It's a clarification for completeness as I know you meant it even if you didn't say it. After all, the reason drives retry upon failure is sometimes a 2nd read works. But if it's taking my drive 5 reads to finally decide it has good data, I'd really like to know that too. But let's move on from here...

cyberjock said:
Some manufacturers consider the sectors good if the sector can be read (regardless of if ECC can correct the issue).

If the sector can be read, it should be considered good. I'm not sure what you mean by the part in parenthesis. To clarify... what are you suggesting happens...
A) If the read fails, but "would have" been OK if ECC were included
B) If the read fails, even after attempting to use ECC
C) If the read succeeds, but only because ECC was used

cyberjock said:
Well, if ECC can't correct the sector then in my book the sector is bad. But that long test you just ran said the disk is fine because it didn't acknowledge a bad sector in my definition of it.

I think you've may have missed sharing some important information in laying out the scenario you're devising. I may be able to address it better if you restate it. Specifically, what are you saying long test isn't catching in this scenario?

cyberjock said:
There's pluses and minuses to long test as well as zpool scrubs.

Right. So if I have 2 tools (limiting the discussion to what you've outlined above) to give me information about the validity of my data, and they do different things, why would I not use them both if there's little to no cost? I honestly don't understand why anyone would look at a test a manufacturer gives you to report on the condition of your drive and decide not to use it. If you want to go above and beyond, I'm on board. But to ignore the manufacturer provided tool? Sure they make funky decisions due to profit pressures, but they also happen to know at least a little bit about the devices they build.

cyberjock said:
What has been fairly well observed by end users is that the SMART long test used to be extremely comprehensive and very through, but because of high RMA costs companies are doing whatever they can to cut costs. If they can make the test less comprehensive (or make the warranty shorter... because they'd NEVER do that) they'll do it. Now we see very short warranty terms compared to 5 years ago and long tests that don't even do the same tests as before.

Depending on how you're defining an "end user", I find it hard to believe average end users know what SMART long tests do. If, however, you mean techs/geeks who purchase drives, I'll buy that. I would also buy that HDD manufacturers have profit pressures that impact product quality. However, it doesn't matter to me. I've said it here before (using different words) but we're dealing with today's game, not the game as it existed in the past or as it "should" be. Within the last year, I've seen several long tests identify bad drives (from all manufacturers). That tells me it works as far as it goes. I used the information to plan drive replacements at my convenience. In some cases, I put the drives into other machines storing non critical data just to see how long they'd last. I didn't keep a spreadsheet, but they did tend to fail (anecdotally) faster than the drives with no errors. Why would I want to wait (if I can avoid it) until the drive's 80% full, replace, and resilver 80% data when I could do it when the drive is only 20% full? That's just running at reduced redundancy (and potentially lowering the performance of the pool) for a lot longer period of time than is necessary.

cyberjock said:
Now think about this... if a free sector is bad and you later write data to it one of two things will happen. The drive will realize its FUBAR (error signal to the system) or it won't notice (hooray for silent corruption). If it won't notice on the write what makes you think its going to notice on the read?

Let me flip it. What makes you think it won't? After all in order for low level ECC to work, it has to write out parity information. So even if I mean to write out "ABC" with (let's use checksum as it's easier) a checksum value of 0x41h + 0x42h + 0x43h = 0xC6h, but it the data gets corrupted to "BBC" on write while the checksum remains valid. The HDD might not notice that on write, but certainly would on read. And I can think of MANY other scenarios where it'd notice on read when it didn't notice on write.

cyberjock said:
After all, you are looking at method for silent corruption. There are many mechanisms to find write errors on drives without verifying the data was written correctly. What is the only known way to find silent corruption? zpool scrubs.

Strictly speaking, that's actually not altogether true. I'll use your words... Let's think about it. If the firmware and electronics couldn't detect errors, what exactly would it use ECC to do? HDD's (and optical media too) have bler rates and have low level errors all the time (as you allude to with your 20% number). They mostly just get corrected silently on the fly. But for that to happen, they have to be detected. ZFS is great and does some great things, but we should be clear about what it actually does. Reliable computing existed long before ZFS ever came along. ATM's exist. Bank's exist. Stock exchanges exist. Insurance companies exist. Military systems exist. They don't rely on ZFS to exist. There are specific cases of bitrot ZFS addresses, but hard drives themselves can also detect a lot of corruption. They can even fix a lot of it.

cyberjock said:
Additionally, just because you could write and verify the data right now is no indicator that the data will be correct in 5 minutes, 5 hours, or 5 days. The future is always an unknown.

That's not a reason not to run a long test any more than the lack of the same guarantee for scrubs would be a reason not to run a scrub. It's just a matter of balance based on individual user situations. If I'm prone to high blood pressure and diabetes, testing my pressure and blucose now and finding it's OK won't tell me it'll be OK tomorrow, but it's still valid to test it today. How often I "need" to test it becomes an educted calculation based on my knowledge of my own situation (a theme I'll always keep returning to). If I know I'm avoiding salt, exercising, etc. I may not test as frequently as if I'm eating McDonald's for breakfast, lunch and dinner. I'd never feel comfortable telling someone else whether -- and how often -- they should run SMART tests except to say that you should probably at least do it once when you get a new drive, and perhaps periodically afterwards if you consider what it does useful. How often depends on when you feel it's "necessary" to know the information is provides.

cyberjock said:
So ultimately scrubs are a required component for identifying failing disks.

Scrubs are a great additional tool in the arsenal. However, I've read nothing that convinces me I shouldn't care about unwritten areas of the disk before I start using them. Using simple logic, if I have a disk that writes from beginning to end (that's not how it works, but for simplicity, let's use that model), and I've filled up 10% of the disk, but 20% at the end is bad, I'd rather know that reasonably soon. I don't want to wait until I have 80% of the disk full and only find out when I start writing to that last 20%. Is it possible a long test could have told me everything was OK all the way up until I actually started writing to that 20%? Sure. It's possible. But I don't think it's likely.

cyberjock said:
Long tests, not so much.

Again, I disagree, for the reasons I've stated.

cyberjock said:
They don't provide any guarantee of problems in the past, now, or in the future.

Knowing your blood pressure and glucose level aren't high don't guarantee you won't have a heart attack or stroke either, but I wouldn't ignore the early warning information they provide even if they doesn't account for everything (including the future).

cyberjock said:
But scrubs will absolutely tell you about any errors from the past or right now. The future is still left to the unknown.

Except that you're not giving much credence to the fact that long tests can tell you about errors in the unwritten portions of the disk ahead of time. Because you personally attach no value to knowing that, you seem to be suggesting no one should value it. For whatever reason, because a successful long test today can't guarantee the disk will be OK in the future eliminates its utility in your eyes, even though a scrub can't guarantee that either. Because HDD's go bad and generally don't get better, if I find out today about errors on a disk via SMART long test in an area not yet filled with data, they're likely to be there in the future too. If you personally say, "I'll deal with it when I get there," I have zero problem with that. Because you understand your choices and you're making an educated decision. Even if you make that decision for your clients, I have zero issue with it, because you will have made personal support choices I find sound. But I do take issue with telling the public at large THEY shouldn't care about knowing about disk errors in the unwritten portion of a disk as soon as possible.

titan_rw · Apr 13, 2013

Stephens said:
That is, of course, a subjective, not objective statement. It's a statement that can be disproven as objective fact simply by me stating that if my hard drive temperatures are rising, I'd like to know within 30 minutes. Thus I run SMART short tests every 30 minutes.

Maybe I'm reading this wrong.

You're using SMART short self tests to detect for abnormal operating temperatures? By the time a smart tests reports anything wrong about a drive due to temperature, the drive is probably on fire. I think most drives would have to hit at least 60 before they trigger any kind of smart alert. Probably higher. They log any peaks in temperature though, which can be read back later. (use "smartctl -l scttempsts /dev/ada0" to check on temp stats)

For temperature monitoring, why not use the smart service, with the interval set to 30 minutes, and whatever temperature thresholds are deemed appropriate. You'll get temperature alerts before the drive is in complete meltdown.

It's been my experience that smart short tests don't really do much. A basic "can I seek my head to a couple different tracks" kind of test. A drive is going to be completely unusable in freenas long before it starts failing a short test. I'd almost guarantee you'll get read / write / checksum errors on the zpool before any short test failure.

What I do, is scrubs every week, smart long tests every week, and manual inspection of "smartctl -a /dev/adaX" attributes about every week. Any drive temperature issues will be emailed to me immediately (within 30 minutes), as with any drives 'disconnecting', due to excessive errors or such. Freenas is pointing to another freebsd box (a vm actually) to syslog to, so I can go over logs from any point in time if I wish.

That being said, I don't think any harm will come from running short tests every 30 minutes, but it's not something I would do.

cyberjock · Apr 14, 2013

Stephens said:
That is, of course, a subjective, not objective statement. It's a statement that can be disproven as objective fact simply by me stating that if my hard drive temperatures are rising, I'd like to know within 30 minutes. Thus I run SMART short tests every 30 minutes.

You don't have to run a SMART test. Query it via smartctl -a /dev/ada0 | grep Temperature. Poof, instant temp without running a bunch of unnecessary tests. So IMO not a good reason to do a SMART short test.

Which is why I don't like to tell people what they should do without having comprehensive knowledge of their situation. I don't know your situations, only my own. For some, I don't run SMART short tests at all, because it simply doesn't matter. For others, a 30 minute quick check of the condition of the HDD's is completely reasonable because I want an early warning as soon as possible.

You do realize that you can find out if something is wrong in a bunch of different ways. If the 30 minute query of the drive fails(because the drive has been disconnected) you'll get a warning email that something is wrong because a script returned an error. Now you know to go see what is really wrong. So again, no test needed to find out if the HDD is attached and functioning within the specification of a short test anyway. For most other errors, a short test isn't inclusive enough so won't find issues. And as I said before, the majority of stuff the short test does(read from platter, SATA communication test, etc) are BLATANTLY obvious within a few seconds of them happening and probably give you an email within a minute or two(and definitely within 30 minutes). Even if you don't enable SMART tests, your nightly emails will clue you in when you have a bunch of read and write errors. So again, not much value added in doing short tests except for the random chance that the small number of sectors read in the <60 seconds the short test takes it happens to read the bad sector you are hoping to find prematurely.

For ALL situations? You really feel comfortable saying that? Personally, I'd feel more comfortable making sure people know what the test does and applying that knowledge to their own situation because I wouldn't even feel comfortable saying once per day is valid for "most" situations (because realistically speaking, I don't know what "most" situations are).

I feel absolutely comfortable saying that for 99% of situations. I'd even go so far to say if you are asking in a forum setting what settings to use you do NOT have the necessary knowledge to be in that 1%. One manufacturer recommend short tests no more frequently than weekly and no less frequently than monthly. I haven't found any recommendations from any of the other manufacturers except for a long test before doing an RMA. Also keep in mind that servers being pounded regular won't ever complete a long test because the long test is temporarily interrupted by any read or write command from the host system. Some hard drives will actually fail the long test if you perform it while the system is heavily loaded because it considers the test a fail if you exceed a given time limit. So now you are getting failures that might not even be failures! Some hard drives will just test as much as it can within its time limit and say the drive "passed" even if it didn't finish the test because the time expired! Those vendors, if you read the documentation, tell you to only perform a long test with the drive offline. If you are like most people(including myself) you'll ignore that message thinking its just stupid and do the test online. So in some cases, you'll never actually finish a long test because they'll always timeout and will tell you it passed only because it didn't fail! Really useful test if your drive is doing constant re-reads of bad sectors and never actually finishes the test, huh?

No. Because the end users using the system aren't sitting there monitoring the fact that I/O is slowing down, or any number of other "usability" hints to disk problems. And frankly, neither am I. I have better things to do and would rather the system notify ME when there are problems. If that costs me 1-2 minutes of a quick SMART check every 30 minutes, I can live with that -- and so can my clients.

And you are SURE that the SMART short test will find your issue? You put too much faith in what the short test does. Short test typically takes 30 seconds to 2 minutes and is 90% a silicon test with a very small random read test(if any) just to prove the head functions. If you want to see real errors you need to do a full surface scan(and NOT a long smart test because ECC is excluded.. see below).

The scrub that is being done once per month or whatever is a non-starter. Scrubs have nothing to do with the type of tests (SMART short) we're talking about. No way I'm waiting until my monthly scrub to find out I have trouble with my drives for a production system. On those systems, I do scrubs in addition to long tests on alternating 30 day cycles (example: long test on 1st of month, scrub on 15th). But we're talking about SMART short tests here. IMO, 30 minutes is actually a LONG time to get updates on drive temperatures. It's a tradeoff I'm living with because I don't have the time to develop something better. On Windows, my HDD temps are monitored on a minute by minute basis, complete with pop-up alarms and e-mails.

Where you think monthly scrubs are recommended is beyond me. I still don't get why FreeNAS' default is 35 days. Sun(and Oracle) recommend 30 days for enterprise class and weekly for consumer grade. If you can't figure out which category you fall into and if you don't want to follow those recommendations, that's your own business. I do bi-weekly because I've never had a single repair under my zpool since constructing it except for a drive that suddenly failed with no spiny sounds from it. No warning, just dropped from the array and stopped spinning. Got an email <30 mins later because the drive failed to respond to the SMART query(see above). But you can be sure that when I start seeing errors I'll be reconsidering how frequently I run a scrub. If the scrubs didn't take 18 hours I'd probably do weekly right now. I also consider running scrubs regularly somewhat useless and potentially destructive too. I prefer to work my drives enough to trust my data, but not so much I'm prematurely wearing down the drives.

Your hard drive temps shouldn't be fluctuating much in 30 minutes except from cold metal bootup. Mine fluctuate by 3C from the hottest part of the day to the coldest part of the night in my living room. On all of the servers I manage I've yet to see temperatures change by more than 2C in 4 hours except when a scrub starts. I bet if you watched your temps you'd see that temps don't really change that much, and shouldn't be changing very quickly. I've even seen a fan fail and it was almost 4 hours to increase by 4C over the other disks and 24 hours to go up by 7C. I don't care how frequently you want to monitor temp though, since no test is required. Just a query of SMART drive info. Not to mention that FreeNAS includes a feature to send an email if temps change by more than a certain value. So again, not much value in short smart tests when you can get the temp without running any test at all.

I have no problem with your definition with the exception that I'd add the word "reliably" as in "...reliably (repeatedly) read the data I wrote...". It's a clarification for completeness as I know you meant it even if you didn't say it. After all, the reason drives retry upon failure is sometimes a 2nd read works. But if it's taking my drive 5 reads to finally decide it has good data, I'd really like to know that too. But let's move on from here...

If the sector can be read, it should be considered good. I'm not sure what you mean by the part in parenthesis. To clarify... what are you suggesting happens...
A) If the read fails, but "would have" been OK if ECC were included
B) If the read fails, even after attempting to use ECC
C) If the read succeeds, but only because ECC was used

A "read" is defined as the actual physical retrieval of the voltage spike from the magnetic domains for that sector as well as the ECC data for that sector. For 4k-sector drives that means that the smallest read is 4096 bytes + 100 byte for ECC(512-byte sectored drives are 512-bytes + 50-bytes for ECC). For long tests the ECC data is read but is discarded immediately(at least, that's true for all major manufacturers). So if the data read is corrupted the hard drive long test will still report the sector as "good" but it shouldn't be, but if you did a zpool scrub and it included that sector it would return a read error on zpool status(by what I understand a read error to be.. there's little documentation explaining this). So no, just because a sector can be "read" is no indication at all that the data is even valid.

The 2 ways a non test "read" fails is if:
(a) a sector is read and the sector+ECC don't match or can't be repaired
(b) if the hard disk reads a particular sector but the sector marker says a different sector was just read.(whoops!)

So to answer your question:

A. Only possible with (b) above. And in that situation there is most likely something physically wrong with the drive or a firmware bug. In any case, you are in trouble regardless of a long test or zpool scrub.(so minimal value in doing a long test if a scrub is already considered necessary maintenance for a zpool).
B. On a scrub, attempt to rewrite the sector. Long test doesn't check ECC so N/A. No value for long test.
C. Only possible on a scrub or disk read. Again, Long test doesn't check ECC so N/A. No value added for long test.

So as you can see, I just showed that for all 3 situations you mentioned there is virtually no value added except if there is a physical issue or firmware bug causing problems with spare disk space. However, in more than 99% of cases, either of those issues will not be limited to a small area of the disk and generally results in quick corruption of significant portions of the disk in short order. This type of issue, when found, is often "too late" to correct since the disk is effectively trashed(thank goodness for redundancy).

I think you've may have missed sharing some important information in laying out the scenario you're devising. I may be able to address it better if you restate it. Specifically, what are you saying long test isn't catching in this scenario?

On any disk read the sector's data, its ECC, and a sector marker are read. This verifies the correct sector was read then verifies that the stored data was accurate. In 20% or so cases the data is not correct and requires correction by ECC. Not surprisingly, Seagate each correction in its SMART parameters as "Raw_Read_Error_Rate" in SMART info. For Seagate, that number tracks the number of ECC corrections needed since last power on. Values for long term power on hours in the 100's of millions or billions is not uncommon. For all other manufacturers, a non-zero number is "bad". All other manufacturers hide this value because what customer wants to see a drive with a billion "Raw_Read_Error_Rate" and not instantly feel tightness in their chest and want to do an RMA right now.

Right. So if I have 2 tools (limiting the discussion to what you've outlined above) to give me information about the validity of my data, and they do different things, why would I not use them both if there's little to no cost? I honestly don't understand why anyone would look at a test a manufacturer gives you to report on the condition of your drive and decide not to use it. If you want to go above and beyond, I'm on board. But to ignore the manufacturer provided tool? Sure they make funky decisions due to profit pressures, but they also happen to know at least a little bit about the devices they build.

Depending on how you're defining an "end user", I find it hard to believe average end users know what SMART long tests do. If, however, you mean techs/geeks who purchase drives, I'll buy that. I would also buy that HDD manufacturers have profit pressures that impact product quality. However, it doesn't matter to me. I've said it here before (using different words) but we're dealing with today's game, not the game as it existed in the past or as it "should" be. Within the last year, I've seen several long tests identify bad drives (from all manufacturers). That tells me it works as far as it goes. I used the information to plan drive replacements at my convenience. In some cases, I put the drives into other machines storing non critical data just to see how long they'd last. I didn't keep a spreadsheet, but they did tend to fail (anecdotally) faster than the drives with no errors. Why would I want to wait (if I can avoid it) until the drive's 80% full, replace, and resilver 80% data when I could do it when the drive is only 20% full? That's just running at reduced redundancy (and potentially lowering the performance of the pool) for a lot longer period of time than is necessary.

There is a cost though. Doing high cycles of random reads and writes is bad for drives(similar to working your server and long test at the same time). When I had a hardware RAID until Dec 2012 I used to offline my array via script and then do a long test, then initiate a RAID6 redundancy check(basically its a hardware RAID version of scrub) in that order and sequentially every month. I did that because it causes wear and tear and shortens their lifespan. This is why if you read around Google and this forum you'll find that when people ask "how to I verify a new hard drive is trustworthy" they always include a VERY hard test of random i/o's, sometimes for days. Out of my 24 drives in my server I've had one failure(Feb 2013) in just over 3 years of continuous use. I'd say I'm doing something right, or I'm VERY VERY lucky.

Let me flip it. What makes you think it won't? After all in order for low level ECC to work, it has to write out parity information. So even if I mean to write out "ABC" with (let's use checksum as it's easier) a checksum value of 0x41h + 0x42h + 0x43h = 0xC6h, but it the data gets corrupted to "BBC" on write while the checksum remains valid. The HDD might not notice that on write, but certainly would on read. And I can think of MANY other scenarios where it'd notice on read when it didn't notice on write.

You are correct, there are some checks to validate that the data will be read, but there is no guarantee. Unless you do a verify read, you are somewhat gambling. It works for 10^19 or whatever the error rate is. But do you really think that the long test you perform is going to be able to find that issue if it doesn't even verify the ECC? Remember, the long test just says "yep.. I wanted 4096 bytes from sector *** and I got 4096 bytes from sector ***, so I'm happy". A scrub will say "WTF are you doing handing me this crap data da0? I'm gonna rewrite you and fix you."

Strictly speaking, that's actually not altogether true. I'll use your words... Let's think about it. If the firmware and electronics couldn't detect errors, what exactly would it use ECC to do? HDD's (and optical media too) have bler rates and have low level errors all the time (as you allude to with your 20% number). They mostly just get corrected silently on the fly. But for that to happen, they have to be detected. ZFS is great and does some great things, but we should be clear about what it actually does. Reliable computing existed long before ZFS ever came along. ATM's exist. Bank's exist. Stock exchanges exist. Insurance companies exist. Military systems exist. They don't rely on ZFS to exist. There are specific cases of bitrot ZFS addresses, but hard drives themselves can also detect a lot of corruption. They can even fix a lot of it.

They do.. with the ECC data for each sector. That's what's kept them reliable. How do you make your data more reliable? Put more ECC bits in there. But then you take away from total available storage space the drive can store. So it's a give and take. The whole reason why HD manufacturers went to 4k sector sizes is because 4096 bytes consumes 4196 bytes of storage space on the platter, but 512-byte sector drives use 4496-bytes(512-bytes per sector + 50 ECC bytes * 8 sectors). So HD manufacturers love that they just increased their usable disk space by about 7% just by changing the sector size! How awesome is that! Notice that the warrantys have also gotten shorter around the same time? Not a coincidence. They're literally getting something for nothing and the average consumer bought it lock, stock, and barrel.

That's not a reason not to run a long test any more than the lack of the same guarantee for scrubs would be a reason not to run a scrub. It's just a matter of balance based on individual user situations. If I'm prone to high blood pressure and diabetes, testing my pressure and blucose now and finding it's OK won't tell me it'll be OK tomorrow, but it's still valid to test it today. How often I "need" to test it becomes an educted calculation based on my knowledge of my own situation (a theme I'll always keep returning to). If I know I'm avoiding salt, exercising, etc. I may not test as frequently as if I'm eating McDonald's for breakfast, lunch and dinner. I'd never feel comfortable telling someone else whether -- and how often -- they should run SMART tests except to say that you should probably at least do it once when you get a new drive, and perhaps periodically afterwards if you consider what it does useful. How often depends on when you feel it's "necessary" to know the information is provides.

Precisely. But if you were checking your glucose so frequently you were fainting from a loss of blood, you'd be a little upset about that too. Too much testing is bad and adds no value. Would you really argue that if your glucose was normal right now that it desperately needed testing seconds later just to make sure it hasn't changed significantly? There's no value added in testing it 1 minute later. This is what you are doing with all of the testing but at 30 minute increments(which is the whole point I was trying to make along with the test not really telling you anything).

Think about this. The average hard drive stores the test results for the last 3-5 tests(long and short). You can view them from smartctl -a /dev/da0. There's a reason why its 3-5 and not 500. It's not something the manufacturers expect to be performed frequently. It's nice for them to be able to look back and see that at 1000 poweron hours it did a long test that was good, at 3000 hours it was good, but at 5000 hours there was an issue. It gives them some statistical info. Telling them it was good at 3000 hours, 3024, and 3048, but at 3072 you had trouble. Were there any errors before 3000 hours? They don't know because they can't see that far back when you RMA the drive.

[/QUOTE]Scrubs are a great additional tool in the arsenal. However, I've read nothing that convinces me I shouldn't care about unwritten areas of the disk before I start using them. Using simple logic, if I have a disk that writes from beginning to end (that's not how it works, but for simplicity, let's use that model), and I've filled up 10% of the disk, but 20% at the end is bad, I'd rather know that reasonably soon. I don't want to wait until I have 80% of the disk full and only find out when I start writing to that last 20%. Is it possible a long test could have told me everything was OK all the way up until I actually started writing to that 20%? Sure. It's possible. But I don't think it's likely.

Again, it goes back to what I said above: So as you can see, I just showed that for all 3 situations you mentioned there is virtually no value added except if there is a physical issue or firmware bug causing problems with spare disk space. However, in more than 99% of cases, either of those issues will not be limited to a small area of the disk and generally results in quick corruption of significant portions of the disk in short order. This type of issue, when found, is often "too late" to correct since the disk is effectively trashed(thank goodness for redundancy).

That 1% will represent the magnetic domains breaking down on the disk(generally a manufacturing defect that should have been caught during manufacturing testing). I have yet to see anyone ever have a hard drive that they proved was due to magnetic domain breakage from regular use. I've seen it from overheating a hard drive(the hard drive smelled funny even when off).

So what scenario does a long test stand a reasonable chance of finding problems preemptively? None.

But does a long test stand a reasonable chance of shortening drive life? Absolutely.

Does it give you a piece of mind(even if that thought is mistaken)? Absolutely. It's still hard for me to tell myself that a long test is literally worthless and technically destructive. But I have learned to accept the fact that this is the lesser of the two evils. The important thing to keep in mind is that with redundancy you are ready for a disk to fail at any moment, and you are banking on the extra hard drives saving your zpool from going offline permanently.

Except that you're not giving much credence to the fact that long tests can tell you about errors in the unwritten portions of the disk ahead of time. Because you personally attach no value to knowing that, you seem to be suggesting no one should value it. For whatever reason, because a successful long test today can't guarantee the disk will be OK in the future eliminates its utility in your eyes, even though a scrub can't guarantee that either. Because HDD's go bad and generally don't get better, if I find out today about errors on a disk via SMART long test in an area not yet filled with data, they're likely to be there in the future too. If you personally say, "I'll deal with it when I get there," I have zero problem with that. Because you understand your choices and you're making an educated decision. Even if you make that decision for your clients, I have zero issue with it, because you will have made personal support choices I find sound. But I do take issue with telling the public at large THEY shouldn't care about knowing about disk errors in the unwritten portion of a disk as soon as possible.

Click to expand...

A long test does not invalidate its usefulness. There is no doubt in my mind that it is conceivable that a long test may identify an error before a scrub. I give it a non-zero probability that it may work. But I also give it a non-zero probability that you are wearing down the disk by doing all the testing that you are doing. I also give the amount of wear and tear far more "need to avoid" than I do of a long test having a chance of finding an previously unknown defect.

My problem with the long test is that it doesn't find the most common defects until its "too late"(which a scrub is just as effective at finding) for the explanations given above, but definitely puts more burden on your disks. So why do a test that adds little(if any value) but certainly does add wear and tear? Then consider this...what if you don't use that bad sector for months, or years. Is it really worth it to replace a "failing" disk with a few bad sectors that you may never use anyway? I'm sure your local mall doesn't repave the entire parking lot when a couple of parking spaces in the back are in bad shape. I'd much rather use the disk until I actually find problems that genuinely interfere with my ability to access my data than be so proactive that I'm replacing a disk at any sign of error from a long test. After all, long tests are basically pass/fail. There is no "kinda sorts broken but keep using it". Either its good or its not. Would you choose to use a hard drive that reports results that say "Your drive is usable, but is also broken. But RMA is not authorized yet". You'd be a little upset and rightly so. SMART tests have always been go/no-go tests for RMAs, even more now that companies are trying to save money everywhere they can.

I've followed my practices for about 10 years, and I've had very very few disks fail in my servers. I think 3 in more than 60 disks over the last 10 years, owning all the drives for no less than 3 years before upgrading. Drives in desktops have been dropping like flies though, but not any more than the norm. Overall, I think I have an excellent track record with server drives. Whether luck, I'm doing something right or something else I don't know. But I will agree that it is very difficult to "step aside" and consider if any given maintenance really adds value and really does have any kind of chance for finding an error.

Another good example of maintenance gone bad... a friend that insists on having a clean computer interior. Every other week he takes his computer outside and blows it out with a can of compressed air. The reason.. he doesn't want dust from making the CPU get too hot. So desite the extra engineering involved with heat removal for the system, overestimating heat generation and underestimating the amount of heat the CPU can remove, he's cleaning his computer every week anyway. CPU temps are not high but he has stripped almost every screw on his case from removing panels constantly from his case, his DVI connector screws on his video card are stripped from constantly being connected and disconnected, and his USB ports are actually "loose" and anything plugged in can sometimes fall out under its own weight. But he still insists that its "best" for his computer to blow it out every single week.

The real question is "How much is too much?" and for that, each has their own opinion. The only "evidence" you can really present is your personal experience and the recommendations of the manufacturer. And frankly, I don't see any reason to think the manufacturer is lying about how often to perform the tests when they've made the same recommendations for over 10 years. Additionally I've accounted for them potentially being wrong (as well as bad luck) by using dual redundancy(or triple redundancy) and having good backups. I won't lose any sleep if your drive are constantly failing. But I go to sleep every night and I don't worry about a hard drive overheating within 30 minutes or that a long test might have changed the outcome of a zpool that I didn't lose anyway despite being less aggressive. I've lost nothing, and I'm getting outstanding drive life. What more can I ask for?

So what do I do to monitor my hard drives? My total disk maintenance consists of the following:

1. Make sure my fans work anytime I'm inside the computer. Do a powerup with the case off and make sure they spin at normal speeds and don't make any bad noises.
2. Scrubs on 1st and 15th like clockwork.
3. If I do any maintenance on my server hardware, I always do a scrub after I boot it up.
4. If I'm going to move my server(I've had to move it a few times over the years) I do a scrub before and after moving it.
5. Check the temperatures daily(I do it via emails now). I try to schedule it during the hottest part of the day, or give myself extra temperature room if I have to schedule it during the night when the drives are idle. I also always setup scrubs to be running for at least an hour before a daily temperature check so I can see just how hot the drives get. I try to keep all the drives between 35 and 40C per Google's Whitepaper on "Failure trends in a Large Disk Drive Population". It's an excellent read if drive life is important to you. It may be the single best document is geeks can read if we want to know what to do right and what to do wrong with hard drives.

That's all. For new disks, its a different story and not one I intend to discuss in this thread. It's been somewhat discussed elsewhere and is a whole different best entirely.

Stephens · Apr 14, 2013

cyberjock said:
You don't have to run a SMART test. Query it via smartctl -a /dev/ada0 | grep Temperature. Poof, instant temp without running a bunch of unnecessary tests. So IMO not a good reason to do a SMART short test.

IMO, combined with the other information it monitors, it is. I don't think it's your job to tell people what they should value based on your values. I think it's far more useful to just provide information and let them apply it to their situation. I did make a mistake, though (long day, late night) in mixing up the "SMART short test" and "SMART inquiries", but it's done now and I'll live with it. My SMART Inquiries are 30 minutes, SMART short tests are daily, SMART long tests are montly, and Scrubs are also monthly (they alternate by bi-weekly). But I'll still live with my original misstatement because if that's what I were doing, it still wouldn't be your place to definitively say it's "wrong".

You do realize that you can find out if something is wrong in a bunch of different ways.

Of course not. Only you and other selected unnamed members who meet your criteria of "senior" do. ;)

I feel absolutely comfortable saying that for 99% of situations.

Unless I see some calculations you used to arrive at that 99% number, it's a made up number and adds no more weight to your position than saying a unicorn told you.

I'd even go so far to say if you are asking in a forum setting what settings to use you do NOT have the necessary knowledge to be in that 1%.

I didn't. Regarding SMART short tests, I simply said what I do and you made a definitive statement that (effectively) I shouldn't do it. Not that YOU wouldn't, that *I* (anyone) shouldn't. I'm simply attempting to point out a pattern of making definitive statements for all situations based on your perceptions.

One manufacturer recommend short tests no more frequently than weekly and no less frequently than monthly.

You're cherry picking when you want to trust manufacturers based on when it lends credence to your already-held positions. Here you trust the manufacturer, but when it comes to long tests (the true subject of our discussion), you don't. Again, I have no problem with that. My own model is I trust them 'somewhat" always. And I distrust them "somewhat" always.

I haven't found any recommendations from any of the other manufacturers except for a long test before doing an RMA.

I get from that a long test will give some information about the quality of the drive. I don't need any further recommendations to run that once a month. If the drive can't handle a long test once per month, I want to know it so I can stop buying them. I'll address this more later.

Also keep in mind that servers being pounded regular won't ever complete a long test because the long test is temporarily interrupted by any read or write command from the host system. Some hard drives will actually fail the long test if you perform it while the system is heavily loaded because it considers the test a fail if you exceed a given time limit. So now you are getting failures that might not even be failures!

Alternatively, you could do it overnight like I said and avoid the pounding. I don't think it's useful to create situations that don't exist to support your position.

You put too much faith in what the short test does.

Another hyperbolic (and definitive) statement you have no way of supporting. Allow for the possibility I (basically) know what it does and simply value it more than you do.

Short test typically takes 30 seconds to 2 minutes and is 90% a silicon test with a very small random read test(if any) just to prove the head functions.

Now that we've recapped what we both know, I'll restate that I see value in doing this regularly on some servers. You're in no position to say it definitively provides no value for my situation (or anyone else's, unless you really know their situation).

If you want to see real errors you need to do a full surface scan (and NOT a long smart test because ECC is excluded.. see below).

I use each tool for what it's for. I don't expect a SMART Short Test to do what a SMART Long Test does, and I don't expect a SMART Long Test to do what a Scrub does (or vice versa).

Where you think monthly scrubs are recommended is beyond me. I still don't get why FreeNAS' default is 35 days.

"Once per month or whatever" doesn't equal "once per month." For some reason, you're picking the most innocuous of comments to argue against. Montlhy (because that's the default) or whatever (you set it to). Cool? ;) But again, you're cherry picking how you apply your logic. When it suits your position, you'll say the people who put out a product (such as FreeNAS) know best. But when it doesn't suit your position, you call their decision making into question.

Sun (and Oracle) recommend 30 days for enterprise class and weekly for consumer grade.

You've said you consider FreeNAS 8.x targeted towards enterprise users. It shouldn't be very difficult, then, to figure out why the team has defaulted to monthly.

If you can't figure out which category you fall into and if you don't want to follow those recommendations, that's your own business.

Out of curiosity, who does "you" refer to here? It seems incredibly condescending to make such a statement because someone chooses to do something differently than you would. I assure you there are things you say on a regular basis that are different than how I would do them. But I have a big enough world view to understand other people can arrive at the same destination by taking different routes more in sync with their situations (needs, experiences, funds, etc). And that although my ways may seem "99%" right to me, that doesn't always make it so right for someone else.

I do bi-weekly

So you ignore their recommendations? Because you can't figure out which category you belong to? ;)

because I've never had a single repair under my zpool since constructing it except for a drive that suddenly failed with no spiny sounds from it. No warning, just dropped from the array and stopped spinning. Got an email <30 mins later because the drive failed to respond to the SMART query (see above). But you can be sure that when I start seeing errors I'll be reconsidering how frequently I run a scrub. If the scrubs didn't take 18 hours I'd probably do weekly right now. I also consider running scrubs regularly somewhat useless and potentially destructive too. I prefer to work my drives enough to trust my data, but not so much I'm prematurely wearing down the drives.

In other words, you tailor your approach based on knowledge of the value of each product and your situation (experiences, etc.). Seems reasonable to me. Allow for others to do the same. I have no problem with anything you said here. What I have a problem with is if you think 14 days is the correct time period based on your situation, if someone else says 21 days (or 30 days), you seem to get a feeling in your gut they're wrong. My position is they'd be wrong only in the sense that you'd be wrong to someone doing them every 7 days. Some things aren't an exact science and we shouldn't treat them as if they are because we see a choice different from one we'd make.

Your hard drive temps shouldn't be fluctuating much in 30 minutes

You don't monitor because things do what they should do. You monitor to see when they do what they shouldn't do.

On all of the servers I manage I've yet to see temperatures change by more than 2C in 4 hours except when a scrub starts.

You're talking about normal operation, which to me is completely besides the point.

I bet if you watched your temps you'd see that temps don't really change that much, and shouldn't be changing very quickly.

What does monitoring temps when things are going well have to do with monitoring for when things don't? I don't design systems based on things going well, and I doubt you do either. I design systems trying to anticipate every point of failure.

I've even seen a fan fail and it was almost 4 hours to increase by 4C over the other disks and 24 hours to go up by 7C.

So what you're telling me is either a hard drive in a case will either run hot on day 1 or basically never? If so, I disagree.

I don't care how frequently you want to monitor temp though, since no test is required. Just a query of SMART drive info. Not to mention that FreeNAS includes a feature to send an email if temps change by more than a certain value. So again, not much value in short smart tests when you can get the temp without running any test at all.

Agreed.

Just to keep everyone reading along up to date, we're now going back to the core of the discussion (phew), SMART Long Tests.

A. Only possible with (b) above. And in that situation there is most likely something physically wrong with the drive or a firmware bug. In any case, you are in trouble regardless of a long test or zpool scrub. (so minimal value in doing a long test if a scrub is already considered necessary maintenance for a zpool).

I didn't quote everyting you said here since most of it was just clarifying what you were saying previously (I get what you're saying now), but if you feel I'm quoting you out of context, I apologize. That said, wow. I really don't get how you say option A) is the same whether you do a long test or a scrub when the scrub doesn't touch any portion of the disk without data. I simply say if I want to check montly if there's something wrong with the drive in areas of unwritten data, and I believe that can happen at times other than when the drive is first put into service. In fact, I know from experience it can. Thus I do long tests periodically.

So as you can see, I just showed that for all 3 situations you mentioned there is virtually no value added except if there is a physical issue or firmware bug causing problems with spare disk space. However, in more than 99% of cases, either of those issues will not be limited to a small area of the disk and generally results in quick corruption of significant portions of the disk in short order. This type of issue, when found, is often "too late" to correct since the disk is effectively trashed(thank goodness for redundancy).

You're making up 99% again to support your position. It's an interesting way to structure a statement by excluding the value to make the statement there is no value. I'm not sure what you mean by "correct" but I think I've already said or implied I want to know about a fading disk so I can replace it. If it's fading 23 months into a 24-month warranty, I want to know before I start writing to that area in month 25 and start experiencing catastrophic errors.

There is a cost though. Doing high cycles of random reads and writes is bad for drives (similar to working your server and long test at the same time).

Can you provide a cite that doing a long test on a healthy drive during downtime is harmful to the drive running it once per month? Of course I could say I'm 99% sure that isn't true, but we've convered my view on statistics. Rather I'll just say that in my experience (and I've been doing it long before experimenting with FreeNAS), it has not caused my drives to prematurely fail. And we're talking thousands of drives if you include all the systems I'm responsible for. (Though I'm not the admin, I am responsible for approving policy).

When I had a hardware RAID until Dec 2012 I used to offline my array via script and then do a long test, then initiate a RAID6 redundancy check(basically its a hardware RAID version of scrub) in that order and sequentially every month. I did that because it causes wear and tear and shortens their lifespan. This is why if you read around Google and this forum you'll find that when people ask "how to I verify a new hard drive is trustworthy" they always include a VERY hard test of random I/O's, sometimes for days. Out of my 24 drives in my server I've had one failure (Feb 2013) in just over 3 years of continuous use. I'd say I'm doing something right, or I'm VERY VERY lucky.

I think you do a lot right. I'd have no problem hiring you to build or admin a system for me. And one day I might. But only because I'm used to dealing with people who think they invented logic and I'm pretty good at letting it roll off most of the time. ;) But let's take the flip side of what you said above. If I tell you I've been doing things since soon after SMART technology arrived and they have been working for me, how does that fit in your world view? As I've said, sometimes there are many ways to skin a cat, all roads lead to Rome, etc. If you want to tell me you've decided you don't value the benefit of SMART long tests, I'm OK with that (as I've said). If you say there IS no value, I'm not.

But do you really think that the long test you perform is going to be able to find that issue if it doesn't even verify the ECC?

Maybe, maybe not. I don't presume to know exactly what a long test will find because I don't keep up with what every firmware of every drive from every manufacturer does. I doubt many do. So in that sense I trust that it performs "some tests" and finds "some conditions". I'm good with that since it's only the blank areas of the drive I care about anyway (scrub will handle the other areas).

Remember, the long test just says "yep.. I wanted 4096 bytes from sector *** and I got 4096 bytes from sector ***, so I'm happy". A scrub will say "WTF are you doing handing me this crap data da0? I'm gonna rewrite you and fix you."

Again, for an unused area of the drive, scrub will do even less than a long test.

Precisely. But if you were checking your glucose so frequently you were fainting from a loss of blood, you'd be a little upset about that too.

Are we talking LONG or SMART tests here? I don't think a SMART test every 30 minutes (mea culpa, I actually do the SMART inquiry every 30 minutes) or a LONG test every month is equivalent to what you're suggesting, but either way, I want to know what you're referring to so I can respond to it. But that said, if I agreed they WERE analagous, I'd agree with your statement.

Too much testing is bad and adds no value.

"Too much" is a subjective term. So while the statement is true on the surface, it doesn't inherently shed any light on our respective positions.

Would you really argue that if your glucose was normal right now that it desperately needed testing seconds later just to make sure it hasn't changed significantly?

But again, you're using a far exaggerated example that has no relationship to anything we've discussed to support your point. Even a 30 minute SMART test (which you admit is mostly in silicon) is nowhere near doing a glucose test seconds after you just did one.

There's no value added in testing it 1 minute later.

I agree with the conclusion of your strawman. ;)

This is what you are doing with all of the testing but at 30 minute increments (which is the whole point I was trying to make along with the test not really telling you anything).

Ah, so we're back to the SMART short test, which wasn't the initial focus of the discussion. I'm focused on the comment that the Long Test is basically useless. If you don't mind, I'm going to ignore the rest of the SMART Short Test comments as I think we've covered everything important on both sides (both our positions) for anyone who doesn't know and reads all this later.

Think about this.

I'd like to think I do, even when I my expeirences and conclusions differ from your own.

So what scenario does a long test stand a reasonable chance of finding problems preemptively? None.

OK cool, we're back to long tests... ;) So I guess when I tell you I HAVE had long tests find failing drives preemptively, you can only conclude that I'm lying to you. At which point, I'm not sure where we go in the discussion.

But does a long test stand a reasonable chance of shortening drive life? Absolutely.

Shorten it from expected, or shorten it from some theretical maximum (such as 10 years down to, say, 8)? Because we've been doing this on our production drives without issue. Let's say WD Blacks have a 5 year warranty, so the expected life is 5 years. That's 60 long tests if you're doing one per month. You're telling me you think 60 long tests is likely to kill the drive?. Or are you just saying over time, those 60 long tests will shorten the life expectancy from (say) 5 years to 4 years, 11 months, 30 days? Or what exactly is it you're saying, and can you provide any proof of it? Because the anecdotal experiences don't do it since we each have our own and it doesn't agree.

Does it give you a piece of mind (even if that thought is mistaken)? Absolutely.

Yes, this placebo is absolutely delicious. ;)

It's still hard for me to tell myself that a long test is literally worthless and technically destructive.

That can happen, and has happened. I once made a casual comment about the sun being a ball of gas and someone showed me on the internet (at that time) that scientists currently speculated it had a solid core (perhaps of iron) or something. So in that sense I get where your 99%'s come from because I only believe anything 99% myself (even if I'm "totally sure"). Who knows? I'm open and learn new things all the time, even about things I previously "knew". But I'm not in a position to accept that 1 long test per month kills drives when I have so much experience that refutes that, and my assumptions based on my knowledge also refutes it. I disagree with your conclusion that there is no value provided, so we'll agree to disagree on that one. I think we've both fully stated our respective positions on that. But if you have evidence that 1 smart long test per month makes drives die sooner their their expected life, I'd like to read up on it.

But I have learned to accept the fact that this is the lesser of the two evils.

It bears repeating, I've no problem with your conclusion other than you think they're universal, and apparently that anyone who disagrees is inherently "wrong".

The important thing to keep in mind is that with redundancy you are ready for a disk to fail at any moment, and you are banking on the extra hard drives saving your zpool from going offline permanently.

I think that's something we can all agree on.

A long test does not invalidate its usefulness. There is no doubt in my mind that it is conceivable that a long test may identify an error before a scrub. I give it a non-zero probability that it may work. But I also give it a non-zero probability that you are wearing down the disk by doing all the testing that you are doing.

Again, I'm finding it difficult to quantify what you're saying here. By your logic, doing CHKDSK on Windows should be avoided because it wears down the drives. Or for that matter, scrubs. Because doing a scrub every 2 weeks is SURELY killing the disk faster than doing scrubs every month. Right? All this is subjective? We have to separate the 2 sides of the equation so we're clear where your complaint lies. There's the cost side and the benefit side. I disagree with you on the benefit, but also consider it subjective based on situation. I think that ship has sailed and we probably won't meet up. On the cost side, I currently disagree with you, but I'm open to reading evidence that 60 long tests over 5 years poses an unacceptable risk to a healthy hard drive. Because I give it a greater value and a lesser cost, we arrive at different conclusions. The problem is you don't accept that the valuations are subjective, and not as absolute as you believe they are.

I also give the amount of wear and tear far more "need to avoid" than I do of a long test having a chance of finding an previously unknown defect.

Exactly my feeling the same time I learned about ZFS scrubs, resilvers, et al. "Woah, that's really going to make the drives work up a sweat." Then I got over it. Because a healthy drive can deal with it. And if it makes a marginal drive fail, screw it. Life's short. I'll replace it and move on. It's worth the cost to me. And the flaky drive was probably going to go soon anyway -- probably on a Friday at 5 p.m..

My problem with the long test is that it doesn't find the most common defects until its "too late" (which a scrub is just as effective at finding) for the explanations given above, but definitely puts more burden on your disks. So why do a test that adds little (if any value) but certainly does add wear and tear?

To me, it's not a competition. They do different things and I use them to complement each other.

Then consider this...what if you don't use that bad sector for months, or years. Is it really worth it to replace a "failing" disk with a few bad sectors that you may never use anyway?

If it's under warranty, probably. It depends on how bad the drive is. You've no way of knowing a long test will only return 1 (or a few) bad sector. You have to run the test. I'm not sure I understand what you mean by pass fail. That's not what I recall seeing. On the last drive I RMA'd, it told me how many errors it encountered. It even told me the LBA. It just didn't stop on the first error and say FAIL.

I'd much rather use the disk until I actually find problems that genuinely interfere with my ability to access my data than be so proactive that I'm replacing a disk at any sign of error from a long test.

Your choice, and a valid one. Different choices are also valid. As I said, I'd rather not wait until the warranty runs out. We have too many drives and honestly, to use one of your favorite phrases, I should REALLY be fired if I'm letting faulty drives sit in servers until after their warrantys expire when I could know they're faulty before the warranty expires. There are many factors that come into play. If you consider nothing else of what I've said, please at least consider this.

I've followed my practices for about 10 years, and I've had very very few disks fail in my servers. I think 3 in more than 60 disks over the last 10 years, owning all the drives for no less than 3 years before upgrading. Drives in desktops have been dropping like flies though, but not any more than the norm. Overall, I think I have an excellent track record with server drives. Whether luck, I'm doing something right or something else I don't know. But I will agree that it is very difficult to "step aside" and consider if any given maintenance really adds value and really does have any kind of chance for finding an error.

All I ask is that we discuss the factors as best we can and indentify the credibillity we assign to said factors, as well as our assumptions. I've no horse in the race. I trust my experience and my clients trust my experience, so no loss there. I've been doing this a long time as well (35 years). I just want the people reading to be able to make decisions for their own situations based on understanding the factors involved in the decision, not, "A senior moderator said so."

I won't lose any sleep if your drive are constantly failing.

Except that doesn't happen.

joeschmuck · Apr 14, 2013

OMG I'm laughing so hard! I'm sorry I stirred this up. I hope the OP got the fact that they could run a script to conduct the SMART tests.

cyberjock · Apr 14, 2013

Stephens said:
Of course not. Only you and other selected unnamed members who meet your criteria of "senior" do. ;)

I consider people that frequent the forums and try to provide help regularly as "senior", even if they don't officially have that title in the forum. I consider you a "senior" member.

Stephens said:
Unless I see some calculations you used to arrive at that 99% number, it's a made up number and adds no more weight to your position than saying a unicorn told you.

I said 99% because I'm sure there's an except somewhere that I haven't heard of. There's always "that one exception" people grab onto. Just ask the morons that don't wear a seatbelt because they heard a story about a single person that was saved from 1 accident because they didn't wear a seatbelt and ignore the fact that many more are saved because they wore a seatbelt every year. I prefer not to use absolutes and accept the fact that I don't know everything about everything and there are always exceptions in the world. There is almost always a non-zero chance of any given event occurring. Big deal. But feel free to argue at my "math" and not the real issue though.

Stephens said:
You're cherry picking when you want to trust manufacturers based on when it lends credence to your already-held positions. Here you trust the manufacturer, but when it comes to long tests (the true subject of our discussion), you don't. Again, I have no problem with that. My own model is I trust them 'somewhat" always. And I distrust them "somewhat" always.

I don't cherry pick what they do and don't tell me. I trust them at their word when there's more than a decade of SMART experience from many vendors(many that don't even exist anymore) that says that, and if you read some of the technical manuals on SMART(there is an established standard that is only loosely followed by some manufacturers *cough* Seagate *cough*). If you aren't going to trust the standard to be mostly followed(if not absolutely) you shouldn't be trusting SMART at all. For all you know the manufacturer could just have a timer that says "after 2 minutes say it passes" and you and I would be none the wiser for it. We'd be a little upset when a hard drive keeps saying nothing is broken but we know we can't trust the drive with data. I'd bet everyone has had one of "those" drives before.

Stephens said:
I get from that a long test will give some information about the quality of the drive. I don't need any further recommendations to run that once a month. If the drive can't handle a long test once per month, I want to know it so I can stop buying them. I'll address this more later.

It's not that a hard drive can't "handle" a long test once per month, it's just what they recommend. There's a point at which too much work on a hard drive wears it down faster. Where you want to consider that line is up to you. The manufacturers have their own values they use and came up with their own numbers. If you think you can do that failure analysis better than they can I'm sure they'd be happy to hire you.

Stephens said:
Alternatively, you could do it overnight like I said and avoid the pounding. I don't think it's useful to create situations that don't exist to support your position.

And you've never gone to bed leaving a movie streaming and a scrub started, right? And you've never accidentally had a long test and scrub run at the same time, right? It's already been mentioned that running scrubs is hard on the pool. It's not that reading is hard, its that constant seeking from a drive having to multitask between the scrub and retrieving data is hard on them.

Stephens said:
Another hyperbolic (and definitive) statement you have no way of supporting. Allow for the possibility I (basically) know what it does and simply value it more than you do.

Really? I have no way of supporting what a SMART short test does? Check out this short paragraph from wikipedia.. http://en.wikipedia.org/wiki/S.M.A.R.T. Don't be shocked.. it says checks the electrical and mechanical performance. If I had my FreeNAS server up right now I'd post the white paper on what a short test should include and what should NOT be included. The entire document (200 pages or so) is a very interesting read. If I can remember in 2 days when my server is back up I'll link it for you. So yes, I can support my comment.

Stephens said:
Now that we've recapped what we both know, I'll restate that I see value in doing this regularly on some servers. You're in no position to say it definitively provides no value for my situation (or anyone else's, unless you really know their situation).

You are right, I am in no position to say it provides no value. But I'm not the one saying it provides no value. It doesn't, if you think about it(and read the white paper). But I let the manufacturers provide their own recommendations, along with the recommendations provided by the SMART documents on what a test should and shouldn't do.

It's not that I'm not in "their situation". See below where I explain long and short test and what they should(and shouldn't find to see my explanation of the short tests)

Titan_rw has it right when he said that its been his experience that smart short tests don't really do much. Their intended purpose was made mute because the issues that a SMART short test was intended to find would render the drive broken before you ever ran the test. Some manufacturers don't even have short tests at all anymore. They're easy to spot because if you run the test it returns "passed" 1 second later and does not show up in the short list of testing logs(because it literally does nothing and is only there because some RAID controllers freak out if a hard drive says it doesn't support SMART short tests).

Stephens said:
I use each tool for what it's for. I don't expect a SMART Short Test to do what a SMART Long Test does, and I don't expect a SMART Long Test to do what a Scrub does (or vice versa).

If you did you'd recognize really quickly that a long test was not meant to be any kind of way to preemptively find errors beforehand. That was the intent of SMART when it was created in the mid 90s or so via monitoring... but not for testing. The aberration with long tests is that it was supposed to find bad sectors preemptively, but then they deliberately excluded ECC testing/verification(whoops). Whether that was an accident or on purpose is up for a very heated debate that started when SMART was being developed. Not to mention I don't think you even know what you expect to learn from a short or long test. You don't seem to have a grasp as to what is actually tested when you do a test. You just seem to understand that a short test isn't as thorough as a long test, but have no idea what stuff is actually tested. There is some stuff that a short test does test but isn't tested in a long test specifically because if you had those problem you'd know from other tests. (Car analogy: Check if your car can do 55mph without bad sounds but not test if the starter is good. If the starter was bad you'd know because you wouldn't get to the point of doing 55mph)

You want to do a good useful test of the drive? Do a dd=if/dev/da0 of=/dev/null. That'll make the drive read every sector and even do the ECC calcs that the long test doesn't do. In fact, it's often mentioned that this test is better than any other test you can do that isn't a write test to the drive for a "surface scan". If you test all of your drive simultaneously and see a dd command that returns an error or one drive that took significantly longer than the rest of the drive you'll know something is up.

Stephens said:
"Once per month or whatever" doesn't equal "once per month." For some reason, you're picking the most innocuous of comments to argue against. Montlhy (because that's the default) or whatever (you set it to). Cool? ;) But again, you're cherry picking how you apply your logic. When it suits your position, you'll say the people who put out a product (such as FreeNAS) know best. But when it doesn't suit your position, you call their decision making into question.

You've said you consider FreeNAS 8.x targeted towards enterprise users. It shouldn't be very difficult, then, to figure out why the team has defaulted to monthly.

I really have no idea what you were trying to convey right there. Check out wikipedia's comment about scrubs and resilvering...

The official recommendation from Sun/Oracle is to scrub once every month with Enterprise disks, because they have much higher reliability than cheap commodity disks. If using cheap commodity disks, scrub every week.[41][42]

And check out the 41 and 42. 41 actually directs you to FreeNAS documentation that used to say weekly and monthly. How cool is that? Even FreeNAS documents themselves call out weekly and monthly as I prescribed above. Now maybe you understand why FreeNAS' default scrub is 35 days, which seems very awkward considering their own documents disagree. Of course, this is still ignoring Sun/Oracle who also say weekly and monthly. But don't let a few little links get in your way of facts. I don't consider 35 days monthly(but I'm anal like that with what is defined as a "weekly" or "monthly". Monthly implies 12 times per year. 35 days give you only 10.4 tests per year.

Also, I don't consider FreeNAS to be a product that is targeted at enterprises. I'm not sure exactly what iXsystems really targets TrueNAS to. Enterprises would hire true FreeBSD wizards to run their servers and would keep their own in-house support when using FreeBSD. There are so many options, settings, etc that FreeNAS doesn't even begin to cover every scenario. I'd say it targets small to small-mid sized business, but home users are also enjoying it. FreeNAS built the bridge between expensive salaried employees that manage FreeBSD servers and smaller business that would like most(but not all) of the benefits without the "expensive salaried employees".

Stephens said:
Out of curiosity, who does "you" refer to here? It seems incredibly condescending to make such a statement because someone chooses to do something differently than you would. I assure you there are things you say on a regular basis that are different than how I would do them. But I have a big enough world view to understand other people can arrive at the same destination by taking different routes more in sync with their situations (needs, experiences, funds, etc). And that although my ways may seem "99%" right to me, that doesn't always make it so right for someone else.

Just to clarify, that was your response to my statement that " If you can't figure out which category you fall into and if you don't want to follow those recommendations, that's your own business." All I was saying was that if you can't figure out if you are using enterprise class or consumer class drives and if you choose to not follow the recommendations for the particular class you are in, that is your business. And you for some reason felt it necessary to bring up the 99% again when I was discussing SMART tests and not zpool scrubs. Please keep my comments to the actual topic I was discussing.

Stephens said:
So you ignore their recommendations? Because you can't figure out which category you belong to? ;)

I don't ignore their recommendations for weekly scrubs(since I have consumer drives). If you read the entire ZFS document provided by Sun they recommend weekly for zpools that have changing data. My data is actually very static. Less than 10MB/day actually changes. And when I do move large quantities of data, I either move the data before a scheduled scrub or I run a manual scrub afterwards. I helped a friend with his desktop yesterday(copied 2TB to my server), scrub will trigger tomorrow morning. And his copy of the data still exists on his desktop until Wednesday, so I preemptively set myself up for the scrub. Just like the manual says if the server is offline for extended periods of time doing scrubs every 45 days is recommended. I wouldn't go booting up my server every single week just to run a weekly scrub.

Stephens said:
In other words, you tailor your approach based on knowledge of the value of each product and your situation (experiences, etc.). Seems reasonable to me. Allow for others to do the same. I have no problem with anything you said here. What I have a problem with is if you think 14 days is the correct time period based on your situation, if someone else says 21 days (or 30 days), you seem to get a feeling in your gut they're wrong. My position is they'd be wrong only in the sense that you'd be wrong to someone doing them every 7 days. Some things aren't an exact science and we shouldn't treat them as if they are because we see a choice different from one we'd make.

No, I feel they're wrong because they feel they're deviating from the recommendations of the inventors of ZFS. Who really thinks they know more than the inventors of it? I'd add someone to my ignore list immediately if they started making those kinds of claims. I really don't need to read what spews from their keyboard into the forums. Just look at the discussion on another thread about the sync=disabled recommendation. That was an interesting topic to say the least.

Stephens said:
You don't monitor because things do what they should do. You monitor to see when they do what they shouldn't do.

You're talking about normal operation, which to me is completely besides the point.

No, I'm talking about even the abnormal. I had a fan fail and it still took hours for 4C difference. If 4C is a difference between life and death for your hard drive, your hard drives are already running much too hot.

Just like the FreeNAS' "Difference" field for temperature checking. If you check it every single minute and put 2C you'll probably never ever hit the trigger. Hard drives don't heat up that fast even with no cooling. That's why you have the "Informal" and "Critical" fields as shown in 8.12 of the FreeNAS manual.

What does monitoring temps when things are going well have to do with monitoring for when things don't? I don't design systems based on things going well, and I doubt you do either. I design systems trying to anticipate every point of failure.

Stephens said:
So what you're telling me is either a hard drive in a case will either run hot on day 1 or basically never? If so, I disagree.

Yes, only insofar as that if you build a system and everything is static the temps shouldn't change drastically. But things aren't static(fans fail, A/C breaks, etc) so temps do change, but you aren't going to see a hard drive go from 40C to 60C in a short time frame because a fan failed. Even when a server room I worked in had A/C break,it was 2 hours before it was even relatively warm for the humans. I'd wager if the ambient room temperature increases so fast you are worried about cooking a hard drive with 30 minute temp checks, you probably have a unique situation in which you are aware of that situation and probably have other plans in place to help mitigate the situation. Monitoring hard drive temps should be the least of your worries because you should already know from experience that you need to pre-emptively shut down the system.

Stephens said:
I didn't quote everyting you said here since most of it was just clarifying what you were saying previously (I get what you're saying now), but if you feel I'm quoting you out of context, I apologize. That said, wow. I really don't get how you say option A) is the same whether you do a long test or a scrub when the scrub doesn't touch any portion of the disk without data. I simply say if I want to check montly if there's something wrong with the drive in areas of unwritten data, and I believe that can happen at times other than when the drive is first put into service. In fact, I know from experience it can. Thus I do long tests periodically.

I'll explain it slightly different:

When a read request is made to the drive and something goes wrong, the 3 most common failures are that there is a firmware bug, something physically wrong with the head or platter, or sector data + ECC don't match.

Firmware bugs are typically disk-wide and present themselves very quickly after the issue begins(See the seagate firmware bug from 2008 for a great example). Firmware bugs often turn into a drive that drops itself from the host machine, fail to even be detected on bootup, or other very catastrophic errors. You'll get an email if a drive drops from the host machine because the 30 minute SMART query will fail with no drive attached(and you'll continue to get an email every 30 minutes forever). If it fails on boot up you'll get your nightly email that a disk is missing as well as an email that the SMART query failed(again, every 30 minutes forever until a SMART query is made that succeeds).

If there is something wrong with the head or platters(such as debris or something physically breaks off) you won't have a single sector or even a small number of sectors that go bad. Physical issues inside the drive manifest themselves very quickly and are made worse very quickly if its debris because the disk head will spread the debris all over the platter(and the head may even be damaged by such debris). That disk will go very quickly from working just fine to significant problems with reads and writes all over the drive. Zpool performance will tank(maybe even become unresponsive), you'll start getting emails about disk read and write errors, zpool status will show error rates that increase rapidly as you read and write data, possibly click of death from the bad disk. Even something as innocuous as streaming a movie will manifest itself into major problems for a drive with a small amount of debris. Once the debris is deposited on the platter and the head makes a single pass over the debris, the damage will begin and progress rapidly.

Sector + ECC data would only be found on a short test, and only if the short test happened to test that one sector in the 1-2 minutes that it runs. Very unlikely that out of a few billion sectors your hard drive will happen to hit that "one" bad sector. So I consider it a waste of time to even consider that any smart test will find an issue with Sector + ECC data that is incorrect

So, for firmware bugs as well as physical issues(which are the only 2 failure mechanisms that long tests can easily find) you will know long before you even run that long test nightly/weekly/whatever that the drive has major problems. So what failure mode does a long test find exactly? None. They'll provide you with an error that will allow you to do an RMA, but they did nothing to preemptively find any new issues that a scrub(or even regular use if the system is idle since you'll get an email box full of emails in short order). But, you've added wear and tear by doing lots of long tests.

Now lets examine short tests...

Short tests are 90% silicon and only include a very short read test(could be pseudo-random or even just read the first 10 tracks). So not much reading from the hard drive, but enough to prove that the head and servo function properly. But then if you look at the silicon tests they include things such as communication checks between the SATA controller and hard drive, internal communications between different components on the hard drive, firmware checksum check, on-disk cache check, etc. If any of those things "went bad" you'd know right away. For communication issues firmware checksum failures and the like the disk won't even finish its own POST test and be unavailable to the host system. AKA you'll start getting 30 minute emails that a SMART query failed until you fix it. If the on-disk cache failed, you'll know very quickly also because as data is corrupted ZFS should start racking up lots and lots of errors. AKA, you'll get an email at night for sure with loads of errors, you'll potentially have other issues with the drive besides just some reads and writes since the loaded firmware may be trashed. If the disk goes offline you're sure to see 30 minute emails to your inbox.

So, what common failure mode was not sent to you in an email via a SMART query or regular use? None. That's precisely why I don't do the SMART tests at all except when I first buy them.

Now, what about desktops? In your standard desktop if you look at how those failure modes work and how they manifest themselves into "the drive no worky worky" and then think about how a long and or short test would have helped, would they have? Nope. Not any more than using them on FreeNAS.

What SMART does do is Self-Monitoring, Analyze, and Report. It records certain values throughout its drive life and tries to predict its own failures. Check out the list at http://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes. Many of those are recorded based on regular use. They aren't related or influenced by scrubs, SMART tests, etc. For example, 0x03 is spin-up time. If that starts getting longer and longer then the drive can reasonably assume that its having some kind of error outside of established parameters. Your SMART reporting tool should give you a warning(I have a bad drive that gives an error for this exact parameter and warns to replace it soon) and I keep it in my box because its a great way to experiment with SMART apps in Windows, Linux, FreeNAS, etc.

So now lets throw ZFS scrubs into the mix. Sure, only 1/2 the drive may be full, but the point will still be made. ZFS is so great because if anything is wrong in the path the data takes between the hard drive patters and the CPU ZFS will throw a fit. So even if the issue isn't necessarily with the hard drive itself, but the issue is still responsible for corrupting data, you are still safe.(This is where Sun/Oracle list ECC as a requirement because to make the assumption that the data is being corrupted somewhere besides RAM you also are implicitly trusting that the data in RAM is correct). Even in a worst case where the hard drive is deliberately giving you trash results ZFS will still keep you safe. Because ZFS is doing its own internal checks to verify the data is correct you don't need to worry too much about a single point of failure(although a SATA controller failure, PSU failure, etc could be devastating)

So what should be taken away from this whole thing? Long tests don't really tell you much you won't learn on your own just by having the disk in the system. Short tests don't really tell you much you won't learn on your own just by having the disk in the system. SMART parameter monitoring WILL tell you what is going on.... just by having it in your system. So just use the drive and monitor the SMART parameters. That's why I setup the nightly SMART emails and I keep them in my archive. Every so often I do a line by line comparison of one from today and some one from the archive. Drives that only have small temperature changes can be ignored. But that one drive with 3 parameters that are slowly changing over a period of weeks or months may be a drive to look at more closely. Is it in a location where it is susceptible to higher temperatures, more vibrations, etc? Usually you can see what the trigger value is for the SMART warning that the disk should be replaced. If I see that some value increase by 2 per month, its at 4 and at 10 you get the warning I can decide if I want to replace the drive right now or wait until I get to 10. Do I want to replace the drive when it hits 10? Do I want to ignore the warning and keep using the drive as long as scrubs keep coming back clean? Those are the questions the server admin get to choose for themselves.

Stephens said:
You're making up 99% again to support your position. It's an interesting way to structure a statement by excluding the value to make the statement there is no value. I'm not sure what you mean by "correct" but I think I've already said or implied I want to know about a fading disk so I can replace it. If it's fading 23 months into a 24-month warranty, I want to know before I start writing to that area in month 25 and start experiencing catastrophic errors.

I didn't exclude anything by saying 99%. I specifically included the chance of something else going wrong that is competely unexpected from the manufacturer's expectations. For instance, I'm sure they didn't account for the possibility that the hard drive might spontaneously catch on fire because it's just not that likely.

Stephens said:
Can you provide a cite that doing a long test on a healthy drive during downtime is harmful to the drive running it once per month? Of course I could say I'm 99% sure that isn't true, but we've convered my view on statistics. Rather I'll just say that in my experience (and I've been doing it long before experimenting with FreeNAS), it has not caused my drives to prematurely fail. And we're talking thousands of drives if you include all the systems I'm responsible for. (Though I'm not the admin, I am responsible for approving policy).

1000s of drive.. big deal. I used to be the lead IT person at the largest Naval Command on the east cost. We had 1000s of machines, many with more than 1 drive. Quantity only means so much for so long. I'm sure your company weren't handhold and SMART monitoring every single drive on every single machine, getting an email the instant one machine had anything that remotely resembled an error, did you? Companies don't pay employees to monitor large quantities of hard drives in desktops. It's more cost effective to just replace the drive when it fails or the user calls up and says that when the computer POSTS they get this weird warning they don't understand and tell employees to do their own backups or store their data on the servers.

Stephens said:
I think you do a lot right. I'd have no problem hiring you to build or admin a system for me. And one day I might. But only because I'm used to dealing with people who think they invented logic and I'm pretty good at letting it roll off most of the time. ;) But let's take the flip side of what you said above. If I tell you I've been doing things since soon after SMART technology arrived and they have been working for me, how does that fit in your world view? As I've said, sometimes there are many ways to skin a cat, all roads lead to Rome, etc. If you want to tell me you've decided you don't value the benefit of SMART long tests, I'm OK with that (as I've said). If you say there IS no value, I'm not.

It doesn't affect my world view at all. I was around when SMART was first being conceived. It never worked out as well as anyone had hoped, and some companies hate it(and deliberately modify SMART to be less thorough to fit their wants/needs..just look at the ECC thing) because the industry standard is that a failure of the long test(or failure to be able to run a long test) is the de-facto standard for "an RMA is authorized". Before SMART each company had their own tools you had to run, your own ways of testing the drive, and different levels of "bad" regarding when an RMA is or isn't authorized. SMART was supposed to help everyone by giving consumers a baseline for "good" and "bad" and (hopefully) give consumer a preemptive warning of impending failure as well as give manufacturers the ability to gather data on their hard drives' health and see how they could improve their own reliability. None of that really worked out too well.

Stephens said:
Maybe, maybe not. I don't presume to know exactly what a long test will find because I don't keep up with what every firmware of every drive from every manufacturer does. I doubt many do. So in that sense I trust that it performs "some tests" and finds "some conditions". I'm good with that since it's only the blank areas of the drive I care about anyway (scrub will handle the other areas).

You don't need to. As long as you understand what stuff is possibly tested and understand that anything that would show up on a test failure would almost always manifest themselves with your drive behaving so badly you'd be getting emails from your server, who cares? Just like if Seagate does only 1/2 the tests WD does, that doesn't mean I'd trust WDs test any more or less than Seagate. It means that if I have to RMA the drive Seagate's test may be less thorough than WD's and Seagate might tell me the drive is good where WD wouldn't.

Stephens said:
Again, for an unused area of the drive, scrub will do even less than a long test.

I explained above why the unused areas really don't matter anywhere as much as you think and included a dd command that works well if you really think it does matter.

Stephens said:
Are we talking LONG or SMART tests here? I don't think a SMART test every 30 minutes (mea culpa, I actually do the SMART inquiry every 30 minutes) or a LONG test every month is equivalent to what you're suggesting, but either way, I want to know what you're referring to so I can respond to it. But that said, if I agreed they WERE analagous, I'd agree with your statement.

So you can't grasp the concept that when you use something you are always causing wear and tear, even if only so very slightly? Why do you think the servo and drive head are almost completely insusceptible to regular wear and tear factors? The head doesn't touch the platter and a magnetic field moves the head because the amount of wear and tear is minimal, but not zero. Drive wouldn't last an hour if they had wear and tear associated with physically moving parts.

Non-IT story: In a past life we had a pressure switch. It performed a very very important function. By law, we had to test it every 3 years. Well, we thought "we want to be superior performers so we're going to calibrate the pressure switch every 18 months". So we did. After 5 years we noticed that these switches were going bad far before their expected lifespan. Upper management freaked out. So lets start doing them ever 12 months! So now we're calibrating them every 12 months, and they're failing even more frequently than before. At $25k per switch and a 6 month lead time for a replacement, this was a very big deal. We had alot of these switches that we were trusting for safety reason and now their reliability was being questioned. They should have lasted at least 20 years, but we could predict the failures because if we had calibrated them more than 3 times it was about 50/50 chance they'd fail the next time. The manufacturers were baffled because these switches were sold all over the world, many in far more inhospitable conditions than our switches were in. What did we eventually figure out? That when we were calibrating the switch there was the possibility that even microscopic pieces of debris could get into the switch when we hooked up our calibration equipment and that we were actually ruining the switches by testing them more frequently than the manufacturer recommended. The actual test itself wasn't the problem, nor was it the equipment we used. It was just the fact that we were opening up the system via a single very small valve and hooking up test equipment. Surprise! Despite the fact that we were in the business of calibrating 1000s of switches from many different manufacturers, different designs, different materials, we learned a very important lesson. You shouldn't always think you know better and deviate from the manufacturer. So we replaced all the switches and started testing them every 3 years. Never had any problems with them again.

Stephens said:
"Too much" is a subjective term. So while the statement is true on the surface, it doesn't inherently shed any light on our respective positions.

It is subjective. So be objective. At what point are you testing but not really getting any value from the test? If your test is doing more harm than good, why are you running the test. And by value I mean you are predicting failure rates ahead of time more than the wear and tear you are placing on the drives. Hint: You as a consumer are not privy to the information on how much wear and tear you are causing on a drive just by testing it.

So remind me why you think you know better than the manufacturers?

Stephens said:
But again, you're using a far exaggerated example that has no relationship to anything we've discussed to support your point. Even a 30 minute SMART test (which you admit is mostly in silicon) is nowhere near doing a glucose test seconds after you just did one.

Exactly. Because your testing methods are being far exaggerated for the relationship between failures and your ability to find those failures. I chose seconds instead of 30 minutes because the human body is designed for self repair while a hard drive is not. A single drop of blood every 30 minutes wouldn't cause much of a blood loss over your entire lifetime, so I said seconds to prove the point. Your hard drive can only monitor its own demise over its life span.

Stephens said:
I agree with the conclusion of your strawman. ;)

You call it a strawman because you don't like the fact that I've proved my point.

Stephens said:
OK cool, we're back to long tests... ;) So I guess when I tell you I HAVE had long tests find failing drives preemptively, you can only conclude that I'm lying to you. At which point, I'm not sure where we go in the discussion.

Nice of you to presume that I'm accusing you of lying. I really appreciate it! I don't think you're lying, but I think your drive would have had a longer lifespan, and you still would have been able to replace the drive without a loss of data, if you hadn't been running all of those tests. And there's no way to know either, but its easy for you to jump to the conclusion that the long test must have saved you because the clearly are correlated.

Stephens said:
Shorten it from expected, or shorten it from some theretical maximum (such as 10 years down to, say, 8)? Because we've been doing this on our production drives without issue. Let's say WD Blacks have a 5 year warranty, so the expected life is 5 years. That's 60 long tests if you're doing one per month. You're telling me you think 60 long tests is likely to kill the drive?. Or are you just saying over time, those 60 long tests will shorten the life expectancy from (say) 5 years to 4 years, 11 months, 30 days? Or what exactly is it you're saying, and can you provide any proof of it? Because the anecdotal experiences don't do it since we each have our own and it doesn't agree.

I don't give a crap about theoretical. Nor do I really care about the warranty. The warranty is only enough to release the manufacturer from replacing my drive.

I think that running long tests that really don't tell you if anything is broken that you won't figure out on your own from emails, etc. are only adding wear and tear. How much, I have no clue. To really have a solid value for "how much" would take a few $million and quite a bit of time and analysis. But I think we can both agree that if you can decrease the wear and tear on hard drives you should. And since I've tried to discuss above why the long and short tests really don't help identify a failing drive:

1. But a test could easily be "destructive" because it picks up a stray particle from a part of the disk you would never ever have used anyway and now just damaged the head or platters making the drive broken.
2. A long and short test don't even a reasonable chance of find a common problem from hard drives that wouldn't be self evident very quickly.
3. A long and short test adds wear and tear.

So why are you doing tests that really are useful only for "remote circumstances", can be destructive to the drive, and don't have a reasonable chance of even identifying a disk before its already not reliable.

Stephens said:
Again, I'm finding it difficult to quantify what you're saying here. By your logic, doing CHKDSK on Windows should be avoided because it wears down the drives. Or for that matter, scrubs. Because doing a scrub every 2 weeks is SURELY killing the disk faster than doing scrubs every month. Right? All this is subjective? We have to separate the 2 sides of the equation so we're clear where your complaint lies. There's the cost side and the benefit side. I disagree with you on the benefit, but also consider it subjective based on situation. I think that ship has sailed and we probably won't meet up. On the cost side, I currently disagree with you, but I'm open to reading evidence that 60 long tests over 5 years poses an unacceptable risk to a healthy hard drive. Because I give it a greater value and a lesser cost, we arrive at different conclusions. The problem is you don't accept that the valuations are subjective, and not as absolute as you believe they are.

No, what I'm saying is that there is no value for long and short tests. I never made any comparison about chkdsk, and if you were doing scrubs every day I'd make the same argument that you are adding no real value but definitely making those drives work hard for their paycheck.

I laughed at the morons that had scripts setup on their first gen SSDs to erase the unused sectors(this was before GC and TRIM) and gee... later they found out that doing it more frequently than once a week(and for most users) once a month only caused excessive writing to the limited write cycles of the memory cells because the custom tool always rearranged the data(kinda like doing a defrag). And I laughed all the way home because I never had any issues, did the zeroing out tool once a month just like the manufacturers recommended, and I was happy. Every SSD manufacturer of the time that released tools to do GC saw a very high failure rate as soon as those tools were made available. Some of them even started rejecting RMAs from drives that had run out of write cycles because it wasn't likely that in 4-6 months you'd actually hit that limit with regular drive use. It wasn't a coincidence either. It was all the "cool kids" that wanted every single ounce of write performance from their SSD as they could get and thought they were so much smarter than the manufacturer that they did their own thing. Even now, I know people that use the Intel SSD toolbox and do daily runs of the tool "for maximum performance". I wonder why the SMART data on their drive says less than 80% life remaining despite his drive being slightly newer than mine, is the same gen and firmware, but mine says 96%. I don't make any effort to save cycles on my drive, and he's got 10% of the drive unpartitioned and does alot of other things to try to extend the life of his drive because "his SSD lifespan is sucking".

Stephens said:
Exactly my feeling the same time I learned about ZFS scrubs, resilvers, et al. "Woah, that's really going to make the drives work up a sweat." Then I got over it. Because a healthy drive can deal with it. And if it makes a marginal drive fail, screw it. Life's short. I'll replace it and move on. It's worth the cost to me. And the flaky drive was probably going to go soon anyway -- probably on a Friday at 5 p.m..

I disagree! 6pm on a Friday before a 3 day weekend. That way you get home then get the emails that things are getting ugly. :P

But again, WRT ZFS scrubs, if you were doing them daily I'd be arguing that you're crazy for doing it that frequently. You're adding wear and tear with no value added. So why work a healthy drive down to marginal when you don't have to?

Stephens said:
If it's under warranty, probably. It depends on how bad the drive is. You've no way of knowing a long test will only return 1 (or a few) bad sector. You have to run the test. I'm not sure I understand what you mean by pass fail. That's not what I recall seeing. On the last drive I RMA'd, it told me how many errors it encountered. It even told me the LBA. It just didn't stop on the first error and say FAIL.

I disagree, but that's your own personal preference. I'd never want to waste my time trying to RMA a drive that won't fail me in its expected lifetime. For all you know, the new drive might not be any better. I've had far more "recertified" drives require replacement than new drives. I have an old laptop IDE drive that has some problems about 119GB out of 120GB. Did I RMA it? Nope. Because I understand that an issue that suddenly appears and doesn't spread rapidly isn't likely to get worse, I just partitioned the first 110GB and left the rest unused. Statistically, I had better odds from the drive than a "recertified" drive. Had the drive for 7 years now and it still makes a great external USB drive. Just did a write/read/verify test pattern to the drive in January and zero errors.

Stephens said:
Your choice, and a valid one. Different choices are also valid. As I said, I'd rather not wait until the warranty runs out. We have too many drives and honestly, to use one of your favorite phrases, I should REALLY be fired if I'm letting faulty drives sit in servers until after their warrantys expire when I could know they're faulty before the warranty expires. There are many factors that come into play. If you consider nothing else of what I've said, please at least consider this.

Yeah, except there's more factors than just getting a new drive before the warranty expires. What if the recertified drive only lasts until 1 day out of warranty. What did you save? This whole "what/if/but/maybe/kinda" is old. It's pure and simple statistics and the likelyhood of a scenario you don't want. Everyone wants to be on the winning side of the curve, but you definitely aren't considering the big picture.

Stephens said:
All I ask is that we discuss the factors as best we can and indentify the credibillity we assign to said factors, as well as our assumptions. I've no horse in the race. I trust my experience and my clients trust my experience, so no loss there. I've been doing this a long time as well (35 years). I just want the people reading to be able to make decisions for their own situations based on understanding the factors involved in the decision, not, "A senior moderator said so."

There's a reason why many RAID controllers don't let you do SMART tests. And it seems to be getting worse. They really are a waste of time. If you have a disk that you think is flaky, a long test may prove it is bad, but it won't do much else. At best, a long test is likely to tell you that a bad disk is bad(what a shocker), and at worst a long test is simply wearing down a drive that is currently fine.

So before you start prescribing tests that you don't understand, you should figure out what they do, how they do it, and if they really add any value in even performing them. If you don't intend to do that homework, you should seriously consider just following the manufacturer's recommendations.

SavageBrewer · Apr 15, 2013

Oh wow... I appreciate all of your comments, I learned alot about SMART in this thread.
What I am doing now is splitting up my drives.
I have 3 in a RAIDZ (3.6 TB usable) for general file usage and media streaming that are spun up 24 hours a day with APM set to 192 to offload the heads when not in use.
I have a single ZFS formatted drive that is for backups that I shut APM off and set the spindown to 10 minutes.
I plan to once a week do a rsync to copy my files over. This way I have a second copy in case I do something stupid...
And then a 5th drive that only spins up when needed, this one may change I use it to dump surveillance footage to, this drive is getting more use than I originally thought so I will probably have this one spun up 24 hours a day...

As with anything new, I am sure that it will take some experimentation and reading.
In the end I will probably wind up rebuilding it to use all 5 drives in a single RAIDZ configuration with them all spun up 24 hours a day now that I understand more about the product and also seeing how much motion triggered cameras activate throughout the day.

FreeNAS is a real good product for being free, and as I have found there are some very intelligent people here.
I just have more experience working with large scale enterprise SAN's and scaling back to a storage solution for home has been very interesting.

Thanks again for all of the input.

Important Announcement for the TrueNAS Community.

HDD Standby and SMART problems

Cadet

MVP

Inactive Account

Cadet

Patron

Wizard

Old Man

Patron

Old Man

Inactive Account

Patron

Inactive Account

Patron

Guru

Inactive Account

Patron

Old Man

Inactive Account

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "HDD Standby and SMART problems"

Similar threads