Errors on sectors of a drive, now unreadable and S.M.A.R.T. seems to say the drive is OK-ish.

Status
Not open for further replies.

Vrakfall

Dabbler
Joined
Mar 2, 2014
Messages
42
If you saw anything on the internet stating this, we would be interested in it. I'm sure Seagate may have the popular vote for Enterprise class drives but it's my gut feeling they do not for Consumer class. One other warning is we generally stay away from drives having the Shingle Recording method. It's not that we feel it's bad but mostly due to it not being very proven.
I saw nothing really going that way, it's just my felling after all the disks I've seen failing, consumer grade only. Nothing precise, ofc, and fully opinionated. I guess I'm looking for stats that prove me right or wrong, like the ones scrappy gave.
There are areas where Seagate reliability has improved recently. Take BackBlaze's hard drive failure report Q1 2017 for example. With the exception of one 8TB model, most Seagate drives 4TB and larger are proving themselves to be reliable. On a side note: those 4TB ST4000DM000 desktop drives are usually very cheap to buy online. Even though they aren't the most reliable 4TB drive on the market, if you have good storage redundancy and a proper backup, I see no serious reason not to buy them.
Thank you for these stats, it's very interesting. I'd also be interested in stats that covers all brands in a more balanced way. :)
The SMART service simply checks the SMART attributes that every modern hard drive maintains internally. SMART tests are a different beast and must be configured manually.
Alright, thank you. Then I don't know how the SMART tests were activated so often on my drives. :/
FWIW, anecdotally, I just had one of my new 8 IronWolfs spontaneously die after about 3 weeks of use. It's currently being replaced by the retailer and I'll start an IronWolf thread when the process is complete.

My 8 Seagate NAS-HD (which afaict are the same things) are about 12 months old now and not had an issue.

Don't recall ever having an issue with dozens of reds.
Ouich. :s Where is that thread btw? I couldn't find it. Also, couldn't it just fall under the "possible failure rate"? Does that mean I shouldn't buy one from that brand?
If all things are roughly equal I would take a 7200 RPM over 5400/5900 any day of the week. If enterprise was 10% - 20% more I would buy Enterprise without a second thought.

Why use 7200 RPM?
  • They usually come with 5 year warranty.
  • They are made with higher quality parts and design.
  • Because of the above you will generally see better longevity and fewer problems.
  • Less down time. 2 million vs. 1 million (or less)
  • *SOME* 7200 use less electricity than the "favored" drives you hear people talk about here. See my spreadsheet to see which ones.
  • Generally speaking, they use a small amount more of electricity - $2-10 per drive per year. A5 drive RAIDZ2 might cost you only $20-30 more in electricity,
  • 20-30% faster.
  • Faster scrubs
  • Faster replace of new drive into RAIDZx vdev
  • If you get Enterprise drives they are designed to last and have fewer problems vs. commercial or home brands.
Only use NAS drives if you care about reliability and recoverability. The cheap drives do not do TLER and in a Z1 with 10TB drives you have a higher failure rate of vdev - 2 drive failure and you are screwed. I have seen it happen more than once and I would sure hate to hear it happen again.
You make some good comments however I don't agree with all your statements. But everyone is allowed to have an opinion.
I don't know who to trust when it comes to RPMs now. xD I guess both are ok depending on the capacity of the drive and the $/TB of it?
Some thoughts about disk drive "recovery" when you see errors occur.

If the errors are from badblocks, generally, they tend to come in groups/clusters of badblocks based on physical proximity on the disk platter.

What does that mean? The disk platter has a magnetic coating applied to it. When an area fails, the areas around it - think 2 dimensions- (simplistic) in front and behind, to right and to left are areas that could easily fail, also. I expect them to fail, its just a matter of time. Hopefully it is just in a small group so I might end up with 5-10 badblocks in same locale.

At a minimum, take the drive out of the vdev. I usually do an erase (write all zero's) on the drive to see where else other badblocks might be. This will also reset any read flagged badblocks. It will also help fix up any newly found badblocks.

Preferred would be to run badblocks, from the shell, on it. I run 3 patterns on it and it has the best chance to really scrub the platters with a variety of patterns on the disk. This can take days or longer depending on size of disk drive. If you run this and see a few more badblocks appear and all in the same area, it would give me some security that the drive would probably be usable as a vdev member.

Some vendors provide spare "local" revector badblock areas that are on same cylinder. This enables revectored badblocks performance to stay the same. If the cynlinder revector area is full it will then go to a large badblock revector area that is in another place on disk (beginning or end). This affects performance because heads must move to the new location when it is used.

For home usage, this is not a big deal. For enterprise database usage it could affect writes, as the slowest disk drive (e.g. the one that must move the heads to a new location then back) will cause the user I/O to disk on all drives to be seem slower because of that one drive. The drives without badblocks will finish their writes and the revectored one will have multiple internal disk writes to hide the badblock from the O/S.

If the revector badblock area becomes full, the the badblock file comes into affect. If this occurs, dump the disk for NAS purposes, IMHO.

Not all manufacturers will do the above process. I know of multiple manufacturers that do it for their enterprise drives. I cannot promise that home or commercial grade will do this.

For what it is worth...
Interesting, I think I'll try that after my holidays. Do you have any link to a howto with commands and stuff like that? Do you use commands like `srm` or more low level ones?
 

Vrakfall

Dabbler
Joined
Mar 2, 2014
Messages
42
Oh and for your information, I started a thread to buy a new drive. I know it's a bit late but my life was quite busy lately and I was more abroad than anything and I came back to take care of this old NAS as soon as I could. I'll also start a thread to buy an old server-grade build (as suggested by danb35) but that needs to wait a bit as I'm soon gone for holidays.

Edit:
Another thing: Reading some things in the forums made me realize where the real SMART check tasks are in the GUI and I found that my past-and-silly me created a task to make a long self-test on /dev/ada2 (so, not the one that failed) every 4 hours.... D: u.u

I changed it so it checks them all the first day of the month at midnight. Is it still too short? What's the best practice when it comes to that? Any link to help me?
So here's my task, just to make sure. :P
CrTEEzcRF8clAlPajbRIFCVqFxoOT6Mywm9CudGv37c
 
Last edited:

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
What's the best practice when it comes to that?
Opinions vary, but there's a degree of consensus on the ranges.
Short self-test: not more than once per day, not less than once per week
Extended self-test: not more than once per week, not less than once per month.

As for the SMART checks, the default was 30 minutes last time I checked, which seems completely reasonable to me.
 

Vrakfall

Dabbler
Joined
Mar 2, 2014
Messages
42
Opinions vary, but there's a degree of consensus on the ranges.
Short self-test: not more than once per day, not less than once per week
Extended self-test: not more than once per week, not less than once per month.

As for the SMART checks, the default was 30 minutes last time I checked, which seems completely reasonable to me.
Thank you for the info, I'll correct those asap.

Ah and for everyone that was following, I opened a thread to ask for suggestions for a cloud-based backup solution.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Ouich. :s Where is that thread btw? I couldn't find it. Also, couldn't it just fall under the "possible failure rate"? Does that mean I shouldn't buy one from that brand?

That's because I'm still waiting for the replacement ;)

Now have an additional 6 8TB Ironwolfs. So one out of 22 (so far) is not bad I guess. The Ironwolfs are so significantly cheaper for me than WD Reds that I can't justify the reds.
 

Vrakfall

Dabbler
Joined
Mar 2, 2014
Messages
42
That's because I'm still waiting for the replacement ;)

Now have an additional 6 8TB Ironwolfs. So one out of 22 (so far) is not bad I guess. The Ironwolfs are so significantly cheaper for me than WD Reds that I can't justify the reds.
Héhé. :rolleyes:

Neat. :) I also came to that conclusion after my thread to get help on choosing a drive. I finally had them at 90€ when reds were at 116.90€. I even bought 3 (not only one) of them to replace all my drives as I thought that at this price I should better make everything right, straight at the beginning. ^^

Now I should update that thread as well.:rolleyes:

Please keep me up to date with that thread of yours with the ironwolf drive. ;)
 

farmerpling2

Patron
Joined
Mar 20, 2017
Messages
224
Opinions vary, but there's a degree of consensus on the ranges.
Short self-test: not more than once per day, not less than once per week
Extended self-test: not more than once per week, not less than once per month.

As for the SMART checks, the default was 30 minutes last time I checked, which seems completely reasonable to me.

I tend to view when to run the tests in a couple ways:
  • ROI
    If the test just does a quick check of internal stats and provides a response without wear and tear on drive -- You can use it quite frequently, with the hope that it will provide information to you quicker so you can respond faster to fix a possible failure.
    Is it worth a complete disk scan to find one bad block, which would have been possibly found during a write vs. find a cluster of bad blocks and re-vector them.
    If you see an increase in bad blocks appearing, then it might make more sense to run a scan on the area where bad blocks are occurring or possible scan the whole drive more often.
    It would be helpful if some statistical analysis would affect how often tests would run.
    --> New FreeNAS feature request <--
  • What do I get from the test?
    A SHORT test provides some functional testing of heads, motors, limited testing of platter(s), armature, etc. There is some wear associated with this test on previously mentioned hardware. The short test will likely "test" the same area of media for each run, so there is a higher likelihood of wear on the platter (usually means head crash or magnetic particles wearing quicker causing bad blocks or other failures (e.g. uneven wear, etc.).
    A LONG test should likely do all what short does, but it tests the complete surface for failing media. The typical goal is to find failing media quicker so the drive can be replaced before it fails. A bad block every now and then is not a problem. It is an expected tendency of the media and technology.
  • I forgot the third one. :-(
Think the same way on any additional tests. What is my gain (e.g. find problem(s) sooner) vs. loss (e.g. increasing the likelihood of failure by running a test more often).

So run the tests based upon desired result. Forget about weekly or monthly. Think about how time frame should testing occur, to find a problem that you can fix, before the drive fails. (If that makes sense.)

SOAP BOX
Why do people choose to do something on a unit of measurement (e.g. day, week, month, minute, hour, even boundaries, multiples of 10, etc.) Should you run something every 10 days (nice easy number to use) or 8.6 (based upon numeric analysis.) Many people would choose 10 because it is easier to enter or remember, but because it provides the best coverage.
OFF MY SOAP BOX

I hope this helps you make a more knowledgeable decision on how to run your systems.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Why do people choose to do something on a unit of measurement (e.g. day, week, month, minute, hour, even boundaries, multiples of 10, etc.)
Because it's fairly easy to remember, and because we don't have a strong statistical basis for doing anything else. To do better than rough numbers, we'd need to have fairly hard answers, for each type of diagnostic, on (1) how much does this test "cost" (in whatever terms are relevant--energy cost, time, reduced performance, reduced service life of the DUT), (2) what kind of failure can this test predict, (3) how soon can it predict that failure (i.e., how long before the actual failure), and (4) what are the consequences of that failure. We have reasonable answers on (2) and (4), vague information at best on (3), and virtually no real data on (1). So we take wild guesses, and come up with things like this:
  • short SMART tests don't predict a whole lot, but they do catch some failures early, and they're pretty much free. Do them frequently.
  • long SMART tests are better at predicting (or finding) failure, but also "cost" more wear on the drive. Do them regularly, but less frequently.
  • Scrubs are good at finding and fixing corruption in data and metadata, but tax the whole system quite a bit. Also do them regularly, but probably less often than the long SMART tests.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
An important reason NOT to run short smart tests too often is that there is a limit to how many test results are stored and the short tests will make it so you can't see the results of the long tests (the important ones)
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Because it's fairly easy to remember, and because we don't have a strong statistical basis for doing anything else. To do better than rough numbers, we'd need to have fairly hard answers, for each type of diagnostic, on (1) how much does this test "cost" (in whatever terms are relevant--energy cost, time, reduced performance, reduced service life of the DUT), (2) what kind of failure can this test predict, (3) how soon can it predict that failure (i.e., how long before the actual failure), and (4) what are the consequences of that failure. We have reasonable answers on (2) and (4), vague information at best on (3), and virtually no real data on (1). So we take wild guesses, and come up with things like this:
  • short SMART tests don't predict a whole lot, but they do catch some failures early, and they're pretty much free. Do them frequently.
  • long SMART tests are better at predicting (or finding) failure, but also "cost" more wear on the drive. Do them regularly, but less frequently.
  • Scrubs are good at finding and fixing corruption in data and metadata, but tax the whole system quite a bit. Also do them regularly, but probably less often than the long SMART tests.
A sillier way to put this is as follows:

Before we can apply a cost function from which we can get an optimum value, we need to apply a cost function to the act of applying a cost function to SMART test scheduling. The costs of applying a cost function to the problem are rather high, because data would need to be acquired and processed to obtain measures of how many and what type of errors each SMART test can detect and with what false positive and false negative rates, and what the impact on wear and tear of each SMART test is. Instead of gathering data and doing a very specific study, our cost function attributes a lower cost to just eyeballing it, instead of applying a cost function to optimize this cost of detection rate versus additional wear.
 
Status
Not open for further replies.
Top