5 of 10 WD green drives failed in 2.5 years

Status
Not open for further replies.

maiitax

Cadet
Joined
Aug 20, 2013
Messages
9
I am looking for advice on what my next step should be in terms of drive replacement in my current setup as I am experiencing a large percentage of my Western Digital 2 tb drives failing. Before I continue I will give you some background information about my setup. I am currently running FreeNas 8.2 x64 with 8 gb of ram on E8400 (I know more is recommended but this setup is going back 3 years) and raidz1 I know it’s not recommend. I have a total of 10 x 2tb WD green drives that have had intellipark disabled to avoid unnecessary head parks. The system has been operating for 2 ½ years, within the first year 3 drives have failed and I replaced them. Approaching my 3rd year I had another 2 drives fail one two weeks ago which I replaced with a new drive and one a few days ago. To make matters worse the brand new drive's controller board is gone. The drives are running around 32-35C.

I know my settings are not optimal by today standards and scrubbing occurs once a month. That being said is there something I am missing why I am experience such a high failure rate? Reading cyberjocks posts about 24 drives and only 1 failing within 3 years is amazing. What am I missing? I am considering a new build with more ram, and WD RE drives if necessary. Is this a setting or hardware configuration issue for the failure rate or bad luck? The server is just used for streaming nothing really intensive.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Interesting to say the least. How are the drives mounted? Did you mount them with all 4 screws? All 4 screws are highly recommended to minimize vibration on the x, y, and z axis.

It really "sounds" like you are doing everything right. When i saw the title of the thread my first thought was "wdidle" but you said you already did it. I actually didn't disable mine, i set it to 300 seconds. I chose that because it does cause "some" wear and tear(I get 10-20 per day for load cycling) but having the head be parked off of the platter completely is a big plus(in my opinion). Heads touching media is bad, so vibrations and such on a hard drive that is spinning but idle can still be damaging.

A poor quality PSU can also kill drives prematurely. Fluctuating voltages are a bane for hard drives.

So, what PSU do you have? Anything else out of the ordinary with your system? What kind of case are you using? How many hard drives are installed in it?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
One other thing.. do you turn off your server at night or otherwise have the hard drives set to spindown when idle?
 

Yatti420

Wizard
Joined
Aug 12, 2012
Messages
1,437
I had a nightmare setting up my green ... Problem after problem for my freenas 8 first quick build.. Finally ran WDIDLE after a bit fiddling with freenas 8 and have now upgraded to 9.. I ran WDIDLE pretty sure I disabled & use freenas drive control to spin the mirror down quickly after startup.. This mirror is never accessed pretty much.. Cyberjock what is the difference between disable and 300s? Also drives spinup when freenas powers down - I think..

I just set up off peak electricity hours so everything cycles now.. I think drives should run 24/7 if you have the choice.. With SSDs out there so much now it's pretty easy have so pretty fast hardware so.. I haven't looked for a full breakdown / full review of both drives yet.. I suspect the green drives are more eco friendly where as red has higher quality/duration parts.. The question is how much faster considering they are both at ~5900rpm? I think reds are going to beat greens in write.. Most likely reads aswell..

I'd check to see if you got a batch of bad greens.. Definitely check / replace the PSU.. Inspect capacitors tighten screws up etc..
 

maiitax

Cadet
Joined
Aug 20, 2013
Messages
9
Cyberjock you bring up a good point about vibration, I think that might be the cause. Some of the drives are only secured on one side. It was just a convenience thing at the time which I might have paid the price for.

The server is running 24/7. Initially I did not spin down the drives but after the first 3 failed in the first year I decided to spin them down after 3 hours of inactivity. I have heard the school of thought that it’s better to leave them running 24/7 but after the high failure rate it spooked me. Is yours running 24/7?

I’m going to check later today to see which drives failed, the secured one or the ones that are only supported on one side.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Mine do run 24x7. They never spindown. The only time I shutdown my server is if I plan to go on vacation for more than a weekend.
 

Yatti420

Wizard
Joined
Aug 12, 2012
Messages
1,437
Get reds to avoid future headaches.. I forgot that I have 50% failure rate from purchase.. Had to RMA the first WD20ears immediately after purchase pretty much..
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Personally, I think that model matters less than how you treat the drives. Take 5 different IT guys and each one will tell you they've had bad experiences with a different brand and nobody will agree that one brand is better.
 

Yatti420

Wizard
Joined
Aug 12, 2012
Messages
1,437
I like Seagate drives.. Always have.. Never have had to run a wdidle or something similar..
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Well, I have a box of Seagates that won't stay functional long enough to trust them. They'll pass tests all day long. Put them in a system and they'll randomly disconnect and reconnect to the system at random. Try to use those things in a RAID or even as a bootable drive for a desktop. I dropped $1800+ on them, used them for about 90 days, then realized they were so unreliable I couldn't trust them with my data. 2 weeks after that I had dropped another $2000 on WD drives. So I have paid the price with Seagates. But before that incident, I swore by Seagate. In fact, I had been using them exclusively for 10 years, and if you had told me before I bought them that I would be spending my money twice for 1 good server I would have told you that you are full of it.

Remember what I said above....
Take 5 different IT guys and each one will tell you they've had bad experiences with a different brand and nobody will agree that one brand is better.

Score another one for brand-biased and personal experiences.
 

maiitax

Cadet
Joined
Aug 20, 2013
Messages
9
I took some pictures of the setup showing which drives failed and when. Two drives failed with full support on both sides in the upper cage and 3 failed in the lower cage with only one side of the drives screwed in.

I was thinking about the effects of vibration. I wonder if it's worth while having an external drive cages weighted down with some sort of mass to reduce the vibrations. I am just curious if anyone had better success with WD RE drives? I am going to building a new server soon as this guy is done and I don't trust any of the remaining drives.

I forgot that I have 50% failure rate from purchase.. Had to RMA the first WD20ears immediately after purchase pretty much..
What was your story? How many drives total, Brand, life span?
 

Attachments

  • 2.jpg
    2.jpg
    197.8 KB · Views: 350
  • 1.jpg
    1.jpg
    255.8 KB · Views: 316

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Oh! Oh! You have a thermal camera.. I want! I'd love to use one on my overclocked system just to see where the heat is concentrating(if it is).

If those top 4 drives don't have a fan behind them I wouldn't use that spot.

The bottom are probably okay(cooling-wise) as long as they aren't fans that are low CFM. Remember that your best indicator is being below 40C. If your hard drives are below 40C then all is probably ok.
 

maiitax

Cadet
Joined
Aug 20, 2013
Messages
9
I'm resilvering now and other then the main drive that I am replacing, two more drives are showing up errors. At this point I am going to fix the remainder and I feel like scrapping the rest of the drives and switching over to red 3tb drives. They are 1/2 the price of re and come with a 3 year warranty. Being 1/2 the price I can add in more drives for redundancy and increase the raidzx next build. I am going to keep the vdevs smaller in size and op for raidz2 next time around.

I am starting to find more cases of high failure rates for wd 2tb green drives here. Seems like this company is seeing 3-5 drives fail per year in a 24 drive setup. Again I know it varies on the setup and cooling. I guess it's expected that multiple drives will fail per year and to just plan accordingly. I wonder what the best practice is, to replace all the drives every 3-4 years?

The last thing I am thinking of is changing out the case and either go with the 4224 or with a custom solid slab of steel boiler plate between each drive cage of 4 drives in combination with an anti vibration cage like this one or this. Is it better to have the drives bolted to something steel so they are solid or to have rubber grommets isolating it from the main cage? The more I read the more I am finding that vibration is what is killing theses drives. The worst thing is no matter what I do there is no way of knowing until a year or two if it actually works.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I never go custom, the reason being that you don't have the required engineering experience to design to deal with all aspects of mounting hard drives. If there's something special you don't know because your career isn't building hard drive cages you could make things worse.

Why not just buy a case that is designed to hold the required number of drives to start with instead of buying a case that can't and then using these weird aftermarket devices to accomplish the same thing?

One thing I haven't commented on is your infrared imaging. I'll commend you for the effort you've put into checking your drives, but unfortunately I'm not sure how much value there is in that. The outside part where you measured the temperature will naturally be lower than the back of the drives. What you really should do is use smartctl to query the drives for their temperature. After all, that's what Google did for their white paper so it makes sense to use the same methodology for you.

I'm gonna send you a PM...
 

maiitax

Cadet
Joined
Aug 20, 2013
Messages
9
I appreciate your assistance cyberjock and thanks for the support. My system is setup for S.M.A.R.T alerts and I get daily emails. I knew I was around 32-36C with the case closed using smartctl -a /dev/adax and used the thermal for more insight. Thinking again last night I am going to scrap this case and go with the 4224. I thought about the custom idea as it seems the only drives that are able to take this kind of vibrate are the enterprise drives. My temperatures are fine the only thing I can think of for the drives failing was a bad batch or vibration.

What really nailed me was I didn't get a smart alert on one of my drives ada8 for "UDMA_CRC_Error_Count" which was out of control.
July 3 - ada8 failed, replaced it system appears normal.
July 26 - ada0 gets a sector error email alert so I replace it. I notice reslivering takes a long time so I run a smartctl test. That's where I discover the new drive I replace on July 3 (ada8) has a massive amount UDMA_CRC_Error_Count value.

After running more smart tests I see that my drives are hurting.

Here is the one I just replaced ada 8. Originally this guy UDMA rate was going up by the thousands. After the resliver with a new drive UDMA is at 209. It is possible the cable is picking up EMI or there is some issues for that error count.

UDMA_CRC_Error_Count
Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      0
  3 Spin_Up_Time            0x0027  175  174  021    Pre-fail  Always      -      4216
  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      13
  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -      0
  9 Power_On_Hours          0x0032  100  100  000    Old_age  Always      -      29
10 Spin_Retry_Count        0x0032  100  253  000    Old_age  Always      -      0
11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      12
192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      10
193 Load_Cycle_Count        0x0032  200  200  000    Old_age  Always      -      9
194 Temperature_Celsius    0x0022  116  105  000    Old_age  Always      -      31
196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0
197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0030  100  253  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      209
200 Multi_Zone_Error_Rate  0x0008  100  253  000    Old_age  Offline      -      0


Here is the 7th Drive ada 6 "Multi_Zone_Error_Rate " Seems like another drive is starting to fail

Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x002f  179  179  051    Pre-fail  Always      -      7402
  3 Spin_Up_Time            0x0027  244  172  021    Pre-fail  Always      -      2783
  4 Start_Stop_Count        0x0032  099  099  000    Old_age  Always      -      1195
  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -      0
  9 Power_On_Hours          0x0032  073  073  000    Old_age  Always      -      19851
10 Spin_Retry_Count        0x0032  100  100  000    Old_age  Always      -      0
11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      78
192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      58
193 Load_Cycle_Count        0x0032  200  200  000    Old_age  Always      -      1143
194 Temperature_Celsius    0x0022  124  111  000    Old_age  Always      -      26
196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0
197 Current_Pending_Sector  0x0032  200  001  000    Old_age  Always      -      15
198 Offline_Uncorrectable  0x0030  200  200  000    Old_age  Offline      -      19
199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0
200 Multi_Zone_Error_Rate  0x0008  175  175  000    Old_age  Offline      -      6742
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
You know, with the widespread issues you are having, I'd try a new PSU.

UDMA_CRC_Error_Count is usually a sign of cabling issues, so replacing the SATA cables is an option(as you've mentioned). I'd probably replace them just because you can and they are so cheap.

I have the 4224. There are some limitations to it though.

1. It has inadequate cooling.(My thread.. http://forums.freenas.org/threads/norco-rpc-4224-and-hard-drives-that-are-too-hot.13022/) The total airflow is just too low, and I'm using Green drives that are supposed to be cooler. :(
2. It requires you to provide your own PSU(not necessarily bad if you already have one)
3. To use all 24 ports you will need to use controllers that can handle that many ports. Supermicro's usually only need a single or dual SFF-8087 connector for all 24 bays.

In all seriousness, if you are really looking for an excellent long term server, I'd buy a Supermicro Chassis. I know they cost more, but after you factor in adding better fans, providing your own PSU, and hardware to use all 24 bays, it's not really that much more and the build quality is so far superior that its hard to say no. I'm regretting the Norco case and my next build will be a Supermicro without a doubt.

Or, you can buy something like.. http://www.ebay.com/itm/200953227387?ssPageName=STRK:MEWAX:IT&_trksid=p3984.m1423.l2649

Then gut it and put your hardware in(or use the hardware it has). Even the hardware that is installed should be powerful enough to run circles around FreeNAS. More RAM for the included setup might be good though.
 

maiitax

Cadet
Joined
Aug 20, 2013
Messages
9
The power supply is an Enermax 650 Watt. I am going to test it next, if I get really ambitious maybe I will check the ripple voltage.

I saw your post the other day and was thinking of adding a Panasonic FV-20NLF1 with that case and an enermax 1200 watt power supply. It's rated at 240 CFM and it will handle some static pressure. I was thinking if I had that case I would take off the fan plate between the drives and the mother board and just let the Panasonic do all the work. It should provide a massive amount of airflow without all the noise at 55 watts. The fan will be outside and attached with a flex duct to a 4" x 8" to 6 inch duct adapter via the PIC slots. Your supermicro fans are 72 CFM each x 4 = 288 CFM.

My server is basically dead as two drives dropped out. Luckly I have an offline backup of everything so I am going to RMA all the drives and start the rebuild process. That used server you suggested looks really good I am going to check it out.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
That panasonic doesn't post the entire curve of "WC vs CFM, but my guess is that fan is a very poor choice. If you look at the Supermicro fan I bought in that other thread its rated for over 2"WC. That fan you listed is already down to 200CFM at 0.4"WC and is only 1260RPM. If you read my long explanation for fan theory you'll read that lower RPM, and larger diameter fans lead to lower CFM at higher DP. I'd love to see what shutoff head is for that fan, but if you extrapolate out the curve with those 3 points you may have problems with cooling. When you do high density cases you need fans that have high CFM at high D/P(which is the problem I had to deal with in April/May).

If you look at http://www.nidecamerica.com/fanpdfs/v80e_a5.pdf you'll see that as D/P goes up the CFM goes waaaay down. I'd bet I get maybe 1/2 of that theoretical 288CFM, at best. Probably more like 1/4 in reality because of the D/P.

One theory I try to stick to is not to go with non-standard(and especially custom) designs. For example, server class stuff is built to be used for a server, which probably means 24x7 uptime, high reliability, excellent cooling, etc. There's no reason to reinvent the wheel or even think that you can make a better one by going with some other design. Just stick with what works. :)
 

maiitax

Cadet
Joined
Aug 20, 2013
Messages
9
I had a few days to think about different options. I think I found a solution and I agree with you 100% on avoiding custom designs as I don't feel like going crazy making it work and or calculating and trying new ideas. I have done it in the past and it's too much of a pain.

What is your opinion on vibration damping racks? I know I asked something similar to this before but I was looking at caselabs http://www.caselabs-store.com/magnum-th10-case/ it can hold 32 drives with their adapters. This allows me to spread out the drives so they can breath and less vibration issues. My biggest fear now is it better for metal screws to be screwed into the drives or these rubber grommets? I don't want to end up with the failure rate I had before. Here is there adapter below http://www.caselabs-store.com/standard-hdd-cage-assy/

MAC-101B__55777.1316378464.1280.1280.jpg


I like this option as it allows me to use slower spinning 120 mm fans without worrying about heat issues you experienced with the 4224 case and having to move to a 7k fan. Plus I am considering WD SE enterprise drives with their 5 year warranty/7200 rpm so it will be easier to cool. I am all for pro grade gear if it works but my house is not a data center and at night I don't want to hear the buzz of fans even in the basement. I had a 4k 80mm fan a long time ago and it was loud, I can't image what 7k fan would sound like.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I have very mixed opinions about it. Here's my thoughts about it in no particular order:

1. Those vibration dampening gaskets aren't exactly a new idea. Surely if they had a significant difference in hard drive lifespan they'd had become far more mainstream before 2010 or so, right? Plus, if you look at servers like Supermicro chassis(or my own server) they don't have that stuff. I'm having great results with my server without them. So do they matter?

2. If you look at any commercial server chassis they don't have that stuff. I'd think that if it really had an impact in drive life they'd be in there, right?

3. Rubber only provides good dampening to a certain point. If it is compressed too much(or not enough) it won't absorb vibrations but will just make the whole drive wobble, or serve no function at all. The rubber also can't just be any off the street rubber. It has to have certain characteristics that make it good for absorbing vibrations. I'm just not convinced that companies are really investing a lot of money to use good quality rubber. Also I have seen many designs for dampening and I can't say I've found one that really looks like they've got the engineering down to get that exact right amount of compression to do the job well. I was on a submarine in the Navy so I know just a little about vibration dampening. ;)

4. Look at my server... I have none of that stuff and I've had great lifespans with my disks. I'm not convinced it was luck or anything as I bought the drive from a few different sources within about 2 months of each other.

But clearly you can have very good lifespan without them. And since I'm not convinced that they provide a benefit, but they could certainly make things worse, I tend to avoid them if I can.

Personally, I think if you mount them the way they are supposed to be mounted(aka use all 4 screws), leave them on 24x7, keep them below 40C, and use a good PSU that has a proven track record then you are doing about the best you can for yourself. From the experience I've seen on this forum, I've seen people have longer lifespans with hard drives that aren't 7200RPM or higher(presumably because of heat, but possibly from the higher RPMs just wearing out the components faster) than those that choose to go with the lower RPM and slower drives.

Keep in mind that while 7200RPM drives will give you higher transfer rates and lower seek times, you are almost certainly going to be bottlenecked at the NIC unless you are doing an ESXi datastore or something equivalent. So buy appropriately. If you don't have an actual use for higher RPM I'd recommend you consider the lower RPM equivalents.
 
Status
Not open for further replies.
Top