How long should 'NAS' HDDs last?

Status
Not open for further replies.

sdspieg

Contributor
Joined
Aug 6, 2012
Messages
168
Can anybody give any indications for this? It seems to me that mine tend to crash after about 8 months - is that normal? I installed 3 WD Red WD40EFRX 4 TB on April 15, and one of those is now already offline with over 100 errors. Is there anything I could be doing wrong that would cause this? Thanks in advance for any pointers.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Can anybody give any indications for this? It seems to me that mine tend to crash after about 8 months - is that normal? I installed 3 WD Red WD40EFRX 4 TB on April 15, and one of those is now already offline with over 100 errors. Is there anything I could be doing wrong that would cause this? Thanks in advance for any pointers.
I just replaced a set after over five years of service, but not all of them lasted the full five years. It is not normal for them to only last 8 months.
There are may things you could be doing wrong. Tell what hardware you are using with as much detail as possible and we may see the problem.
Also, what are the temperatures of the drives? The most frequent thing I see is overheating.
 

sdspieg

Contributor
Joined
Aug 6, 2012
Messages
168
Thanks for that quick response. Here are my specs
Asus M5A78L LEAthlon II X2 250
8 gig of DDR-3-133MHz RAM
AOpen power supply 350W
and then the WD Red HHDs I mentioned.
I'm running FreeNAS 11 (regularly updated), with zfs
As to the temperatures - my system is off now (I took out the degraded one; and am waiting for the new one to arrive), so I can't check right now. But I did put them in the bay in such a way that there is an open slot between every one of them.
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Athlon II X2 250
8 gig of DDR-3-133MHz RAM
AOpen power supply 350W
How old is this system, specifically the power supply?
Also, what are your chassis specifics? For example how many fans to keep the drives cool?
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Other than the heat being a possible factor, the power supply could be creating what is called 'ripple' in the DC voltage it produces, which requires an oscilloscope to detect, but would be fairly damaging to the drives connected to it. If that is the problem, and it is more likely if the power supply is old, the answer would be to replace the power supply. If you decide to do that, you should go with a larger capacity power supply. I would suggest a 450 watt supply at the minimum.
 

sdspieg

Contributor
Joined
Aug 6, 2012
Messages
168
The components vary in age. But the power supply still looks pretty recent (maybe 3-4 years?). It's an ATX tower; and the only fan (other than the one on the CPU of course) is in the chassis - see attached.
IMG_20171227_170706_small.jpg
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
The components vary in age. But the power supply still looks pretty recent (maybe 3-4 years?). It's an ATX tower; and the only fan (other than the one on the CPU of course) is in the chassis - see attached.
Tiny photo, but I can see gaping holes in the back of the chassis, so that rear chassis fan is not pulling any air in from the front of the case to cool the drives. There should be a way to mount a fan at the front of the chassis to blow air directly onto the drives to keep them cool and you will need to do that. The drive temperature should, in an ideal case, be kept around 30 degrees centigrade but certainly not more than 45 degrees. Have you been monitoring the temperatures?

I would suggest replacing the power supply also because it is a likely source of trouble and it is better safe than sorry. With power supplies, especially the cheap ones like that, they are pretty much shot after four or five years even if they are still putting out voltage. The components are just not built to last.
It doesn't need to be fancy, something like this should do the job and last for years:
https://www.newegg.com/Product/Product.aspx?item=N82E16817139027
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
As to the temperatures - my system is off now (I took out the degraded one; and am waiting for the new one to arrive), so I can't check right now. But I did put them in the bay in such a way that there is an open slot between every one of them.
When you have your system running again, you will probably want to setup some of these scripts:

Github repository for FreeNAS scripts, including disk burnin
https://forums.freenas.org/index.ph...for-freenas-scripts-including-disk-burnin.28/

I have my NAS email me a daily report that includes the following:
Code:
########## SMART status report summary for all drives on server EMILY-NAS ##########

+------+------------------+----+-----+-----+-----+-------+-------+--------+------+----------+------+-------+----+
|Device|Serial			|Temp|Power|Start|Spin |ReAlloc|Current|Offline |Seek  |Total	 |High  |Command|Last|
|	  |Number			|	|On   |Stop |Retry|Sectors|Pending|Uncorrec|Errors|Seeks	 |Fly   |Timeout|Test|
|	  |				  |	|Hours|Count|Count|	   |Sectors|Sectors |	  |		  |xrites|Count  |Age |
+------+------------------+----+-----+-----+-----+-------+-------+--------+------+----------+------+-------+----+
|da0   |x4x2xxxx		  | 30 | 6018|   97|	0|	  0|	  0|	   0|	 0|  53730779|	 3|	  0|   0|
|da1   |x4x2xxxx		  | 29 |11650|   70|	0|	  0|	  0|	   0|	 0|  59253256|	 1|	  0|   0|
|da2   |x307xxxx		  | 30 | 3016|   38|	0|	  0|	  0|	   0|	 0|  12785487|	 0|	  0|   0|
|da3   |x307xxxx		  | 30 | 5235|   40|	0|	  0|	  0|	   0|	 0|  13805147|	 0|	  0|   0|
|da4   |x307xxxx		  | 30 | 5594|   81|	0|	  0|	  0|	   0|	 0|  14547109|	 0|	  0|   0|
|da5   |x307xxxx		  | 30 | 6516|   73|	0|	  0|	  0|	   0|	 0|  16538488|	 0|	  0|   0|
|da6   |x4x3xxxx		  | 30 | 9673|   22|	0|	  0|	  0|	   0|	 1|  30297374|	 1|	  0|   0|
|da7   |x305xxxx		  | 30 | 2889|	7|	0|	  0|	  0|	   0|	 0|  16823925|	 0|	  0|   0|
|da8   |x305xxxx		  | 31 | 2904|	9|	0|	  0|	  0|	   0|	 0|  17388197|	 0|	  0|   0|
|da9   |x4x3xxxx		  | 30 | 9467|   22|	0|	  0|	  0|	   0|	 1|  28638457|	 0|	  0|   0|
|da10  |x4x3xxxx		  | 30 | 9465|   21|	0|	  0|	  0|	   0|	 0|  29463419|	 1|	  0|   0|
|da11  |x4x2xxxx		  | 30 | 9469|   25|	0|	  0|	  0|	   0|	 0|  27727878|	 2|	  0|   0|
|ada2  |x4x1xxxx		  | 27 | 6433|   49|	0|	  0|	  0|	   0|	 0| 800207771|	 0|	  0|   0|
|ada3  |x4x1xxxx		  | 27 | 6433|   53|	0|	  0|	  0|	   0|	 0| 789936201|	 0|	  0|   0|
|ada5  |x4x1xxxx		  | 28 | 6433|   49|	0|	  0|	  0|	   0|	 0| 787824618|	 0|	  0|   0|
|ada4  |x4x1xxxx		  | 28 | 6433|   48|	0|	  0|	  0|	   0|	 0| 783033839|	 0|	  0|   0|
|ada0  |x05Txxxx		  | 36 |13611|  119|	0|	  0|	  0|	   0|	 0|	  2995|   N/A|	N/A|   0|
|ada1  |x05Txxxx		  | 35 |13612|  114|	0|	  0|	  0|	   0|	 0|	  3583|   N/A|	N/A|   0|
+------+------------------+----+-----+-----+-----+-------+-------+--------+------+----------+------+-------+----+
Keeping an eye on the health of the drives is important to maintaining the data you want to keep.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I am guessing (because you didn't say) that with 3 drives you are running RAID-z1. Is that correct?
 

sdspieg

Contributor
Joined
Aug 6, 2012
Messages
168
Hey Chris - thanks for the advice. Much appreciated.

Yes, raidz1 - is that an issue?

The new drive is resilvering now. I will look into the temperature issue (using the get_hdd_temp.sh script, I can see them running at 36 degrees, which should be fine - no?) and will buy a new power supply.
Code:
[root@freenas ~]# get_hdd_temp.sh																								   
=== CPU (2) ===																													 
CPU  0:   34C																													   
CPU  1:   34C																													   
																																   
=== DRIVES ===																													 
  ada0:   37C [4.00TB] WD-WCC7K6AKJA2S	  Western Digital Red (WDC WD40EFRX-68N32N0)											 
  ada1:   37C [4.00TB] WD-WCC7K6HDRKX2	  Western Digital Red (WDC WD40EFRX-68N32N0)											 
  ada2:   36C [4.00TB] WD-WCC7K6ZF6TSR	  Western Digital Red (WDC WD40EFRX-68N32N0)
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
You might want to run SMART checks, (not tests), to get the 2 existing disk's maximum tempeture.
Don't remember how to do this, though from the command line it's something like;

smartctl -x /dev/HDD
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Exactly right @Arwen if you use smartctl -x /dev/da2 replacing da2 with your drive number, it will give you output something like the following, but it will look a little different on a Western Digital drive,
Code:
root@Emily-NAS:~ # smartctl -x /dev/da2
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Seagate Desktop HDD.15
Device Model:	 ST4000DM000-1F2168
Serial Number:	xxxxxxxx
LU WWN Device Id: 5 000c50 09221622e
Firmware Version: CC54
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5900 rpm
Form Factor:	  3.5 inches
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Thu Dec 28 19:05:57 2017 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
										was completed without error.
										Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(  107) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   1) minutes.
Extended self-test routine
recommended polling time:		( 502) minutes.
Conveyance self-test routine
recommended polling time:		(   2) minutes.
SCT capabilities:			  (0x1085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAGS	VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate	 POSR--   117   099   006	-	161816304
  3 Spin_Up_Time			PO----   092   091   000	-	0
  4 Start_Stop_Count		-O--CK   100   100   020	-	38
  5 Reallocated_Sector_Ct   PO--CK   100   100   010	-	0
  7 Seek_Error_Rate		 POSR--   071   060   030	-	13098967
  9 Power_On_Hours		  -O--CK   097   097   000	-	3051
 10 Spin_Retry_Count		PO--C-   100   100   097	-	0
 12 Power_Cycle_Count	   -O--CK   100   100   020	-	36
183 Runtime_Bad_Block	   -O--CK   100   100   000	-	0
184 End-to-End_Error		-O--CK   100   100   099	-	0
187 Reported_Uncorrect	  -O--CK   100   100   000	-	0
188 Command_Timeout		 -O--CK   100   100   000	-	0 0 0
189 High_Fly_Writes		 -O-RCK   100   100   000	-	0
190 Airflow_Temperature_Cel -O---K   070   057   045	-	30 (Min/Max 27/33)
191 G-Sense_Error_Rate	  -O--CK   100   100   000	-	0
192 Power-Off_Retract_Count -O--CK   100   100   000	-	12
193 Load_Cycle_Count		-O--CK   099   099   000	-	3729
194 Temperature_Celsius	 -O---K   030   043   000	-	30 (0 20 0 0 0)
197 Current_Pending_Sector  -O--C-   100   100   000	-	0
198 Offline_Uncorrectable   ----C-   100   100   000	-	0
199 UDMA_CRC_Error_Count	-OSRCK   200   200   000	-	0
240 Head_Flying_Hours	   ------   100   253   000	-	2774h+01m+43.073s
241 Total_LBAs_Written	  ------   100   253   000	-	37581665099
242 Total_LBAs_Read		 ------   100   253   000	-	141153190667
							||||||_ K auto-keep
							|||||__ C event count
							||||___ R error rate
							|||____ S speed/performance
							||_____ O updated online
							|______ P prefailure warning

General Purpose Log Directory Version 1
SMART		   Log Directory Version 1 [multi-sector log support]
Address	Access  R/W   Size  Description
0x00	   GPL,SL  R/O	  1  Log Directory
0x01		   SL  R/O	  1  Summary SMART error log
0x02		   SL  R/O	  5  Comprehensive SMART error log
0x03	   GPL	 R/O	  5  Ext. Comprehensive SMART error log
0x04	   GPL,SL  R/O	  8  Device Statistics log
0x06		   SL  R/O	  1  SMART self-test log
0x07	   GPL	 R/O	  1  Extended self-test log
0x09		   SL  R/W	  1  Selective self-test log
0x10	   GPL	 R/O	  1  SATA NCQ Queued Error log
0x11	   GPL	 R/O	  1  SATA Phy Event Counters log
0x21	   GPL	 R/O	  1  Write stream error log
0x22	   GPL	 R/O	  1  Read stream error log
0x24	   GPL	 R/O   1223  Current Device Internal Status Data log
0x25	   GPL	 R/O   1223  Saved Device Internal Status Data log
0x30	   GPL,SL  R/O	  9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W	 16  Host vendor specific log
0xa1	   GPL,SL  VS	  20  Device vendor specific log
0xa2	   GPL	 VS	4496  Device vendor specific log
0xa8	   GPL,SL  VS	 129  Device vendor specific log
0xa9	   GPL,SL  VS	   1  Device vendor specific log
0xab	   GPL	 VS	   1  Device vendor specific log
0xb0	   GPL	 VS	5176  Device vendor specific log
0xbe-0xbf  GPL	 VS   65535  Device vendor specific log
0xc0	   GPL,SL  VS	   1  Device vendor specific log
0xc1	   GPL,SL  VS	  10  Device vendor specific log
0xc3	   GPL,SL  VS	   8  Device vendor specific log
0xc4	   GPL,SL  VS	   5  Device vendor specific log
0xe0	   GPL,SL  R/W	  1  SCT Command/Status
0xe1	   GPL,SL  R/W	  1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	  3033		 -
# 2  Extended offline	Completed without error	   00%	  3028		 -
# 3  Short offline	   Completed without error	   00%	  3009		 -
# 4  Short offline	   Completed without error	   00%	  2985		 -
# 5  Extended offline	Completed without error	   00%	  2977		 -
# 6  Short offline	   Completed without error	   00%	  2961		 -
# 7  Short offline	   Completed without error	   00%	  2937		 -
# 8  Extended offline	Completed without error	   00%	  2929		 -
# 9  Short offline	   Completed without error	   00%	  2913		 -
#10  Extended offline	Completed without error	   00%	  2905		 -
#11  Short offline	   Completed without error	   00%	  2889		 -
#12  Short offline	   Completed without error	   00%	  2865		 -
#13  Extended offline	Completed without error	   00%	  2857		 -
#14  Short offline	   Completed without error	   00%	  2841		 -
#15  Short offline	   Completed without error	   00%	  2817		 -
#16  Extended offline	Completed without error	   00%	  2809		 -
#17  Short offline	   Completed without error	   00%	  2793		 -
#18  Short offline	   Completed without error	   00%	  2769		 -
#19  Extended offline	Completed without error	   00%	  2761		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:				  3
SCT Version (vendor specific):	   522 (0x020a)
SCT Support Level:				   1
Device State:						Active (0)
Current Temperature:					30 Celsius
Power Cycle Min/Max Temperature:	 27/33 Celsius
Lifetime	Min/Max Temperature:	 20/43 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Data Table command not supported

SCT Error Recovery Control command not supported

Device Statistics (GP Log 0x04)
Page  Offset Size		Value Flags Description
0x01  =====  =			   =  ===  == General Statistics (rev 2) ==
0x01  0x008  4			  36  ---  Lifetime Power-On Resets
0x01  0x010  4			3051  ---  Power-on Hours
0x01  0x018  6	 37822443344  ---  Logical Sectors Written
0x01  0x020  6	   292340112  ---  Number of Write Commands
0x01  0x028  6	 52156728054  ---  Logical Sectors Read
0x01  0x030  6	   296485851  ---  Number of Read Commands
0x01  0x038  6			   -  ---  Date and Time TimeStamp
0x03  =====  =			   =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4			3051  ---  Spindle Motor Power-on Hours
0x03  0x010  4			2474  ---  Head Flying Hours
0x03  0x018  4			3729  ---  Head Load Events
0x03  0x020  4			   0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4			   0  ---  Read Recovery Attempts
0x03  0x030  4			   0  ---  Number of Mechanical Start Failures
0x04  =====  =			   =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4			   0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4			   0  ---  Resets Between Cmd Acceptance and Completion
								|||_ C monitored condition met
								||__ D supports DSN
								|___ N normalized value

SATA Phy Event Counters (GP Log 0x11)
ID	  Size	 Value  Description
0x000a  2			8  Device-to-host register FISes sent due to a COMRESET
0x0001  2			0  Command failed due to ICRC error
0x0003  2			0  R_ERR response for device-to-host data FIS
0x0004  2			0  R_ERR response for host-to-device data FIS
0x0006  2			0  R_ERR response for device-to-host non-data FIS
0x0007  2			0  R_ERR response for host-to-device non-data FIS

root@Emily-NAS:~ #
and down toward the bottom will be stats like this:
Code:
Current Temperature:					30 Celsius
Power Cycle Min/Max Temperature:	 27/33 Celsius
Lifetime	Min/Max Temperature:	 20/43 Celsius
You will need to keep an eye on the temperature and make sure it does not continue to climb. You don't want it to go over the drive's rated max operating temperature, but cooler is better. I have a bunch of WD Red drives at work and the ones I have are rated to 50c and the WD Red Pro drives are rated to 60c, but you would never want to run them that high on a regular basis. The optimum temp, from the research I have read, is between 30 and 40 degrees celsius.
 

expediter

Dabbler
Joined
Dec 16, 2017
Messages
15
While you are replacing the power supply, look at the capacitors on the motherboard. It there are any with bulging or rounded tops instead of flat.

I had a power supply once that was marginal, ok it was bad, but it still worked. It blew the caps on the motherboard, and after the new power supply was put in, the MB ended up breaking the new power supply.

At least that's what I think happened. It is possible there was a surge or a couple of times.

And you do have a ups? Are the batteries fairly fresh?
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Yes, raidz1 - is that an issue?
Generally, with anything larger than a 1TB drive, we suggest using RAID-z2 as a minimum so there are two parity drives to provide a little more security for when there is a drive failure. Also, drive failures happen, it is just a matter of when, not if. Did you look at the resources section of the forum? There are many guides there that will help you get a better understanding for the suggested hardware configurations. All the recommendations are geared toward keeping the data protected.
 

sdspieg

Contributor
Joined
Aug 6, 2012
Messages
168
Here's what smartctl gives:
Code:
Current Temperature:					37 Celsius
Power Cycle Min/Max Temperature:	 27/37 Celsius
Lifetime	Min/Max Temperature:	 12/38 Celsius

and
Code:
Min/Max recommended Temperature:	  0/65 Celsius
Min/Max Temperature Limit:		   -41/85 Celsius

So it seems that must not have been the problem. I suppose it may have been the power supply. I did replace it with a 550W one with 80 Plus Bronze rating. Capacitators look ok to me (all nice - straight, flush, flat tops, no oozing electrolyte). I have a surge protector. And as to batteries - which ones do you mean?
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Here's what smartctl gives:
~ snip ~
So it seems that must not have been the problem. I supposed it may have been the power supply.
I did replace it with a 550W one with 80 Plus Bronze rating. Capacitators look ok to me (all nice - straight, flush, flat tops, no oozing electrolyte). I have a surge protector. And as to batteries - which ones do you mean?
He was talking about batteries in a UPS. Do you have a UPS to deal with power failure? It is a good idea to do that.
This could have just been a random fault in the drive though. Just keep an eye on it and see how it does. You might not have any more trouble, but it is best to be prepared. Did you purchase a new drive? If the old drive is under warranty, you probably want to send it in for replacement and keep the replacement as a spare. Hopefully, you can make some upgrades to the system as time an budget allow.
If the system is working properly now, did you have any other questions?
 

sdspieg

Contributor
Joined
Aug 6, 2012
Messages
168
Well, I cleared the other two drives (which now showed errors - I guess I should have done that first thing), and it is now resilvering. I hope that after that, I'll be fully back on 'green'. Thanks for your help, and if something is still not right after all resilvering is done , I'll post back here...
 
Status
Not open for further replies.
Top