do i have a bad disk ?

treboR2Robert · Nov 30, 2018

Strange issue I have here.

My server is in my room and I was getting fed up with hdd noise when trying to get to sleep at night.

I moved system dataset to my mirrored usb boot drives and disabled my jails which shut it up.
(I know this is not recommended so I have ordered 2 SSD's for the jails and system dataset)

BUT i then could hear a quiet click every 15 seconds or so.

After a long time I identified it was coming from 1 of my 8 wd red 3tb drives. (da1)

I shut down the system removed the drive and turned it back on.

No more click at all. :)

I plugged the drive into my laptop with a sata to usb and done a 7 hour extended test using wd data lifeguard diagnostic which came back fine.

The whole 7 hours during the test the drive was silent but as soon as it finished the click came back.

I got the stopwatch out and it occurs exactly every 15 seconds.

Luckily I have a brand new spare drive, I plugged this into my laptop to see if it had the same click.

At first I could hear nothing, but if i put my ear very close to it I can hear the exact same thing, but it is so quiet i literally have to have my ear a few cm away. But it is bang on every 15 seconds the same as the old drive.

I plugged the old drive back into my laptop to have another listen and it is so loud you can hear it easily from the other side of the room.

All the drives are about 4 years old now so past RMA time.

I thought ok I will just put the new 1 in and keep the old one as single disk for backup as testing says its ok.

I then looked into what i had to do to replace the disk in freenas and read that it should still be attached. (oops)

Anyway I looked at the pool status and it said one of the OTHER disks (da4) had "write" and "checksum" errors. see below.

So I thought ok I better shut down and put the old disk back in to see whats up

After putting the old disk back in, there are no errors on da4 anymore (which is actualy da5 when all 8 drives are attached)

But now it says there are errors on da1 ( and da1 is the drive that i pulled out earlier because of the loud click ) see below.

So now I am unsure what to do, I had read something about clearing the error and running a scrub but I'm not sure how to clear the error and if I even need to.

I have started a scrub but there is about 10tb of data so im assuming its gonna take 15 hours or so.

Should I stop the scrub and clear the errors and restart the scrub ?

Thanks for anyones help in advance.

Chris Moore · Nov 30, 2018

treboR2Robert said:
I plugged the drive into my laptop with a sata to usb and done a 7 hour extended test using wd data lifeguard diagnostic which came back fine.

WD Diagnostics are not worth downloading. They can't tell you anything unless the drive is completely unusable. They don't even know how to make a diagnostic. Pure waste of time. If the drive is making a ticking noize, it is bad. If it is in warranty, send it in, if not, replace it through a new purchase.

Hard Drive Troubleshooting Guide (All Versions of FreeNAS)
https://forums.freenas.org/index.ph...bleshooting-guide-all-versions-of-freenas.17/

Building, Burn-In, and Testing your FreeNAS system
https://forums.freenas.org/index.php?resources/building-burn-in-and-testing-your-freenas-system.38/

Github repository for FreeNAS scripts, including disk burnin
https://forums.freenas.org/index.ph...for-freenas-scripts-including-disk-burnin.28/

Chris Moore · Nov 30, 2018

treboR2Robert said:
I then looked into what i had to do to replace the disk in freenas and read that it should still be attached. (oops)

Not needed. Just do a replacement of the missing drive. Here is the guide:
https://forums.freenas.org/index.php?resources/replacing-a-failed-failing-disk.75/

Chris Moore · Nov 30, 2018

treboR2Robert said:
Anyway I looked at the pool status and it said one of the OTHER disks (da4) had "write" and "checksum" errors. see below.

Needs to be replaced also. You need to be monitoring your system more closely. Are you running SMART tests and do you have your email alerts setup?

Uncle Fester's Basic FreeNAS Configuration Guide
https://www.familybrown.org/dokuwiki/doku.php?id=fester:intro

Chris Moore · Nov 30, 2018

treboR2Robert said:
I have started a scrub but there is about 10tb of data so im assuming its gonna take 15 hours or so.

Should I stop the scrub and clear the errors and restart the scrub ?

Just let the scrub finish and then run a long SMART test on all the drives. You may have two bad but at 4 years of age, you need to be thinking about replacing them. Drives do not last forever.

treboR2Robert · Dec 1, 2018

Chris Moore said:
WD Diagnostics are not worth downloading. They can't tell you anything unless the drive is completely unusable. They don't even know how to make a diagnostic. Pure waste of time. If the drive is making a ticking noize, it is bad. If it is in warranty, send it in, if not, replace it through a new purchase.

Hi Chris thank you for you quick reply.
Yes I was going to replace it with the new spare that I have until i noticed the errors on da4 (da5 with all 8 drives attached).

Now I am just a bit confused as to why these only showed up once I had removed da1.

Chris Moore said:
Needs to be replaced also. You need to be monitoring your system more closely. Are you running SMART tests and do you have your email alerts setup?

Yes I have it set to do daily short SMART tests and also have my email alerts set up.

I did not have any problems until I shut down, removed da1 and powered back on.

Chris Moore said:
Just let the scrub finish and then run a long SMART test on all the drives. You may have two bad but at 4 years of age, you need to be thinking about replacing them. Drives do not last forever.

The scrub finished and now is also reporting problems with da6. but nothing wrong with da5 like it did when da1 was removed. see below.

I ran a LONG smart test on da1 last night and it came back as fine. see below.

The click that I mentioned isn't very loud and only occurs once every 15 seconds. All of the drives seem to do it ( even a brand new 1 ) but only da1 is audible from the other side of the room.

I think what I will do now is clear the errors, if I can figure out how to do that.
Then run another scrub.

If there are still errors with da6 I will run a LONG SMART test on that.

At the top of the pool status it says that there are 0 errors, does this mean that all my data is still intact ?

Thanks again for your help.

treboR2Robert · Dec 1, 2018

So I just done a zpool clear and the status page still shows checksum errors but not "write" errors. see below.

I am now running LONG SMART test on all 8 drives at once. (should have a result by 3am)

I am very confused and starting to wish I had just ignored the faint click noise.

Chris Moore · Dec 1, 2018

Best bet is to reference your drives by serial number due to the da# changes. The last four digits are usually enough to identify the drive.

Sent from my SAMSUNG-SGH-I537 using Tapatalk

treboR2Robert · Dec 2, 2018

Chris Moore said:
Best bet is to reference your drives by serial number due to the da# changes. The last four digits are usually enough to identify the drive.

Thanks yes I had already done that.

The Long SMART tests came back clear for all 8 drives. So on da1 it has had 2 long SMART tests now and it passed both. see below.

I then run another scrub and according to "zpool status" everything is fine but in the "gui pool status" it still shows checksum errors.
again see below.

Code:

root@freenas[~]# zpool status Boo
  pool: Boo
 state: ONLINE
  scan: scrub repaired 0 in 0 days 05:54:26 with 0 errors on Sun Dec  2 09:55:40 2018
config:

		NAME											STATE	 READ WRITE CKSUM
		Boo											 ONLINE	   0	 0	 0
		  raidz2-0									  ONLINE	   0	 0	 0
			gptid/d8434892-0748-11e6-9934-0cc47a340a24  ONLINE	   0	 0	 0
			gptid/d95a33f2-0748-11e6-9934-0cc47a340a24  ONLINE	   0	 0	 0
			gptid/da51e619-0748-11e6-9934-0cc47a340a24  ONLINE	   0	 0	 0
			gptid/db7785b5-0748-11e6-9934-0cc47a340a24  ONLINE	   0	 0	 0
			gptid/dc91bcc9-0748-11e6-9934-0cc47a340a24  ONLINE	   0	 0	 0
			gptid/dda98813-0748-11e6-9934-0cc47a340a24  ONLINE	   0	 0	 0
			gptid/dec40a2f-0748-11e6-9934-0cc47a340a24  ONLINE	   0	 0	 0
			gptid/dfe41c77-0748-11e6-9934-0cc47a340a24  ONLINE	   0	 0	 0

errors: No known data errors

Any ideas ?

I'm tempted to copy my stuff on to a external drive and and do a fresh install of freenas and remake the pool from scratch.

What do you think ?

Thanks again

treboR2Robert · Dec 2, 2018

So I just restarted the system and all errors in the gui are GONE.

so confused

Chris Moore · Dec 2, 2018

That is strange, but there have been a lot of errors over the year where once the error flag gets set in the GUI, the only way to clear it is to reboot the system. I would suggest that you report this as a bug and hopefully they team that is designing the new GUI (what you are using) can make some changes that will allow it to update the status without needing a reboot.

treboR2Robert · Dec 2, 2018

Chris Moore said:
That is strange, but there have been a lot of errors over the year where once the error flag gets set in the GUI, the only way to clear it is to reboot the system. I would suggest that you report this as a bug and hopefully they team that is designing the new GUI (what you are using) can make some changes that will allow it to update the status without needing a reboot.

Yes I will do that.

So do you have any idea why I was getting the errors to begin with ?

Thanks

Chris Moore · Dec 2, 2018

treboR2Robert said:
Yes I will do that.

So do you have any idea why I was getting the errors to begin with ?

Thanks

I would suggest setting up your FreeNAS to allow you to SSH in and getting a client like PuTTY and there are many others, I personally like to use a cygwin terminal, but however you do it, you need to be able to SSH into your FreeNAS so that you can see all the output of your SMART status and scroll through it to examine each drive in detail. You are not just looking to see if the test completed without errors.
You need to look at this section of the results

Code:

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   118   099   006	Pre-fail  Always	   -	   171252360
  3 Spin_Up_Time			0x0003   091   091   000	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   020	Old_age   Always	   -	   107
  5 Reallocated_Sector_Ct   0x0033   100   100   010	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000f   076   060   030	Pre-fail  Always	   -	   44030377
  9 Power_On_Hours		  0x0032   085   085   000	Old_age   Always	   -	   13723
 10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   020	Old_age   Always	   -	   106
183 Runtime_Bad_Block	   0x0032   100   100   000	Old_age   Always	   -	   0
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   100   100   000	Old_age   Always	   -	   0
188 Command_Timeout		 0x0032   100   100   000	Old_age   Always	   -	   0 0 0
189 High_Fly_Writes		 0x003a   100   100   000	Old_age   Always	   -	   0
190 Airflow_Temperature_Cel 0x0022   063   055   045	Old_age   Always	   -	   37 (Min/Max 35/39)
191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   0
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   59
193 Load_Cycle_Count		0x0032   084   084   000	Old_age   Always	   -	   33885
194 Temperature_Celsius	 0x0022   037   045   000	Old_age   Always	   -	   37 (0 13 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0
240 Head_Flying_Hours	   0x0000   100   253   000	Old_age   Offline	  -	   11270h+54m+20.650s
241 Total_LBAs_Written	  0x0000   100   253   000	Old_age   Offline	  -	   56530597161
242 Total_LBAs_Read		 0x0000   100   253   000	Old_age   Offline	  -	   1159350999253

SMART Error Log Version: 1

and pay particular attention to these lines:

Code:

  5 Reallocated_Sector_Ct   0x0033   100   100   010	Pre-fail  Always	   -	   0
183 Runtime_Bad_Block	   0x0032   100   100   000	Old_age   Always	   -	   0
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   100   100   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0

You should schedule a long test at least once a month. I do one every week. If your drives are healthy, it won't hurt and if they are not, it will help you find the problem more quickly.

Be sure to review all the links I gave you. Lots of good info there, especially the monitoring scripts and it makes setting those up much easier when you can SSH into your NAS.

Chris Moore · Dec 2, 2018

PS. Here is a good example of a WD drive that has failing numbers, one of my old drives:

Code:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAGS	VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate	 POSR-K   100   253   051	-	0
  3 Spin_Up_Time			POS--K   167   167   021	-	6641
  4 Start_Stop_Count		-O--CK   100   100   000	-	45
  5 Reallocated_Sector_Ct   PO--CK   134   134   140	NOW  1254
  7 Seek_Error_Rate		 -OSR-K   200   200   000	-	0
  9 Power_On_Hours		  -O--CK   008   008   000	-	67405
 10 Spin_Retry_Count		-O--CK   100   253   000	-	0
 11 Calibration_Retry_Count -O--CK   100   253   000	-	0
 12 Power_Cycle_Count	   -O--CK   100   100   000	-	43
192 Power-Off_Retract_Count -O--CK   200   200   000	-	42
193 Load_Cycle_Count		-O--CK   001   001   000	-	802190
194 Temperature_Celsius	 -O---K   117   113   000	-	33
196 Reallocated_Event_Count -O--CK   001   001   000	-	1254
197 Current_Pending_Sector  -O--CK   187   187   000	-	4234
198 Offline_Uncorrectable   ----CK   195   195   000	-	1661
199 UDMA_CRC_Error_Count	-O--CK   200   200   000	-	0
200 Multi_Zone_Error_Rate   ---R--   001   001   000	-	177867

A healthy drive would look more like this:

Code:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAGS	VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate	 POSR-K   200   200   051	-	0
  3 Spin_Up_Time			POS--K   253   253   021	-	1083
  4 Start_Stop_Count		-O--CK   100   100   000	-	45
  5 Reallocated_Sector_Ct   PO--CK   200   200   140	-	0
  7 Seek_Error_Rate		 -OSR-K   200   200   000	-	0
  9 Power_On_Hours		  -O--CK   008   008   000	-	67436
 10 Spin_Retry_Count		-O--CK   100   253   000	-	0
 11 Calibration_Retry_Count -O--CK   100   253   000	-	0
 12 Power_Cycle_Count	   -O--CK   100   100   000	-	43
192 Power-Off_Retract_Count -O--CK   200   200   000	-	42
193 Load_Cycle_Count		-O--CK   001   001   000	-	2414193
194 Temperature_Celsius	 -O---K   116   113   000	-	34
196 Reallocated_Event_Count -O--CK   200   200   000	-	0
197 Current_Pending_Sector  -O--CK   200   200   000	-	0
198 Offline_Uncorrectable   ----CK   200   200   000	-	0
199 UDMA_CRC_Error_Count	-O--CK   200   200   000	-	0
200 Multi_Zone_Error_Rate   ---R--   200   200   000	-	0

I actually still have both of those drives, but I don't use them right now. One is not being used because it doesn't work properly, the other because it is a little on the old side. Still works though.

The math on 67436 power on hours works out to 7.6 years and the second drive never skipped a beat the whole time.

Chris Moore · Dec 2, 2018

PPS. Notice how the WD drives give much less data about the drive health than the drive in the previous post. One of the reasons I like Seagate drives is because they give more information on what the drive is doing so I can make informed decisions about the drive health. WD drives are horrible about not reporting problems until they are totally unusable.
Voice of experience. WD drives can be great drives, until they malfunction. Any drive can malfunction, but Seagate drives will try to give you a warning, if you are looking at the SMART data, to let you know it is on the way. WD either doesn't monitor or doesn't report the data that would let you figure out if there is a problem so the first thing you know is that the drive is corrupting your data for you or it just totally dies during a reboot and doesn't come back.

treboR2Robert · Dec 2, 2018

Chris Moore said:
I would suggest setting up your FreeNAS to allow you to SSH in and getting a client like PuTTY and there are many others, I personally like to use a cygwin terminal, but however you do it, you need to be able to SSH into your FreeNAS so that you can see all the output of your SMART status and scroll through it to examine each drive in detail. You are not just looking to see if the test completed without errors.
You need to look at this section of the results

You should schedule a long test at least once a month. I do one every week. If your drives are healthy, it won't hurt and if they are not, it will help you find the problem more quickly.

Be sure to review all the links I gave you. Lots of good info there, especially the monitoring scripts and it makes setting those up much easier when you can SSH into your NAS.

Thanks yes I already have putty but i usually just use the gui shell unless i need putty.
I used putty earlier to copy the zpool status.

And yes I have learned a lot over the past few days, and setting up long smart scans fortnightly is something I will be doing.
Thanks for the links, yes they are very helpful.

This article was very helpful for smart scans
https://forums.freenas.org/index.php?threads/scrub-and-smart-testing-schedules.20108/

Anyway here is the output of "smartctl -q noserial -a /dev/da1" (the drive that has a slightly louder click than the others)

Code:

root@freenas[~]# smartctl -q noserial -a /dev/da1
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD30EFRX-68EUZN0
Firmware Version: 82.00A82
User Capacity:	3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Sun Dec  2 20:50:08 2018 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Disabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(39840) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off supp																											 ort.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 399) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x703d) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_																											 FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -																													8
  3 Spin_Up_Time			0x0027   180   176   021	Pre-fail  Always	   -																													6000
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -																													131
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -																													0
  7 Seek_Error_Rate		 0x002e   100   253   000	Old_age   Always	   -																													0
  9 Power_On_Hours		  0x0032   061   061   000	Old_age   Always	   -																													29122
 10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -																													0
 11 Calibration_Retry_Count 0x0032   100   100   000	Old_age   Always	   -																													0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -																													124
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -																													115
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -																													453
194 Temperature_Celsius	 0x0022   117   108   000	Old_age   Always	   -																													33
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -																													0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -																													0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -																													0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -																													0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -																													0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA																											 _of_first_error
# 1  Extended offline	Completed without error	   00%	 29105		 -
# 2  Extended offline	Completed without error	   00%	 29095		 -
# 3  Conveyance offline  Completed without error	   00%	 29070		 -
# 4  Short offline	   Completed without error	   00%	 28113		 -
# 5  Short offline	   Completed without error	   00%	 28089		 -
# 6  Short offline	   Completed without error	   00%	 28065		 -
# 7  Short offline	   Completed without error	   00%	 28041		 -
# 8  Short offline	   Completed without error	   00%	 28017		 -
# 9  Short offline	   Completed without error	   00%	 27993		 -
#10  Short offline	   Completed without error	   00%	 27969		 -
#11  Short offline	   Completed without error	   00%	 27945		 -
#12  Short offline	   Completed without error	   00%	 27921		 -
#13  Short offline	   Completed without error	   00%	 27897		 -
#14  Short offline	   Completed without error	   00%	 27873		 -
#15  Short offline	   Completed without error	   00%	 27849		 -
#16  Short offline	   Completed without error	   00%	 27825		 -
#17  Short offline	   Completed without error	   00%	 27802		 -
#18  Short offline	   Completed without error	   00%	 27778		 -
#19  Short offline	   Completed without error	   00%	 27754		 -
#20  Short offline	   Completed without error	   00%	 27730		 -
#21  Short offline	   Completed without error	   00%	 27706		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Chris Moore said:
PPS. Notice how the WD drives give much less data about the drive health than the drive in the previous post. One of the reasons I like Seagate drives is because they give more information on what the drive is doing so I can make informed decisions about the drive health. WD drives are horrible about not reporting problems until they are totally unusable.
Voice of experience. WD drives can be great drives, until they malfunction. Any drive can malfunction, but Seagate drives will try to give you a warning, if you are looking at the SMART data, to let you know it is on the way. WD either doesn't monitor or doesn't report the data that would let you figure out if there is a problem so the first thing you know is that the drive is corrupting your data for you or it just totally dies during a reboot and doesn't come back.

I am not the most informed by far when it comes to this stuff, I have an idea but defiantly no expert.

When I built the machine 4 years ago, from what I read WD Reds were most peoples recommendation for performance to price.

I have read quite a few bad things about them over the last few days though.

What seagate drives do you recommend that are good value ? Maybe I could slowly change over to those if it is ok to mix them with WD Reds ?

treboR2Robert · Dec 2, 2018

Done it again with

smartctl -q noserial -a /dev/da1

Code:

root@freenas[~]# smartctl -q noserial -a /dev/da1
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD30EFRX-68EUZN0
Firmware Version: 82.00A82
User Capacity:	3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Sun Dec  2 21:14:16 2018 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Disabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(39840) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 399) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x703d) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   8
  3 Spin_Up_Time			0x0027   180   176   021	Pre-fail  Always	   -	   6000
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   131
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   100   253   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   061   061   000	Old_age   Always	   -	   29122
 10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   100   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   124
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   115
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   453
194 Temperature_Celsius	 0x0022   117   108   000	Old_age   Always	   -	   33
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	 29105		 -
# 2  Extended offline	Completed without error	   00%	 29095		 -
# 3  Conveyance offline  Completed without error	   00%	 29070		 -
# 4  Short offline	   Completed without error	   00%	 28113		 -
# 5  Short offline	   Completed without error	   00%	 28089		 -
# 6  Short offline	   Completed without error	   00%	 28065		 -
# 7  Short offline	   Completed without error	   00%	 28041		 -
# 8  Short offline	   Completed without error	   00%	 28017		 -
# 9  Short offline	   Completed without error	   00%	 27993		 -
#10  Short offline	   Completed without error	   00%	 27969		 -
#11  Short offline	   Completed without error	   00%	 27945		 -
#12  Short offline	   Completed without error	   00%	 27921		 -
#13  Short offline	   Completed without error	   00%	 27897		 -
#14  Short offline	   Completed without error	   00%	 27873		 -
#15  Short offline	   Completed without error	   00%	 27849		 -
#16  Short offline	   Completed without error	   00%	 27825		 -
#17  Short offline	   Completed without error	   00%	 27802		 -
#18  Short offline	   Completed without error	   00%	 27778		 -
#19  Short offline	   Completed without error	   00%	 27754		 -
#20  Short offline	   Completed without error	   00%	 27730		 -
#21  Short offline	   Completed without error	   00%	 27706		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Chris Moore · Dec 2, 2018

treboR2Robert said:
Done it again with

You might want to try -x instead of -a
smartctl -x /dev/da10
It gives a report with a little more detail. I don't see any indication in the test results that tell me where the problem is but if it is making significantly more noise, it is a good indicator that something mechanical is failing.

treboR2Robert said:
When I built the machine 4 years ago, from what I read WD Reds were most peoples recommendation for performance to price.

I have read quite a few bad things about them over the last few days though.

No drives are perfect. Any of them can fail and there is often very little way to tell before the purchase if a drive will last or not. Often, drives are purchased when they are relatively new technology and nobody can say (not even the manufacturer) if they are going to work out the way they were planned. I have (at work) a server filled with 60 of the WD Red Pro drives in the 6 TB variety and in the two years they have been running, there have been only 4 hard failures where the drive needed replacement.
I have another server that we just put in service. It only has around 600 power on hours so far. It is filled with 60 of the 10 TB Seagate Exos drives. At this time I can only tell you that they all passed the burn-in test and we have dumped (as of Friday) about 54TB of data into the server with no errors. In six months, I will know more, but based on my experience with drives over the last few years, I expect these drives to be fine, potentially with an even lower failure rate than the WD Red Pro drives. We have some HGST drives and other models of Western Digital and Seagate drives also, but it is a very difficult proposition to make a judgement call about what to buy. I have recently changed my position on the recent consumer drives from Seagate because they are using SMR (Shingled Magnetic Recording) to allow them to put more data on drives that have lower physical capacity and this technology makes the drive perform poorly under many use scenarios in a NAS. If I were buying drives today, I would want to go with a drive that was engineered for multi-drive array use, such as the the HGST drives:
https://serverpartdeals.com/hgst-ul...n600-0f23663-6tb-7-2k-rpm-sata-6gb-s-3-5-hdd/

treboR2Robert · Dec 2, 2018

Here is the -x output
Unfortunately I have no idea what I'm looking at

Code:

root@freenas[~]# smartctl -q noserial -x /dev/da1
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD30EFRX-68EUZN0
Firmware Version: 82.00A82
User Capacity:	3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Mon Dec  3 02:29:27 2018 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Disabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(39840) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off supp																											 ort.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 399) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x703d) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAGS	VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate	 POSR-K   200   200   051	-	8
  3 Spin_Up_Time			POS--K   180   176   021	-	6000
  4 Start_Stop_Count		-O--CK   100   100   000	-	131
  5 Reallocated_Sector_Ct   PO--CK   200   200   140	-	0
  7 Seek_Error_Rate		 -OSR-K   100   253   000	-	0
  9 Power_On_Hours		  -O--CK   061   061   000	-	29127
 10 Spin_Retry_Count		-O--CK   100   100   000	-	0
 11 Calibration_Retry_Count -O--CK   100   100   000	-	0
 12 Power_Cycle_Count	   -O--CK   100   100   000	-	124
192 Power-Off_Retract_Count -O--CK   200   200   000	-	115
193 Load_Cycle_Count		-O--CK   200   200   000	-	453
194 Temperature_Celsius	 -O---K   116   108   000	-	34
196 Reallocated_Event_Count -O--CK   200   200   000	-	0
197 Current_Pending_Sector  -O--CK   200   200   000	-	0
198 Offline_Uncorrectable   ----CK   100   253   000	-	0
199 UDMA_CRC_Error_Count	-O--CK   200   200   000	-	0
200 Multi_Zone_Error_Rate   ---R--   200   200   000	-	0
							||||||_ K auto-keep
							|||||__ C event count
							||||___ R error rate
							|||____ S speed/performance
							||_____ O updated online
							|______ P prefailure warning

General Purpose Log Directory Version 1
SMART		   Log Directory Version 1 [multi-sector log support]
Address	Access  R/W   Size  Description
0x00	   GPL,SL  R/O	  1  Log Directory
0x01		   SL  R/O	  1  Summary SMART error log
0x02		   SL  R/O	  5  Comprehensive SMART error log
0x03	   GPL	 R/O	  6  Ext. Comprehensive SMART error log
0x06		   SL  R/O	  1  SMART self-test log
0x07	   GPL	 R/O	  1  Extended self-test log
0x09		   SL  R/W	  1  Selective self-test log
0x10	   GPL	 R/O	  1  NCQ Command Error log
0x11	   GPL	 R/O	  1  SATA Phy Event Counters log
0x21	   GPL	 R/O	  1  Write stream error log
0x22	   GPL	 R/O	  1  Read stream error log
0x80-0x9f  GPL,SL  R/W	 16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS	  16  Device vendor specific log
0xa8-0xb7  GPL,SL  VS	   1  Device vendor specific log
0xbd	   GPL,SL  VS	   1  Device vendor specific log
0xc0	   GPL,SL  VS	   1  Device vendor specific log
0xc1	   GPL	 VS	  93  Device vendor specific log
0xe0	   GPL,SL  R/W	  1  SCT Command/Status
0xe1	   GPL,SL  R/W	  1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA																											 _of_first_error
# 1  Extended offline	Completed without error	   00%	 29105		 -
# 2  Extended offline	Completed without error	   00%	 29095		 -
# 3  Conveyance offline  Completed without error	   00%	 29070		 -
# 4  Short offline	   Completed without error	   00%	 28113		 -
# 5  Short offline	   Completed without error	   00%	 28089		 -
# 6  Short offline	   Completed without error	   00%	 28065		 -
# 7  Short offline	   Completed without error	   00%	 28041		 -
# 8  Short offline	   Completed without error	   00%	 28017		 -
# 9  Short offline	   Completed without error	   00%	 27993		 -
#10  Short offline	   Completed without error	   00%	 27969		 -
#11  Short offline	   Completed without error	   00%	 27945		 -
#12  Short offline	   Completed without error	   00%	 27921		 -
#13  Short offline	   Completed without error	   00%	 27897		 -
#14  Short offline	   Completed without error	   00%	 27873		 -
#15  Short offline	   Completed without error	   00%	 27849		 -
#16  Short offline	   Completed without error	   00%	 27825		 -
#17  Short offline	   Completed without error	   00%	 27802		 -
#18  Short offline	   Completed without error	   00%	 27778		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:				  3
SCT Version (vendor specific):	   258 (0x0102)
SCT Support Level:				   1
Device State:						Active (0)
Current Temperature:					34 Celsius
Power Cycle Min/Max Temperature:	 26/37 Celsius
Lifetime	Min/Max Temperature:	  2/42 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:	 2
Temperature Sampling Period:		 1 minute
Temperature Logging Interval:		1 minute
Min/Max recommended Temperature:	  0/60 Celsius
Min/Max Temperature Limit:		   -41/85 Celsius
Temperature History Size (Index):	478 (305)

Index	Estimated Time   Temperature Celsius
 306	2018-12-02 18:32	34  ***************
 ...	..(  3 skipped).	..  ***************
 310	2018-12-02 18:36	34  ***************
 311	2018-12-02 18:37	33  **************
 ...	..( 47 skipped).	..  **************
 359	2018-12-02 19:25	33  **************
 360	2018-12-02 19:26	34  ***************
 ...	..( 30 skipped).	..  ***************
 391	2018-12-02 19:57	34  ***************
 392	2018-12-02 19:58	33  **************
 393	2018-12-02 19:59	33  **************
 394	2018-12-02 20:00	33  **************
 395	2018-12-02 20:01	34  ***************
 396	2018-12-02 20:02	34  ***************
 397	2018-12-02 20:03	34  ***************
 398	2018-12-02 20:04	33  **************
 ...	..( 67 skipped).	..  **************
 466	2018-12-02 21:12	33  **************
 467	2018-12-02 21:13	34  ***************
 ...	..( 35 skipped).	..  ***************
  25	2018-12-02 21:49	34  ***************
  26	2018-12-02 21:50	35  ****************
 ...	..(112 skipped).	..  ****************
 139	2018-12-02 23:43	35  ****************
 140	2018-12-02 23:44	34  ***************
 ...	..( 30 skipped).	..  ***************
 171	2018-12-03 00:15	34  ***************
 172	2018-12-03 00:16	32  *************
 ...	..( 94 skipped).	..  *************
 267	2018-12-03 01:51	32  *************
 268	2018-12-03 01:52	33  **************
 ...	..( 15 skipped).	..  **************
 284	2018-12-03 02:08	33  **************
 285	2018-12-03 02:09	32  *************
 286	2018-12-03 02:10	32  *************
 287	2018-12-03 02:11	33  **************
 ...	..(  7 skipped).	..  **************
 295	2018-12-03 02:19	33  **************
 296	2018-12-03 02:20	34  ***************
 ...	..(  8 skipped).	..  ***************
 305	2018-12-03 02:29	34  ***************

SCT Error Recovery Control:
		   Read:	 70 (7.0 seconds)
		  Write:	 70 (7.0 seconds)

Device Statistics (GP/SMART Log 0x04) not supported

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID	  Size	 Value  Description
0x0001  2			0  Command failed due to ICRC error
0x0002  2			0  R_ERR response for data FIS
0x0003  2			0  R_ERR response for device-to-host data FIS
0x0004  2			0  R_ERR response for host-to-device data FIS
0x0005  2			0  R_ERR response for non-data FIS
0x0006  2			0  R_ERR response for device-to-host non-data FIS
0x0007  2			0  R_ERR response for host-to-device non-data FIS
0x0008  2			0  Device-to-host non-data FIS retries
0x0009  2			3  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2			4  Device-to-host register FISes sent due to a COMRESET
0x000b  2			0  CRC errors within host-to-device FIS
0x000f  2			0  R_ERR response for host-to-device data FIS, CRC
0x0012  2			0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4	   173405  Vendor specific

treboR2Robert · Dec 2, 2018

Chris Moore said:
No drives are perfect. Any of them can fail and there is often very little way to tell before the purchase if a drive will last or not. Often, drives are purchased when they are relatively new technology and nobody can say (not even the manufacturer) if they are going to work out the way they were planned. I have (at work) a server filled with 60 of the WD Red Pro drives in the 6 TB variety and in the two years they have been running, there have been only 4 hard failures where the drive needed replacement.
I have another server that we just put in service. It only has around 600 power on hours so far. It is filled with 60 of the 10 TB Seagate Exos drives. At this time I can only tell you that they all passed the burn-in test and we have dumped (as of Friday) about 54TB of data into the server with no errors. In six months, I will know more, but based on my experience with drives over the last few years, I expect these drives to be fine, potentially with an even lower failure rate than the WD Red Pro drives. We have some HGST drives and other models of Western Digital and Seagate drives also, but it is a very difficult proposition to make a judgement call about what to buy. I have recently changed my position on the recent consumer drives from Seagate because they are using SMR (Shingled Magnetic Recording) to allow them to put more data on drives that have lower physical capacity and this technology makes the drive perform poorly under many use scenarios in a NAS. If I were buying drives today, I would want to go with a drive that was engineered for multi-drive array use, such as the the HGST drives:
https://serverpartdeals.com/hgst-ul...n600-0f23663-6tb-7-2k-rpm-sata-6gb-s-3-5-hdd/

Thank you I will have a look into it more.

Important Announcement for the TrueNAS Community.

do i have a bad disk ?

Dabbler

Hall of Famer

Hall of Famer

Hall of Famer

Hall of Famer

Dabbler

Dabbler

Hall of Famer

Dabbler

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Hall of Famer

Hall of Famer

Dabbler

Dabbler

Hall of Famer

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "do i have a bad disk ?"

Similar threads