Time to replace drive?

johan851 · Jun 4, 2017

I got notifications about the following set of errors over the weekend. Looks serious enough, but logging in and looking through the UI, I can't tell if this was a one-off transient event or if it's an ongoing condition. I can't find where to view the results from SMART checks.

Or maybe it's a one-off from the last SMART check that ran? I have them scheduled weekly. Can I run one again on demand?

http://imgur.com/a/taT4A

nojohnny101 · Jun 4, 2017

A simple search would have answered most of your questions.

Yes you can run a manual smart test if you wish (long or short, or both). You can view previous smart test results by running the command

Code:

smartctl -l selftest /dev/adaX (with X being appropriate for your drive)

Can you still view the disk in the GUI under "Storage"->"View Disks"?

zoomzoom · Jun 4, 2017

Log in via SSH and issue the following, then please reply with the output within [code] brackets:

smartctl -t short /dev/ada5 ; sleep 65 ; smartctl -a /dev/ada5

johan851 · Jun 4, 2017

nojohnny101 said:
A simple search would have answered most of your questions.

Yes you can run a manual smart test if you wish (long or short, or both). You can view previous smart test results by running the command

Code:
smartctl -l selftest /dev/adaX (with X being appropriate for your drive)

Can you still view the disk in the GUI under "Storage"->"View Disks"?

I can still view the disk under Storage -> View Disks.

On the console, I can see the message about the "currently unreadable pending sector" every few minutes.

smartctl -l selftest /dev/adaX

Code:

~# smartctl -l selftest /dev/ada5
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p31 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	 44011		 -
# 2  Extended offline	Completed without error	   00%	 43924		 -
# 3  Short offline	   Completed without error	   00%	 43749		 -
# 4  Short offline	   Completed without error	   00%	 43605		 -
# 5  Short offline	   Completed without error	   00%	 43461		 -
# 6  Extended offline	Completed without error	   00%	 43217		 -
# 7  Short offline	   Completed without error	   00%	 43043		 -
# 8  Short offline	   Completed without error	   00%	 42899		 -
# 9  Short offline	   Completed without error	   00%	 42755		 -
#10  Short offline	   Completed without error	   00%	 42612		 -
#11  Extended offline	Completed without error	   00%	 42474		 -
#12  Short offline	   Completed without error	   00%	 42372		 -
#13  Short offline	   Completed without error	   00%	 42228		 -
#14  Short offline	   Completed without error	   00%	 42084		 -
#15  Short offline	   Completed without error	   00%	 41940		 -
#16  Extended offline	Completed without error	   00%	 41803		 -
#17  Short offline	   Completed without error	   00%	 41628		 -
#18  Short offline	   Completed without error	   00%	 41484		 -
#19  Short offline	   Completed without error	   00%	 41340		 -
#20  Short offline	   Completed without error	   00%	 41196		 -
#21  Extended offline	Completed without error	   00%	 41058		 -

Log in via SSH and issue the following, then please reply with the output within brackets:

Code:

# smartctl -t short /dev/ada5 ; sleep 65 ; smartctl -a /dev/ada5
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p31 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 1 minutes for test to complete.
Test will complete after Sun Jun  4 21:42:40 2017

Use smartctl -X to abort test.
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p31 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Hitachi Deskstar 5K3000
Device Model:	 Hitachi HDS5C3020ALA632
Serial Number:	ML0221F304GTGD
LU WWN Device Id: 5 000cca 369c20900
Firmware Version: ML6OA580
User Capacity:	2,000,398,934,016 bytes [2.00 TB]
Sector Size:	  512 bytes logical/physical
Rotation Rate:	5940 rpm
Form Factor:	  3.5 inches
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Sun Jun  4 21:42:45 2017 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x85) Offline data collection activity
										was aborted by an interrupting command from host.
										Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(21777) seconds.
Offline data collection
capabilities:					(0x5b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										No Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   1) minutes.
Extended self-test routine
recommended polling time:		( 363) minutes.
SCT capabilities:			  (0x003d) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000b   077   077   016	Pre-fail  Always	   -	   787476
  2 Throughput_Performance  0x0005   132   132   054	Pre-fail  Offline	  -	   110
  3 Spin_Up_Time			0x0007   145   145   024	Pre-fail  Always	   -	   352 (Average 403)
  4 Start_Stop_Count		0x0012   100   100   000	Old_age   Always	   -	   142
  5 Reallocated_Sector_Ct   0x0033   100   100   005	Pre-fail  Always	   -	   69
  7 Seek_Error_Rate		 0x000b   100   100   067	Pre-fail  Always	   -	   0
  8 Seek_Time_Performance   0x0005   146   146   020	Pre-fail  Offline	  -	   29
  9 Power_On_Hours		  0x0012   094   094   000	Old_age   Always	   -	   44011
10 Spin_Retry_Count		0x0013   100   100   060	Pre-fail  Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   138
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   319
193 Load_Cycle_Count		0x0012   100   100   000	Old_age   Always	   -	   319
194 Temperature_Celsius	 0x0002   107   107   000	Old_age   Always	   -	   56 (Min/Max 16/63)
196 Reallocated_Event_Count 0x0032   100   100   000	Old_age   Always	   -	   127
197 Current_Pending_Sector  0x0022   100   100   000	Old_age   Always	   -	   1
198 Offline_Uncorrectable   0x0008   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x000a   200   200   000	Old_age   Always	   -	   1

SMART Error Log Version: 1
ATA Error Count: 21 (device log contains only the most recent five errors)
		CR = Command Register [HEX]
		FR = Features Register [HEX]
		SC = Sector Count Register [HEX]
		SN = Sector Number Register [HEX]
		CL = Cylinder Low Register [HEX]
		CH = Cylinder High Register [HEX]
		DH = Device/Head Register [HEX]
		DC = Device Command Register [HEX]
		ER = Error register [HEX]
		ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 21 occurred at disk power-on lifetime: 43975 hours (1832 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 05 93 b2 1c 0f  Error: UNC at LBA = 0x0f1cb293 = 253538963

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 e0 90 b2 1c 40 00   4d+03:27:08.561  READ FPDMA QUEUED
  b0 d0 01 00 4f c2 40 00   4d+03:27:03.357  SMART READ DATA
  2f 00 01 10 00 00 00 00   4d+03:27:03.353  READ LOG EXT
  60 08 e0 90 b2 1c 40 00   4d+03:26:48.192  READ FPDMA QUEUED
  b0 da 00 00 4f c2 40 00   4d+03:26:43.533  SMART RETURN STATUS

Error 20 occurred at disk power-on lifetime: 43975 hours (1832 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 06 92 b2 1c 0f  Error: UNC at LBA = 0x0f1cb292 = 253538962

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 e0 90 b2 1c 40 00   4d+03:26:48.192  READ FPDMA QUEUED
  b0 da 00 00 4f c2 40 00   4d+03:26:43.533  SMART RETURN STATUS
  ef 02 00 00 00 00 40 00   4d+03:26:43.533  SET FEATURES [Enable write cache]
  ef aa 00 00 00 00 40 00   4d+03:26:43.533  SET FEATURES [Enable read look-ahead]
  e5 00 00 00 00 00 40 00   4d+03:26:43.533  CHECK POWER MODE

Error 19 occurred at disk power-on lifetime: 43975 hours (1832 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 07 61 43 1c 0f  Error: UNC at LBA = 0x0f1c4361 = 253510497

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 28 40 44 1c 40 00   4d+03:20:16.723  READ FPDMA QUEUED
  60 08 20 60 43 1c 40 00   4d+03:20:16.723  READ FPDMA QUEUED
  60 80 18 30 77 1d 40 00   4d+03:20:16.721  READ FPDMA QUEUED
  60 00 10 30 76 1d 40 00   4d+03:20:16.721  READ FPDMA QUEUED
  60 c0 08 70 75 1d 40 00   4d+03:20:16.718  READ FPDMA QUEUED

Error 18 occurred at disk power-on lifetime: 43975 hours (1832 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 05 0b 42 1c 0f  Error: UNC at LBA = 0x0f1c420b = 253510155

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 18 90 c0 46 40 00   4d+03:20:09.540  READ FPDMA QUEUED
  60 08 10 08 42 1c 40 00   4d+03:20:09.540  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00   4d+03:20:09.536  READ LOG EXT
  60 08 10 90 c0 46 40 00   4d+03:20:05.856  READ FPDMA QUEUED
  60 08 08 08 42 1c 40 00   4d+03:20:05.856  READ FPDMA QUEUED

Error 17 occurred at disk power-on lifetime: 43975 hours (1832 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 05 0b 42 1c 0f  Error: UNC at LBA = 0x0f1c420b = 253510155

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 10 90 c0 46 40 00   4d+03:20:05.856  READ FPDMA QUEUED
  60 08 08 08 42 1c 40 00   4d+03:20:05.856  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00   4d+03:20:05.852  READ LOG EXT
  60 08 08 90 c0 46 40 00   4d+03:20:01.510  READ FPDMA QUEUED
  60 08 00 08 42 1c 40 00   4d+03:20:01.510  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	 44011		 -
# 2  Short offline	   Completed without error	   00%	 44011		 -
# 3  Extended offline	Completed without error	   00%	 43924		 -
# 4  Short offline	   Completed without error	   00%	 43749		 -
# 5  Short offline	   Completed without error	   00%	 43605		 -
# 6  Short offline	   Completed without error	   00%	 43461		 -
# 7  Extended offline	Completed without error	   00%	 43217		 -
# 8  Short offline	   Completed without error	   00%	 43043		 -
# 9  Short offline	   Completed without error	   00%	 42899		 -
#10  Short offline	   Completed without error	   00%	 42755		 -
#11  Short offline	   Completed without error	   00%	 42612		 -
#12  Extended offline	Completed without error	   00%	 42474		 -
#13  Short offline	   Completed without error	   00%	 42372		 -
#14  Short offline	   Completed without error	   00%	 42228		 -
#15  Short offline	   Completed without error	   00%	 42084		 -
#16  Short offline	   Completed without error	   00%	 41940		 -
#17  Extended offline	Completed without error	   00%	 41803		 -
#18  Short offline	   Completed without error	   00%	 41628		 -
#19  Short offline	   Completed without error	   00%	 41484		 -
#20  Short offline	   Completed without error	   00%	 41340		 -
#21  Short offline	   Completed without error	   00%	 41196		 -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

johan851 · Jun 4, 2017

You can see in the SMART report the temperature reading of 56C. That's way higher than usual - I found that after I replaced a failed motherboard a couple of weeks ago, the fan controller switch got bumped. So these drives have been baking.

The errors I saw were shortly after a scheduled scrub started, so I'm guessing the extra stress pushed this one over the edge? Now I'm worried about the damage done to all six of the drives in the heat...

zoomzoom · Jun 4, 2017

@johan851 I'm not sure what the errors mean, but the last 5 occurred at powered on hour 43,975 (current is 44,011)... did something happen around ~36hrs ago? Someone more knowledgeable will need to help with the those errors, but it appears they're due to read errors, possibly a few write errors when trying to set drive options (enable write cache & read look-ahead).

Something that's highly concerning are your drive temps, as 56C is WAY to high... it should normally be in the 30's, and under heavy load, low 40s (anything over 40 is high and of concern).

Until someone with more knowledge replies, I would replace the drive with a spare and do a resilver, as it's better to be safe than sorry... especially with a drive that's near, or at, it's EOL [End of Life].

Two things that would likely show any hardware issues is a long S.M.A.R.T test ( smartctl -t long /dev/ada5) and running a HDD benchmark test afterwards for a heavy drive load (look for higher than normal seek times and low r/w speeds).
There were a few postings I found via Google that mentioned it could be the sign of bad blocks, but again, I don't have a lot of experience troubleshooting HDD errors, so someone with more knowledge than I will need to chime in for adequate help to be given.

tvsjr · Jun 5, 2017

69 reallocated sectors and ongoing read errors on a drive that's 5 years old, running at 56C with at least one excursion to 63C (that's 145F... ouch!)? It's dead, Jim. Take a backup, shut down the system, fix your cooling issues, and replace the drive.

johan851 · Jun 5, 2017

Yep, cooling issues fixed, drives are back down to 28-30 C. I'll swap the drive out. Thanks.

zoomzoom · Jun 5, 2017

johan851 said:
Yep, cooling issues fixed, drives are back down to 28-30 C. I'll swap the drive out. Thanks.

If you wanted to do so, you could run diagnostics on the drive after it's replaced.

Do a S.MA.R.T long test
Once S.M.A.R.T finishes, run an HDD stress test via a HDD bench marking utility, paying attention to the seek, read, and write times.
- You should be able to find the normal range for your drive model via google, duckduckgo, etc. to compare the results to.
Run a bad block scan via a DOS utility, or a Linux/BSD equivalent. This will take 24hrs+, so ensure it's performed from a desktop or your FreeNAS server if you have a spare bay or spare motherboard SATA port. I know for DOS programs the drive has to connected to a SATA interface, not sure about the Linux/BSD equivalents.

tvsjr · Jun 5, 2017

I would strongly suggest a long SMART test on all of your drives once you get the failed drive swapped. Perhaps a non-destructive badblocks test. At that temperature, there's a good chance you've shortened the life of other drives as well. Be prepared for it.

danb35 · Jun 5, 2017

zoomzoom said:
I know for DOS programs the drive has to connected to a SATA interface, not sure about the Linux/BSD equivalents.

Badblocks on Linux or FreeBSD will work on any block device irrespective of the interface. My disks are all connected via a SAS HBA, and badblocks runs just fine.

Robert Trevellyan · Jun 5, 2017

johan851 said:
I'm worried about the damage done to all six of the drives in the heat...

Since there's nothing you can do about it now, just keep up your proper FreeNAS care and feeding and be happy that the system will let you know if another drive starts to fail.

Important Announcement for the TrueNAS Community.

Time to replace drive?

johan851

Cadet

nojohnny101

Wizard

zoomzoom

Guru

johan851

Cadet

johan851

Cadet

zoomzoom

Guru

tvsjr

Guru

johan851

Cadet

zoomzoom

Guru

tvsjr

Guru

danb35

Hall of Famer

Robert Trevellyan

Pony Wrangler

Similar threads

Important Announcement for the TrueNAS Community.

Time to replace drive?

Cadet

Wizard

Guru

Cadet

Cadet

Guru

Guru

Cadet

Guru

Guru

Hall of Famer

Pony Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Time to replace drive?"

Similar threads