Help with SMART results

Zebbe152 · Oct 21, 2018

Yesterday I noticed that I got this error in the console

Code:

Oct 20 05:13:11 freenas (ada1:ahcich1:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Oct 20 05:13:11 freenas (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Oct 20 05:13:11 freenas (ada1:ahcich1:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 04 (ABRT )
Oct 20 05:13:11 freenas (ada1:ahcich1:0:0:0): RES: 51 04 38 96 ee 40 58 02 00 00 00
Oct 20 05:13:11 freenas (ada1:ahcich1:0:0:0): Retrying command

As soon as I noticed that I ran the following smart tests (in this order): Short, Conveyance and ended with a long test.

I checked the smart results after each test. No errors after both the short and conveyance tests.

But after the long test I got this smart result:

Code:

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Disabled.
Self-test execution status:	  ( 113) The previous self-test completed having
										the read element of the test failed.
Total time to complete Offline
data collection:				( 2624) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 680) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x303d) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   0
  3 Spin_Up_Time			0x0027   197   197   021	Pre-fail  Always	   -	   9108
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   19
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   099   099   000	Old_age   Always	   -	   747
 10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   19
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   8
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   158
194 Temperature_Celsius	 0x0022   119   114   000	Old_age   Always	   -	   33
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed: read failure	   10%	   747		 1492031032
# 2  Conveyance offline  Completed without error	   00%	   736		 -
# 3  Short offline	   Completed without error	   00%	   736		 -
# 4  Short offline	   Completed without error	   00%	   689		 -
# 5  Short offline	   Completed without error	   00%	   522		 -
# 6  Extended offline	Completed without error	   00%	   437		 -
# 7  Short offline	   Completed without error	   00%	   354		 -
# 8  Extended offline	Completed without error	   00%		23		 -
# 9  Conveyance offline  Completed without error	   00%		10		 -
#10  Short offline	   Completed without error	   00%		10		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Since it says "Completed: read failure" I'm assuming that this isn't good. But I'm a bit confused since the Raw Read Error Rate still is at 0.

Is this an early indication of drive failure or could it be something else, like a bad cable or something?

I'm new to FreeNAS and I've never completely understood how to read the smart results, so I'm hoping you guys can help me out.

Johnnie Black · Oct 21, 2018

Zebbe152 said:
like a bad cable or something?

Extended test read failure can't be cable related, it's a disk surface problem, though not very common without any other SMART attribute warnings, except for the failed test the SMART report looks perfect, I would repeat the extended test once more just to make sure it wasn't some kind of fluke and it it fails again replace the disk.

Zebbe152 · Oct 21, 2018

If the test was a fluke, isn't it a little weird that the test would fail within a day of the console error?

Anyhow, I'll run the test again and see what happens. If it fails again, do you think that this would be enough for a warranty replacement?

Just asking since the "SMART overall-health self-assessment test result" says PASSED?

Johnnie Black · Oct 21, 2018

Zebbe152 said:
If the test was a fluke, isn't it a little weird that the test would fail within a day of the console error?

It will most likely fail again, but no harm in doing it to be sure.

Zebbe152 said:
If it fails again, do you think that this would be enough for a warranty replacement?

Yes

Zebbe152 said:
Just asking since the "SMART overall-health self-assessment test result" says PASSED?

SMART overall health status is mostly useless, you can have a disk with 100 pending sectors and it will still say passed, it only shows failed if there's a SMART failing now attribute.

Zebbe152 · Oct 22, 2018

As expected the extended test failed on the second pass as well.

I replaced the drive with a spare one I had. I'll be requesting a warranty replacement for the one with the SMART error.

Thank you for your help!

Zebbe152 · Oct 25, 2018

I hooked the drive up to my windows desktop machine and installed the western digital data lifeguard tool. I ran the extended test with the software and it found some bad sectors and asked if it should repair them. According to the software it was able to "repair" was them successfully, so I ran the extended test again, the drive passed without any issues.

I checked the SMART data with HDTune Pro:

The only error is "Interface CRC Error Count" and according to the description it can be caused by a bad cable. Since it's only 1 error I'm starting to think that this might actually be a fluke or maybe a badly seated cable.

The WD software claimed that it found bad sectors and repaired them, if those sectors were actually bad physical sectors then they would show up in the SMART data as re-allocated sectors, correct? I'm thinking that the communications error caused a bad logical sector that the software was able to correct. So, what do you guys think?

I also ran the error scan in the HDTune software and the drive passed without any issues.

Johnnie Black · Oct 25, 2018

Zebbe152 said:
The only error is "Interface CRC Error Count" and according to the description it can be caused by a bad cable. Since it's only 1 error I'm starting to think that this might actually be a fluke or maybe a badly seated cable.

That's usually a bad cable, but it has nothing to do with the previous failed SMART tests, in fact that attribute was 0 on the first SMART report, so it happened after that, WD tool remapped the bad sectors with spare ones, so disk should be OK, at least for now.

Zebbe152 · Oct 25, 2018

But if the bad sector was remapped with a spare sector, wouldn't the re-allocated sector count increase? That value is still at 0.

Johnnie Black · Oct 25, 2018

Zebbe152 said:
But if the bad sector was remapped with a spare sector, wouldn't the re-allocated sector count increase?

Not when done by disk util, it would only increase if done by the disk's firmware, though even in that case sometimes it doesn't increase, firmwares aren't perfect, like in the case where your disk wasn't showing any bad sectors in the SMART report, despite having some.

Zebbe152 · Oct 25, 2018

Ok. I'll check with WD and see what they say. But I don't think that they'll replace the drive since it passed both the quick and extended test in the software.

Johnnie Black · Oct 25, 2018

Zebbe152 said:
But I don't think that they'll replace the drive since it passed both the quick and extended test in the software.

They still usually do, since they send you a refurbished drive as soon as they get yours, now if that refurbished drive will last any longer than yours would it's a different matter, since they are a crapshoot, I would make sure to schedule regular extended SMART tests and if no more issues keep that one for now, the devil you know...

Zebbe152 · Nov 8, 2018

So... A couple of days ago I got the CAM status error again (almost identical, only some different numbers).

Code:

(ada1:ahcich1:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: ATA Status Error
(ada1:ahcich1:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 04 (ABRT )
(ada1:ahcich1:0:0:0): RES: 51 04 60 02 4c 40 33 01 00 00 00
(ada1:ahcich1:0:0:0): Retrying command

I noticed that the error was related to the same SATA port as before. I replaced both the drive and the SATA cable earlier, so I started to think that this might be a controller problem. Anyway, I ran the extended SMART test on the drive and it passed without any issues. I thought that I'd wait and see if the error occurs again.

Today however I got another SMART error, but on a different drive (ada4 this time). To me the SMART data look fine, just like on the first drive (strange?).

Code:

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Red
Device Model:	 WDC WD60EFRX-68L0BN1
Serial Number:	WD-WX61D38DL9K8
LU WWN Device Id: 5 0014ee 004880444
Firmware Version: 82.00A82
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5700 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Thu Nov  8 16:57:14 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
										was never started.
										Auto Offline Data Collection: Disabled.
Self-test execution status:	  ( 113) The previous self-test completed having
										the read element of the test failed.
Total time to complete Offline
data collection:				( 7184) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 725) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x303d) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   0
  3 Spin_Up_Time			0x0027   197   197   021	Pre-fail  Always	   -	   9133
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   22
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   099   099   000	Old_age   Always	   -	   1181
 10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   22
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   8
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   159
194 Temperature_Celsius	 0x0022   118   111   000	Old_age   Always	   -	   34
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed: read failure	   10%	  1180		 2954611968
# 2  Short offline	   Completed without error	   00%	  1097		 -
# 3  Short offline	   Completed without error	   00%	   856		 -
# 4  Short offline	   Completed without error	   00%	   689		 -
# 5  Short offline	   Completed without error	   00%	   522		 -
# 6  Extended offline	Completed without error	   00%	   438		 -
# 7  Short offline	   Completed without error	   00%	   354		 -
# 8  Extended offline	Completed without error	   00%		24		 -
# 9  Conveyance offline  Completed without error	   00%		10		 -
#10  Short offline	   Completed without error	   00%		10		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Do I just have incredibly bad luck and got two defective drives or is something wrong else wrong with my system?

Johnnie Black · Nov 8, 2018

Zebbe152 said:
To me the SMART data look fine, just like on the first drive (strange?).

SMART attributes do look fine, but it failed the extended test:

Zebbe152 said:
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 10% 1180 2954611968

Zebbe152 · Nov 8, 2018

Yes, I know that :). So basically I just have some incredibly bad luck and got two defective drives that shows good SMART data but fails the extended tests?

Can any of this be related to the CAM status error or is that a totally different issue? Should I be concerned about that and if so what should I about it (I've already replaced the SATA cable and the drive)?

Sorry for all the stupid questions.

Johnnie Black · Nov 8, 2018

Failed SMART test can't be cable/connection related, it's a problem with the disk surface, and IMO it should be replaced.

Zebbe152 · Nov 9, 2018

I ran the WD tool on the drive and it found bad sector(s) (as expected). The software offered to repair the bad sectors. Just like the previous time the re-allocated sector count stayed at 0 after the "repair". I know you said that some disk utilities might re-allocate the sector by itself and that might not trigger the re-allocated count in the SMART data. You also said that the drives firmware must re-allocate the sector in order for the re-allocated count to increase.

So I checked the help in the software and it says the following:

"The Extended Test scans the entire disk media from LBA #0 to the maximum LBA. When a bad sector is detected, you are prompted to repair it. If you choose to repair the sector, the program writes zeros to the bad sector; this causes the drive firmware to relocate the bad sector and return the sector to a defect-free state."

Since it clearly states that the drives firmware should re-allocate the sector, isn't it strange that the re-allocated sector count is still at 0?

I basically don't want to RMA two drives if there's a chance that this might be caused by something other than a disc surface issue. Especially since I will get refurbished drives in return if I were to RMA them.

Johnnie Black · Nov 9, 2018

Zebbe152 said:
Since it clearly states that the drives firmware should re-allocate the sector, isn't it strange that the re-allocated sector count is still at 0?

I've never seen bad sectors repaired by any disk utility, from the manufacturer or other utilities, result in SMART showing reallocated sectors, and I do that on a regular basis at work, but can't say it will never happen.

Zebbe152 · Nov 9, 2018

So, what would you personally do, replace them both?

I emailed WD about a warranty replacement when the first drive had issues (I attached both the SMART report and result of the WD tool). They replied and said that the drive "should" be fine but if I don't trust the drive they could replace it. However they asked me to contact the reseller that I bought the drive from since the drive is brand new. The thing is that I don't think that the reseller will replace it since the drive passes all tests and the SMART data looks ok.

Johnnie Black · Nov 9, 2018

Zebbe152 said:
So, what would you personally do, replace them both?

Difficult to say, I hate rufurbished drives, they fail like crazy, IIRC the first disk didn't fail again after the repair, if that's correct I would give it a second chance, same for the other, repair it once though if either fails again I would likely replace and take my chances with a refurshided drive.

Zebbe152 · Nov 9, 2018

I actually never put the first drive back in the system, I started the email conversation with WD and in the mean time I bought a new one. I figured that it wouldn't hurt to have a spare lying around anyway.

I'm thinking that I might put one of them back in the system and see what happens.

I've read that there's two types of bad sectors, bad physical sectors and bad logical sectors. If I understand correctly, bad physical sectors can't be repaired in any way, the only solution is to re-allocate them. Logical bad sectors however can turn out to be ok if the drive is successful when it tries to overwrite the sector with new data.

Might that be a plausible explanation to why the SMART count never increases? Can bad logical sectors be caused by say a bad sata controller?

Important Announcement for the TrueNAS Community.

Help with SMART results

Dabbler

Guru

Dabbler

Guru

Dabbler

Dabbler

Guru

Dabbler

Guru

Dabbler

Guru

Dabbler

Guru

Dabbler

Guru

Dabbler

Guru

Dabbler

Guru

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Help with SMART results"

Similar threads