Help with Smart errors

Status
Not open for further replies.

s25a

Explorer
Joined
Jan 16, 2016
Messages
76
Hi,

I need your help as I have a broken disk and want to understand what happened here.

First of all - I do a regular extended Smart testing on my disks. And then Send an email report. The scrips I use are hosed here: https://github.com/Spearfoot/FreeNAS-scripts/blob/master/smart_report.sh

The output of this last night was: (Just for the first 3 disks) are shown below. So i have several questions and maybe someone can help as I am really confused.

1) The error disk is ada1...Looking at the at the overall status report however everything's fine. Looking in the detailed section of ada1 it's also fine - on a first view.
It says no errors logged but looking at the CRC_Error_Count - it has more than 6K. So I guess this means the SDD is almost broken right?
If so - I should exchange this immediately I guess. It is a mirrored Volume so I guess I should simply shutdown, exchange and replace the disk (in the WEB GUI). Sorry for stupid question I had never this case before. The good thing everythings backuped so even if somethings going wrong it would not be a disaster.

2) Coming to this script - I am not sure but I think lot of people use this here. My question is...There should be a clear message in the header that one disk is broken. As I do not fully understand the script can someone help me to understand better.
Maybe everything is right and I just misunderstood the parameters.

3) What I also see in disk 3 is: "Status aborted by host". What does that mean?

Thanks a lot for help and support

S

Code:
########## SMART status report summary for all drives on server NAS ##########

+------+------------------+----+-----+-----+-----+-------+-------+--------+------+----------+------+-------+----+
|Device|Serial			|Temp|Power|Start|Spin |ReAlloc|Current|Offline |Seek  |Total	 |High  |Command|Last|
|	  |Number			|	|On   |Stop |Retry|Sectors|Pending|Uncorrec|Errors|Seeks	 |Fly   |Timeout|Test|
|	  |				  |	|Hours|Count|Count|	   |Sectors|Sectors |	  |		  |Writes|Count  |Age |
+------+------------------+----+-----+-----+-----+-------+-------+--------+------+----------+------+-------+----+
|ada0  |S252NXAG820276K   | 34 |24303|	 |	 |	  0|	   |		|   N/A|	   N/A|   N/A|	N/A|   1|
|ada1  |S252NCAGA00376M   | 34 |23966|	 |	 |	  0|	   |		|   N/A|	   N/A|   N/A|	N/A|   1|
|ada2  |WD-WCC4E4AH9899   | 29 |13896| 4797|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|   1|
|ada3 ?|WD-WCC4E1XUF2Z4   | 29 |13879| 4225|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|  1|
|ada4 ?|WD-WCC4E5NF9F5F   | 28 |13892| 4860|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|  1|
|ada5 ?|WD-WCC4E4RXR2E0   | 28 |13883| 4178|	0|	  0|	  0|	   0|   N/A|	   N/A|   N/A|	N/A|  1|
+------+------------------+----+-----+-----+-----+-------+-------+--------+------+----------+------+-------+----+

########## SMART status report for ada0 drive (Samsung based SSDs: S252NXAG820276K) ##########

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010	Pre-fail  Always	   -	   0
  9 Power_On_Hours		  0x0032   095   095   000	Old_age   Always	   -	   24303
12 Power_Cycle_Count	   0x0032   099   099   000	Old_age   Always	   -	   200
177 Wear_Leveling_Count	 0x0013   099   099   000	Pre-fail  Always	   -	   14
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010	Pre-fail  Always	   -	   0
181 Program_Fail_Cnt_Total  0x0032   100   100   010	Old_age   Always	   -	   0
182 Erase_Fail_Count_Total  0x0032   100   100   010	Old_age   Always	   -	   0
183 Runtime_Bad_Block	   0x0013   100   100   010	Pre-fail  Always	   -	   0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000	Old_age   Always	   -	   0
190 Airflow_Temperature_Cel 0x0032   066   045   000	Old_age   Always	   -	   34
195 ECC_Error_Rate		  0x001a   200   200   000	Old_age   Always	   -	   0
199 CRC_Error_Count		 0x003e   099   099   000	Old_age   Always	   -	   1
235 POR_Recovery_Count	  0x0012   099   099   000	Old_age   Always	   -	   156
241 Total_LBAs_Written	  0x0032   099   099   000	Old_age   Always	   -	   13190459927

No Errors Logged

Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
Extended offline	Completed without error	   00%	 24290		 -

########## SMART status report for ada1 drive (Samsung based SSDs: S252NCAGA00376M) ##########

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010	Pre-fail  Always	   -	   0
  9 Power_On_Hours		  0x0032   095   095   000	Old_age   Always	   -	   23966
12 Power_Cycle_Count	   0x0032   099   099   000	Old_age   Always	   -	   197
177 Wear_Leveling_Count	 0x0013   099   099   000	Pre-fail  Always	   -	   13
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010	Pre-fail  Always	   -	   0
181 Program_Fail_Cnt_Total  0x0032   100   100   010	Old_age   Always	   -	   0
182 Erase_Fail_Count_Total  0x0032   100   100   010	Old_age   Always	   -	   0
183 Runtime_Bad_Block	   0x0013   100   100   010	Pre-fail  Always	   -	   0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000	Old_age   Always	   -	   0
190 Airflow_Temperature_Cel 0x0032   066   046   000	Old_age   Always	   -	   34
195 ECC_Error_Rate		  0x001a   200   200   000	Old_age   Always	   -	   0
199 CRC_Error_Count		 0x003e   093   093   000	Old_age   Always	   -	   6076
235 POR_Recovery_Count	  0x0012   099   099   000	Old_age   Always	   -	   152
241 Total_LBAs_Written	  0x0032   099   099   000	Old_age   Always	   -	   12669998984

No Errors Logged

Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
Extended offline	Interrupted (host reset)	  00%	 23952		 -

########## SMART status report for ada2 drive (Western Digital Red: WD-WCC4E4AH9899) ##########

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   0
  3 Spin_Up_Time			0x0027   183   179   021	Pre-fail  Always	   -	   7841
  4 Start_Stop_Count		0x0032   096   096   000	Old_age   Always	   -	   4797
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   081   081   000	Old_age   Always	   -	   13896
10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0
11 Calibration_Retry_Count 0x0032   100   100   000	Old_age   Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   123
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   50
193 Load_Cycle_Count		0x0032   197   197   000	Old_age   Always	   -	   11205
194 Temperature_Celsius	 0x0022   123   099   000	Old_age   Always	   -	   29
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   100   253   000	Old_age   Offline	  -	   0

No Errors Logged

Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
Extended offline	Aborted by host			   90%	 13883		 -

 

m0nkey_

MVP
Joined
Oct 27, 2015
Messages
2,739
It says no errors logged but looking at the CRC_Error_Count - it has more than 6K.
These are data transmission errors. Might be an idea to replace the cable first.
2) Coming to this script - I am not sure but I think lot of people use this here. My question is...There should be a clear message in the header that one disk is broken. As I do not fully understand the script can someone help me to understand better.
Since it's a 3rd party script, suggest you open an issue on their GitHub page.
What I also see in disk 3 is: "Status aborted by host". What does that mean?
The SMART test was aborted because either the test was cancelled or the system was rebooted.
 

s25a

Explorer
Joined
Jan 16, 2016
Messages
76
Hi,

thanks a lot for your answer. I exchanged the cable as suggested and now the repeating CRC errors which I could in the shell are gone :smile:

However I still see the errors in the smart overview. I am not sure if this is updated after a new test. I just want to be sure that no further CRCs occur and therefore it would make sense to have a 0 there and then check after a few hours if it is still 0. Is thatz possible?

Thanks S
 
Joined
May 10, 2017
Messages
838
UDMA_CRC errors attribute doesn't reset, as long as it doesn't keep increasing problem is solved.
 
Status
Not open for further replies.
Top