Unrecoverable Error Critical warning

Status
Not open for further replies.

Jonnhy

Dabbler
Joined
Jan 19, 2017
Messages
32
Hi FreeNAS forum members,

today I received the following alert from freenas. I have posted bothe th ealert and a zpool output. I have some checksum errors on one of the drives. I have read in other posts, they recommend clearing the errors and observing if they occur again. Should I be worried about it?

Screenshot-2017-10-11 freenas - FreeNAS-11 0-U4 (54848d13b).png


Screenshot-2017-10-11 freenas - FreeNAS-11 0-U4 (54848d13b)(1).png


Thanks in advance
 

Attachments

  • Screenshot-2017-10-11 freenas - FreeNAS-11 0-U4 (54848d13b).png
    Screenshot-2017-10-11 freenas - FreeNAS-11 0-U4 (54848d13b).png
    7.2 KB · Views: 219

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,079
I have never had a checksum error, it sounds terrible.
What kind of hardware do you have there? Details, please.
I would replace the drive and put the suspect drive in another system for intensive testing.
If the suspect drive survives testing, I would keep it as a spare, but the drives I pull usually fail testing which would initiate either a warranty repair process or disposal.
However, I am a bit on the paranoid side and I don't like risking my data. I replace drives at the very first reallocated sector.
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
It looks like the drive dropped offline. When it came back online, some writes were missing (the checksum errors) and it updated 35MB of pending writes. Look for a failing drive, power, or cable problem. Post the smartctl -a for the drive.

I'm also curious if there is a reason it happened at 5:03AM.
 

Jonnhy

Dabbler
Joined
Jan 19, 2017
Messages
32
No ideal about the time it happened. My hardware is listed in my signature. Here is the smartctl for that drive. Wasn't sure which part you wanted, so here is all of it.
Screenshot-2017-10-11 freenas - FreeNAS-11 0-U4 (54848d13b)(2).png
Screenshot-2017-10-11 freenas - FreeNAS-11 0-U4 (54848d13b)(3).png
Screenshot-2017-10-11 freenas - FreeNAS-11 0-U4 (54848d13b)(4).png
Screenshot-2017-10-11 freenas - FreeNAS-11 0-U4 (54848d13b)(5).png
Screenshot-2017-10-11 freenas - FreeNAS-11 0-U4 (54848d13b)(6).png
Screenshot-2017-10-11 freenas - FreeNAS-11 0-U4 (54848d13b)(7).png
 

Attachments

  • Screenshot-2017-10-11 freenas - FreeNAS-11 0-U4 (54848d13b)(2).png
    Screenshot-2017-10-11 freenas - FreeNAS-11 0-U4 (54848d13b)(2).png
    38.9 KB · Views: 275

rs225

Guru
Joined
Jun 28, 2014
Messages
878
You show 49 UDMA error, which could be a cable problem. It looks like you haven't run a short test in a while, so I recommend doing that now.

The drive also logged some read errors about 5 hours ago. On some drives, that is a sign the drive is failing or has a power problem. Not sure with Toshiba.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
That's a lot of CRC errors. I suggest replacing the cable on that drive. Also, the last SMART test was waaaaay too long ago. Definitely fix that.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
It looks like you haven't run a short test in a while, so I recommend doing that now.
The short tests are the least of it. It's the long tests that are really important.
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
The short tests are the least of it. It's the long tests that are really important.
And it looks like the last one that was run failed.

Run a long test on that drive and see if it will pass. If not replace that drive ASAP.
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Right, but it's sometimes an indicator of a failing drive.
Seems relatively unlikely. Bad power perhaps, but an internal failure causing a reset seems weird.
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
Cable problems could result in the UDMA errors, and possibly even trigger the controller reset behavior. Or problem power, too.
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
Seems relatively unlikely. Bad power perhaps, but an internal failure causing a reset seems weird.
I have a faulted drive in my pool as we speak that showed that very behavior. Showed one pending sector and an aborted smart test last Friday. The next day it dropped out of the pool due to too many errors. The replacement is being burned in as we speak.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Am I the only one to see 60 °C as the max temp it experienced? We know drives can do weird things even at 45-50 °C so I guess it's maybe that. Anyway you definitely need to look at the cooling design because something is really wrong, the drives should be kept under 40 °C.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Oh, yeah, it does say that. That's a huge red flag.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,079
Wasn't sure which part you wanted, so here is all of it.
Next time, you could just copy the text and past it in code tags like this:
Code:
########## SMART status report for da1 drive (Seagate NAS HDD: W7210xxxx) ##########

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   110   099   006	Pre-fail  Always	   -	   28207176
  3 Spin_Up_Time			0x0003   095   095   000	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   020	Old_age   Always	   -	   34
  5 Reallocated_Sector_Ct   0x0033   100   100   010	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000f   080   060   030	Pre-fail  Always	   -	   4411014225
  9 Power_On_Hours		  0x0032   088   088   000	Old_age   Always	   -	   10898
10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   020	Old_age   Always	   -	   34
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   100   100   000	Old_age   Always	   -	   0
188 Command_Timeout		 0x0032   100   100   000	Old_age   Always	   -	   0
189 High_Fly_Writes		 0x003a   098   098   000	Old_age   Always	   -	   2
190 Airflow_Temperature_Cel 0x0022   071   061   045	Old_age   Always	   -	   29 (Min/Max 28/32)
191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   0
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   28
193 Load_Cycle_Count		0x0032   100   100   000	Old_age   Always	   -	   54
194 Temperature_Celsius	 0x0022   029   040   000	Old_age   Always	   -	   29 (0 25 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0

No Errors Logged

Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
Extended offline	Completed without error	   00%	 10894		 -
Abbreviated listing as it is just an example.
 
Status
Not open for further replies.
Top