Lost iSCSI connection, Disk Alert

Status
Not open for further replies.

SubX

Explorer
Joined
Sep 15, 2017
Messages
56
Today iSCSI connection was lost. Checked FreeNAS, get the following error message. New to FreeNAS, please give me a hand on troubleshooting steps to bring back the hard drive if possible.

- unreadable (pending) sectors
- online correctable sectors
- ATA error count increase from 5385 to 5415
111_ada1_3_errors.JPG


Thanks,
 

SubX

Explorer
Joined
Sep 15, 2017
Messages
56
Thanks for the quick response. Does it mean the disk is totally damaged, there is no way to get back the data. Should I run some command via putty or shell to verify disk status?
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Yes, that guide doesn't deal with any diagnosis steps - sorry.

Looks like you have problems on two drives, ada1 and ada2.

Suggest you run smart tests on each and post the results in code tags for review by the many experts here - smartctl -a /dev/adax
 

SubX

Explorer
Joined
Sep 15, 2017
Messages
56
Also I found that from Webgui, the volume shown healthy and check the disk status, it shows 0 error, 0 repaired. However dated back Oct 29. I did have scrub scheduled for every day. What should I do next? Gui doesn't allow me to start a scrub job right away. Any command line will allow me to perform scrub job?
Thanks,
 

SubX

Explorer
Joined
Sep 15, 2017
Messages
56
Yes, that guide doesn't deal with any diagnosis steps - sorry.

Looks like you have problems on two drives, ada1 and ada2.

Suggest you run smart tests on each and post the results in code tags for review by the many experts here - smartctl -a /dev/adax

Let me run the report now.
 

SubX

Explorer
Joined
Sep 15, 2017
Messages
56
Here is the Smart Test result for ada1 / ada2. Please help to understand what is the issue. Thanks,
 

Attachments

  • putty-smart-mon-111-a1a2.txt
    16.7 KB · Views: 338

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
This isn't my specialty, but both drives look to be running hot, both have reallocated sector counts and ada1 has current pending and offline uncorrectable counts.

ada12 (?) has the "failing now" End-to-end error - I don't know anything about that one.

Hopefully someone more conversant can jump in here with comment.
 

SubX

Explorer
Joined
Sep 15, 2017
Messages
56
Nice Guide. Thanks! I do have 1 in ID 5, 197, 198. I can still keep the drive, but have to run smart more frequent to see if those number increased.
If I opt to use my drive, at this point, how to get it works again without losing the data?
 

SubX

Explorer
Joined
Sep 15, 2017
Messages
56
Here is my two drives smart output. Could someone give me a guide on next steps. Thanks in advanced!

Code:
=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2017.11.03 20:33:06 =~=~=~=~=~=~=~=~=~=~=~=
login as: root
root@172.16.88.111's password: 
Last login: Mon Oct  9 20:19:47 2017 from 192.168.2.18

FreeBSD 10.3-STABLE (FreeNAS.amd64) #0 r295946+21897e6695f(HEAD): Tue Jul 25 00:03:12 UTC 2017

FreeNAS (c) 2009-2016, The FreeNAS Development Team
All rights reserved.
FreeNAS is released under the modified BSD license.

For more information, documentation, help or support, go here:
 http://freenas.org
Welcome to FreeNAS
[root@freenas111] ~# smartctl -a /dev/ada1

smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Seagate Barracuda 7200.11
Device Model:	 ST31000340AS
Serial Number:	6QJ04NKN
LU WWN Device Id: 5 000c50 00e21aa29
Firmware Version: SD15
User Capacity:	1,000,203,804,160 bytes [1.00 TB]
Sector Size:	  512 bytes logical/physical
Rotation Rate:	7200 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 1.5 Gb/s
Local Time is:	Fri Nov  3 20:34:51 2017 EDT

==> WARNING: There are known problems with these drives,
THIS DRIVE MAY OR MAY NOT BE AFFECTED,
see the following web pages for details:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/207951en
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=632758

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0)The previous self-test routine completed
without error or no self-test has ever 
been run.
Total time to complete Offline 
data collection: (  634) seconds.
Offline data collection
capabilities:  (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:			(0x0003)Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:		(0x01)Error logging supported.
General Purpose Logging supported.
Short self-test routine 
recommended polling time:  (   1) minutes.
Extended self-test routine
recommended polling time:  ( 225) minutes.
Conveyance self-test routine
recommended polling time:  (   2) minutes.
SCT capabilities:		(0x103b)SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   102   079   006	Pre-fail  Always	   -	   150475746
  3 Spin_Up_Time			0x0003   091   086   000	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   099   099   020	Old_age   Always	   -	   1931
  5 Reallocated_Sector_Ct   0x0033   100   100   036	Pre-fail  Always	   -	   49
  7 Seek_Error_Rate		 0x000f   072   060   030	Pre-fail  Always	   -	   51758464288
  9 Power_On_Hours		  0x0032   074   074   000	Old_age   Always	   -	   22991
 10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   8
 12 Power_Cycle_Count	   0x0032   099   099   020	Old_age   Always	   -	   1929
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   001   001   000	Old_age   Always	   -	   5326
188 Command_Timeout		 0x0032   100   099   000	Old_age   Always	   -	   5
189 High_Fly_Writes		 0x003a   001   001   000	Old_age   Always	   -	   2749
190 Airflow_Temperature_Cel 0x0022   051   041   045	Old_age   Always   In_the_past 49 (Min/Max 45/52 #405)
194 Temperature_Celsius	 0x0022   049   059   000	Old_age   Always	   -	   49 (0 8 0 0 0)
195 Hardware_ECC_Recovered  0x001a   032   019   000	Old_age   Always	   -	   150475746
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   1
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   1
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0

SMART Error Log Version: 1
ATA Error Count: 5415 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 5415 occurred at disk power-on lifetime: 22989 hours (957 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 9c 01 00 00  Error: UNC at LBA = 0x0000019c = 412

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 80 01 00 e0 00   8d+20:36:38.611  READ DMA
  c8 00 00 80 01 00 e0 00   8d+20:36:33.407  READ DMA
  c8 00 00 80 01 00 e0 00   8d+20:36:28.226  READ DMA
  c8 00 00 80 01 00 e0 00   8d+20:36:23.052  READ DMA
  c8 00 00 80 01 00 e0 00   8d+20:36:17.844  READ DMA

Error 5414 occurred at disk power-on lifetime: 22989 hours (957 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 9c 01 00 00  Error: UNC at LBA = 0x0000019c = 412

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 80 01 00 e0 00   8d+20:36:33.407  READ DMA
  c8 00 00 80 01 00 e0 00   8d+20:36:28.226  READ DMA
  c8 00 00 80 01 00 e0 00   8d+20:36:23.052  READ DMA
  c8 00 00 80 01 00 e0 00   8d+20:36:17.844  READ DMA
  35 00 18 ff ff ff 4f 00   8d+20:36:17.843  WRITE DMA EXT

Error 5413 occurred at disk power-on lifetime: 22989 hours (957 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 9c 01 00 00  Error: UNC at LBA = 0x0000019c = 412

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------F197
  c8 00 00 80 01 00 e0 00   8d+20:36:23.052  READ DMA
  c8 00 00 80 01 00 e0 00   8d+20:36:17.844  READ DMA
  35 00 18 ff ff ff 4f 00   8d+20:36:17.843  WRITE DMA EXT
  c8 00 08 d8 27 40 e0 00   8d+20:36:17.843  READ DMA

Error 5412 occurred at disk power-on lifetime: 22989 hours (957 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 9c 01 00 00  Error: UNC at LBA = 0x0000019c = 412

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 80 01 00 e0 00   8d+20:36:23.052  READ DMA
  c8 00 00 80 01 00 e0 00   8d+20:36:17.844  READ DMA
  35 00 18 ff ff ff 4f 00   8d+20:36:17.843  WRITE DMA EXT
  c8 00 08 d8 27 40 e0 00   8d+20:36:17.843  READ DMA
  c8 00 00 80 00 00 e0 00   8d+20:36:17.812  READ DMA

Error 5411 occurred at disk power-on lifetime: 22989 hours (957 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 9c 01 00 00  Error: UNC at LBA = 0x0000019c = 412

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 80 01 00 e0 00   8d+20:36:17.844  READ DMA
  35 00 18 ff ff ff 4f 00   8d+20:36:17.843  WRITE DMA EXT
  c8 00 08 d8 27 40 e0 00   8d+20:36:17.843  READ DMA
  c8 00 00 80 00 00 e0 00   8d+20:36:17.812  READ DMA
  35 00 18 ff ff ff 4f 00   8d+20:36:17.812  WRITE DMA EXT

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[root@freenas111] ~# 
[root@freenas111] ~# smartctl -a /dev/ada12

smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Seagate Barracuda 7200.11
Device Model:	 ST31000333AS
Serial Number:	6TE0F86P
LU WWN Device Id: 5 000c50 01175f4d9
Firmware Version: CC1F
User Capacity:	1,000,203,804,160 bytes [1.00 TB]
Sector Size:	  512 bytes logical/physical
Rotation Rate:	7200 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:	Fri Nov  3 20:36:59 2017 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0)The previous self-test routine completed
without error or no self-test has ever 
been run.
Total time to complete Offline 
data collection: (  617) seconds.
Offline data collection
capabilities:  (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:			(0x0003)Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:		(0x01)Error logging supported.
General Purpose Logging supported.
Short self-test routine 
recommended polling time:  (   1) minutes.
Extended self-test routine
recommended polling time:  ( 208) minutes.
Conveyance self-test routine
recommended polling time:  (   2) minutes.
SCT capabilities:		(0x103f)SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   118   099   006	Pre-fail  Always	   -	   181549753
  3 Spin_Up_Time			0x0003   100   096   000	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   099   099   020	Old_age   Always	   -	   1037
  5 Reallocated_Sector_Ct   0x0033   100   100   036	Pre-fail  Always	   -	   2
  7 Seek_Error_Rate		 0x000f   079   060   030	Pre-fail  Always	   -	   91654799
  9 Power_On_Hours		  0x0032   091   091   000	Old_age   Always	   -	   8554
 10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   3
 12 Power_Cycle_Count	   0x0032   099   099   020	Old_age   Always	   -	   1369
184 End-to-End_Error		0x0032   093   093   099	Old_age   Always   FAILING_NOW 7
187 Reported_Uncorrect	  0x0032   100   100   000	Old_age   Always	   -	   0
188 Command_Timeout		 0x0032   100   099   000	Old_age   Always	   -	   62
189 High_Fly_Writes		 0x003a   001   001   000	Old_age   Always	   -	   173
190 Airflow_Temperature_Cel 0x0022   055   046   045	Old_age   Always	   -	   45 (Min/Max 41/46)
194 Temperature_Celsius	 0x0022   045   054   000	Old_age   Always	   -	   45 (0 11 0 0 0)
195 Hardware_ECC_Recovered  0x001a   028   021   000	Old_age   Always	   -	   181549753
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0
240 Head_Flying_Hours	   0x0000   100   253   000	Old_age   Offline	  -	   7524 (33 91 0)
241 Total_LBAs_Written	  0x0000   100   253   000	Old_age   Offline	  -	   3935769474
242 Total_LBAs_Read		 0x0000   100   253   000	Old_age   Offline	  -	   3452445201

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	  7007		 -
# 2  Short offline	   Completed without error	   00%	  7006		 -
# 3  Short offline	   Completed without error	   00%	  7005		 -
# 4  Short offline	   Completed without error	   00%	  7004		 -
# 5  Short offline	   Completed without error	   00%	  7003		 -
# 6  Short offline	   Completed without error	   00%	  7002		 -
# 7  Short offline	   Completed without error	   00%	  7001		 -
# 8  Short offline	   Completed without error	   00%	  7000		 -
# 9  Short offline	   Completed without error	   00%	  6999		 -
#10  Short offline	   Completed without error	   00%	  6998		 -
#11  Short offline	   Completed without error	   00%	  6997		 -
#12  Short offline	   Completed without error	   00%	  6997		 -
#13  Short offline	   Completed without error	   00%	  6996		 -
#14  Short offline	   Completed without error	   00%	  6995		 -
#15  Short offline	   Completed without error	   00%	  6994		 -
#16  Short offline	   Completed without error	   00%	  6993		 -
#17  Short offline	   Completed without error	   00%	  6992		 -
#18  Short offline	   Completed without error	   00%	  6991		 -
#19  Short offline	   Completed without error	   00%	  6990		 -
#20  Short offline	   Completed without error	   00%	  6989		 -
#21  Short offline	   Completed without error	   00%	  6988		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[root@freenas111] ~# 
 

SubX

Explorer
Joined
Sep 15, 2017
Messages
56
iSCSI is back now. Will migrate data out of ada1. Should I low-level format the drive and put it in use again. Just wonder if I can RMA the drive? What is the HDD RMA policy?
Thanks,
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Should I low-level format the drive
No especially because you have no way of doing so and doing so would be absolutely pointless anyway.
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
ada1 is toast, replace it. You also need to check the status of ada2. ada12 is supect as well from the "failing now" in the smart output.

Are you actually running scheduled smart tests every hour? And a scrub every day is a bit excessive.

Your drives are running very hot, you need to address that. The Seagate 7200.11 drives are total crap and known to be failure prone.
 

SubX

Explorer
Joined
Sep 15, 2017
Messages
56
ada1 is toast, replace it. You also need to check the status of ada2. ada12 is supect as well from the "failing now" in the smart output.

Are you actually running scheduled smart tests every hour? And a scrub every day is a bit excessive.

Your drives are running very hot, you need to address that. The Seagate 7200.11 drives are total crap and known to be failure prone.

Thanks, Regarding 184 Error Message on ada2, what steps can I follow? Should I open a new post on that?

Code:
184 End-to-End_Error		0x0032 093 093 099	Old_age Always FAILING_NOW 7
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
Not sure how others feel about it but if I had a smart attribute that says "failing now" I would replace the drive.
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
ada1 has dangerous firmware SD15. You should replace that drive because it is failing, but if you have any others, you should upgrade the firmware.
 

SubX

Explorer
Joined
Sep 15, 2017
Messages
56
ada1 has dangerous firmware SD15. You should replace that drive because it is failing, but if you have any others, you should upgrade the firmware.

Thanks! Is SD1A the latest firmware. However to upgrade? Google it but can't find one showing detail steps.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Did you check the two Seagate page links in the ada1 smart report? The second link has the detailed upgrade instructions, the first should tell you/confirm if SD1A is the latest.

Do you have ada2 AND ada12 with End-to-End "Failing Now"? Your first post had the Alert System pasting showing ada2, the smart tests you made had ada12 but not ada2.
 
Status
Not open for further replies.
Top