Gnarly hard error prevents full boot

Status
Not open for further replies.

aufalien

Patron
Joined
Jul 25, 2013
Messages
374
Hi,

Running the latest FreeNAS v9.3 on my backup server and when rebooting it, noticed this error;

terminated ioc 804b scsi 0 state c xfr 0
(probe9:mps1:0:47:0) INQUIRY. CBD: 12 00 00 00 24 00 length 36 SMID 57

Its in an endless cycle. The last number in this example will increment after 10 repeats and will keep going, seemingly forever.

Through trial and error I found the offending drive and pulled it. The system boots normal although the drive was pulled w/o grace meaning I didn't use the UI to remove and replace. So it boots in a degraded state.

I was able to replace the drive via the UI due to having 6 spares on hand for this one server.

Its currently rebuilding.

Would any one mind shedding some light as to whats going on?

I can post specs if needed.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Yes please post specs. You should also run a smart test and look at the results for that drive you pulled. There seem to be a couple posts and bugs that might relate to this but I'm not sure if they actualy match your problem.
 

aufalien

Patron
Joined
Jul 25, 2013
Messages
374
Ok cool, will run a smart on one of my Linux boxes on that problem drive and post results.

As for the specs;

Intel Storage Server running E5-2430 procs.
4xIntel S2000 JBODs
48 Seagate 3TB Enterprise SATA
192GB ram.
No dedicated ZIL or L2ARC
LSI 9206-16E HBA
8xRaidZ2 groups in stripe

Is this good enough?

PS BTW Thanks you very much. I'm surprised to have gotten such a rapid reply to my post. WOW!
 
Last edited:

aufalien

Patron
Joined
Jul 25, 2013
Messages
374
The drive is pretty hosed. Upon inserting into my Linux box, it almost sounds as if one is scraping 5 o-clock shadow with a credit card.

Immediate log results;
kernel: ata4: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
kernel: ata4: irq_stat 0x00000040, connection status changed
kernel: ata4: SError: { CommWake DevExch }
kernel: ata4: hard resetting link
kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
kernel: ata4.00: ATA-8: ST33000650NS, 0006, max UDMA/133
kernel: ata4.00: 5860533168 sectors, multi 0: LBA48 NCQ (depth 31/32)
kernel: ata4.00: configured for UDMA/133
kernel: ata4: EH complete
kernel: scsi 3:0:0:0: Direct-Access ATA ST33000650NS 0006 PQ: 0 ANSI: 5
kernel: sd 3:0:0:0: Attached scsi generic sg2 type 0
kernel: sd 3:0:0:0: [sdb] 5860533168 512-byte logical blocks: (3.00 TB/2.72 TiB)
kernel: sd 3:0:0:0: [sdb] Write Protect is off
kernel: sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
kernel: sdb: sdb1
kernel: sd 3:0:0:0: [sdb] Attached SCSI disk




smartctl --all /dev/sdb
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.23.2.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Seagate Constellation ES.2 (SATA 6Gb/s)
Device Model: ST33000650NS
Serial Number: Z2936WG6
LU WWN Device Id: 5 000c50 04d8b3966
Firmware Version: 0006
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Fri Jun 19 15:12:26 2015 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 609) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 452) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x10bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 082 064 044 Pre-fail Always - 180627218
3 Spin_Up_Time 0x0003 090 090 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 13
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 055 054 030 Pre-fail Always - 880544920813
9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 5611
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 12
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 064 050 045 Old_age Always - 36 (Min/Max 29/36)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 11
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 246
194 Temperature_Celsius 0x0022 036 050 000 Old_age Always - 36 (0 20 0 0 0)
195 Hardware_ECC_Recovered 0x001a 022 004 000 Old_age Always - 180627218
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 5611 -
# 2 Short offline Completed without error 00% 5611 -
# 3 Extended offline Aborted by host 90% 5611 -
# 4 Extended offline Completed without error 00% 5495 -
# 5 Short offline Completed without error 00% 5320 -
# 6 Extended offline Completed without error 00% 5094 -
# 7 Short offline Completed without error 00% 4912 -
# 8 Extended offline Completed without error 00% 4753 -
# 9 Short offline Completed without error 00% 4576 -
#10 Extended offline Completed without error 00% 4367 -
#11 Short offline Completed without error 00% 4192 -
#12 Extended offline Interrupted (host reset) 80% 4025 -
#13 Short offline Completed without error 00% 3855 -
#14 Extended offline Completed without error 00% 3623 -
#15 Short offline Completed without error 00% 3447 -
#16 Extended offline Completed without error 00% 3287 -
#17 Short offline Completed without error 00% 3113 -
#18 Extended offline Completed without error 00% 2953 -
#19 Short offline Completed without error 00% 2777 -
#20 Extended offline Completed without error 00% 2616 -
#21 Short offline Completed without error 00% 2440 -

Would you like a specific test?
 
Status
Not open for further replies.
Top