blckhm
Dabbler
- Joined
- Sep 24, 2018
- Messages
- 42
Hi,
Our freenas has been configured as 20x 8TB seagate archive hdd in single pool with 2way mirror setup.
We bring that setup online 4 weeks ago without any problem but after that, our drives generates some error and freenas looks detaches these drives suddenly from mirror.
So we need to manually disconnect drive from enclosure and reattach drive. pool goes to resilvering and everythings fine.
This was happened 3 times in 2 weeks and now we are worried.
I'll attach some logs about error.
We could not find any specific report as like that.
And here is the sample smart log of a disk which is detached from pool
HBA
Our freenas has been configured as 20x 8TB seagate archive hdd in single pool with 2way mirror setup.
We bring that setup online 4 weeks ago without any problem but after that, our drives generates some error and freenas looks detaches these drives suddenly from mirror.
So we need to manually disconnect drive from enclosure and reattach drive. pool goes to resilvering and everythings fine.
This was happened 3 times in 2 weeks and now we are worried.
I'll attach some logs about error.
We could not find any specific report as like that.
Code:
Copyright (c) 1992-2017 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 11.1-STABLE #2 r321665+366f54a78b2(freenas/11.1-stable): Wed Mar 21 23:04:13 UTC 2018 root@gauntlet:/freenas-11-releng/freenas/_BE/objs/freenas-11-releng/freenas/_BE/os/sys/FreeNAS.amd64 amd64 FreeBSD clang version 5.0.0 (tags/RELEASE_500/final 312559) (based on LLVM 5.0.0svn) CPU: Intel(R) Xeon(R) CPU E5-2630L v3 @ 1.80GHz (1800.00-MHz K8-class CPU) Origin="GenuineIntel" Id=0x306f2 Family=0x6 Model=0x3f Stepping=2 Features=0x1f83fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE,SSE2,SS,HTT> Features2=0xfffa3203<SSE3,PCLMULQDQ,SSSE3,FMA,CX16,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND,HV> AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM> AMD Features2=0x21<LAHF,ABM> Structured Extended Features=0x27ab<FSGSBASE,TSCADJ,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,NFPUSG> XSAVE Features=0x1<XSAVEOPT> TSC: P-state invariant Hypervisor: Origin = "VMwareVMware" real memory = 56908316672 (54272 MB) avail memory = 54198685696 (51687 MB) Event timer "LAPIC" quality 600 ACPI APIC Table: <PTLTD APIC > FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs FreeBSD/SMP: 1 package(s) x 8 core(s) WARNING: VIMAGE (virtualized network stack) is a highly experimental feature. MADT: Forcing active-low polarity and level trigger for SCI ioapic0 <Version 2.0> irqs 0-23 on motherboard .. .. .. .. .. (da4:mpr0:0:11:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 890 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0 (da4:mpr0:0:11:0): READ(16). CDB: 88 00 00 00 00 02 68 92 b9 20 00 00 00 08 00 00 length 4096 SMID 981 terminated ioc 804b lo(da4:mpr0:0:11:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 ginfo 31110e03 scsi 0 state c xfer 0 (da4:mpr0:0:11:0): CAM status: CCB request completed with an error (da4:mpr0:0:11:0): Retrying command (da4:mpr0:0:11:0): READ(16). CDB: 88 00 00 00 00 02 68 92 b9 20 00 00 00 08 00 00 (da4:mpr0:0:11:0): CAM status: CCB request completed with an error (da4:mpr0:0:11:0): Retrying command (da4:mpr0:0:11:0): READ(16). CDB: 88 00 00 00 00 02 68 92 b9 20 00 00 00 08 00 00 (da4:mpr0:0:11:0): CAM status: SCSI Status Error (da4:mpr0:0:11:0): SCSI status: Check Condition (da4:mpr0:0:11:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) (da4:mpr0:0:11:0): Retrying command (per sense data) (da4:mpr0:0:11:0): READ(16). CDB: 88 00 00 00 00 02 68 92 b9 20 00 00 00 08 00 00 (da4:mpr0:0:11:0): CAM status: SCSI Status Error (da4:mpr0:0:11:0): SCSI status: Check Condition (da4:mpr0:0:11:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error) (da4:mpr0:0:11:0): Info: 0x26892b920 (da4:mpr0:0:11:0): Error 5, Unretryable error mpr0: mprsas_prepare_remove: Sending reset for target ID 11 da4 at mpr0 bus 0 scbus3 target 11 lun 0 da4: <ATA ST8000AS0002-1NA AR13> s/n Z8403XVT detached (da4:mpr0:0:11:0): Periph destroyed mpr0: clearing target 11 handle 0x000d mpr0: At enclosure level 0, slot 3, connector name ( ) mpr0: Unfreezing devq for target ID 11 mpr0: SAS Address for SATA device = 371a444567948a47 mpr0: SAS Address from SAS device page0 = 5003048001ce5083 mpr0: SAS Address from SATA device = 371a444567948a47 mpr0: Found device <81<SataDev>,End Device> <12.0Gbps> handle<0x000d> enclosureHandle<0x0002> slot 3 mpr0: At enclosure level 0 and connector name ( ) ses0: da4,pass5: Element descriptor: 'Slot03' da4 at mpr0 bus 0 scbus3 target 11 lun 0 ses0: da4,pass5: SAS Device Slot Element: 1 Phys at Slot 3 ses0: phy 0: SATA device ses0: phy 0: parent 5003048001ce50bf addr 5003048001ce5083 da4: <ATA ST8000AS0002-1NA AR13> Fixed Direct Access SPC-4 SCSI device da4: Serial Number Z8403XGD da4: 1200.000MB/s transfers da4: Command Queueing enabled da4: 7630885MB (15628053168 512 byte sectors) da4: quirks=0x80<SMR_DM> Local NSM refuses to monitor worker2 Limiting closed port RST response from 280 to 200 packets/sec Limiting closed port RST response from 273 to 200 packets/sec (da13:mpr0:0:20:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 329 Aborting command 0xfffffe000105d8f0 mpr0: Sending reset from mprsas_send_abort for target ID 20 mpr0: mprsas_prepare_remove: Sending reset for target ID 20 da13 at mpr0 bus 0 scbus3 target 20 lun 0 da13: <ATA ST8000AS0002-1NA AR17> s/n Z840WDAJ detached (da13:mpr0:0:20:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 854 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0 (da13:mpr0:0:20:0): READ(10). CDB: 28 00 44 cd c8 28 00 01 00 00 length 131072 SMID 1142 terminated ioc 804b loginfo 31130000(da13:mpr0:0:20:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 scsi 0 state c xfer 0 (da13:mpr0:0:20:0): READ(10). CDB: 28 00 44 cd c7 28 00 01 00 00 length 131072 SMID 1015 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0 (da13:mpr0:0:20:0): READ(10). CDB: 28 00 44 cd c9 28 00 01 00 00 length 131072 SMID 748 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0 mpr0: clearing target 20 handle 0x0017 mpr0: At enclosure level 0, slot 12, connector name ( ) mpr0: Unfreezing devq for target ID 20 mpr0: Unfreezing devq for target ID 20 (da13:mpr0:0:20:0): CAM status: CCB request completed with an error (da13:mpr0:0:20:0): Error 5, Periph was invalidated (da13:mpr0:0:20:0): READ(10). CDB: 28 00 44 cd c8 28 00 01 00 00 (da13:mpr0:0:20:0): CAM status: CCB request completed with an error (da13:mpr0:0:20:0): Error 5, Periph was invalidated (da13:mpr0:0:20:0): READ(10). CDB: 28 00 44 cd c7 28 00 01 00 00 (da13:mpr0:0:20:0): CAM status: CCB request completed with an error (da13:mpr0:0:20:0): Error 5, Periph was invalidated (da13:mpr0:0:20:0): READ(10). CDB: 28 00 44 cd c9 28 00 01 00 00 (da13:mpr0:0:20:0): CAM status: CCB request completed with an error (da13:mpr0:0:20:0): Error 5, Periph was invalidated (da13:mpr0:0:20:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da13:mpr0:0:20:0): CAM status: Command timeout (da13:mpr0:0:20:0): Error 5, Periph was invalidated GEOM_MIRROR: Device swap3: provider da13p1 disconnected. (da13:mpr0:0:20:0): Periph destroyed mpr0: SAS Address for SATA device = 371a44458b80844d mpr0: SAS Address from SAS device page0 = 5003048001ce509c mpr0: SAS Address from SATA device = 371a44458b80844d mpr0: Found device <81<SataDev>,End Device> <12.0Gbps> handle<0x0017> enclosureHandle<0x0002> slot 12 mpr0: At enclosure level 0 and connector name ( ) ses0: da13,pass14: Element descriptor: 'Slot12' da13 at mpr0 bus 0 scbus3 target 20 lun 0 ses0: da13,pass14: SAS Device Slot Element: 1 Phys at Slot 12 ses0: phy 0: SATA device ses0: phy 0: parent 5003048001ce50bf addr 5003048001ce509c da13: <ATA ST8000AS0002-1NA AR17> Fixed Direct Access SPC-4 SCSI device da13: Serial Number Z840WDAJ da13: 1200.000MB/s transfers da13: Command Queueing enabled da13: 7630885MB (15628053168 512 byte sectors) da13: quirks=0x80<SMR_DM> (da18:mpr0:0:25:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 722 Aborting command 0xfffffe0001080de0 mpr0: Sending reset from mprsas_send_abort for target ID 25 mpr0: mprsas_prepare_remove: Sending reset for target ID 25 da18 at mpr0 bus 0 scbus3 target 25 lun 0 da18: <ATA ST8000AS0002-1NA AR17> s/n Z840WDWG detached (da18:mpr0:0:25:0): WRITE(10). CDB: 2a 00 45 c9 81 b0 00 00 08 00 length 4096 SMID 353 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0 mpr0: clearing target 25 handle 0x001c (da18:mpr0:0:25:0): WRITE(10). CDB: 2a 00 45 c9 81 b0 00 00 08 00 mpr0: At enclosure level 0, slot 17, connector name ( ) mpr0: Unfreezing devq for target ID 25 mpr0: Unfreezing devq for target ID 25 (da18:mpr0:0:25:0): CAM status: CCB request completed with an error (da18:mpr0:0:25:0): Error 5, Periph was invalidated (da18:mpr0:0:25:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da18:mpr0:0:25:0): CAM status: Command timeout (da18:mpr0:0:25:0): Error 5, Periph was invalidated GEOM_MIRROR: Device swap1: provider da18p1 disconnected. (da18:mpr0:0:25:0): Periph destroyed
And here is the sample smart log of a disk which is detached from pool
Code:
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate Archive HDD Device Model: ST8000AS0002-1NA17Z LU WWN Device Id: 5 000c50 0936b1c4e Firmware Version: AR17 User Capacity: 8,001,563,222,016 bytes [8.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5980 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Mon Sep 24 18:20:17 2018 +03 SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 936) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x30b5) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 108 099 006 Pre-fail Always - 19827984 3 Spin_Up_Time 0x0003 092 090 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 47 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 088 060 030 Pre-fail Always - 4970814419 9 Power_On_Hours 0x0032 089 089 000 Old_age Always - 9676 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 45 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 098 000 Old_age Always - 8590065668 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 072 061 045 Old_age Always - 28 (Min/Max 27/29) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 454 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 498 194 Temperature_Celsius 0x0022 028 040 000 Old_age Always - 28 (0 16 0 0 0) 195 Hardware_ECC_Recovered 0x001a 108 099 000 Old_age Always - 19827984 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 9603 (13 255 0) 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 11219455640 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 294052578821 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 9630 - # 2 Short offline Completed without error 00% 9541 - # 3 Short offline Completed without error 00% 9373 - # 4 Extended offline Completed without error 00% 9296 - # 5 Short offline Completed without error 00% 9205 - # 6 Short offline Completed without error 00% 8965 - # 7 Extended offline Completed without error 00% 8886 - # 8 Short offline Completed without error 00% 8797 - # 9 Short offline Completed without error 00% 8621 - #10 Extended offline Completed without error 00% 8546 - #11 Short offline Completed without error 00% 8453 - #12 Short offline Completed without error 00% 8213 - #13 Extended offline Completed without error 00% 8134 - #14 Short offline Completed without error 00% 8045 - #15 Short offline Completed without error 00% 7877 - #16 Extended offline Completed without error 00% 7798 - #17 Extended offline Interrupted (host reset) 00% 7735 - #18 Short offline Completed without error 00% 7659 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
HBA
Code:
Avago Technologies SAS3 Flash Utility Version 16.00.00.00 (2017.05.02) Copyright 2008-2017 Avago Technologies. All rights reserved. Adapter Selected is a Avago SAS: SAS3008(C0) Num Ctlr FW Ver NVDATA x86-BIOS PCI Addr ---------------------------------------------------------------------------- 0 SAS3008(C0) 12.00.00.00 0b.02.00.07 08.29.00.00 00:03:00:00
Last edited: