Import Volume does not show all disks from an encrypted Raid-Z1

Status
Not open for further replies.

morfizm

Cadet
Joined
Mar 8, 2015
Messages
8
Problem description:
Pool "all" was detached before reboot. It's Raid-Z1 with encryption. After reboot I am trying to "Import Volume" via GUI, first page asks "Encrypted ZFS volume?", I say "yes, decrypt disks", click OK. Second page asks me to select disks, encryption key and passphrase. The problem is that I only see the following list:
- ada1p2 -- this is one of the 4 drives
- ada0p2 -- this is one of the 4 drives
- ada2p2 -- this is one of the 4 drives
- da2p2 -- this is mirrored ZIL
- da1p1 -- this is L2ARC
- da0p2 -- this is mirrored ZIL
and I don't see ada3p2 in that list!

When I click "view disks" I see all 4 disks.
The disk is healthy, because I experienced this problem many times (it's pretty consistent repro), ended up importing 3 drives out of 4, pool in DEGRADED state, then "replaced" missing disk with the 4th disk, and had it successfully resilvered. Last time I hit an unrecoverable read error during resilvering, which resulted in a corrupted file (luckily not important one, I also have regular snapshots for backups), but I decided to not play risky anymore and ask for help.

Not sure if it's related, there's another problem with this pool that I observed - if I do not detach pool before rebooting, but just reboot, it will show up in unencrypted state (this is scary, because normally pool would show up LOCKED and I need to enter password). Though I am not sure how reproducible it is, because I was more worried about the problem with mounting/importing drives.


System configuration:
FreeNAS version: FreeNAS-9.3-STABLE-201502271818
Motherboard: Intel C226 Chipset w/LGA1150 socket (standard Lenovo TS140 70A4001LUX Server build)
CPU: Xeon E3-1225 v3 (4C/3.2GHz/8MB/1600MHz)
RAM: 24GB (8+8+4+4) 1600MHz ECC unbuffered.
SAS card: IBM M1015 / 9220-8i SAS RAID Card - Flashed to LSI-9211-8i IT MODE
Hard drives:
- ada0 - WD Caviar Green WDC WD20EARS-00MVWB0 (firmw 51.0AB51) - 2TB -- attached to motherboard
- ada1 - Seagate Barracuda - ST3000DM001-9YN166 (firmw CC4H) - 3TB -- attached to motherboard
- ada2 - Samsung Spinpoint F4EG - Samsung HD204UI (firmw 1AQ10001) - 2TB -- attached to motherboard
- ada3 - Seagate Barracuda - ST3000DM001-9YN166 (firmw CC46) - 3TB -- attached to motherboard
- da0 - Intel SSD 320 Series - SSDSA2BW12 (firmw 0362) - 120GB - attached to SAS card.
- da1 - Samsung SSD 830 (firmw 3B1Q) - 128GB - attached to SAS card.
- da2 - Intel SSD 320 Series - SSDSA2BW12 (firmw 0362) - 120GB - attached to SAS card.
- da3 - WD Red - 5 TB - attached to USB 3.0 card.

Pool Configuration:
- da3 form single-drive "ext-backup" pool
- ada0+ada1+ada2+ada3 normally form Raid-Z1 "all" pool, where da0+da2 are in mirror for ZIL and da1 for L2ARC.


dmesg
Code:
Copyright (c) 1992-2014 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 9.3-RELEASE-p10 #0 r275790+a21079f: Thu Feb 26 22:13:30 PST 2015
    root@build3.ixsystems.com:/tank/home/jkh/build/FN93/freenas/objs/os-base/amd64/fusion/jkh/FN93/freenas/FreeBSD/src/sys/FREENAS.amd64 amd64
gcc version 4.2.1 20070831 patched [FreeBSD]
CPU: Intel(R) Xeon(R) CPU E3-1225 v3 @ 3.20GHz (3192.68-MHz K8-class CPU)
  Origin = "GenuineIntel"  Id = 0x306c3  Family = 0x6  Model = 0x3c  Stepping = 3
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x7ffafbff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,<b11>,FMA,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
  AMD Features2=0x21<LAHF,ABM>
  Standard Extended Features=0x2fbb<GSFSBASE,TSCADJ,SMEP,ENHMOVSB,INVPCID>
  TSC: P-state invariant, performance statistics
real memory  = 26287800320 (25070 MB)
avail memory = 24629456896 (23488 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <LENOVO TC-FB   >
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
FreeBSD/SMP: 1 package(s) x 4 core(s)
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  2
 cpu2 (AP): APIC ID:  4
 cpu3 (AP): APIC ID:  6
WARNING: VIMAGE (virtualized network stack) is a highly experimental feature.
ACPI Warning: FADT (revision 5) is longer than ACPI 5.0 version, truncating length 268 to 256 (20111123/tbfadt-325)
ioapic0 <Version 2.0> irqs 0-23 on motherboard
ispfw: registered firmware <isp_1040>
ispfw: registered firmware <isp_1040_it>
ispfw: registered firmware <isp_1080>
ispfw: registered firmware <isp_1080_it>
ispfw: registered firmware <isp_12160>
ispfw: registered firmware <isp_12160_it>
ispfw: registered firmware <isp_2100>
ispfw: registered firmware <isp_2200>
ispfw: registered firmware <isp_2300>
ispfw: registered firmware <isp_2322>
ispfw: registered firmware <isp_2400>
ispfw: registered firmware <isp_2400_multi>
ispfw: registered firmware <isp_2500>
ispfw: registered firmware <isp_2500_multi>
kbd1 at kbdmux0
cryptosoft0: <software crypto> on motherboard
aesni0: <AES-CBC,AES-XTS> on motherboard
padlock0: No ACE support.
acpi0: <LENOVO TC-FB> on motherboard
acpi0: Power Button (fixed)
acpi0: reservation of 67, 1 (4) failed
cpu0: <ACPI CPU> on acpi0
cpu1: <ACPI CPU> on acpi0
cpu2: <ACPI CPU> on acpi0
cpu3: <ACPI CPU> on acpi0
hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0
Timecounter "HPET" frequency 14318180 Hz quality 950
Event timer "HPET" frequency 14318180 Hz quality 550
Event timer "HPET1" frequency 14318180 Hz quality 440
Event timer "HPET2" frequency 14318180 Hz quality 440
Event timer "HPET3" frequency 14318180 Hz quality 440
Event timer "HPET4" frequency 14318180 Hz quality 440
atrtc0: <AT realtime clock> port 0x70-0x77 irq 8 on acpi0
atrtc0: Warning: Couldn't map I/O.
Event timer "RTC" frequency 32768 Hz quality 0
attimer0: <AT timer> port 0x40-0x43,0x50-0x53 irq 0 on acpi0
Timecounter "i8254" frequency 1193182 Hz quality 0
Event timer "i8254" frequency 1193182 Hz quality 100
Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x1808-0x180b on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
pcib1: <ACPI PCI-PCI bridge> irq 16 at device 1.0 on pci0
pci1: <ACPI PCI bus> on pcib1
mpslsi0: <Avago Technologies (LSI) SAS2008> port 0xe000-0xe0ff mem 0xf7d00000-0xf7d03fff,0xf7c80000-0xf7cbffff irq 16 at device 0.0 on pci1
mpslsi0: Firmware: 19.00.00.00, Driver: 20.00.00.00
mpslsi0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
vgapci0: <VGA-compatible display> port 0xf000-0xf03f mem 0xf7400000-0xf77fffff,0xe0000000-0xefffffff irq 16 at device 2.0 on pci0
vgapci0: Boot video device
pci0: <multimedia, HDA> at device 3.0 (no driver attached)
xhci0: <Intel Lynx Point USB 3.0 controller> mem 0xf7f20000-0xf7f2ffff irq 16 at device 20.0 on pci0
xhci0: 32 byte context size.
xhci0: Port routing mask set to 0xffffffff
usbus0 on xhci0
pci0: <simple comms> at device 22.0 (no driver attached)
uart2: <Intel Lynx Point KT Controller> port 0xf0e0-0xf0e7 mem 0xf7f3e000-0xf7f3efff irq 19 at device 22.3 on pci0
em0: <Intel(R) PRO/1000 Network Connection 7.4.2> port 0xf080-0xf09f mem 0xf7f00000-0xf7f1ffff,0xf7f3d000-0xf7f3dfff irq 20 at device 25.0 on pci0
em0: Using an MSI interrupt
em0: Ethernet address: 44:39:c4:4e:fd:9b
ehci0: <EHCI (generic) USB 2.0 controller> mem 0xf7f3c000-0xf7f3c3ff irq 17 at device 26.0 on pci0
usbus1: EHCI version 1.0
usbus1 on ehci0
pci0: <multimedia, HDA> at device 27.0 (no driver attached)
pcib2: <ACPI PCI-PCI bridge> irq 16 at device 28.0 on pci0
pci2: <ACPI PCI bus> on pcib2
pcib3: <ACPI PCI-PCI bridge> irq 17 at device 28.1 on pci0
pci3: <ACPI PCI bus> on pcib3
xhci1: <XHCI (generic) USB 3.0 controller> mem 0xf7e00000-0xf7e0ffff,0xf7e11000-0xf7e11fff,0xf7e10000-0xf7e10fff irq 17 at device 0.0 on pci3
xhci1: 32 byte context size.
usbus2 on xhci1
pcib4: <ACPI PCI-PCI bridge> irq 19 at device 28.3 on pci0
pci4: <ACPI PCI bus> on pcib4
pcib5: <ACPI PCI-PCI bridge> irq 19 at device 0.0 on pci4
pci5: <ACPI PCI bus> on pcib5
ehci1: <EHCI (generic) USB 2.0 controller> mem 0xf7f3b000-0xf7f3b3ff irq 23 at device 29.0 on pci0
usbus3: EHCI version 1.0
usbus3 on ehci1
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
ahci0: <Intel Lynx Point AHCI SATA controller> port 0xf0d0-0xf0d7,0xf0c0-0xf0c3,0xf0b0-0xf0b7,0xf0a0-0xf0a3,0xf060-0xf07f mem 0xf7f3a000-0xf7f3a7ff irq 19 at device 31.2 on pci0
ahci0: AHCI v1.30 with 5 6Gbps ports, Port Multiplier not supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich3: <AHCI channel> at channel 3 on ahci0
acpi_button0: <Power Button> on acpi0
acpi_tz0: <Thermal Zone> on acpi0
acpi_tz1: <Thermal Zone> on acpi0
uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
orm0: <ISA Option ROM> at iomem 0xd5000-0xd5fff on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
wbwd0: HEFRAS and EFER do not align: EFER 0x2e DevID 0xff DevRev 0xff CR26 0xff
coretemp0: <CPU On-Die Thermal Sensors> on cpu0
est0: <Enhanced SpeedStep Frequency Control> on cpu0
p4tcc0: <CPU Frequency Thermal Control> on cpu0
coretemp1: <CPU On-Die Thermal Sensors> on cpu1
est1: <Enhanced SpeedStep Frequency Control> on cpu1
p4tcc1: <CPU Frequency Thermal Control> on cpu1
coretemp2: <CPU On-Die Thermal Sensors> on cpu2
est2: <Enhanced SpeedStep Frequency Control> on cpu2
p4tcc2: <CPU Frequency Thermal Control> on cpu2
coretemp3: <CPU On-Die Thermal Sensors> on cpu3
est3: <Enhanced SpeedStep Frequency Control> on cpu3
p4tcc3: <CPU Frequency Thermal Control> on cpu3
ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
Timecounters tick every 1.000 msec
ipfw2 (+ipv6) initialized, divert enabled, nat enabled, default to accept, logging disabled
usbus0: 5.0Gbps Super Speed USB v3.0
usbus1: 480Mbps High Speed USB v2.0
usbus2: 5.0Gbps Super Speed USB v3.0
usbus3: 480Mbps High Speed USB v2.0
ugen0.1: <0x8086> at usbus0
uhub0: <0x8086 XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0
ugen1.1: <Intel> at usbus1
uhub1: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus1
ugen2.1: <0x1b73> at usbus2
uhub2: <0x1b73 XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus2
ugen3.1: <Intel> at usbus3
uhub3: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus3
uhub2: 8 ports with 8 removable, self powered
uhub0: 21 ports with 21 removable, self powered
uhub1: 3 ports with 3 removable, self powered
uhub3: 3 ports with 3 removable, self powered
ugen0.2: <CPS> at usbus0
ugen1.2: <vendor 0x8087> at usbus1
uhub4: <vendor 0x8087 product 0x8008, class 9/0, rev 2.00/0.04, addr 2> on usbus1
ugen3.2: <vendor 0x8087> at usbus3
uhub5: <vendor 0x8087 product 0x8000, class 9/0, rev 2.00/0.04, addr 2> on usbus3
uhub4: 6 ports with 6 removable, self powered
uhub5: 8 ports with 8 removable, self powered
ugen2.2: <Prolific Technology Inc.> at usbus2
umass0: <Prolific Technology Inc. USB-SATA Bridge, class 0/0, rev 3.00/1.00, addr 1> on usbus2
umass0:  SCSI over Bulk-Only; quirks = 0xc100
umass0:6:0:-1: Attached to scbus6
ugen0.3: <PNY Technologies> at usbus0
umass1: <PNY Technologies USB 3.0 FD, class 0/0, rev 2.10/10.75, addr 2> on usbus0
umass1:  SCSI over Bulk-Only; quirks = 0x0100
umass1:7:1:-1: Attached to scbus7
ada0 at ahcich0 bus 0 scbus1 target 0 lun 0
da0 at mpslsi0 bus 0 scbus0 target 0 lun 0
da0: <ATA INTEL SSDSA2BW12 0362> Fixed Direct Access SCSI-6 device
da0: Serial Number BTPR2172006K120LGN
da0: 300.000MB/s transfers
da0: Command Queueing enabled
da0: 114473MB (234441648 512 byte sectors: 255H 63S/T 14593C)
da2 at mpslsi0 bus 0 scbus0 target 2 lun 0
da2: <ATA INTEL SSDSA2BW12 0362> Fixed Direct Access SCSI-6 device
da2: Serial Number BTPR217100JU120LGN
da2: 300.000MB/s transfers
da2: Command Queueing enabled
da2: 114473MB (234441648 512 byte sectors: 255H 63S/T 14593C)
da1 at mpslsi0 bus 0 scbus0 target 1 lun 0
da1: <ATA SAMSUNG SSD 830 3B1Q> Fixed Direct Access SCSI-6 device
da1: Serial Number S0Z3NEAC830867
da1: 600.000MB/s transfers
da1: Command Queueing enabled
da1: 122104MB (250069680 512 byte sectors: 255H 63S/T 15566C)
da1: quirks=0x8<4K>
da4 at umass-sim1 bus 1 scbus7 target 0 lun 0
da4: <PNY USB 3.0 FD 1.00> Removable Direct Access SCSI-6 device
da4: Serial Number 693669314
da4: 40.000MB/s transfers
da4: 30474MB (62411243 512 byte sectors: 255H 63S/T 3884C)
da4: quirks=0x13<NO_SYNC_CACHE,NO_6_BYTE,NO_RC16>
ada0: <WDC WD20EARS-00MVWB0 51.0AB51> ATA-8 SATA 2.x device
ada0: Serial Number WD-WMAZA3295120
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada0: quirks=0x1<4K>
ada0: Previously was known as ad4
da3 at umass-sim0 bus 0 scbus6 target 0 lun 0
da3: <WDC WD50 EFRX-68MYMN1 82.0> Fixed Direct Access SCSI-4 device
da3: Serial Number PROLIFICMP000000002
da3: 400.000MB/s transfers
da3: 4769307MB (9767541168 512 byte sectors: 255H 63S/T 608001C)
da3: quirks=0xa<NO_6_BYTE,4K>
ada1 at ahcich1 bus 0 scbus2 target 0 lun 0
ada1: <ST3000DM001-9YN166 CC4H> ATA-8 SATA 3.x device
ada1: Serial Number W1F1M0JR
ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 2861588MB (5860533168 512 byte sectors: 16H 63S/T 16383C)
ada1: quirks=0x1<4K>
ada1: Previously was known as ad6
ada2 at ahcich2 bus 0 scbus3 target 0 lun 0
ada2: <SAMSUNG HD204UI 1AQ10001> ATA-8 SATA 2.x device
ada2: Serial Number S2H7JD2B218016
ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada2: Command Queueing enabled
ada2: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada2: quirks=0x1<4K>
ada2: Previously was known as ad8
ada3 at ahcich3 bus 0 scbus4 target 0 lun 0
ada3: <ST3000DM001-9YN166 CC46> ATA-8 SATA 3.x device
ada3: Serial Number W1F070M3
ada3: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada3: Command Queueing enabled
ada3: 2861588MB (5860533168 512 byte sectors: 16H 63S/T 16383C)
ada3: quirks=0x1<4K>
ada3: Previously was known as ad10
SMP: AP CPU #2 Launched!
SMP: AP CPU #1 Launched!
SMP: AP CPU #3 Launched!
Timecounter "TSC-low" frequency 1596338189 Hz quality 1000
Trying to mount root from zfs:freenas-boot/ROOT/FreeNAS-9.3-STABLE-201502271818 []...
GEOM_RAID5: Module loaded, version 1.3.20140711.62 (rev f91e28e40bf7)
wbwd0: HEFRAS and EFER do not align: EFER 0x2e DevID 0xff DevRev 0xff CR26 0xff
vboxdrv: fAsync=0 offMin=0x2ae offMax=0xef1
nfsd: can't register svc name
GEOM_ELI: Device gptid/9c1b7a3f-921a-11e4-844e-4439c44efd9b.eli created.
GEOM_ELI: Encryption: AES-XTS 128
GEOM_ELI:     Crypto: hardware
nfsd: can't register svc name
GEOM_ELI: Device da3p1.eli created.
GEOM_ELI: Encryption: AES-XTS 256
GEOM_ELI:     Crypto: hardware



smartctl -a /dev/ada3
(the drive that wasn't listed)

Code:
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p10 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST3000DM001-9YN166
Serial Number:    W1F070M3
LU WWN Device Id: 5 000c50 044e52a02
Firmware Version: CC46
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Mar  8 08:07:01 2015 PDT

==> WARNING: A firmware update for this drive is available,
see the following Seagate web pages:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/223651en

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  584) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 355) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x3085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail  Always       -       126414416
  3 Spin_Up_Time            0x0003   093   092   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       40
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   076   060   030    Pre-fail  Always       -       48046291
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       2309
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       40
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   099   099   000    Old_age   Always       -       1
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       1 1 1
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   064   042   045    Old_age   Always   In_the_past 36 (0 2 37 36 0)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       23
193 Load_Cycle_Count        0x0032   078   078   000    Old_age   Always       -       45281
194 Temperature_Celsius     0x0022   036   058   000    Old_age   Always       -       36 (0 23 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       1974h+12m+07.642s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       49325562067260
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       2854919293193

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      2299         -
# 2  Short offline       Completed without error       00%      2275         -
# 3  Short offline       Completed without error       00%      2255         -
# 4  Short offline       Completed without error       00%      2227         -
# 5  Short offline       Completed without error       00%      2203         -
# 6  Short offline       Completed without error       00%      2179         -
# 7  Short offline       Completed without error       00%      2155         -
# 8  Short offline       Completed without error       00%      2131         -
# 9  Short offline       Completed without error       00%      2107         -
#10  Short offline       Completed without error       00%      2083         -
#11  Short offline       Completed without error       00%      2059         -
#12  Short offline       Completed without error       00%      2035         -
#13  Short offline       Completed without error       00%      2011         -
#14  Short offline       Completed without error       00%      1987         -
#15  Short offline       Completed without error       00%      1963         -
#16  Short offline       Completed without error       00%      1939         -
#17  Short offline       Completed without error       00%      1915         -
#18  Short offline       Completed without error       00%      1891         -
#19  Short offline       Completed without error       00%      1867         -
#20  Short offline       Completed without error       00%      1842         -
#21  Short offline       Completed without error       00%      1818         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
+=> WARNING: A firmware update for this drive is available,
I hesitate to respond since I'm no expert, but has it occurred to you to address this?
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
74. mpslsi0: <Avago Technologies (LSI) SAS2008> port 0xe000-0xe0ff mem 0xf7d00000-0xf7d03fff,0xf7c80000-0xf7cbffff irq 16 at device 0.0 on pci1
75. mpslsi0: Firmware: 19.00.00.00, Driver: 20.00.00.00
76. mpslsi0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
Your dmesg out put shows line 75 as a driver/firmware mismatch.
A quick search of the forum will provide lots of reasons to correct that asap.
 

morfizm

Cadet
Joined
Mar 8, 2015
Messages
8
Yes, I've actually just discovered it today while writing this post and executing smartctl, and going to fix it.
I thought that the symptoms are still indicative of a software or configuration problem, so I went ahead and posted this.
I'd be surprised if updating drive's firmware would fix it (the drive look healthy in every way), but even if it would, there's a chance that someone may ask me NOT to rush with firmware update in order to debug software/configuration problem.

Please advice if I should fix it asap or I should wait.
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
FreeNAS has, and is using driver P16.
Please make sure your firmware version is (IT mode) P16 as well.
ZFS must have direct access to the drives, so IT mode is a must.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Also "snapshots for backups" snapshots are good but they are not backups. I strongly recommend you to do some backups (I mean on another physical device), especially with a RAID-Z1 ;)
 

morfizm

Cadet
Joined
Mar 8, 2015
Messages
8
BigDave:
1) OK, I'll flash to P16 (found these instructions: https://forbesconrad.com/blog/flashing-lsi-sas-9211-8i-hba-for-freenas)

2) It just crossed my mind that the whole problematic array is connected solely to the motherboard, so SAS card shouldn't be at issue. SAS card only used for ZIL and L2ARC.
Do you think I should also upgrade firmware on that Seagate drive asap?

Bidule0hm:
Yes, I meant both creating and sending those snapshots (I am very new to FreeNAS/ZFS, sorry for terminology slips :)).
The purpose of that USB 3.0 drive is to accept continuous replication of backup snapshots. (and, yes, I know USB is bad, I am going to switch to eSATA for that drive, but since USB isn't critical issue I am not yet rushing with it).
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Ok, in this case there is no problem ;) it's just that I prefer to warn before anything (very) bad happens.
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
I'm not saying that the driver/firmware mismatch is the cause of your issue, but based
on reading your posted outputs that was what popped out at me. I had no idea your
main drives were not attched to the M1015 / 9220-8i SAS card.
Someone more schooled than myself will have to address the rest of your issue.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Hold it a minute.

  1. mpslsi0: <Avago Technologies (LSI) SAS2008> port 0xe000-0xe0ff mem 0xf7d00000-0xf7d03fff,0xf7c80000-0xf7cbffff irq 16 at device 0.0 on pci1
  2. mpslsi0: Firmware: 19.00.00.00, Driver: 20.00.00.00
  3. mpslsi0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>

What happened here? The driver is supposed to be P16. Did you hack P20 in? That's not a good idea (if it was, FreeNAS would have P20 instead of P16).

I recommend you do a clean install of FreeNAS and upload your config. The P20 driver has not been validated at all and it doesn't include some fixes from the community P16 driver.

I hesitate to respond since I'm no expert, but has it occurred to you to address this?

That's boilerplate. Most Seagate drives of certain models show that warning, but it only applies for older drives. Hard drive firmware is a pain in the ass to detect and update reliably.

As for the "real" problem:

Are the "missing" drives always in the same SATA port? If so, move them to a different one and see if the problem persists.

Not sure if it's related, there's another problem with this pool that I observed - if I do not detach pool before rebooting, but just reboot, it will show up in unencrypted state (this is scary, because normally pool would show up LOCKED and I need to enter password). Though I am not sure how reproducible it is, because I was more worried about the problem with mounting/importing drives.

Have you backed up all keys and related files? If not, do so ASAP and that behavior should stop. If you have, something weird is going on.
 

morfizm

Cadet
Joined
Mar 8, 2015
Messages
8
Bidule0hm, BigDave: yes, I am very grateful for all sorts of useful warnings, and I'll clear these things to minimize "noise" in the future.

Ericloewe:

> What happened here? The driver is supposed to be P16. Did you hack P20 in? That's not a good idea (if it was, FreeNAS would have P20 instead of P16).

OK, I pulled my installation logs (did that a few months ago).
Here is what happened: I've read on forums that proper SAS card is a big deal and it should be in IT mode, not RAID mode (default), so I decided to cut on things and purchased a card on eBay which already flashed to IT mode.
It happened to have P19 firmware so it didn't work. Then I've found this bug https://bugs.freenas.org/issues/6678#note-5 and did per comment #5 , got P20 driver and added a tunable to enable it.

Here is my current list of tunables:

mpslsi_load = YES, type = Loader, comment = LSI P20 Driver
xhci_load = YES, type = Loader, comment = Enable USB 3.0

That's all of the modified drivers/hacks that I have; everything else I did is configuration like pools and datasets, users, permissions, shares, etc.
As of hardware, the SAS card and USB 3.0 card are the only non-standard cards above ThinkServer TS140 config (plus I added compatible memory and added a bunch of disks).

Since you're saying it's not a good idea, I'll remove the hack and put P16 on card.
Though I'd like to know is it a side thing that I should do just to minimize noise so we can better focus on real problems, or do you think this driver/firmware thing can actually affect the problem I described (given that the drives at issue are attached straight to motherboard, not to SAS card)?


> Are the "missing" drives always in the same SATA port? If so, move them to a different one and see if the problem persists.

I didn't keep track of it. I'll try.


>> Not sure if it's related, there's another problem with this pool that I observed - if I do not detach pool before rebooting, but just reboot, it will show up in unencrypted state (this is scary, because normally pool would show up LOCKED and I need to enter password). Though I am not sure how reproducible it is, because I was more worried about the problem with mounting/importing drives.

> Have you backed up all keys and related files? If not, do so ASAP and that behavior should stop. If you have, something weird is going on.

I am creating passphrase, saving geli key and adding a geli_recovery key every time I do any modification to the pool that says the keys will be invalid, also every time after I see something weird (like encryption was reset after reboot), and every time after resilvering. I have high confidence that this happens regardless. I can try to figure out exact repro steps.


One note: these issues have motivated me to buy an extra drive so that I set up Raid-Z2 with 5 drives rather than Raid-Z1 with 4 drives. The way I am going to do it is to make sure I have two full backups on separate disks (for safe redundancy), and then wipe the disks and re-create the volume. There's a chance that the issue will go away after I do that. Or it will stay (which may indicate it's hardware related, such as disk problems). There's are also chances that the issue will disappear if I reinstall FreeNAS from a clean USB stick. I wonder how "interesting" are the problems I described for you to investigate -- if they are, I can hold from doing any radical step (like recreating a volume or putting P16 firmware on card + reinstalling FreeNAS) and can do more troubleshooting steps on the current system. If they aren't, I can just go ahead with my plan, and will also update drivers on SAS card + reinstall FreeNAS, and then report back if the issue is still here or it's gone. What do you think?
 

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,211
The fact that the volume has some drives named dax suggests they ARE on that card.
 

morfizm

Cadet
Joined
Mar 8, 2015
Messages
8
The fact that the volume has some drives named dax suggests they ARE on that card.

Hm, among those only ZIL and L2ARC belong to the same volume at issue, but data drives are all adaX.
Can this issue happen because ZIL is on the card and card have driver/firmware issue?

I didn't think of it this way, because the data drive which isn't on the list is daX drive (not on the card). But maybe I am wrong?
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
The SLOG and L2ARC are part of your pool.
do more troubleshooting steps on the current system. If they aren't, I can just go ahead with my plan, and will also update drivers on SAS card + reinstall FreeNAS, and then report back if the issue is still here or it's gone. What do you think?
Trust me, the firmware/driver stuff needs to be fixed first, then start over with fresh install.
Then if the da3 drive is bad, replacing it will end your troubles.
May I also recommend you study a bit about SLOG/L2ARC before proceeding...
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
When you're done with the other recommendations, can I suggest you schedule some SMART Long Self-Tests too?
 

morfizm

Cadet
Joined
Mar 8, 2015
Messages
8
I'll later give updates on other things.

For now - flashing the LSI SAS 9211-8i card is the biggest FreeNAS-related configuration megafuck I've had so far, and I am giving up.
Please advice the cheapest solution to solve this problem for $$.
Maybe you know online service, preferably in CA, where you can mail a card and get it flashed to P16 back?
Maybe you can recommend another card that I can buy which don't require flashing?

Here is what I've tried:
FreeDOS and Win98 DOS disk: "Failed to initialize PAL"
Many other DOS-es: just won't start.
WinPE (Win8-based): "Erase flash command not supported on this platform"
Note: erase -o -e 7 completed partially I seem to successfully flashed bios from WinPE, however, so I have P19 firmware and P16 bios. Yes, it boots up, not sure how reliable it is though. Not sure what to do with this card now. Will probably sell it as is, describing it's state, and let someone else deal with flashing.
rEFInd's EFI shell: "InitShellApp: Application not Started from Shell".

Based on what I've read on forums I need true BIOS-supported EFI shell that Thinkserver TS140 doesn't have. Capability of EFI boot is not enough. Tried two computers, same problem on both.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
Some eBay sellers offer the IBM M1015 pre-flashed. Edit: never mind, I just reread the thread, doh!

Although HighPoint controllers get no love on this forum, I've had no trouble with my HighPoint 640L (no flashing required). YMMV.
 
Last edited:

morfizm

Cadet
Joined
Mar 8, 2015
Messages
8
danb35 -- thanks a lot! following up in a "conversation" (hope it's the right way to do private messages here, I am very new to freenas forums).

One more interesting observations to share:

I've rebuilt my array (now 5 disks in RAID-Z2 instead of 4 in RAID-Z1) but haven't connected ZIL and L2ARC to it, which means, SAS card isn't being used for the main array. However, I removed USB 3.0 card and plugged my external drive (which has a backup pool) via SAS card and eSATA cable.

Now seeing no more of the problems I complained about, but seeing new type of problems with the external drive: restore all volumes from backup was smooth, but after I recreated backup volume and did full backup, one volume failed to replicate and I've noticed that volume has errors in metadata nodes. This is a strong indicator supporting the idea that SAS card is at issue, which I hope will go away once I either replace it or get P16 on it.
 
Status
Not open for further replies.
Top