from "Unreadable (pending) sector" to "One or more devices has experienced an unrecoverable error."

MrWraith · Jul 19, 2016

Hi guys. I've done a lot of searching on this, and followed some advice found here, but it has got to the stage where I need to just ask for help.

What I've done so far:
I have a FreeNAS box (HP ProLiant N40L) with four WD 3TB drives. (WD-WCAWZXXXXXXX).

The day before yesterday, I had an error "CRITICAL: Device: /dev/ada3, 1 Currently unreadable (pending) sector"

I did some reading, learned about SMART long tests, and did one of those. It took ~8 hours, and after that I had 4 Currently unreadable (pending) sectors.

I tried to write to those bad sectors using dd (as per suggestions on these forums). dd gave me an input/output error when I tried to use it. I didn't know if that meant it was doing nothing, or if the error was a sign that the block was being reassigned. The LBA_of_first_error was changing after my attempted dd and new long self-test, so I thought it may be working. However, after a reboot and another long test, I still had 4 unreadable (pending) sectors.

So I did a zpool scrub (as per some more advice on this forum). It took 18hrs, still 4 unreadable sectors, and now more problems. I have a FreeNAS GUI error "The volume VolumeZero (ZFS) state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected." (same error in my zpool status output).

I just noticed some errors in dmesg when preparing to write this thread. I'll attach the output below.

Where I am now:
These disks are relatively old, and have been running without spinning down for years. I will replace one if I need to, but I'd like to explore other options first if other options exist. Disks are expensive in Australia and I'm a student.

I've already externally backed up all the critical info to my unRAID box, and i'm currently backing up the rest (apart from the 6TB of movies that I can just re-download in case of total failure).

Now for the info dumps:

DMESG Output:
(not sure if the whole thing was needed. Errors near the bottom)

Code:

[root@Micro] ~# dmesg
tun0: link state changed to DOWN
tun0: link state changed to UP
arp: 192.168.1.66 moved from 02:61:7a:00:0c:0a to e8:39:35:ee:28:49 on epair1b
tun0: link state changed to DOWN
tun0: link state changed to UP
tun0: link state changed to DOWN
tun0: link state changed to UP
tun0: link state changed to DOWN
tun0: link state changed to UP
tun0: link state changed to DOWN
tun0: link state changed to UP
arp: 192.168.1.66 moved from 02:61:7a:00:0c:0a to e8:39:35:ee:28:49 on epair1b
tun0: link state changed to DOWN
tun0: link state changed to UP
epair0a: link state changed to DOWN
epair0b: link state changed to DOWN
ifa_del_loopback_route: deletion failed
Freed UMA keg (udp_inpcb) was not empty (40 items).  Lost 4 pages of memory.
Freed UMA keg (udpcb) was not empty (336 items).  Lost 2 pages of memory.
Freed UMA keg (tcptw) was not empty (300 items).  Lost 6 pages of memory.
Freed UMA keg (tcp_inpcb) was not empty (160 items).  Lost 16 pages of memory.
Freed UMA keg (sackhole) was not empty (101 items).  Lost 1 pages of memory.
Freed UMA keg (tcpcb) was not empty (44 items).  Lost 11 pages of memory.
Freed UMA keg (ripcb) was not empty (20 items).  Lost 2 pages of memory.
hhook_vnet_uninit: hhook_head type=1, id=1 cleanup required
hhook_vnet_uninit: hhook_head type=1, id=0 cleanup required
epair1a: link state changed to DOWN
epair1b: link state changed to DOWN
bridge0: link state changed to DOWN
bge0: promiscuous mode disabled
tun0: link state changed to DOWN
pid 1608 (syslog-ng), uid 0: exited on signal 6 (core dumped)
Waiting (max 60 seconds) for system process `vnlru' to stop...done
Waiting (max 60 seconds) for system process `bufdaemon' to stop...done
Waiting (max 60 seconds) for system process `syncer' to stop...
Syncing disks, vnodes remaining...0 0 0 0 0 0 0 done
All buffers synced.
GEOM_ELI: Device ada0p1.eli destroyed.
GEOM_ELI: Detached ada0p1.eli on last close.
GEOM_ELI: Device ada1p1.eli destroyed.
GEOM_ELI: Detached ada1p1.eli on last close.
GEOM_ELI: Device ada2p1.eli destroyed.
GEOM_ELI: Detached ada2p1.eli on last close.
GEOM_ELI: Device ada3p1.eli destroyed.
GEOM_ELI: Detached ada3p1.eli on last close.
Uptime: 66d20h33m11s
Copyright (c) 1992-2014 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
    The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 9.3-RELEASE-p26 #0 r281084+93c5885: Mon Sep 28 13:25:20 PDT 2015
    root@build3.ixsystems.com:/tank/home/stable-builds/FN/objs/os-base/amd64/tank/home/stable-builds/FN/FreeBSD/src/sys/FREENAS.amd64 amd64
gcc version 4.2.1 20070831 patched [FreeBSD]
CPU: AMD Turion(tm) II Neo N40L Dual-Core Processor (1497.54-MHz K8-class CPU)
  Origin = "AuthenticAMD"  Id = 0x100f63  Family = 0x10  Model = 0x6  Stepping = 3
  Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
  Features2=0x802009<SSE3,MON,CX16,POPCNT>
  AMD Features=0xee500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM,3DNow!+,3DNow!>
  AMD Features2=0x837ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,IBS,SKINIT,WDT,NodeId>
  TSC: P-state invariant
real memory  = 9126805504 (8704 MB)
avail memory = 8104435712 (7728 MB)
Event timer "LAPIC" quality 400
ACPI APIC Table: <HP     ProLiant>
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
FreeBSD/SMP: 1 package(s) x 2 core(s)
cpu0 (BSP): APIC ID:  0
cpu1 (AP): APIC ID:  1
WARNING: VIMAGE (virtualized network stack) is a highly experimental feature.
ioapic0 <Version 2.1> irqs 0-23 on motherboard
ispfw: registered firmware <isp_1040>
ispfw: registered firmware <isp_1040_it>
ispfw: registered firmware <isp_1080>
ispfw: registered firmware <isp_1080_it>
ispfw: registered firmware <isp_12160>
ispfw: registered firmware <isp_12160_it>
ispfw: registered firmware <isp_2100>
ispfw: registered firmware <isp_2200>
ispfw: registered firmware <isp_2300>
ispfw: registered firmware <isp_2322>
ispfw: registered firmware <isp_2400>
ispfw: registered firmware <isp_2400_multi>
ispfw: registered firmware <isp_2500>
ispfw: registered firmware <isp_2500_multi>
kbd1 at kbdmux0
cryptosoft0: <software crypto> on motherboard
aesni0: No AESNI support.
padlock0: No ACE support.
acpi0: <HP ProLiant> on motherboard
acpi0: Power Button (fixed)
acpi0: reservation of fee00000, 1000 (3) failed
acpi0: reservation of ffb80000, 80000 (3) failed
acpi0: reservation of fec10000, 20 (3) failed
acpi0: reservation of fed80000, 1000 (3) failed
acpi0: reservation of 0, a0000 (3) failed
acpi0: reservation of 100000, d7f00000 (3) failed
cpu0: <ACPI CPU> on acpi0
cpu1: <ACPI CPU> on acpi0
attimer0: <AT timer> port 0x40-0x43 irq 0 on acpi0
Timecounter "i8254" frequency 1193182 Hz quality 0
Event timer "i8254" frequency 1193182 Hz quality 100
atrtc0: <AT realtime clock> port 0x70-0x71 irq 8 on acpi0
Event timer "RTC" frequency 32768 Hz quality 0
hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0
Timecounter "HPET" frequency 14318180 Hz quality 950
Event timer "HPET" frequency 14318180 Hz quality 550
Event timer "HPET1" frequency 14318180 Hz quality 450
Timecounter "ACPI-safe" frequency 3579545 Hz quality 850
acpi_timer0: <32-bit timer at 3.579545MHz> port 0x808-0x80b on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
pcib1: <ACPI PCI-PCI bridge> at device 1.0 on pci0
pci1: <ACPI PCI bus> on pcib1
vgapci0: <VGA-compatible display> port 0xe000-0xe0ff mem 0xf0000000-0xf7ffffff,0xfe8f0000-0xfe8fffff,0xfe700000-0xfe7fffff irq 18 at device 5.0 on pci1
vgapci0: Boot video device
pcib2: <ACPI PCI-PCI bridge> irq 18 at device 6.0 on pci0
pci2: <ACPI PCI bus> on pcib2
bge0: <HP NC107i PCIe Gigabit Server Adapter, ASIC rev. 0x5784100> mem 0xfe9f0000-0xfe9fffff irq 18 at device 0.0 on pci2
bge0: CHIP ID 0x05784100; ASIC REV 0x5784; CHIP REV 0x57841; PCI-E
miibus0: <MII bus> on bge0
brgphy0: <BCM5784 10/100/1000baseT PHY> PHY 1 on miibus0
brgphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, auto, auto-flow
bge0: Ethernet address: e8:39:35:ee:28:49
ahci0: <AMD SB7x0/SB8x0/SB9x0 AHCI SATA controller> port 0xd000-0xd007,0xc000-0xc003,0xb000-0xb007,0xa000-0xa003,0x9000-0x900f mem 0xfe6ffc00-0xfe6fffff irq 19 at device 17.0 on pci0
ahci0: AHCI v1.20 with 4 3Gbps ports, Port Multiplier supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich3: <AHCI channel> at channel 3 on ahci0
ohci0: <AMD SB7x0/SB8x0/SB9x0 USB controller> mem 0xfe6fe000-0xfe6fefff irq 18 at device 18.0 on pci0
usbus0 on ohci0
ehci0: <AMD SB7x0/SB8x0/SB9x0 USB 2.0 controller> mem 0xfe6ff800-0xfe6ff8ff irq 17 at device 18.2 on pci0
usbus1: EHCI version 1.0
usbus1 on ehci0
ohci1: <AMD SB7x0/SB8x0/SB9x0 USB controller> mem 0xfe6fd000-0xfe6fdfff irq 18 at device 19.0 on pci0
usbus2 on ohci1
ehci1: <AMD SB7x0/SB8x0/SB9x0 USB 2.0 controller> mem 0xfe6ff400-0xfe6ff4ff irq 17 at device 19.2 on pci0
usbus3: EHCI version 1.0
usbus3 on ehci1
atapci0: <ATI IXP700/800 UDMA133 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xff00-0xff0f at device 20.1 on pci0
ata0: <ATA channel> at channel 0 on atapci0
ata1: <ATA channel> at channel 1 on atapci0
isab0: <PCI-ISA bridge> at device 20.3 on pci0
isa0: <ISA bus> on isab0
pcib3: <ACPI PCI-PCI bridge> at device 20.4 on pci0
pci3: <ACPI PCI bus> on pcib3
ohci2: <AMD SB7x0/SB8x0/SB9x0 USB controller> mem 0xfe6fc000-0xfe6fcfff irq 18 at device 22.0 on pci0
usbus4 on ohci2
ehci2: <AMD SB7x0/SB8x0/SB9x0 USB 2.0 controller> mem 0xfe6ff000-0xfe6ff0ff irq 17 at device 22.2 on pci0
usbus5: EHCI version 1.0
usbus5 on ehci2
amdtemp0: <AMD CPU On-Die Thermal Sensors> on hostb4
acpi_button0: <Power Button> on acpi0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
wbwd0: HEFRAS and EFER do not align: EFER 0x2e DevID 0xff DevRev 0xff CR26 0xff
hwpstate0: <Cool`n'Quiet 2.0> on cpu0
ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
Timecounters tick every 1.000 msec
ipfw2 (+ipv6) initialized, divert enabled, nat enabled, default to accept, logging disabled
usbus0: 12Mbps Full Speed USB v1.0
usbus1: 480Mbps High Speed USB v2.0
usbus2: 12Mbps Full Speed USB v1.0
usbus3: 480Mbps High Speed USB v2.0
usbus4: 12Mbps Full Speed USB v1.0
usbus5: 480Mbps High Speed USB v2.0
ugen0.1: <ATI> at usbus0
uhub0: <ATI OHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus0
ugen1.1: <ATI> at usbus1
uhub1: <ATI EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus1
ugen2.1: <ATI> at usbus2
uhub2: <ATI OHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus2
ugen3.1: <ATI> at usbus3
uhub3: <ATI EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus3
ugen4.1: <ATI> at usbus4
uhub4: <ATI OHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus4
ugen5.1: <ATI> at usbus5
uhub5: <ATI EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus5
uhub4: 4 ports with 4 removable, self powered
uhub0: 5 ports with 5 removable, self powered
uhub2: 5 ports with 5 removable, self powered
uhub5: 4 ports with 4 removable, self powered
uhub1: 5 ports with 5 removable, self powered
uhub3: 5 ports with 5 removable, self powered
ugen3.2: <Kingston> at usbus3
umass0: <Kingston DataTraveler 3.0, class 0/0, rev 2.10/1.00, addr 2> on usbus3
umass0:  SCSI over Bulk-Only; quirks = 0x8100
umass0:7:0:-1: Attached to scbus7
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <WDC WD30EZRX-00MMMB0 80.00A80> ATA-8 SATA 3.x device
ada0: Serial Number WD-WCAWZ2042347
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 2861588MB (5860533168 512 byte sectors: 16H 63S/T 16383C)
ada0: quirks=0x1<4K>
ada0: Previously was known as ad4
da0 at umass-sim0 bus 0 scbus7 target 0 lun 0
da0: <Kingston DataTraveler 3.0 PMAP> Removable Direct Access SCSI-6 device
da0: Serial Number 60A44C4250DBBD904B47169E
da0: 40.000MB/s transfers
da0: 7500MB (15360000 512 byte sectors: 255H 63S/T 956C)
da0: quirks=0x2<NO_6_BYTE>
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <WDC WD30EZRX-00MMMB0 80.00A80> ATA-8 SATA 3.x device
ada1: Serial Number WD-WCAWZ2023116
ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 2861588MB (5860533168 512 byte sectors: 16H 63S/T 16383C)
ada1: quirks=0x1<4K>
ada1: Previously was known as ad6
ada2 at ahcich2 bus 0 scbus2 target 0 lun 0
ada2: <WDC WD30EZRX-00MMMB0 80.00A80> ATA-8 SATA 3.x device
ada2: Serial Number WD-WMAWZ0178959
ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada2: Command Queueing enabled
ada2: 2861588MB (5860533168 512 byte sectors: 16H 63S/T 16383C)
ada2: quirks=0x1<4K>
ada2: Previously was known as ad8
ada3 at ahcich3 bus 0 scbus3 target 0 lun 0
ada3: <WDC WD30EZRX-00MMMB0 80.00A80> ATA-8 SATA 3.x device
ada3: Serial Number WD-WMAWZ0183668
ada3: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada3: Command Queueing enabled
ada3: 2861588MB (5860533168 512 byte sectors: 16H 63S/T 16383C)
ada3: quirks=0x1<4K>
ada3: Previously was known as ad10
SMP: AP CPU #1 Launched!
Timecounter "TSC" frequency 1497538359 Hz quality 800
Trying to mount root from zfs:freenas-boot/ROOT/FreeNAS-9.3-STABLE-201509282017 []...
GEOM_RAID5: Module loaded, version 1.3.20140711.62 (rev f91e28e40bf7)
ipmi0: <IPMI System Interface> on isa0
ipmi0: KCS mode found at mem 0x0 alignment 0x1 on isa
ipmi0: couldn't configure I/O resource
device_attach: ipmi0 attach returned 6
wbwd0: HEFRAS and EFER do not align: EFER 0x2e DevID 0xff DevRev 0xff CR26 0xff
GEOM_ELI: Device ada0p1.eli created.
GEOM_ELI: Encryption: AES-XTS 256
GEOM_ELI:     Crypto: software
GEOM_ELI: Device ada1p1.eli created.
GEOM_ELI: Encryption: AES-XTS 256
GEOM_ELI:     Crypto: software
GEOM_ELI: Device ada2p1.eli created.
GEOM_ELI: Encryption: AES-XTS 256
GEOM_ELI:     Crypto: software
GEOM_ELI: Device ada3p1.eli created.
GEOM_ELI: Encryption: AES-XTS 256
GEOM_ELI:     Crypto: software
vboxdrv: fAsync=0 offMin=0x358 offMax=0xb5a
bridge0: Ethernet address: 02:e9:08:1c:c8:00
bge0: promiscuous mode enabled
bridge0: link state changed to UP
epair0a: Ethernet address: 02:b4:99:00:0b:0a
epair0b: Ethernet address: 02:b4:99:00:0c:0b
epair0a: link state changed to UP
epair0b: link state changed to UP
epair0a: promiscuous mode enabled
ng_ether_ifnet_arrival_event: can't re-name node epair0b
epair1a: Ethernet address: 02:35:32:00:0c:0a
epair1b: Ethernet address: 02:35:32:00:0d:0b
epair1a: link state changed to UP
epair1b: link state changed to UP
epair1a: promiscuous mode enabled
ng_ether_ifnet_arrival_event: can't re-name node epair1b
tun0: link state changed to UP
tun0: link state changed to DOWN
tun0: link state changed to UP
tun0: link state changed to DOWN
tun0: link state changed to UP
tun0: link state changed to DOWN
tun0: link state changed to UP
tun0: link state changed to DOWN
tun0: link state changed to UP
arp: 192.168.1.66 moved from 02:35:32:00:0c:0a to e8:39:35:ee:28:49 on epair1b
(ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 a8 b1 b5 40 f5 00 00 01 00 00
(ada3:ahcich3:0:0:0): CAM status: ATA Status Error
(ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada3:ahcich3:0:0:0): RES: 41 40 50 b2 b5 40 f5 00 00 00 00
(ada3:ahcich3:0:0:0): Retrying command
(ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 a8 b1 b5 40 f5 00 00 01 00 00
(ada3:ahcich3:0:0:0): CAM status: ATA Status Error
(ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 01 (ILI)
(ada3:ahcich3:0:0:0): RES: 41 01 50 b2 b5 40 f5 00 00 00 00
(ada3:ahcich3:0:0:0): Retrying command
(ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 a8 b1 b5 40 f5 00 00 01 00 00
(ada3:ahcich3:0:0:0): CAM status: ATA Status Error
(ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada3:ahcich3:0:0:0): RES: 41 40 50 b2 b5 40 f5 00 00 00 00
(ada3:ahcich3:0:0:0): Retrying command
(ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 a8 b1 b5 40 f5 00 00 01 00 00
(ada3:ahcich3:0:0:0): CAM status: ATA Status Error
(ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada3:ahcich3:0:0:0): RES: 41 40 50 b2 b5 40 f5 00 00 00 00
(ada3:ahcich3:0:0:0): Retrying command
(ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 a8 b1 b5 40 f5 00 00 01 00 00
(ada3:ahcich3:0:0:0): CAM status: ATA Status Error
(ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada3:ahcich3:0:0:0): RES: 41 40 50 b2 b5 40 f5 00 00 00 00
(ada3:ahcich3:0:0:0): Error 5, Retries exhausted
(ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 30 f3 99 40 f7 00 00 01 00 00
(ada3:ahcich3:0:0:0): CAM status: ATA Status Error
(ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada3:ahcich3:0:0:0): RES: 41 40 80 f3 99 40 f7 00 00 00 00
(ada3:ahcich3:0:0:0): Retrying command
(ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 30 f3 99 40 f7 00 00 01 00 00
(ada3:ahcich3:0:0:0): CAM status: ATA Status Error
(ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada3:ahcich3:0:0:0): RES: 41 40 80 f3 99 40 f7 00 00 00 00
(ada3:ahcich3:0:0:0): Retrying command
(ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 30 f3 99 40 f7 00 00 01 00 00
(ada3:ahcich3:0:0:0): CAM status: ATA Status Error
(ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada3:ahcich3:0:0:0): RES: 41 40 80 f3 99 40 f7 00 00 00 00
(ada3:ahcich3:0:0:0): Retrying command
(ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 30 f3 99 40 f7 00 00 01 00 00
(ada3:ahcich3:0:0:0): CAM status: ATA Status Error
(ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada3:ahcich3:0:0:0): RES: 41 40 80 f3 99 40 f7 00 00 00 00
(ada3:ahcich3:0:0:0): Retrying command
(ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 30 f3 99 40 f7 00 00 01 00 00
(ada3:ahcich3:0:0:0): CAM status: ATA Status Error
(ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada3:ahcich3:0:0:0): RES: 41 40 80 f3 99 40 f7 00 00 00 00
(ada3:ahcich3:0:0:0): Error 5, Retries exhausted
tun0: link state changed to DOWN
tun0: link state changed to UP
tun0: link state changed to DOWN
tun0: link state changed to UP
tun0: link state changed to DOWN
tun0: link state changed to UP
arp: 192.168.1.66 moved from 02:35:32:00:0c:0a to e8:39:35:ee:28:49 on epair1b

zpool status

Code:

[root@Micro] ~# zpool status
  pool: VolumeZero
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 284K in 18h7m with 0 errors on Wed Jul 20 07:39:31 2016
config:

    NAME                                            STATE     READ WRITE CKSUM
    VolumeZero                                      ONLINE       0     0     0
      raidz1-0                                      ONLINE       0     0     0
        gptid/6b5bbd30-abaf-11e1-9c11-e83935ee2849  ONLINE       0     0     0
        gptid/6c218b94-abaf-11e1-9c11-e83935ee2849  ONLINE       0     0     0
        gptid/6cf0180d-abaf-11e1-9c11-e83935ee2849  ONLINE       0     0     0
        gptid/6db64f2c-abaf-11e1-9c11-e83935ee2849  ONLINE       0     0     1

errors: No known data errors

  pool: freenas-boot
state: ONLINE
status: One or more devices are configured to use a non-native block size.
    Expect reduced performance.
action: Replace affected devices with devices that support the
    configured block size, or migrate data to a properly configured
    pool.
  scan: scrub repaired 0 in 0h1m with 0 errors on Thu Jun 16 03:46:22 2016
config:

    NAME        STATE     READ WRITE CKSUM
    freenas-boot  ONLINE       0     0     0
      da0p2     ONLINE       0     0     0  block size: 8192B configured, 8388608B native

errors: No known data errors

smartctl -a /dev/ada3

Code:

[root@Micro] ~# smartctl -a /dev/ada3
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p26 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Green
Device Model:     WDC WD30EZRX-00MMMB0
Serial Number:    WD-WMAWZ0183668
LU WWN Device Id: 5 0014ee 206df0349
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Wed Jul 20 11:22:49 2016 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)    Offline data collection activity
                    was suspended by an interrupting command from host.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121)    The previous self-test completed having
                    the read element of the test failed.
Total time to complete Offline
data collection:         (51660) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 496) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x3035)    SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   151   150   021    Pre-fail  Always       -       9408
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       64
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   056   056   000    Old_age   Always       -       32747
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       62
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       48
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       3002349
194 Temperature_Celsius     0x0022   116   106   000    Old_age   Always       -       36
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       4
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       2

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     32725         3825966784
# 2  Extended offline    Completed: read failure       90%     32725         3825966788
# 3  Extended offline    Completed: read failure       90%     32725         3825966790
# 4  Extended offline    Completed: read failure       90%     32725         3825966786
# 5  Extended offline    Completed: read failure       90%     32725         3825966788
# 6  Extended offline    Completed: read failure       90%     32725         3825966790
# 7  Extended offline    Completed: read failure       70%     32715         1987001840

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Jailer · Jul 19, 2016

Load cycle count of over 3 MILLION????? You need to replace the drive ASAP!

MrWraith · Jul 19, 2016

Yeah? That sucks.

What if I told you that all four disks have a load cycle count of over 3 million. lol.

Maybe I should replace them all.

MrWraith · Jul 19, 2016

I did some quick reading and it looks like the recommended maximum is ~600k. I don't know much about disks so I had to google what the LCC is. Am I right in thinking these need to be trashed?

Jailer · Jul 19, 2016

If they all have LCC over 3 million then yes you need to replace all of them. You're already living on borrowed time IMHO.

MrWraith · Jul 19, 2016

Ok thanks for the info. Good thing i've already backed up the important stuff. I guess I'll get onto backing up the rest as well.

AVB · Jul 20, 2016

I've got an old WD 2TB drive with 31,000 hours on it and the LCC is just under 5000. It is on 24/7 for the last 3 years. I just can't imagine how yours got to 3 Million. I wonder if that was part of the reason your scrub took 18 hours. My 16 3TB drives with over 20TB of data took 5H42M last time. As was already said, change them out ASAP.

MrWraith · Jul 20, 2016

Yeah I've been wondering the same thing. IIRC I bought these new for this server. It's been maybe 5 years with them on 24/7. I used mostly default FreeNAS settings. Afaik that means they never spin down? They have been accessed via samba and ftp every day to stream movies to a few computers in the house.

I don't want to spend the money getting new disks, so I think i'll just retire the server when I can. I have already backed up the important data. I might move over to unRAID. FreeNAS has been too much work to maintain.

EDIT: Oh also there are a few jails running. One is a transmission server. That would have added some more constant read/write to the equation.

AVB · Jul 20, 2016

Actually if has been 24/7 during the whole life then it has been about 3.5 years. Not sure if Unraid is any better or not. I've been pretty satisfied for the past 15 months or so once I got everything the way I wanted it. A learning curve to be sure that doesn't seem to be getting any easier as new things are added but it does work well once you get it right.

MrWraith · Jul 20, 2016

I have an unRAID box and I'm happy with it. My favourite feature is that if two disks die, you only lose one disk of data (the disks are individual disks using ReiserFS, with a parity disk). It doesn't have nearly the same level of control as FreeNAS though.

FreeNAS has served me well. I'll use it again if I get a chance. But the fact that you can't expand an existing volume is a big problem for me.

Can I ask - what do you mean by "it has been about 3.5 years". Are you talking about your disks or mine?

AVB · Jul 20, 2016

Your discs. 5 years would be over 43,000 hours and you only have 32,000. A year is 8760 hours - why do I know this? I have no idea but it comes in handy once in a while.

MrWraith · Jul 20, 2016

Oh I see. I didn't see that stat. I guess 3.5 sounds about right. It's a shame that they're dead so soon. I blew like $800 on those disks!

AVB · Jul 20, 2016

It wasn't the discs, something caused that unusual problem. I don't know what it might be but once found the discs should last far longer.

MrWraith · Jul 20, 2016

Yeah my solution to the problem is to not use FreeNAS any longer haha

Robert Trevellyan · Jul 20, 2016

I don't think you should be in any rush to replace all the disks. A high LCC simply indicates that the heads have been parked a lot. It does not indicate that the disks are about to fail.

Run WDIDLE3 on all disks, setting the head park time to either 300 seconds or to disabled.
Keep a close eye on the disk with 4 pending sectors.
Make sure you have SMART checks, SMART short and extended tests, and email alerts working.

MrWraith · Jul 20, 2016

Robert Trevellyan said:
I don't think you should be in any rush to replace all the disks. A high LCC simply indicates that the heads have been parked a lot. It does not indicate that the disks are about to fail.

Run WDIDLE3 on all disks, setting the head park time to either 300 seconds or to disabled.

Keep a close eye on the disk with 4 pending sectors.

Make sure you have SMART checks, SMART short and extended tests, and email alerts working.

Yeah you reckon? Ok maybe i'll do that!

SMART tests are running, that's how I detected the first bad sectors. I'll configure the rest.

When you say "keep a close eye on the disk" - do you mean I should try to force-reallocate the bad sectors, or I should just leave it and see how it goes? Am I right in thinking those bad sectors will reallocate themselves in time/use?

Thanks for the input!

Jailer · Jul 20, 2016

Robert Trevellyan said:
I don't think you should be in any rush to replace all the disks. A high LCC simply indicates that the heads have been parked a lot. It does not indicate that the disks are about to fail.

It may not be an indicator of impending failure but I wouldn't trust those drives with any data that matters. I've got a Seagate drive that's been hanging on for years now but I wouldn't put it in anything mission critical.

kaushal7007 · Jul 20, 2016

AVB said:
Actually if has been 24/7 during the whole life then it has been about 3.5 years. Not sure if Unraid is any better or not. I've been pretty satisfied for the past 15 months or so once I got everything the way I wanted it. A learning curve to be sure that doesn't seem to be getting any easier as new things are added but it does work well once you get it right.

how to track LCC in my current FreeNas setup? i'm new on freenas uers.

MrWraith · Jul 20, 2016

I got it from

Code:

smartctl -a /dev/ada0
smartctl -a /dev/ada1
smartctl -a /dev/ada2
smartctl -a /dev/ada3

these give you some stats (among other things) for each of your volumes.

kaushal7007 · Jul 21, 2016

MrWraith said:
I got it from

Code:
smartctl -a /dev/ada0 smartctl -a /dev/ada1 smartctl -a /dev/ada2 smartctl -a /dev/ada3

these give you some stats (among other things) for each of your volumes.

thx but i can't track LCC value there
LOG:-
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-RELEASE-p3 amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar SE Serial ATA
Device Model: WDC WD4000KD-00NAB0
Serial Number: WD-WMAMY1125951
Firmware Version: 01.06A01
User Capacity: 400,088,457,216 bytes [400 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA/ATAPI-7 published, ANSI INCITS 397-2005
Local Time is: Thu Jul 21 00:10:57 2016 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 121) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: (10530) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 152) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x103f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 200 153 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0007 222 212 021 Pre-fail Always - 5941
4 Start_Stop_Count 0x0032 098 098 040 Old_age Always - 2120
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 253 051 Pre-fail Always - 0
9 Power_On_Hours 0x0032 076 076 000 Old_age Always - 17537
10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 656
194 Temperature_Celsius 0x0022 104 091 000 Old_age Always - 48
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always - 1
199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 5
200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline - 0

SMART Error Log Version: 1
ATA Error Count: 1190 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1190 occurred at disk power-on lifetime: 10748 hours (447 days + 20 hours)
When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 80 50 bd f8 40 Error: UNC 128 sectors at LBA = 0x00f8bd50 = 16301392

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 80 1f bd f8 48 00 9d+02:43:34.350 READ DMA EXT
25 00 80 1f bd f8 48 00 9d+02:43:34.350 READ DMA EXT
25 00 80 1f bd f8 48 00 9d+02:43:34.350 READ DMA EXT
25 00 80 1f bd f8 48 00 9d+02:43:34.350 READ DMA EXT
25 00 80 1f bd f8 48 00 9d+02:43:34.350 READ DMA EXT

Error 1189 occurred at disk power-on lifetime: 10748 hours (447 days + 20 hours)
When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 80 50 bd f8 40 Error: UNC 128 sectors at LBA = 0x00f8bd50 = 16301392

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 80 1f bd f8 48 00 9d+02:43:29.050 READ DMA EXT
25 00 80 1f bd f8 48 00 9d+02:43:29.050 READ DMA EXT
25 00 80 1f bd f8 48 00 9d+02:43:29.050 READ DMA EXT
25 00 80 1f bd f8 48 00 9d+02:43:29.050 READ DMA EXT
25 00 1e df df 1c 43 00 9d+02:43:29.050 READ DMA EXT

Error 1188 occurred at disk power-on lifetime: 10748 hours (447 days + 20 hours)
When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 80 50 bd f8 40 Error: UNC 128 sectors at LBA = 0x00f8bd50 = 16301392

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 80 1f bd f8 48 00 9d+02:43:23.700 READ DMA EXT
25 00 80 1f bd f8 48 00 9d+02:43:23.700 READ DMA EXT
25 00 80 1f bd f8 48 00 9d+02:43:23.700 READ DMA EXT
25 00 1e df df 1c 43 00 9d+02:43:23.700 READ DMA EXT
25 00 2d 4f d9 49 40 00 9d+02:43:23.700 READ DMA EXT

Error 1187 occurred at disk power-on lifetime: 10748 hours (447 days + 20 hours)
When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 80 50 bd f8 40 Error: UNC 128 sectors at LBA = 0x00f8bd50 = 16301392

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 80 1f bd f8 48 00 9d+02:43:18.400 READ DMA EXT
25 00 80 1f bd f8 48 00 9d+02:43:18.400 READ DMA EXT
25 00 1e df df 1c 43 00 9d+02:43:18.400 READ DMA EXT
25 00 2d 4f d9 49 40 00 9d+02:43:18.400 READ DMA EXT
35 00 40 8f 13 f7 48 00 9d+02:43:18.400 WRITE DMA EXT

Error 1186 occurred at disk power-on lifetime: 10748 hours (447 days + 20 hours)
When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 80 50 bd f8 40 Error: UNC 128 sectors at LBA = 0x00f8bd50 = 16301392

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 80 1f bd f8 48 00 9d+02:43:13.050 READ DMA EXT
25 00 1e df df 1c 43 00 9d+02:43:13.050 READ DMA EXT
25 00 2d 4f d9 49 40 00 9d+02:43:13.050 READ DMA EXT
35 00 40 8f 13 f7 48 00 9d+02:43:13.050 WRITE DMA EXT
35 00 40 cf 46 c6 42 00 9d+02:43:13.050 WRITE DMA EXT

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 90% 11535 625843504

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Important Announcement for the TrueNAS Community.

from "Unreadable (pending) sector" to "One or more devices has experienced an unrecoverable error."

Dabbler

Not strong, but bad

Dabbler

Dabbler

Not strong, but bad

Dabbler

Contributor

Dabbler

Contributor

Dabbler

Contributor

Dabbler

Contributor

Dabbler

Pony Wrangler

Dabbler

Not strong, but bad

Cadet

Dabbler

Cadet

Similar threads