ZFS reports read,write and checksum errors after scrub

Status
Not open for further replies.

someuser77

Dabbler
Joined
Oct 1, 2011
Messages
12
Hello,

I am using FreeNAS-8.0.1-RELEASE-amd64 (8081) (FreeBSD 8.2-RELEASE-p3) on an GA-D525TUD motherboard with two 2TB Samsung F4 HD204UI configured as a ZFS mirror.

Yesterday a scheduled scrub was running and the result was:
Code:
[root@freenas] ~# zpool status -v
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 4h48m with 0 errors on Fri Aug 31 07:50:06 2012
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank                                            ONLINE       0     0     0
          mirror                                        ONLINE       0     0     0
            gptid/43ee42df-f1ce-11e0-b772-6cf049de8d09  ONLINE   1.35M 1.36M 72.7K
            gptid/44617302-f1ce-11e0-b772-6cf049de8d09  ONLINE       0     0     0

errors: No known data errors


Do I need to replace the drive or should I format it and let ZFS resilver it?

I got a few days before my warranty expires so I would like to know what must I do to diagnose the problem and determine if I should replace the drive and on what grounds?

Is it a software problem or a hardware problem? :smile:
 

freshfeesh

Explorer
Joined
Oct 10, 2011
Messages
72
It certainly looks like you could have a bad disk. Do a "zpool clear" from the command line (via SSH or "Shell" in the menu tree on the left), check that all your cables a set right (probably a good idea to power down for that), then do another scrub and see if there are still errors. Scrub will correct errors if it can, so I don't think clearing the disk and resilvering will buy you anything. Even if you get no errors after another scrub, you'll want to look at the smart data to see if you need to send the drive back.
 

leenux_tux

Patron
Joined
Sep 3, 2011
Messages
238
someuser77....

When the server boots, do you get error messages stating that the disk could have firmware issues in relation to "SMART" being turned on. I can't remember the exact message but I have these disks and I get this type of message. It's fairly well documented on the web that specfic firmware versions on this drive (and others) does not play well with SMART and can cause complete data corruption. I would turn off SMART until you can figure out what firmware you have and if it needs to be upgraded.

leenux_tux
 

peterh

Patron
Joined
Oct 19, 2011
Messages
315
Hello,

I am using FreeNAS-8.0.1-RELEASE-amd64 (8081) (FreeBSD 8.2-RELEASE-p3) on an GA-D525TUD motherboard with two 2TB Samsung F4 HD204UI configured as a ZFS mirror.

Yesterday a scheduled scrub was running and the result was:
Code:
[root@freenas] ~# zpool status -v
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 4h48m with 0 errors on Fri Aug 31 07:50:06 2012
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank                                            ONLINE       0     0     0
          mirror                                        ONLINE       0     0     0
            gptid/43ee42df-f1ce-11e0-b772-6cf049de8d09  ONLINE   1.35M 1.36M 72.7K
            gptid/44617302-f1ce-11e0-b772-6cf049de8d09  ONLINE       0     0     0

errors: No known data errors


Do I need to replace the drive or should I format it and let ZFS resilver it?

I got a few days before my warranty expires so I would like to know what must I do to diagnose the problem and determine if I should replace the drive and on what grounds?

Is it a software problem or a hardware problem? :smile:

Replace the drive ASAP. It's broken
 

someuser77

Dabbler
Joined
Oct 1, 2011
Messages
12
So the bad drive is back online, now it is recognized again and FreeNAS can read its label.

I did not see any SMART errors during POST.
This is the output of dmesg:
Code:
[root@freenas] ~# dmesg
Copyright (c) 1992-2011 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 8.2-RELEASE-p3 #7: Fri Sep 30 12:51:49 PDT 2011
    jpaetzel@servant.iXsystems.com:/b/sf_freenas_build/obj.amd64/b/sf_freenas_build/FreeBSD/src/sys/FREENAS.amd64 amd64
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: Intel(R) Atom(TM) CPU D525   @ 1.80GHz (1799.97-MHz K8-class CPU)
  Origin = "GenuineIntel"  Id = 0x106ca  Family = 6  Model = 1c  Stepping = 10
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x40e31d<SSE3,DTES64,MON,DS_CPL,TM2,SSSE3,CX16,xTPR,PDCM,MOVBE>
  AMD Features=0x20100800<SYSCALL,NX,LM>
  AMD Features2=0x1<LAHF>
  TSC: P-state invariant
real memory  = 4294967296 (4096 MB)
avail memory = 4083167232 (3894 MB)
ACPI APIC Table: <GBT    GBTUACPI>
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
FreeBSD/SMP: 1 package(s) x 2 core(s) x 2 HTT threads
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP/HT): APIC ID:  1
 cpu2 (AP): APIC ID:  2
 cpu3 (AP/HT): APIC ID:  3
ioapic0: Changing APIC ID to 2
ioapic0 <Version 2.0> irqs 0-23 on motherboard
kbd1 at kbdmux0
netsmb_dev: loaded
cryptosoft0: <software crypto> on motherboard
acpi0: <GBT GBTUACPI> on motherboard
acpi0: [ITHREAD]
acpi0: Power Button (fixed)
acpi0: reservation of 0, a0000 (3) failed
acpi0: reservation of 100000, bf4e0000 (3) failed
Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0
cpu0: <ACPI CPU> on acpi0
cpu1: <ACPI CPU> on acpi0
cpu2: <ACPI CPU> on acpi0
cpu3: <ACPI CPU> on acpi0
acpi_button0: <Power Button> on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
vgapci0: <VGA-compatible display> port 0xff00-0xff07 mem 0xfdf00000-0xfdf7ffff,0xd0000000-0xdfffffff,0xfdd00000-0xfddfffff irq 16 at device 2.0 on pci0
agp0: <Intel Pineview SVGA controller> on vgapci0
agp0: detected 8188k stolen memory
agp0: aperture size is 256M
pci0: <multimedia, HDA> at device 27.0 (no driver attached)
pcib1: <ACPI PCI-PCI bridge> irq 16 at device 28.0 on pci0
pci1: <ACPI PCI bus> on pcib1
re0: <RealTek 8168/8111 B/C/CP/D/DP/E PCIe Gigabit Ethernet> port 0xee00-0xeeff mem 0xfdcff000-0xfdcfffff,0xfdcf8000-0xfdcfbfff irq 16 at device 0.0 on pci1
re0: Using 1 MSI-X message
re0: Chip rev. 0x2c000000
re0: MAC rev. 0x00000000
miibus0: <MII bus> on re0
rgephy0: <RTL8169S/8110S/8211B media interface> PHY 1 on miibus0
rgephy0:  10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, 1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow
re0: Ethernet address: 6c:f0:49:de:8d:09
re0: [ITHREAD]
pcib2: <ACPI PCI-PCI bridge> irq 17 at device 28.1 on pci0
pci2: <ACPI PCI bus> on pcib2
ahci0: <JMicron JMB363 AHCI SATA controller> mem 0xfdefe000-0xfdefffff irq 17 at device 0.0 on pci2
ahci0: [ITHREAD]
ahci0: AHCI v1.00 with 2 3Gbps ports, Port Multiplier supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich0: [ITHREAD]
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich1: [ITHREAD]
atapci0: <JMicron JMB363 UDMA133 controller> port 0xdf00-0xdf07,0xde00-0xde03,0xdd00-0xdd07,0xdc00-0xdc03,0xdb00-0xdb0f irq 18 at device 0.1 on pci2
atapci0: [ITHREAD]
ata2: <ATA channel 0> on atapci0
ata2: [ITHREAD]
uhci0: <Intel 82801G (ICH7) USB controller USB-A> port 0xfe00-0xfe1f irq 23 at device 29.0 on pci0
uhci0: [ITHREAD]
usbus0: <Intel 82801G (ICH7) USB controller USB-A> on uhci0
uhci1: <Intel 82801G (ICH7) USB controller USB-B> port 0xfd00-0xfd1f irq 19 at device 29.1 on pci0
uhci1: [ITHREAD]
usbus1: <Intel 82801G (ICH7) USB controller USB-B> on uhci1
uhci2: <Intel 82801G (ICH7) USB controller USB-C> port 0xfc00-0xfc1f irq 18 at device 29.2 on pci0
uhci2: [ITHREAD]
usbus2: <Intel 82801G (ICH7) USB controller USB-C> on uhci2
uhci3: <Intel 82801G (ICH7) USB controller USB-D> port 0xfb00-0xfb1f irq 16 at device 29.3 on pci0
uhci3: [ITHREAD]
usbus3: <Intel 82801G (ICH7) USB controller USB-D> on uhci3
ehci0: <Intel 82801GB/R (ICH7) USB 2.0 controller> mem 0xfdfff000-0xfdfff3ff irq 23 at device 29.7 on pci0
ehci0: [ITHREAD]
usbus4: EHCI version 1.0
usbus4: <Intel 82801GB/R (ICH7) USB 2.0 controller> on ehci0
pcib3: <ACPI PCI-PCI bridge> at device 30.0 on pci0
pci3: <ACPI PCI bus> on pcib3
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
ahci1: <Intel ICH7 AHCI SATA controller> port 0xfa00-0xfa07,0xf900-0xf903,0xf800-0xf807,0xf700-0xf703,0xf600-0xf60f mem 0xfdffe000-0xfdffe3ff irq 19 at device 31.2 on pci0
ahci1: [ITHREAD]
ahci1: AHCI v1.10 with 4 3Gbps ports, Port Multiplier not supported
ahcich2: <AHCI channel> at channel 0 on ahci1
ahcich2: [ITHREAD]
ahcich3: <AHCI channel> at channel 1 on ahci1
ahcich3: [ITHREAD]
ahcich4: <AHCI channel> at channel 2 on ahci1
ahcich4: [ITHREAD]
ahcich5: <AHCI channel> at channel 3 on ahci1
ahcich5: [ITHREAD]
pci0: <serial bus, SMBus> at device 31.3 (no driver attached)
acpi_hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff irq 0,8 on acpi0
Timecounter "HPET" frequency 14318180 Hz quality 900
atrtc0: <AT realtime clock> port 0x70-0x73 on acpi0
uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
uart0: [FILTER]
uart1: <16550 or compatible> port 0x2f8-0x2ff irq 3 on acpi0
uart1: [FILTER]
ppc0: <Parallel port> port 0x378-0x37f irq 7 on acpi0
ppc0: Generic chipset (NIBBLE-only) in COMPATIBLE mode
ppc0: [ITHREAD]
ppbus0: <Parallel port bus> on ppc0
lpt0: <Printer> on ppbus0
lpt0: [ITHREAD]
lpt0: Interrupt-driven port
orm0: <ISA Option ROM> at iomem 0xd0000-0xd2fff on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
atkbd0: [ITHREAD]
coretemp0: <CPU On-Die Thermal Sensors> on cpu0
coretemp0: Can not get Tj(target) from your CPU, using 100C.
p4tcc0: <CPU Frequency Thermal Control> on cpu0
coretemp1: <CPU On-Die Thermal Sensors> on cpu1
coretemp1: Can not get Tj(target) from your CPU, using 100C.
p4tcc1: <CPU Frequency Thermal Control> on cpu1
coretemp2: <CPU On-Die Thermal Sensors> on cpu2
coretemp2: Can not get Tj(target) from your CPU, using 100C.
p4tcc2: <CPU Frequency Thermal Control> on cpu2
coretemp3: <CPU On-Die Thermal Sensors> on cpu3
coretemp3: Can not get Tj(target) from your CPU, using 100C.
p4tcc3: <CPU Frequency Thermal Control> on cpu3
Timecounters tick every 1.000 msec
usbus0: 12Mbps Full Speed USB v1.0
usbus1: 12Mbps Full Speed USB v1.0
usbus2: 12Mbps Full Speed USB v1.0
usbus3: 12Mbps Full Speed USB v1.0
usbus4: 480Mbps High Speed USB v2.0
ugen0.1: <Intel> at usbus0
uhub0: <Intel UHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus0
ugen1.1: <Intel> at usbus1
uhub1: <Intel UHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus1
ugen2.1: <Intel> at usbus2
uhub2: <Intel UHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus2
ugen3.1: <Intel> at usbus3
uhub3: <Intel UHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus3
ugen4.1: <Intel> at usbus4
uhub4: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus4
uhub0: 2 ports with 2 removable, self powered
uhub1: 2 ports with 2 removable, self powered
uhub2: 2 ports with 2 removable, self powered
uhub3: 2 ports with 2 removable, self powered
uhub4: 8 ports with 8 removable, self powered
ugen4.2: <Kingston> at usbus4
umass0: <Kingston DataTraveler G3, class 0/0, rev 2.00/1.00, addr 2> on usbus4
ada0 at ahcich1 bus 0 scbus1 target 0 lun 0
ada0: <SAMSUNG HD204UI 1AQ100da0 at umass-sim0 bus 0 scbus7 target 0 lun 0
da0: <Kingston DataTraveler G3 PMAP> Removable Direct Access SCSI-0 device
da0: 40.000MB/s transfers
da0: 7429MB (15215808 512 byte sectors: 255H 63S/T 947C)
01> ATA-8 SATA 2.x device
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada1 at ahcich2 bus 0 scbus3 target 0 lun 0
ada1: <SAMSUNG HD204UI 1AQ10001> ATA-8 SATA 2.x device
ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
SMP: AP CPU #1 Launched!
SMP: AP CPU #3 Launched!
SMP: AP CPU #2 Launched!
GEOM: da0s1: geometry does not match label (16h,63s != 255h,63s).
Trying to mount root from ufs:/dev/ufs/FreeNASs1a
ZFS NOTICE: Prefetch is disabled by default if less than 4GB of RAM is present;
            to enable, add "vfs.zfs.prefetch_disable=0" to /boot/loader.conf.
ZFS filesystem version 4
ZFS storage pool version 15
re0: link state changed to UP


freshfeesh said:
do another scrub and see if there are still errors
I did a scrub from the GUI and then checked the status:
Code:
[root@freenas] ~# zpool status
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 0h43m with 0 errors on Mon Sep  3 12:37:34 2012
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank                                            ONLINE       0     0     0
          mirror                                        ONLINE       0     0     0
            gptid/43ee42df-f1ce-11e0-b772-6cf049de8d09  ONLINE       0     0     1  159G resilvered
            gptid/44617302-f1ce-11e0-b772-6cf049de8d09  ONLINE       0     0     0

errors: No known data errors

Then I cleared the pool and did another scrub:
Code:
[root@freenas] ~# zpool status
  pool: tank
 state: ONLINE
 scrub: scrub completed after 3h57m with 0 errors on Mon Sep  3 16:36:45 2012
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank                                            ONLINE       0     0     0
          mirror                                        ONLINE       0     0     0
            gptid/43ee42df-f1ce-11e0-b772-6cf049de8d09  ONLINE       0     0     0
            gptid/44617302-f1ce-11e0-b772-6cf049de8d09  ONLINE       0     0     0

errors: No known data errors


I'm not sure what to make of it. Is the data back to a normal state?

The manufacturing date is 2011.7 and one of the labels says SEC-HD204UI(B) so I understand the drive is patched. Should I patch again just in case?

peterh said:
Replace the drive ASAP. It's broken
Is there a way I can confirm that? Maybe run a SMART test?

I called the shop and they said they can run a test on it and if it will pass they will not replace it.
Should I try my luck? I would not want to replace a good drive with a refurbished one.
 

freshfeesh

Explorer
Joined
Oct 10, 2011
Messages
72
Errors can be bad cables or connections as well as bad drives, but I wouldn't trust any data to that drive. You should mentally prep yourself to accept a refurbished drive or be ready to get a new drive. Make sure your data is backed up; it looks like you're running on a single good drive.

Whether dealing directly with the shop or with the manufacturer, you're going to have to prove that the drive is bad by using some test that they authorize. It sounds like the shop may be prepared to replace the bad drive with new, which is obviously preferable. You can save yourself a trip there by asking them what test they plan on running, then if you can, download that test and run it on the drive yourself. It's likely a windows utility, meaning you'll have to pull the drive and hook it up to a Windows box, but you were going to have to pull the drive to take it to the store anyway. Whatever test it is, it should report SMART info. If the store's test is an unknown quantity, you can run Samsung's (Seagate's?) test, which is often the basis for manufacturer returns. You can deal with the shop with that info. If they're reasonable, they would accept a "failure" result from a manufacturer test. You'll want to run a test routine, not just pull the SMART data from the drive.
 

someuser77

Dabbler
Joined
Oct 1, 2011
Messages
12
I have a GA-D525TUD motherboard with 2 SATA controllers, one from the chipset and another one on a dedicated chip.

Initially the drives were plugged each into different controller in case one will malfunction. The bad drive was connected to the controller on the dedicated chip.
I connected the bad drive to the chipset SATA controller together with the good drive.
Because I was unable to find a testing tool on the Samsung site (got redirected to Seagate) and I was unable to install the SeaTools on a thumb drive using the methods found on the web, I ended up using the UltimateBootCD and booted it off a USB stick.

I ran the ES-Tools test from the UltimateBootCD on the bad drive, it ran for 16 hours and all the tests passed, including the read surface scan.

I plugged both drives to the dedicated chip and both of them got recognized.

So its either a bad drive, bad chip, both or a temporary bug.

Could anyone please give me ideas on what can I troubleshoot next?

I can try to return the drive and the motherboard to the store, but currently it looks like they both work.

If I plug the drives into different controllers each boot, can it cause problems with FreeNAS?

Update:
I replaced the board with a new GA-E350N-USB3 and so far it looks okay.
 
Status
Not open for further replies.
Top