Determining which drive is going bad

Status
Not open for further replies.

JCasanova

Dabbler
Joined
Dec 20, 2014
Messages
22
Hi everyone... I have a FreeNAS box that is really just a glorified network-attached hard drive for my home. Nothing fancy about it...

It has developed a problem where my bulk storage pool chokes after writing ~4 GB of data. It's a stripe pool with only 2 disks - I'm not wholely concerned with data loss as everything on there is recoverable within a reasonable amount of time. The pool becomes unresponsive and the webGUI for Freenas goes unresponsive. The pool and webUI recovers in 2-10 minutes everytime.

I've narrowed it down to the pool itself as I have a second pool of SSDs that I use for high(er)-speed operations and that pool has zero problems - I've tested that SSD pool by writing to it from 0% to 100% capacity and the SSD pool takes the writes without skipping a beat or slowing down in anyway at the full speed the aging hardware can take the data in (~225 GB @ ~85 MB/sec).

The hardware itself is old, but not obsolete. I understand it isn't something the community would approve NOW, but it was a while back. It's a Core2Duo E8400 with 8 GB of RAM.

The box ONLY runs storage. No jails... it just serves data via FTP, CIFS, and NFS. I occasionally will use rsync. All of these protocols have this issue writing to the pool. As of now, I haven't done any read tests - I will do that tonight.

I'm currently running FreeNAS 9.3-stable.

None of the disks in the pool report any problem. They are getting up there in duty hours and will be replaced within a year or less. But I'd like to figure out how I can determine which disk it is that is "choking" so I can replace it - while I don't care about dataloss on this pool, I'd prefer to avoid it altogether to save me time in the long run.

I suspect that one of the disks is having an issue with its on-board controller. If what I've described here says otherwise, please feel free to point it out to me.

Thanks!
 

Sakuru

Guru
Joined
Nov 20, 2015
Messages
527

depasseg

FreeNAS Replicant
Joined
Sep 16, 2014
Messages
2,874
My guess without any more detail is that you are running out of RAM. Because the SSD's can flush faster, the problem isn't appearing on that pool. Because the spinning disks are slower, there is more buildup in the ZIL and stuff swaps.

Of course, it could also be a hardware issue. Have you run zpool status and looked for errors?

Does this happen with an empty pool?

Does this happen if you write locally to the pool (vs using a sharing protocol)?
 

JCasanova

Dabbler
Joined
Dec 20, 2014
Messages
22
Please post the fulls specs of your FreeNAS box, including motherboard: http://forums.freenas.org/index.php?threads/forum-rules.22553/#post-134570/
Do you run SMART long tests on your drives?

Sadly, I'm about 400 miles away from the box right now for another 1-2 months. Best I can do is provide the output of dmidecode which tells us it is an EVGA 132-CK-NF78 (http://www.newegg.com/Product/Produ...8024&cm_re=132-CK-NF78-_-13-188-024-_-Product)

Code:
# dmidecode 2.12
SMBIOS 2.4 present.
38 structures occupying 1123 bytes.
Table at 0x000F0000.

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
    Vendor: Phoenix Technologies, LTD
    Version: 6.00 PG
    Release Date: 03/24/2008
    Address: 0xE0000
    Runtime Size: 128 kB
    ROM Size: 512 kB
    Characteristics:
        ISA is supported
        PCI is supported
        PNP is supported
        APM is supported
        BIOS is upgradeable
        BIOS shadowing is allowed
        Boot from CD is supported
        Selectable boot is supported
        BIOS ROM is socketed
        EDD is supported
        5.25"/360 kB floppy services are supported (int 13h)
        5.25"/1.2 MB floppy services are supported (int 13h)
        3.5"/720 kB floppy services are supported (int 13h)
        3.5"/2.88 MB floppy services are supported (int 13h)
        Print screen service is supported (int 5h)
        8042 keyboard services are supported (int 9h)
        Serial services are supported (int 14h)
        Printer services are supported (int 17h)
        CGA/mono video services are supported (int 10h)
        ACPI is supported
        USB legacy is supported
        LS-120 boot is supported
        ATAPI Zip drive boot is supported
        BIOS boot specification is supported
        Targeted content distribution is supported

Handle 0x0001, DMI type 1, 27 bytes
System Information
    Manufacturer:  EVGA
    Product Name: 132-CK-NF78
    Version: 2
    Serial Number: 1
    UUID: EF91D15A-164B-0400-0000-000000000000
    Wake-up Type: Power Switch
    SKU Number:
    Family:

Handle 0x0002, DMI type 2, 8 bytes
Base Board Information
    Manufacturer:  EVGA
    Product Name: 132-CK-NF78
    Version: 2
    Serial Number: 1

Handle 0x0003, DMI type 3, 17 bytes
Chassis Information
    Manufacturer:  EVGA
    Type: Desktop
    Lock: Not Present
    Version: 132-CK-NF78
    Serial Number:
    Asset Tag:
    Boot-up State: Unknown
    Power Supply State: Unknown
    Thermal State: Unknown
    Security Status: Unknown
    OEM Information: 0x00000000

Handle 0x0004, DMI type 4, 35 bytes
Processor Information
    Socket Designation: Socket 775
    Type: Central Processor
    Family: Other
    Manufacturer: Intel
    ID: 76 06 01 00 FF FB EB BF
    Signature: Type 0, Family 6, Model 23, Stepping 6
    Flags:
        FPU (Floating-point unit on-chip)
        VME (Virtual mode extension)
        DE (Debugging extension)
        PSE (Page size extension)
        TSC (Time stamp counter)
        MSR (Model specific registers)
        PAE (Physical address extension)
        MCE (Machine check exception)
        CX8 (CMPXCHG8 instruction supported)
        APIC (On-chip APIC hardware supported)
        SEP (Fast system call)
        MTRR (Memory type range registers)
        PGE (Page global enable)
        MCA (Machine check architecture)
        CMOV (Conditional move instruction supported)
        PAT (Page attribute table)
        PSE-36 (36-bit page size extension)
        CLFSH (CLFLUSH instruction supported)
        DS (Debug store)
        ACPI (ACPI supported)
        MMX (MMX technology supported)
        FXSR (FXSAVE and FXSTOR instructions supported)
        SSE (Streaming SIMD extensions)
        SSE2 (Streaming SIMD extensions 2)
        SS (Self-snoop)
        HTT (Multi-threading)
        TM (Thermal monitor supported)
        PBE (Pending break enabled)
    Version: Intel(R) Core(TM)2 Duo CPU
    Voltage: 1.7 V
    External Clock: 360 MHz
    Max Speed: 200 MHz
    Current Speed: 3240 MHz
    Status: Populated, Enabled
    Upgrade: ZIF Socket
    L1 Cache Handle: 0x000A
    L2 Cache Handle: 0x000B
    L3 Cache Handle: Not Provided
    Serial Number:
    Asset Tag:
    Part Number:

Handle 0x0005, DMI type 5, 24 bytes
Memory Controller Information
    Error Detecting Method: None
    Error Correcting Capabilities:
        None
    Supported Interleave: One-way Interleave
    Current Interleave: One-way Interleave
    Maximum Memory Module Size: 32 MB
    Maximum Total Memory Size: 128 MB
    Supported Speeds:
        70 ns
        60 ns
    Supported Memory Types:
        Standard
        EDO
    Memory Module Voltage: 5.0 V
    Associated Memory Slots: 4
        0x0006
        0x0007
        0x0008
        0x0009
    Enabled Error Correcting Capabilities: None

Handle 0x0006, DMI type 6, 12 bytes
Memory Module Information
    Socket Designation: A0
    Bank Connections: 0 1
    Current Speed: 10 ns
    Type: Other
    Installed Size: 2048 MB (Double-bank Connection)
    Enabled Size: 2048 MB (Double-bank Connection)
    Error Status: OK

Handle 0x0007, DMI type 6, 12 bytes
Memory Module Information
    Socket Designation: A1
    Bank Connections: 2 3
    Current Speed: 10 ns
    Type: Other
    Installed Size: 2048 MB (Double-bank Connection)
    Enabled Size: 2048 MB (Double-bank Connection)
    Error Status: OK

Handle 0x0008, DMI type 6, 12 bytes
Memory Module Information
    Socket Designation: A2
    Bank Connections: 4 5
    Current Speed: 10 ns
    Type: Other
    Installed Size: 2048 MB (Double-bank Connection)
    Enabled Size: 2048 MB (Double-bank Connection)
    Error Status: OK

Handle 0x0009, DMI type 6, 12 bytes
Memory Module Information
    Socket Designation: A3
    Bank Connections: 6 7
    Current Speed: 10 ns
    Type: Other
    Installed Size: 2048 MB (Double-bank Connection)
    Enabled Size: 2048 MB (Double-bank Connection)
    Error Status: OK

Handle 0x000A, DMI type 7, 19 bytes
Cache Information
    Socket Designation: Internal Cache
    Configuration: Enabled, Not Socketed, Level 1
    Operational Mode: Write Back
    Location: Internal
    Installed Size: 32 kB
    Maximum Size: 32 kB
    Supported SRAM Types:
        Synchronous
    Installed SRAM Type: Synchronous
    Speed: Unknown
    Error Correction Type: None
    System Type: Instruction
    Associativity: 8-way Set-associative

Handle 0x000B, DMI type 7, 19 bytes
Cache Information
    Socket Designation: External Cache
    Configuration: Enabled, Not Socketed, Level 2
    Operational Mode: Write Back
    Location: External
    Installed Size: 0 kB
    Maximum Size: 0 kB
    Supported SRAM Types:
        Synchronous
    Installed SRAM Type: Synchronous
    Speed: Unknown
    Error Correction Type: None
    System Type: Unified
    Associativity: 8-way Set-associative

Handle 0x000C, DMI type 8, 9 bytes
Port Connector Information
    Internal Reference Designator: PRIMARY IDE
    Internal Connector Type: On Board IDE
    External Reference Designator: Not Specified
    External Connector Type: None
    Port Type: Other

Handle 0x000D, DMI type 8, 9 bytes
Port Connector Information
    Internal Reference Designator: FDD
    Internal Connector Type: On Board Floppy
    External Reference Designator: Not Specified
    External Connector Type: None
    Port Type: 8251 FIFO Compatible

Handle 0x000E, DMI type 8, 9 bytes
Port Connector Information
    Internal Reference Designator: COM1
    Internal Connector Type: 9 Pin Dual Inline (pin 10 cut)
    External Reference Designator:
    External Connector Type: DB-9 male
    Port Type: Serial Port 16450 Compatible

Handle 0x000F, DMI type 8, 9 bytes
Port Connector Information
    Internal Reference Designator: Keyboard
    Internal Connector Type: PS/2
    External Reference Designator:
    External Connector Type: PS/2
    Port Type: Keyboard Port

Handle 0x0010, DMI type 8, 9 bytes
Port Connector Information
    Internal Reference Designator: PS/2 Mouse
    Internal Connector Type: PS/2
    External Reference Designator:
    External Connector Type: PS/2
    Port Type: Mouse Port

Handle 0x0011, DMI type 8, 9 bytes
Port Connector Information
    Internal Reference Designator: Not Specified
    Internal Connector Type: None
    External Reference Designator: USB0
    External Connector Type: Other
    Port Type: USB

Handle 0x0012, DMI type 8, 9 bytes
Port Connector Information
    Internal Reference Designator: Not Specified
    Internal Connector Type: None
    External Reference Designator: USB1
    External Connector Type: Other
    Port Type: USB

Handle 0x0013, DMI type 8, 9 bytes
Port Connector Information
    Internal Reference Designator: Not Specified
    Internal Connector Type: None
    External Reference Designator: USB2
    External Connector Type: Other
    Port Type: USB

Handle 0x0014, DMI type 8, 9 bytes
Port Connector Information
    Internal Reference Designator: Not Specified
    Internal Connector Type: None
    External Reference Designator: USB3
    External Connector Type: Other
    Port Type: USB

Handle 0x0015, DMI type 8, 9 bytes
Port Connector Information
    Internal Reference Designator: Not Specified
    Internal Connector Type: None
    External Reference Designator: USB4
    External Connector Type: Other
    Port Type: USB

Handle 0x0016, DMI type 8, 9 bytes
Port Connector Information
    Internal Reference Designator: Not Specified
    Internal Connector Type: None
    External Reference Designator: USB5
    External Connector Type: Other
    Port Type: USB

Handle 0x0017, DMI type 9, 13 bytes
System Slot Information
    Designation: PCI0
    Type: 32-bit PCI
    Current Usage: Available
    Length: Long
    ID: 1
    Characteristics:
        5.0 V is provided
        PME signal is supported

Handle 0x0018, DMI type 9, 13 bytes
System Slot Information
    Designation: PCI1
    Type: 32-bit PCI
    Current Usage: Available
    Length: Long
    ID: 2
    Characteristics:
        5.0 V is provided
        PME signal is supported

Handle 0x0019, DMI type 13, 22 bytes
BIOS Language Information
    Language Description Format: Long
    Installable Languages: 3
        n|US|iso8859-1
        n|US|iso8859-1
        r|CA|iso8859-1
    Currently Installed Language: n|US|iso8859-1

Handle 0x001A, DMI type 16, 15 bytes
Physical Memory Array
    Location: System Board Or Motherboard
    Use: System Memory
    Error Correction Type: None
    Maximum Capacity: 2 GB
    Error Information Handle: Not Provided
    Number Of Devices: 4

Handle 0x001B, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x001A
    Error Information Handle: Not Provided
    Total Width: 128 bits
    Data Width: 128 bits
    Size: 2048 MB
    Form Factor: DIMM
    Set: None
    Locator: A0
    Bank Locator: Bank0/1
    Type: DRAM
    Type Detail: None
    Speed: 800 MHz
    Manufacturer: None
    Serial Number: None
    Asset Tag: None
    Part Number: None

Handle 0x001C, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x001A
    Error Information Handle: Not Provided
    Total Width: 128 bits
    Data Width: 128 bits
    Size: 2048 MB
    Form Factor: DIMM
    Set: None
    Locator: A1
    Bank Locator: Bank2/3
    Type: DRAM
    Type Detail: None
    Speed: 800 MHz
    Manufacturer: None
    Serial Number: None
    Asset Tag: None
    Part Number: None

Handle 0x001D, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x001A
    Error Information Handle: Not Provided
    Total Width: 128 bits
    Data Width: 128 bits
    Size: 2048 MB
    Form Factor: DIMM
    Set: None
    Locator: A2
    Bank Locator: Bank4/5
    Type: DRAM
    Type Detail: None
    Speed: 800 MHz
    Manufacturer: None
    Serial Number: None
    Asset Tag: None
    Part Number: None

Handle 0x001E, DMI type 17, 27 bytes
Memory Device
    Array Handle: 0x001A
    Error Information Handle: Not Provided
    Total Width: 128 bits
    Data Width: 128 bits
    Size: 2048 MB
    Form Factor: DIMM
    Set: None
    Locator: A3
    Bank Locator: Bank6/7
    Type: DRAM
    Type Detail: None
    Speed: 800 MHz
    Manufacturer: None
    Serial Number: None
    Asset Tag: None
    Part Number: None

Handle 0x001F, DMI type 19, 15 bytes
Memory Array Mapped Address
    Starting Address: 0x00000000000
    Ending Address: 0x001FFFFFFFF
    Range Size: 8 GB
    Physical Array Handle: 0x001A
    Partition Width: 1

Handle 0x0020, DMI type 20, 19 bytes
Memory Device Mapped Address
    Starting Address: 0x00000000000
    Ending Address: 0x0007FFFFFFF
    Range Size: 2 GB
    Physical Device Handle: 0x001B
    Memory Array Mapped Address Handle: 0x001F
    Partition Row Position: 1

Handle 0x0021, DMI type 20, 19 bytes
Memory Device Mapped Address
    Starting Address: 0x00080000000
    Ending Address: 0x000FFFFFFFF
    Range Size: 2 GB
    Physical Device Handle: 0x001C
    Memory Array Mapped Address Handle: 0x001F
    Partition Row Position: 1

Handle 0x0022, DMI type 20, 19 bytes
Memory Device Mapped Address
    Starting Address: 0x00100000000
    Ending Address: 0x0017FFFFFFF
    Range Size: 2 GB
    Physical Device Handle: 0x001D
    Memory Array Mapped Address Handle: 0x001F
    Partition Row Position: 1

Handle 0x0023, DMI type 20, 19 bytes
Memory Device Mapped Address
    Starting Address: 0x00180000000
    Ending Address: 0x001FFFFFFFF
    Range Size: 2 GB
    Physical Device Handle: 0x001E
    Memory Array Mapped Address Handle: 0x001F
    Partition Row Position: 1

Handle 0x0024, DMI type 32, 11 bytes
System Boot Information
    Status: No errors detected

Handle 0x0025, DMI type 127, 4 bytes
End Of Table




In the past I have run smart long tests on the disks. No issues were reported. I just kicked off smartctl -l on each of the disks in the pool within question. I'll have some new results in a few hours (estimates 455 minutes for each).

My guess without any more detail is that you are running out of RAM. Because the SSD's can flush faster, the problem isn't appearing on that pool. Because the spinning disks are slower, there is more buildup in the ZIL and stuff swaps.

Of course, it could also be a hardware issue. Have you run zpool status and looked for errors?

Does this happen with an empty pool?

Does this happen if you write locally to the pool (vs using a sharing protocol)?

It could very well be the lack of RAM issue. Is there any way to determine that? Would adding more LOG cache on an SSD help the issue?

I will play around with tuneables for ZFS tonight and see what I can do. Do you know if there is a way to push ARC into swap? I could then move swap to an SSD to get better performance (hopefully) out of swap and therefore better out of ARC (theoretically based on my knowledge).

zpool status shows 36 read errors on one disk and 4 in the other. As I mentioned, the disks have seen their fair share of use and are approaching end of life within ~6 months.

I don't have an empty pool to test with. It'll be 1-2 months before I have physical access to the box again and I'm able to install some disks to test this theory. But no, this didn't happen on the SSD pool when it was originally installed, but I do acknowledge your point #1 above about the SSD's flushing faster.

I will perform some tests on local writes from the SSD pool to the HDD pool. I'm going to try cp, mv and rsync to see if any of those commands yield different results.

My goals here are to:
1. Determine if it is a disk or disks that is causing the problem, and..
2. Which disk in the pool is causing the problem, if it is only one disk.
 
Last edited:

depasseg

FreeNAS Replicant
Joined
Sep 16, 2014
Messages
2,874
It could very well be the lack of RAM issue. Is there any way to determine that? Would adding more LOG cache on an SSD help the issue?
How much RAM is there? If it's less than 8GB, then it could be a problem. Look at the reporting graphs and see what the utilization looks like.
Do you know if there is a way to push ARC into swap? I could then move swap to an SSD to get better performance (hopefully) out of swap and therefore better out of ARC (theoretically based on my knowledge).
You aren't reading, so the issue isn't with the ARC. I know a lot of the ARC logic is automatic, and basically will try to stay within the resources that are available. I don't think you would want to try to force it to swap.
zpool status shows 36 read errors on one disk and 4 in the other.
This could certainly be part of the problem. Let's see what the SMART tests look like in a couple hours.
 

JCasanova

Dabbler
Joined
Dec 20, 2014
Messages
22
8 GB of RAM.

I generated a ~7.5 GB tar.gz file for testing. It is on the SSD pool.

rsyncing from the SSD to a location on the HDD pool while monitoring with zpool iostat -v 1 shows some interesting results...

Write bandwidth stays at about 12 MB/sec on disk #1. The other disk mostly reads at 0 MB but does have bursts of 25-35 MB/sec. The transfer speed via rsync --progress goes anywhere from 900 KB/sec to 35 MB/sec.

Once the transfer is complete, I see a lot of read operations/bandwidth on disk #1 and a lot of write requests on disk #2.. at about 10-12 MB/sec constant... It's almost like one disk is being preferred for taking in data to the pool and then once things calm down, the pool is leveling itself back out. I can confirm this by watching the capacity free values change in that manner during and after large write operations.

Could that just be caused by the fact that it is a stripe? Or shouldn't it split the write requests between the two disks?

But during this ~7.5 GB write operation, there was no point at which the pool completely stopped responding like I've seen via CIFS, NFS, or FTP.
 

depasseg

FreeNAS Replicant
Joined
Sep 16, 2014
Messages
2,874
8GB should be enough.

Did you add one of the drives to create the stripe after the first drive already had a bunch of data on it? I'm with your diagnosis.

You can use a dd command to write to a large file in your dataset and monitor with the zpool iostat -v command. Keep in mind that if you use if=/dev/zero, that you need to disable compression on the dataset you are writing to or the results aren't really indicative.
 

JCasanova

Dabbler
Joined
Dec 20, 2014
Messages
22
Both drives were blank when I created the striped pool.

I'd love to see what is going on at the block level with these write operations... get inside the head of why this is writing to the pool the way it is.

Same results with cp (as I expected).

I just realized that disk #1 has more free space than disk #2. Both disks are the roughly the same size (Disk #1 is 1.57TB and 2 is 1.61TB) yet Disk #1 has 36 GB more free space than Disk #2.. perhaps that is why it prefers to be written to? I'm going to continue to write data to the pool and see if the two disks level each other out.

I did delete about 400 GB of data from the pool just a few weeks ago and haven't written much data since.

Is there any way to force zfs to level the disks out? Or maybe a thrifty way to cycle data to the SSD pool and then back onto the HDD pool so it is forced to effectively resilver the entire pool? I bet I could write a very simple python script to do this..
 
Last edited:

depasseg

FreeNAS Replicant
Joined
Sep 16, 2014
Messages
2,874
I don't think you can balance the 2 drives, but you certainly can write a bunch of large files to them and see what happens. :smile:
 

JCasanova

Dabbler
Joined
Dec 20, 2014
Messages
22
I don't think you can balance the 2 drives, but you certainly can write a bunch of large files to them and see what happens. :)
I'm copying about 75 GB of data to the HDD pool to see if it attempts to balance out the free space across the two disks.

I feel like I'm off on a tangent, though.

I'll put together a python script that mv's files between my two pools to rebalance the stripe. If it is reliable I'll share the code here.
 

depasseg

FreeNAS Replicant
Joined
Sep 16, 2014
Messages
2,874
Keep an eye on the zpool iostat while it's copying. I'm curious if once the ~35GB delta is balanced if things change.
 

JCasanova

Dabbler
Joined
Dec 20, 2014
Messages
22
I have a crude python script that copies one file to the SSD pool via rsync, removes it from its original location on the HDD pool... then rsyncs it back to its original location and removes the temp copy on the SSD pool... it does this sequentially for every file in a provided directory and sub-directories...

There's still ~5 GB delta between the two resident disks... Disk #1 is still lagging behind and being preferred to be written to. I'm going to let it run through one of the larger bulk storage datasets for a while and then I'll do some more write testing to see what happens. Currently the free space for each disk reads at 278G/272G.

What's interesting now is that according to zpool iostat -v 1... I'm seeing different values for capacity allocated then I did earlier... 1.54T/1.55T compared to 1.57T/1.61T that we saw earlier - any idea why that could have changed?
 

JCasanova

Dabbler
Joined
Dec 20, 2014
Messages
22
The python script is still running. I've noticed that the box responds very slowly if at all when the HDD pool is being written to. Disk #1 rarely goes over 13 MB/sec write bandwidth. Disk #2 seems to perform writes in bursts; essentially waiting for Disk #1 to catch up.

I also started monitoring disk activity alternatively - my background in cyber security has taught me well... to not trust a single tool. Running systat -iostat reports that tps and MB/s between the two disks is more balanced than zpool iostat reports. I'll have to look into how each tool determines their statistics - unlike in cyber security I don't suspect a rootkit in this case. Perhaps zpool iostats polls the data in a different way - a way that Disk #2 is failing?

The script hasn't finished running - progress is slow when trying to rewrite ~ 3 TB of data at about 12 MB/sec. I did transfer a few hundred GB of external data to the pool last night, so free space isn't comparable to yesterday.. what is important here is the delta between the two disks... Disk #1 is at 137 G free and Disk 2 is at 149 G free. What's further odd is that the allocated capacity has increased again... 1.68T/1.67T respectively. I wonder what could be causing that to change.

smartctl results on that long test... Disk #1 failed, Disk #2 passed.

Disk #1 has over 500 pending sectors... Disk #2 has 33. Time to replace both disks. What's really interesting is that both of these disks were installed new and fresh at the same time yet Disk #1 is reading 18872 power_on_hours and Disk #2 is reading 51527 power_on_hours. Interesting... 51,000 hours is over 5 years, but I haven't had the disks that long. Refurbished somehow? The 2 years for 18872 hours is about right.

Anyway, thanks for the assist and letting me bounce ideas off of you. I really appreciate it.

I may make that striped pool write balance script available if I can work in some solid error handling to mitigate data loss. Right now if the session gets disconnected (I'm working remote) there is a potential for data loss if I'm not careful when I restart the script.

Code:
joe@SAN01:~ % sudo smartctl -a /dev/ada1
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p31 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF)
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WCAZA5575004
LU WWN Device Id: 5 0014ee 205a7f76a
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Sat Apr  2 08:16:14 2016 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
See vendor-specific Attribute list for failed Attributes.

General SMART Values:
Offline data collection status:  (0x85)    Offline data collection activity
                    was aborted by an interrupting command from host.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (  73)    The previous self-test completed having
                    a test element that failed and the test
                    element that failed is not known.
Total time to complete Offline 
data collection:         (39660) seconds.
Offline data collection
capabilities:             (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:     (   2) minutes.
Extended self-test routine
recommended polling time:     ( 382) minutes.
Conveyance self-test routine
recommended polling time:     (   5) minutes.
SCT capabilities:           (0x3035)    SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   001   001   051    Pre-fail  Always   FAILING_NOW 2615306
  3 Spin_Up_Time            0x0027   253   253   021    Pre-fail  Always       -       941
  4 Start_Stop_Count        0x0032   095   095   000    Old_age   Always       -       5265
  5 Reallocated_Sector_Ct   0x0033   184   184   140    Pre-fail  Always       -       319
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   075   075   000    Old_age   Always       -       18872
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   095   095   000    Old_age   Always       -       5261
192 Power-Off_Retract_Count 0x0032   195   195   000    Old_age   Always       -       4005
193 Load_Cycle_Count        0x0032   129   129   000    Old_age   Always       -       213001
194 Temperature_Celsius     0x0022   117   088   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       203
197 Current_Pending_Sector  0x0032   199   198   000    Old_age   Always       -       570
198 Offline_Uncorrectable   0x0030   199   199   000    Old_age   Offline      -       507
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       1
200 Multi_Zone_Error_Rate   0x0008   124   124   000    Old_age   Offline      -       20298

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: unknown failure    90%     18868         -
# 2  Extended offline    Completed: unknown failure    90%     18853         -
# 3  Extended offline    Completed: unknown failure    90%     18844         -
# 4  Extended offline    Completed: unknown failure    90%     18821         -
# 5  Extended offline    Completed: unknown failure    90%     18798         -
# 6  Extended offline    Completed: unknown failure    90%     18774         -
# 7  Extended offline    Completed: unknown failure    90%     18726         -
# 8  Extended offline    Completed: unknown failure    90%     18699         -
# 9  Extended offline    Completed: unknown failure    90%     18675         -
#10  Extended offline    Completed: unknown failure    90%     18651         -
#11  Extended offline    Completed: unknown failure    90%     18628         -
#12  Extended offline    Completed: unknown failure    90%     18604         -
#13  Extended offline    Completed: unknown failure    90%     18561         -
#14  Extended offline    Completed: unknown failure    90%     18537         -
#15  Extended offline    Completed: unknown failure    90%     18513         -
#16  Extended offline    Completed: unknown failure    90%     18490         -
#17  Extended offline    Completed: unknown failure    90%     18466         -
#18  Extended offline    Completed: unknown failure    90%     18442         -
#19  Extended offline    Completed: unknown failure    90%     18394         -
#20  Extended offline    Completed: unknown failure    90%     18370         -
#21  Extended offline    Completed: unknown failure    90%     18346         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

joe@SAN01:~ % sudo smartctl -a /dev/ada2
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p31 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green
Device Model:     WDC WD20EADS-11R6B1
Serial Number:    WD-WCAVY1234580
LU WWN Device Id: 5 0014ee 258ea0529
Firmware Version: 01.00A01
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Sat Apr  2 08:16:18 2016 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)    Offline data collection activity
                    was suspended by an interrupting command from host.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 120)    The previous self-test completed having
                    the read element of the test failed.
Total time to complete Offline 
data collection:         (39960) seconds.
Offline data collection
capabilities:             (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:     (   2) minutes.
Extended self-test routine
recommended polling time:     ( 455) minutes.
Conveyance self-test routine
recommended polling time:     (   5) minutes.
SCT capabilities:           (0x3031)    SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   194   192   051    Pre-fail  Always       -       158113
  3 Spin_Up_Time            0x0027   168   136   021    Pre-fail  Always       -       8600
  4 Start_Stop_Count        0x0032   085   085   000    Old_age   Always       -       15629
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   030   030   000    Old_age   Always       -       51528
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       355
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       182
193 Load_Cycle_Count        0x0032   180   180   000    Old_age   Always       -       60875
194 Temperature_Celsius     0x0022   117   083   000    Old_age   Always       -       35
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       204
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       33
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   189   000    Old_age   Offline      -       88

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       80%     51524         405239576
# 2  Extended offline    Completed: read failure       90%     51509         274535448
# 3  Extended offline    Completed: read failure       90%     51500         274535448
# 4  Extended offline    Completed: read failure       90%     51477         369064784
# 5  Extended offline    Completed: read failure       90%     51454         420742128
# 6  Extended offline    Completed: read failure       60%     51432         1313692272
# 7  Extended offline    Completed: read failure       10%     51400         1313692272
# 8  Extended offline    Completed: read failure       90%     51355         441755920
# 9  Extended offline    Completed: read failure       70%     51332         1119472032
#10  Extended offline    Completed: read failure       90%     51308         420742128
#11  Extended offline    Completed: read failure       90%     51284         399998096
#12  Extended offline    Completed: read failure       90%     51260         399998096
#13  Extended offline    Completed: read failure       70%     51219         1313692272
#14  Extended offline    Completed: read failure       70%     51195         1313692272
#15  Extended offline    Completed: read failure       90%     51170         420742128
#16  Extended offline    Completed: read failure       70%     51147         1313692272
#17  Extended offline    Completed: read failure       90%     51122         420742128
#18  Extended offline    Completed: read failure       90%     51098         428595904
#19  Extended offline    Completed: read failure       10%     51056         420742128
#20  Extended offline    Completed: read failure       90%     51026         392303584
#21  Extended offline    Completed: read failure       90%     51002         420742128

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Both of your drives are failing SMART tests consistently (and FreeNAS should have been warning you of this if you set up email alerting properly), and have been for at least the last 500 hours of operating time, so yes, they definitely need to be replaced, sooner rather than later. And, though it really isn't directly relevant to the problems you're seeing, your SMART testing schedule is far more aggressive than needed. There's no reason at all to be running long tests daily (or even more often). Weekly is plenty, and even that is more often than most here run.
 
Status
Not open for further replies.
Top