Checksum Errors and Degraded/Removed Drives

Status
Not open for further replies.

negabinary

Dabbler
Joined
Jun 24, 2016
Messages
11
I am at my wit’s end with this, so I’m hoping that someone here might recognize my problem or at least have some idea of where to go next.

Here’s my setup:
Supermicro X10SLL-F-O
Intel i3-4360
32GB Crucial DDR3L PC3-12800 ECC (4x8GB)
4x WD Red 3TB (two sets of mirrored drives, RAID-10)
8GB SanDisk CZ36 (OS drive)
835W Raidmax 80 Bronze PSU
Freenas 9.10-STABLE
3 Jails (Plex, VirtualBox, Generic BSD)

My system has two apparent symptoms: (1) “zpool status” shows many checksum errors and occasionally read/write errors for individual disks (especially under load, like during reslivering), and (2) individual disks periodically show as REMOVED or DEGRADED, causing my pool to become DEGRADED.

The problem seems to affect one or two disks at a time, but (knock on wood) not more than two at any one time. But, the problem seems to affect different drives at different times, no one drive appears to be more culpable than any other.

All drives pass a SMART long test.

Memtest86 ran for 8 hours with no errors.

I replaced the PSU and drive power cables. No difference.

I tried replacing all the SATA cables – no difference.

Despite the SMART test, I tried replacing the drive with the most errors (at the time) with a brand-new drive (of the same type), and checksum errors immediately appeared for that drive.

I disconnected my drives from the on-board SATA controller, added a LSI SAS9211-8i, flashed it to p20-IT, and connected all my drives to it instead. Problem continues.

So, before I buy a new motherboard and CPU and/or reconfigure my system with a sledgehammer – does anyone have any idea what I might be dealing with here? Could it be settings/configuration instead of hardware? Happy to provide any diagnostic data that might be useful. An example of the kernel log while this is occurring is below.

Thanks

> (ada2:ata2:0:0:0): WRITE_DMA48. ACB: 35 00 18 29 17 40 43 00 00 00 58 00
> (ada2:ata2:0:0:0): CAM status: ATA Status Error
> (ada2:ata2:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 10 (IDNF )
> (ada2:ata2:0:0:0): RES: 51 10 18 29 17 43 43 00 00 58 00
> (ada2:ata2:0:0:0): Retrying command
> (ada2:ata2:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
> (ada2:ata2:0:0:0): CAM status: Command timeout
> (ada2:ata2:0:0:0): Retrying command
> (ada2:ata2:0:0:0): WRITE_DMA. ACB: ca 00 c0 bb 2b 43 00 00 00 00 08 00
> (ada2:ata2:0:0:0): CAM status: ATA Status Error
> (ada2:ata2:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 10 (IDNF )
> (ada2:ata2:0:0:0): RES: 51 10 c0 bb 2b 03 03 00 00 08 00
> (ada2:ata2:0:0:0): Retrying command
> (ada3:ata3:0:0:0): WRITE_DMA. ACB: ca 00 08 07 2d 43 00 00 00 00 00 00
> (ada3:ata3:0:0:0): CAM status: ATA Status Error
> (ada3:ata3:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 10 (IDNF )
> (ada3:ata3:0:0:0): RES: 51 10 08 07 2d 03 03 00 00 30 00
> (ada3:ata3:0:0:0): Retrying command
 
Last edited:

Mirfster

Doesn't know what he's talking about
Joined
Oct 2, 2015
Messages
3,215
  1. When was the last time a Scrub was ran?
  2. What is the output of zpool status (In CODE Tags please)?
  3. Can you post the SMART results (In CODE Tags please)?
index.php
 

negabinary

Dabbler
Joined
Jun 24, 2016
Messages
11
Thanks for replying. Outputs are below (and in next post). Most recent Scrub was May 27, the June 15 schedule scrub didn't happen because the pool was degraded at the scheduled time.

Code:
[root@freenas] ~# zpool status
  pool: WDRED3TBx2
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 64.1G in 8h25m with 0 errors on Fri Jun 24 07:28:53 2016
config:

        NAME                                            STATE     READ WRITE CKSUM
        WDRED3TBx2                                      DEGRADED     0     0     0
          mirror-0                                      DEGRADED     0     0     0
            gptid/ff71eac0-a8c2-11e5-bcc9-002590477413  DEGRADED     0     0     0  too many errors
            gptid/b222ddd0-363a-11e5-a23f-002590477413  ONLINE       0     0     0
          mirror-1                                      ONLINE       0     0     0
            gptid/821f355f-38e1-11e6-96ce-002590477413  ONLINE       0     0     0
            gptid/1fe18271-31aa-11e6-b193-002590477413  ONLINE       0     0     0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0h5m with 0 errors on Fri May 27 03:50:08 2016
config:

        NAME                                          STATE     READ WRITE CKSUM
        freenas-boot                                  ONLINE       0     0     0
          gptid/8fe17c7e-b50a-11e4-94cf-002590477413  ONLINE       0     0     0

errors: No known data errors



Code:
[root@freenas] ~# smartctl -a /dev/da0
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WCC4N4TU2UTF
LU WWN Device Id: 5 0014ee 260bdfb98
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Jun 24 18:58:58 2016 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (40860) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 410) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   181   181   021    Pre-fail  Always       -       5908
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       34
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   088   088   000    Old_age   Always       -       9072
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       33
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       24
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       3171
194 Temperature_Celsius     0x0022   112   108   000    Old_age   Always       -       38
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 5
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 5 occurred at disk power-on lifetime: 4005 hours (166 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 51 08 58 02 40 40  Error: IDNF at LBA = 0x00400258 = 4194904

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 08 58 02 40 40 08   2d+22:40:30.905  WRITE DMA

Error 4 occurred at disk power-on lifetime: 3995 hours (166 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 51 08 c0 01 40 40  Error: IDNF at LBA = 0x004001c0 = 4194752

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 08 c0 01 40 40 08   2d+12:55:55.992  WRITE DMA

Error 3 occurred at disk power-on lifetime: 4 hours (0 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 51 08 00 02 40 40  Error: IDNF at LBA = 0x00400200 = 4194816

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 08 00 02 40 40 08      00:10:56.540  WRITE DMA
  ca 00 20 d0 22 40 40 00      00:10:53.067  WRITE DMA

Error 2 occurred at disk power-on lifetime: 4 hours (0 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 51 08 a8 01 40 40  Error: IDNF at LBA = 0x004001a8 = 4194728

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 08 a8 01 40 40 08      00:02:46.456  WRITE DMA

Error 1 occurred at disk power-on lifetime: 4 hours (0 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 51 e0 a0 00 40 40  Error: IDNF at LBA = 0x004000a0 = 4194464

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 e0 a0 00 40 40 08      00:08:55.796  WRITE DMA

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      9069         -
# 2  Extended offline    Interrupted (host reset)      90%      9009         -
# 3  Short offline       Interrupted (host reset)      80%      8984         -
# 4  Short offline       Completed without error       00%      8936         -
# 5  Short offline       Completed without error       00%      8888         -
# 6  Short offline       Completed without error       00%      8840         -
# 7  Short offline       Completed without error       00%      8792         -
# 8  Short offline       Completed without error       00%      8744         -
# 9  Short offline       Completed without error       00%      8696         -
#10  Extended offline    Completed without error       00%      8681         -
#11  Short offline       Completed without error       00%      8648         -
#12  Short offline       Completed without error       00%      8600         -
#13  Short offline       Completed without error       00%      8553         -
#14  Short offline       Completed without error       00%      8505         -
#15  Short offline       Completed without error       00%      8481         -
#16  Short offline       Completed without error       00%      8433         -
#17  Short offline       Completed without error       00%      8385         -
#18  Short offline       Completed without error       00%      8337         -
#19  Short offline       Completed without error       00%      8289         -
#20  Extended offline    Completed without error       00%      8274         -
#21  Short offline       Completed without error       00%      8241         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Code:
[root@freenas] ~# smartctl -a /dev/da1
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WMC4N0K3H2J4
LU WWN Device Id: 5 0014ee 6b046394a
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Jun 24 18:59:48 2016 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (40020) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 401) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   180   180   021    Pre-fail  Always       -       5958
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       15
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   090   090   000    Old_age   Always       -       7981
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       15
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       6
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       219
194 Temperature_Celsius     0x0022   111   108   000    Old_age   Always       -       39
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      7979         -
# 2  Short offline       Completed without error       00%      7941         -
# 3  Extended offline    Completed without error       00%      7926         -
# 4  Short offline       Completed without error       00%      7894         -
# 5  Short offline       Completed without error       00%      7846         -
# 6  Short offline       Completed without error       00%      7798         -
# 7  Short offline       Completed without error       00%      7750         -
# 8  Short offline       Completed without error       00%      7702         -
# 9  Short offline       Completed without error       00%      7654         -
#10  Short offline       Completed without error       00%      7606         -
#11  Extended offline    Completed without error       00%      7591         -
#12  Short offline       Completed without error       00%      7558         -
#13  Short offline       Completed without error       00%      7510         -
#14  Short offline       Completed without error       00%      7462         -
#15  Short offline       Completed without error       00%      7414         -
#16  Short offline       Completed without error       00%      7390         -
#17  Short offline       Completed without error       00%      7342         -
#18  Short offline       Completed without error       00%      7294         -
#19  Short offline       Completed without error       00%      7246         -
#20  Short offline       Completed without error       00%      7198         -
#21  Extended offline    Completed without error       00%      7183         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

negabinary

Dabbler
Joined
Jun 24, 2016
Messages
11
Code:
[root@freenas] ~# smartctl -a /dev/da2
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WCC4N7AJY415
LU WWN Device Id: 5 0014ee 26270d2d2
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Jun 24 19:01:04 2016 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (39720) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 399) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   253   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   177   176   021    Pre-fail  Always       -       6141
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       9
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       42
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       9
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       6
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       31
194 Temperature_Celsius     0x0022   110   107   000    Old_age   Always       -       40
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 8 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 8 occurred at disk power-on lifetime: 9 hours (0 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 45 00 00 00 e0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 45 00 00 00 e0 00      02:36:05.222  SET FEATURES [Set transfer mode]
  ef 03 45 00 00 00 e0 00      02:36:05.169  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 e0 00      02:36:05.051  IDENTIFY DEVICE
  ef 03 0c 00 00 00 00 00      02:35:45.809  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 00 00      02:35:45.808  IDENTIFY DEVICE

Error 7 occurred at disk power-on lifetime: 9 hours (0 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 45 00 00 00 e0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 45 00 00 00 e0 00      02:36:05.169  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 e0 00      02:36:05.051  IDENTIFY DEVICE
  ef 03 0c 00 00 00 00 00      02:35:45.809  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 00 00      02:35:45.808  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00      02:35:45.806  IDENTIFY DEVICE

Error 6 occurred at disk power-on lifetime: 9 hours (0 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 0c 00 00 00 00  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 0c 00 00 00 00 00      02:35:45.809  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 00 00      02:35:45.808  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00      02:35:45.806  IDENTIFY DEVICE
  ef 03 45 00 00 00 e0 00      00:02:26.403  SET FEATURES [Set transfer mode]
  ef 03 45 00 00 00 e0 00      00:02:26.350  SET FEATURES [Set transfer mode]

Error 5 occurred at disk power-on lifetime: 6 hours (0 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 45 00 00 00 e0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 45 00 00 00 e0 00      00:02:26.403  SET FEATURES [Set transfer mode]
  ef 03 45 00 00 00 e0 00      00:02:26.350  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 e0 00      00:02:26.233  IDENTIFY DEVICE
  ef 03 0c 00 00 00 00 00      00:02:07.081  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 00 00      00:02:07.080  IDENTIFY DEVICE

Error 4 occurred at disk power-on lifetime: 6 hours (0 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 45 00 00 00 e0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 45 00 00 00 e0 00      00:02:26.350  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 e0 00      00:02:26.233  IDENTIFY DEVICE
  ef 03 0c 00 00 00 00 00      00:02:07.081  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 00 00      00:02:07.080  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00      00:02:07.078  IDENTIFY DEVICE

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%        40         -
# 2  Short offline       Completed without error       00%        32         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Code:
[root@freenas] ~# smartctl -a /dev/da3
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WCC4N4TU23EZ
LU WWN Device Id: 5 0014ee 20b689989
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Jun 24 19:01:55 2016 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (42480) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 426) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   212   181   021    Pre-fail  Always       -       4400
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       45
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       6022
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       45
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       32
193 Load_Cycle_Count        0x0032   198   198   000    Old_age   Always       -       7531
194 Temperature_Celsius     0x0022   114   109   000    Old_age   Always       -       36
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 151 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 151 occurred at disk power-on lifetime: 5980 hours (249 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 70 32 6f ee  Error: UNC at LBA = 0x0e6f3270 = 242168432

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 70 32 6f ee 00      00:16:25.752  READ DMA
  c8 00 00 70 31 6f ee 00      00:16:25.750  READ DMA
  c8 00 a0 d0 30 6f ee 00      00:16:25.748  READ DMA
  c8 00 b0 20 30 6f ee 00      00:16:25.746  READ DMA
  c8 00 88 98 2f 6f ee 00      00:16:25.744  READ DMA

Error 150 occurred at disk power-on lifetime: 5980 hours (249 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 40 c0 bb fd ef  Error: UNC 64 sectors at LBA = 0x0ffdbbc0 = 268286912

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 40 c0 bb fd ef 00      00:15:44.777  READ DMA

Error 149 occurred at disk power-on lifetime: 5980 hours (249 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 18 f8 1c 0b e2  Error: UNC 24 sectors at LBA = 0x020b1cf8 = 34282744

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 18 f8 1c 0b e2 00      00:04:36.373  READ DMA
  c8 00 08 e0 02 0b e2 00      00:04:36.373  READ DMA
  c8 00 08 80 02 0b e2 00      00:04:36.372  READ DMA
  c8 00 50 78 02 0b e2 00      00:04:36.372  READ DMA
  c8 00 08 c0 01 0b e2 00      00:04:36.372  READ DMA

Error 148 occurred at disk power-on lifetime: 5980 hours (249 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 18 5a fa ee  Error: UNC at LBA = 0x0efa5a18 = 251288088

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 18 5a fa ee 00      00:02:58.154  READ DMA

Error 147 occurred at disk power-on lifetime: 5980 hours (249 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 d8 00 80 fe ee  Error: UNC 216 sectors at LBA = 0x0efe8000 = 251559936

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 d8 00 80 fe ee 00      00:02:49.980  READ DMA

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      6020         -
# 2  Extended offline    Completed without error       00%      5969         -
# 3  Short offline       Completed without error       00%      5936         -
# 4  Short offline       Completed without error       00%      5888         -
# 5  Short offline       Completed without error       00%      5840         -
# 6  Short offline       Completed without error       00%      5792         -
# 7  Short offline       Completed without error       00%      5744         -
# 8  Short offline       Completed without error       00%      5696         -
# 9  Short offline       Completed without error       00%      5648         -
#10  Extended offline    Completed without error       00%      5633         -
#11  Short offline       Completed without error       00%      5600         -
#12  Short offline       Completed without error       00%      5552         -
#13  Short offline       Completed without error       00%      5504         -
#14  Short offline       Completed without error       00%      5456         -
#15  Short offline       Completed without error       00%      5432         -
#16  Short offline       Completed without error       00%      5384         -
#17  Short offline       Completed without error       00%      5336         -
#18  Short offline       Completed without error       00%      5288         -
#19  Short offline       Completed without error       00%      5240         -
#20  Extended offline    Completed without error       00%      5225         -
#21  Short offline       Completed without error       00%      6385         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
How hot do your drives get when you are doing a scrub or under heavy load? While idling they are kind of warm but not to bad but when they are all doing work they might be overheating.
 

negabinary

Dabbler
Joined
Jun 24, 2016
Messages
11
I have the system set to notify me if the drives go over 45 degrees, and I've never gotten a notice.
 

negabinary

Dabbler
Joined
Jun 24, 2016
Messages
11
Haven't restarted the system since my original post. zpool status now outputs this:

Code:
[root@freenas] ~# zpool status
  pool: WDRED3TBx2
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 64.1G in 8h25m with 0 errors on Fri Jun 24 07:28:53 2016
config:

        NAME                                            STATE     READ WRITE CKSUM
        WDRED3TBx2                                      DEGRADED     0     0     0
          mirror-0                                      DEGRADED     0     0     0
            gptid/ff71eac0-a8c2-11e5-bcc9-002590477413  DEGRADED     0     0   714  too many errors
            gptid/b222ddd0-363a-11e5-a23f-002590477413  ONLINE       0     0     0
          mirror-1                                      ONLINE       0     0     0
            gptid/821f355f-38e1-11e6-96ce-002590477413  ONLINE       0     0     0
            gptid/1fe18271-31aa-11e6-b193-002590477413  ONLINE       0     0     5

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0h5m with 0 errors on Fri May 27 03:50:08 2016
config:

        NAME                                          STATE     READ WRITE CKSUM
        freenas-boot                                  ONLINE       0     0     0
          gptid/8fe17c7e-b50a-11e4-94cf-002590477413  ONLINE       0     0     0

errors: No known data errors
 

Mirfster

Doesn't know what he's talking about
Joined
Oct 2, 2015
Messages
3,215
Just a shot in the dark here, but could a Firmware update for the Hard Drives be considered?
"WD Utility for RED drives with High Load Cycle Counts"

Since I don't use WDs, I am not too familiar with "widdle", "twiddle" or whatever it is called so that might be considered too?
 

negabinary

Dabbler
Joined
Jun 24, 2016
Messages
11
Just a shot in the dark here, but could a Firmware update for the Hard Drives be considered?
"WD Utility for RED drives with High Load Cycle Counts"

Since I don't use WDs, I am not too familiar with "widdle", "twiddle" or whatever it is called so that might be considered too?

I just ran WDIDLE3 and it reports all my drives are set to 300 seconds, which appears to be the correct setting. Appreciate the idea though.
 

Mirfster

Doesn't know what he's talking about
Joined
Oct 2, 2015
Messages
3,215
Seems to me that since you went so far as to change the drive controller (From MB SATA to LSI 9211-8I) and the problem still persists, then it very well may just be the drive(s). If you have past logs, check to see if the errors all pertain to "ff71eac0-a8c2-11e5-bcc9-002590477413" and don't focus on "/dev/da#" since that can change.

Before you go using the sledgehammer technique, it may just be worth while to actually try another drive (or two)...

P.S., check to see if your drives are under warranty and perhaps an RMA is in order?
 

negabinary

Dabbler
Joined
Jun 24, 2016
Messages
11
Seems to me that since you went so far as to change the drive controller (From MB SATA to LSI 9211-8I) and the problem still persists, then it very well may just be the drive(s). If you have past logs, check to see if the errors all pertain to "ff71eac0-a8c2-11e5-bcc9-002590477413" and don't focus on "/dev/da#" since that can change.

Before you go using the sledgehammer technique, it may just be worth while to actually try another drive (or two)...

P.S., check to see if your drives are under warranty and perhaps an RMA is in order?

Based on my email notifications, each of the following drives have had issues (and the most recent issue):

(A) gptid/821f355f-38e1-11e6-96ce-002590477413 (6/23: UNAVAIL, no errors)*
(B) gptid/ff71eac0-a8c2-11e5-bcc9-002590477413 (6/25: DEGRADED, 708 CKSUM errors)
(C) gptid/1fe18271-31aa-11e6-b193-002590477413 (6/25: 5 CKSUM errors)
(D) gptid/1f1512f0-31aa-11e6-b193-002590477413 (6/22: REMOVED, 3.39K CKSUM errors)*

The only drive that has had no apparent issues is
(E) gptid/b222ddd0-363a-11e5-a23f-002590477413

*I physically replaced drive (D) with drive (A), a new drive, on 6/23.

All the drives are under warranty, but I'm having a hard time believing this many drives could go bad simultaneously, especially when they weren't all bought at the same time or from the same vendor. Perhaps its my only option left though.
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
Please confirm the following;
  1. Switched power supplies, eliminating likelihood of power issue.
  2. Used different data cables from SATA to drives.
  3. Used flashed/FW updated PCIe HBA with same issues as SATA interface.
  4. Replaced drive in pool with immediate same results on new drive.
If the smart results shown are idle temps, your drives need more cooling

Please post the results of # smartctl -l scttemp /dev/dax
for all four of your drives.


Only ran 8hrs. of MemTest86 :(:(:(:(

With 32GB of RAM, I'm guessing this was only 1 pass :rolleyes:

If your RAM can go 8 - 10 passes without error, my wildass guess
is the board is most likely the problem
OR the CPU
 

negabinary

Dabbler
Joined
Jun 24, 2016
Messages
11
Please confirm the following;
  1. Switched power supplies, eliminating likelihood of power issue.
  2. Used different data cables from SATA to drives.
  3. Used flashed/FW updated PCIe HBA with same issues as SATA interface.
  4. Replaced drive in pool with immediate same results on new drive.
If the smart results shown are idle temps, your drives need more cooling

Please post the results of # smartctl -l scttemp /dev/dax
for all four of your drives.


Only ran 8hrs. of MemTest86 :(:(:(:(

With 32GB of RAM, I'm guessing this was only 1 pass :rolleyes:

If your RAM can go 8 - 10 passes without error, my wildass guess
is the board is most likely the problem
OR the CPU

I can confirm #1-4.

Cooling may not be ideal, but like I said above, they've never been over 45 degrees. Requested smartctl output below.

I don't remember how many passes MemTest86 made, but it was enough that it popped up the "passed" message. I'll try to run it for a longer period when time permits.

Code:
[root@freenas] ~# smartctl -l scttemp /dev/da0
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    39 Celsius
Power Cycle Min/Max Temperature:     37/39 Celsius
Lifetime    Min/Max Temperature:      2/42 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (196)

Index    Estimated Time   Temperature Celsius
 197    2016-06-25 04:29    39  ********************
 ...    ..( 20 skipped).    ..  ********************
 218    2016-06-25 04:50    39  ********************
 219    2016-06-25 04:51    38  *******************
 ...    ..(343 skipped).    ..  *******************
  85    2016-06-25 10:35    38  *******************
  86    2016-06-25 10:36     ?  -
  87    2016-06-25 10:37    38  *******************
  88    2016-06-25 10:38    38  *******************
  89    2016-06-25 10:39     ?  -
  90    2016-06-25 10:40    38  *******************
  91    2016-06-25 10:41     ?  -
  92    2016-06-25 10:42    37  ******************
 ...    ..( 23 skipped).    ..  ******************
 116    2016-06-25 11:06    37  ******************
 117    2016-06-25 11:07    38  *******************
 ...    ..( 17 skipped).    ..  *******************
 135    2016-06-25 11:25    38  *******************
 136    2016-06-25 11:26    39  ********************
 ...    ..( 59 skipped).    ..  ********************
 196    2016-06-25 12:26    39  ********************

[root@freenas] ~# smartctl -l scttemp /dev/da1
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    40 Celsius
Power Cycle Min/Max Temperature:     37/40 Celsius
Lifetime    Min/Max Temperature:      2/42 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (156)

Index    Estimated Time   Temperature Celsius
 157    2016-06-25 04:29    39  ********************
 ...    ..( 78 skipped).    ..  ********************
 236    2016-06-25 05:48    39  ********************
 237    2016-06-25 05:49    38  *******************
 ...    ..(135 skipped).    ..  *******************
 373    2016-06-25 08:05    38  *******************
 374    2016-06-25 08:06     ?  -
 375    2016-06-25 08:07    39  ********************
 376    2016-06-25 08:08    38  *******************
 377    2016-06-25 08:09     ?  -
 378    2016-06-25 08:10    38  *******************
 379    2016-06-25 08:11     ?  -
 380    2016-06-25 08:12    37  ******************
 ...    ..( 14 skipped).    ..  ******************
 395    2016-06-25 08:27    37  ******************
 396    2016-06-25 08:28    38  *******************
 ...    ..(  6 skipped).    ..  *******************
 403    2016-06-25 08:35    38  *******************
 404    2016-06-25 08:36    39  ********************
 ...    ..( 11 skipped).    ..  ********************
 416    2016-06-25 08:48    39  ********************
 417    2016-06-25 08:49    40  *********************
 ...    ..( 88 skipped).    ..  *********************
  28    2016-06-25 10:18    40  *********************
  29    2016-06-25 10:19    39  ********************
 ...    ..(126 skipped).    ..  ********************
 156    2016-06-25 12:26    39  ********************

[root@freenas] ~# smartctl -l scttemp /dev/da2
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    39 Celsius
Power Cycle Min/Max Temperature:     38/39 Celsius
Lifetime    Min/Max Temperature:     27/43 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (263)

Index    Estimated Time   Temperature Celsius
 264    2016-06-25 04:29    40  *********************
 ...    ..(252 skipped).    ..  *********************
  39    2016-06-25 08:42    40  *********************
  40    2016-06-25 08:43    39  ********************
 ...    ..( 16 skipped).    ..  ********************
  57    2016-06-25 09:00    39  ********************
  58    2016-06-25 09:01    40  *********************
 ...    ..( 50 skipped).    ..  *********************
 109    2016-06-25 09:52    40  *********************
 110    2016-06-25 09:53    39  ********************
 ...    ..( 19 skipped).    ..  ********************
 130    2016-06-25 10:13    39  ********************
 131    2016-06-25 10:14     ?  -
 132    2016-06-25 10:15    40  *********************
 133    2016-06-25 10:16    39  ********************
 134    2016-06-25 10:17     ?  -
 135    2016-06-25 10:18    39  ********************
 136    2016-06-25 10:19     ?  -
 137    2016-06-25 10:20    38  *******************
 ...    ..( 32 skipped).    ..  *******************
 170    2016-06-25 10:53    38  *******************
 171    2016-06-25 10:54    39  ********************
 ...    ..( 91 skipped).    ..  ********************
 263    2016-06-25 12:26    39  ********************

[root@freenas] ~# smartctl -l scttemp /dev/da3
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    36 Celsius
Power Cycle Min/Max Temperature:     35/36 Celsius
Lifetime    Min/Max Temperature:      2/41 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (189)

Index    Estimated Time   Temperature Celsius
 190    2016-06-25 04:29    36  *****************
 ...    ..(365 skipped).    ..  *****************
  78    2016-06-25 10:35    36  *****************
  79    2016-06-25 10:36     ?  -
  80    2016-06-25 10:37    36  *****************
  81    2016-06-25 10:38    36  *****************
  82    2016-06-25 10:39     ?  -
  83    2016-06-25 10:40    36  *****************
  84    2016-06-25 10:41     ?  -
  85    2016-06-25 10:42    35  ****************
 ...    ..(  5 skipped).    ..  ****************
  91    2016-06-25 10:48    35  ****************
  92    2016-06-25 10:49    36  *****************
 ...    ..( 96 skipped).    ..  *****************
 189    2016-06-25 12:26    36  *****************
 

negabinary

Dabbler
Joined
Jun 24, 2016
Messages
11
In case anyone ever comes along this thread with a similar problem, I replaced another drive in my pool (gptid/ff71eac0-a8c2-11e5-bcc9-002590477413 if you're following along), and that seems to have corrected the problems. Reslivering completed without issue, and I haven't seen any checksum errors on any drive in the 36 hours since, despite the system being under fairly heavy load.

So, it appears that my issue was a double drive failure, where both drives could still pass a SMART long test. The failing drives seem to have (inexplicably?) caused checksum errors on other healthy drives, leading me down many wrong paths before coming back to the obvious answer.
 
Status
Not open for further replies.
Top