Unretryable error (One or more devices has experienced an unrecoverable error.)

m3ki · Jul 2, 2016

Hi all,

I am hoping someone can educate me with this problem. This is the second time I am getting this error and this one is with a new vdev. So problem happened 2 times, once with different drives.

I have a suspicion this is just a failing drive or some bad sectors. Since this system has been running for about a year with Unraid before I started migration to freenas a week ago. (Obviously HDDs were stress tested at one point or another before being added to a system. Memtests all pass)

So this morning I got this message:

Code:

kernel log messages:
> mps2: SAS Address for SATA device = 4874463cffdcbe94
> mps2: SAS Address from SATA device = 4874463cffdcbe94
> da20 at mps2 bus 0 scbus2 target 11 lun 0
> da20: <ATA WDC WD50EFRX-68M 0A82> Fixed Direct Access SPC-4 SCSI device
> da20: Serial Number      WD-xxxxxxxxx
> da20: 600.000MB/s transfers
> da20: Command Queueing enabled
> da20: 4769307MB (9767541168 512 byte sectors)
> da20: quirks=0x8<4K>
> ahcich1: Timeout on slot 8 port 0
> ahcich1: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd c0 serr 00000000 cmd 0004c817
> (ada1:ahcich1:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
> (ada1:ahcich1:0:0:0): CAM status: Command timeout
> (ada1:ahcich1:0:0:0): Retrying command
> ahcich0: Timeout on slot 7 port 0
> ahcich0: is 00000000 cs 00000080 ss 00000000 rs 00000080 tfd c0 serr 00000000 cmd 0004c717
> (ada0:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
> (ada0:ahcich0:0:0:0): CAM status: Command timeout
> (ada0:ahcich0:0:0:0): Retrying command
> (da18:mps0:0:3:0): WRITE(10). CDB: 2a 00 7b b7 df e8 00 00 08 00
> (da18:mps0:0:3:0): CAM status: SCSI Status Error
> (da18:mps0:0:3:0): SCSI status: Check Condition
> (da18:mps0:0:3:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
> (da18:mps0:0:3:0): Info: 0x7bb7dfe8
> (da18:mps0:0:3:0): Error 22, Unretryable error

-- End of security output --

Status of the pool

Code:

Checking status of zfs pools:
NAME           SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
freenas-boot  14.9G  1.09G  13.8G         -      -     7%  1.00x  ONLINE  -
zfast          460G   195G   265G         -    27%    42%  1.00x  ONLINE  /mnt
zroot         81.8T  38.3T  43.5T         -    24%    46%  1.00x  ONLINE  /mnt

  pool: zroot
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 68K in 0h0m with 0 errors on Sat Jul  2 02:53:31 2016
config:

        NAME                                            STATE     READ WRITE CKSUM
        zroot                                           ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/775bcae9-38d8-11e6-8455-0cc47a6b6816  ONLINE       0     0     0
            gptid/7861ae25-38d8-11e6-8455-0cc47a6b6816  ONLINE       0     0     0
            gptid/79076331-38d8-11e6-8455-0cc47a6b6816  ONLINE       0     0     0
            gptid/79b53996-38d8-11e6-8455-0cc47a6b6816  ONLINE       0     0     0
            gptid/7ac4a958-38d8-11e6-8455-0cc47a6b6816  ONLINE       0     0     0
            gptid/7bdf71d9-38d8-11e6-8455-0cc47a6b6816  ONLINE       0     0     0
          raidz2-1                                      ONLINE       0     0     0
            gptid/7cf57b75-38d8-11e6-8455-0cc47a6b6816  ONLINE       0     0     0
            gptid/7e037123-38d8-11e6-8455-0cc47a6b6816  ONLINE       0     0     0
            gptid/7f1e9fd8-38d8-11e6-8455-0cc47a6b6816  ONLINE       0     0     0
            gptid/80358d0f-38d8-11e6-8455-0cc47a6b6816  ONLINE       0     0     0
            gptid/814a11ec-38d8-11e6-8455-0cc47a6b6816  ONLINE       0     0     0
            gptid/81fe2a3b-38d8-11e6-8455-0cc47a6b6816  ONLINE       0     0     0
          raidz2-2                                      ONLINE       0     0     0
            gptid/063d5bbf-3ebe-11e6-8f50-0cc47a6b6816  ONLINE       0     0     0
            gptid/0751dcd6-3ebe-11e6-8f50-0cc47a6b6816  ONLINE       0     0     0
            gptid/080d0dc1-3ebe-11e6-8f50-0cc47a6b6816  ONLINE       0     0     0
            gptid/08b9ac7e-3ebe-11e6-8f50-0cc47a6b6816  ONLINE       0     0     0
            gptid/09630467-3ebe-11e6-8f50-0cc47a6b6816  ONLINE       0     1     0
            gptid/0a0c005f-3ebe-11e6-8f50-0cc47a6b6816  ONLINE       0     0     0

errors: No known data errors

-- End of daily output --

I got similar error on another vdev a day before on a different drive but I discarded it as a bad sector and did zpool clear after running a smart test.

This is today's status

The second highlighted rectangle had same error as first a day ago.

smartctl -a /dev/da18

Code:

smartctl -a /dev/da18
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD50EFRX-68MYMN1
Serial Number:    WD-xxxx
LU WWN Device Id: 5 0014ee 260b5e9b4
Firmware Version: 82.00A82
User Capacity:    5,000,981,078,016 bytes [5.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5700 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Jul  2 08:59:33 2016 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 248) Self-test routine in progress...
                                        80% of test remaining.
Total time to complete Offline
data collection:                (57960) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 579) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   204   202   021    Pre-fail  Always       -       8791
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       310
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   087   087   000    Old_age   Always       -       10207
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       21
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       9
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       381
194 Temperature_Celsius     0x0022   118   112   000    Old_age   Always       -       34
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      5569         -
# 2  Short offline       Completed without error       00%      5554         -
# 3  Extended offline    Completed without error       00%      5542         -
# 4  Short offline       Completed without error       00%      5530         -
# 5  Short offline       Completed without error       00%      5506         -
# 6  Short offline       Completed without error       00%      5472         -
# 7  Short offline       Completed without error       00%      5448         -
# 8  Short offline       Completed without error       00%      5424         -
# 9  Short offline       Completed without error       00%      5400         -
#10  Short offline       Completed without error       00%      5376         -
#11  Short offline       Completed without error       00%      5353         -
#12  Short offline       Completed without error       00%      5329         -
#13  Short offline       Completed without error       00%      5305         -
#14  Short offline       Completed without error       00%      5281         -
#15  Short offline       Completed without error       00%      5257         -
#16  Extended offline    Completed without error       00%      5245         -
#17  Short offline       Completed without error       00%      5233         -
#18  Short offline       Completed without error       00%      5232         -
#19  Short offline       Completed without error       00%      5208         -
#20  Short offline       Completed without error       00%      5184         -
#21  Short offline       Completed without error       00%      5160         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.

This is error happened the other day with da11 I also added new vdev that day i am asuming error was ok with corrupt GPT stuff

Code:

anomaly.m3ki.net kernel log messages:
> mps2: SAS Address for SATA device = 4873463ef7c2c595
> mps2: SAS Address from SATA device = 4873463ef7c2c595
> da19 at mps2 bus 0 scbus2 target 2 lun 0
> da19: <ATA WDC WD50EFRX-68M 0A82> Fixed Direct Access SPC-4 SCSI device
> da19: Serial Number      WD----------
> da19: 600.000MB/s transfers
> da19: Command Queueing enabled
> da19: 4769307MB (9767541168 512 byte sectors)
> da19: quirks=0x8<4K>
> GEOM_ELI: Device da13p1.eli destroyed.
> GEOM_ELI: Detached da13p1.eli on last close.
> GEOM_ELI: Device da14p1.eli destroyed.
> GEOM_ELI: Detached da14p1.eli on last close.
> GEOM_ELI: Device da0p1.eli destroyed.
> GEOM_ELI: Detached da0p1.eli on last close.
> GEOM_ELI: Device da1p1.eli destroyed.
> GEOM_ELI: Detached da1p1.eli on last close.
> GEOM_ELI: Device da2p1.eli destroyed.
> GEOM_ELI: Detached da2p1.eli on last close.
> GEOM_ELI: Device da3p1.eli destroyed.
> GEOM_ELI: Detached da3p1.eli on last close.
> GEOM_ELI: Device da4p1.eli destroyed.
> GEOM_ELI: Detached da4p1.eli on last close.
> GEOM_ELI: Device da5p1.eli destroyed.
> GEOM_ELI: Detached da5p1.eli on last close.
> GEOM_ELI: Device da7p1.eli destroyed.
> GEOM_ELI: Detached da7p1.eli on last close.
> GEOM_ELI: Device da8p1.eli destroyed.
> GEOM_ELI: Detached da8p1.eli on last close.
> GEOM_ELI: Device da9p1.eli destroyed.
> GEOM_ELI: Detached da9p1.eli on last close.
> GEOM_ELI: Device da11p1.eli destroyed.
> GEOM_ELI: Detached da11p1.eli on last close.
> GEOM_ELI: Device ada0p1.eli destroyed.
> GEOM_ELI: Detached ada0p1.eli on last close.
> GEOM_ELI: Device ada1p1.eli destroyed.
> GEOM_ELI: Detached ada1p1.eli on last close.
> GEOM: da6: the primary GPT table is corrupt or invalid.
> GEOM: da6: using the secondary instead -- recovery strongly advised.
> GEOM: da10: the primary GPT table is corrupt or invalid.
> GEOM: da10: using the secondary instead -- recovery strongly advised.
> GEOM: da12: the primary GPT table is corrupt or invalid.
> GEOM: da12: using the secondary instead -- recovery strongly advised.
> GEOM: da15: the primary GPT table is corrupt or invalid.
> GEOM: da15: using the secondary instead -- recovery strongly advised.
> GEOM: da18: the primary GPT table is corrupt or invalid.
> GEOM: da18: using the secondary instead -- recovery strongly advised.
> GEOM: da19: the primary GPT table is corrupt or invalid.
> GEOM: da19: using the secondary instead -- recovery strongly advised.
> GEOM_ELI: Device da13p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 128
> GEOM_ELI:     Crypto: hardware
> GEOM_ELI: Device da14p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 128
> GEOM_ELI:     Crypto: hardware
> GEOM_ELI: Device da0p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 128
> GEOM_ELI:     Crypto: hardware
> GEOM_ELI: Device da1p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 128
> GEOM_ELI:     Crypto: hardware
> GEOM_ELI: Device da2p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 128
> GEOM_ELI:     Crypto: hardware
> GEOM_ELI: Device da3p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 128
> GEOM_ELI:     Crypto: hardware
> GEOM_ELI: Device da4p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 128
> GEOM_ELI:     Crypto: hardware
> GEOM_ELI: Device da5p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 128
> GEOM_ELI:     Crypto: hardware
> GEOM_ELI: Device da7p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 128
> GEOM_ELI:     Crypto: hardware
> GEOM_ELI: Device da8p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 128
> GEOM_ELI:     Crypto: hardware
> GEOM_ELI: Device da9p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 128
> GEOM_ELI:     Crypto: hardware
> GEOM_ELI: Device da11p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 128
> GEOM_ELI:     Crypto: hardware
> GEOM_ELI: Device da6p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 128
> GEOM_ELI:     Crypto: hardware
> GEOM_ELI: Device da10p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 128
> GEOM_ELI:     Crypto: hardware
> GEOM_ELI: Device da12p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 128
> GEOM_ELI:     Crypto: hardware
> GEOM_ELI: Device da15p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 128
> GEOM_ELI:     Crypto: hardware
> GEOM_ELI: Device da18p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 128
> GEOM_ELI:     Crypto: hardware
> GEOM_ELI: Device da19p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 128
> GEOM_ELI:     Crypto: hardware
> GEOM_ELI: Device ada0p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 128
> GEOM_ELI:     Crypto: hardware
> GEOM_ELI: Device ada1p1.eli created.
> GEOM_ELI: Encryption: AES-XTS 128
> GEOM_ELI:     Crypto: hardware
>       (da11:mps2:0:6:0): READ(10). CDB: 28 00 40 56 04 58 00 00 40 00 length 32768 SMID 965 terminated ioc 804b scsi 0 state 0 xfer 0
>       (da11:mps2:0:6:0): READ(10). CDB: 28 00 40 56 05 18 00 00 40 00 length 32768 SMID 744 terminated ioc 804b scsi 0 state 0 xfer(da11:mps2:0:6:0): READ(10). CDB: 28 00 40 56 04 58 00 00 40 00
>  0
> (da11:mps2:0:6:0): CAM status: CCB request completed with an error
>       (da11:mps2:0:6:0): READ(10). CDB: 28 00 40 56 04 98 00 00 40 00 length 32768 SMID 973 terminated ioc 804b scsi 0 state 0 xfer(da11: 0
> mps2:0:6:0): Retrying command
> (da11:mps2:0:6:0): READ(10). CDB: 28 00 40 56 05 18 00 00 40 00
> (da11:mps2:0:6:0): CAM status: CCB request completed with an error
> (da11:mps2:0:6:0): Retrying command
> (da11:mps2:0:6:0): READ(10). CDB: 28 00 40 56 04 98 00 00 40 00
> (da11:mps2:0:6:0): CAM status: CCB request completed with an error
> (da11:mps2:0:6:0): Retrying command
> (da11:mps2:0:6:0): WRITE(16). CDB: 8a 00 00 00 00 01 2c 93 cd 60 00 00 00 40 00 00
> (da11:mps2:0:6:0): CAM status: SCSI Status Error
> (da11:mps2:0:6:0): SCSI status: Check Condition
> (da11:mps2:0:6:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
> (da11:mps2:0:6:0): Info: 0x12c93cd60
> (da11:mps2:0:6:0): Error 22, Unretryable error

-- End of security output --

smartctl -a /dev/da11

Code:

 smartctl -a /dev/da11
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD50EFRX-68MYMN1
Serial Number:    WD-xxx
LU WWN Device Id: 5 0014ee 260b5abef
Firmware Version: 82.00A82
User Capacity:    5,000,981,078,016 bytes [5.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5700 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Jul  2 08:52:53 2016 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (57660) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 576) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   203   202   021    Pre-fail  Always       -       8816
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       205
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   090   090   000    Old_age   Always       -       7325
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       25
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       6
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       324
194 Temperature_Celsius     0x0022   116   109   000    Old_age   Always       -       36
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      7303         -
# 2  Short offline       Completed without error       00%      7291         -
# 3  Short offline       Completed without error       00%      7255         -
# 4  Short offline       Completed without error       00%      7159         -
# 5  Short offline       Completed without error       00%      6475         -
# 6  Short offline       Completed without error       00%      6451         -
# 7  Short offline       Completed without error       00%      6427         -
# 8  Extended offline    Completed without error       00%      6404         -
# 9  Short offline       Completed without error       00%      6390         -
#10  Short offline       Completed without error       00%      6366         -
#11  Short offline       Completed without error       00%      6343         -
#12  Short offline       Completed without error       00%      6324         -
#13  Short offline       Completed without error       00%      6300         -
#14  Short offline       Completed without error       00%      6267         -
#15  Short offline       Completed without error       00%      6243         -
#16  Short offline       Completed without error       00%      6219         -
#17  Short offline       Completed without error       00%      6195         -
#18  Short offline       Completed without error       00%      6171         -
#19  Short offline       Completed without error       00%      6146         -
#20  Short offline       Completed without error       00%      6122         -
#21  Short offline       Completed without error       00%      6098         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Here are my firmware versions I am also assuming all is ok there:

Code:

> mps0: Firmware: 20.00.04.00, Driver: 20.00.00.00-fbsd
> mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
> pcib2: <ACPI PCI-PCI bridge> irq 16 at device 1.1 on pci0
> pci2: <ACPI PCI bus> on pcib2
> mps1: <Avago Technologies (LSI) SAS2308> port 0xd000-0xd0ff mem 0xf7240000-0xf724ffff,0xf7200000-0xf723ffff irq 17 at device 0.0 on pci2
> mps1: Firmware: 20.00.04.00, Driver: 20.00.00.00-fbsd
> mps1: IOCCapabilities: 5285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
> xhci0: <Intel Lynx Point USB 3.0 controller> mem 0xf7700000-0xf770ffff irq 16 at device 20.0 on pci0
> xhci0: 32 bytes context size, 64-bit DMA
> xhci0: Port routing mask set to 0xffffffff
> usbus0 on xhci0
> ehci0: <Intel Lynx Point USB 2.0 controller USB-B> mem 0xf7714000-0xf77143ff irq 16 at device 26.0 on pci0
> usbus1: EHCI version 1.0
> usbus1 on ehci0
> pcib3: <ACPI PCI-PCI bridge> irq 16 at device 28.0 on pci0
> pci3: <ACPI PCI bus> on pcib3
> pcib4: <ACPI PCI-PCI bridge> at device 0.0 on pci3
> pci4: <ACPI PCI bus> on pcib4
> vgapci0: <VGA-compatible display> port 0xc000-0xc07f mem 0xf6000000-0xf6ffffff,0xf7000000-0xf701ffff irq 16 at device 0.0 on pci4
> vgapci0: Boot video device
> pcib5: <ACPI PCI-PCI bridge> irq 18 at device 28.2 on pci0
> pci5: <ACPI PCI bus> on pcib5
> igb0: <Intel(R) PRO/1000 Network Connection, Version - 2.5.3-k> port 0xb000-0xb01f mem 0xf7500000-0xf757ffff,0xf7580000-0xf7583fff irq 18 at device 0.0 on pci5
> igb0: Using MSIX interrupts with 5 vectors
> igb0: Ethernet address: 0c:c4:7a:6b:68:16
> igb0: Bound queue 0 to cpu 0
> igb0: Bound queue 1 to cpu 1
> igb0: Bound queue 2 to cpu 2
> igb0: Bound queue 3 to cpu 3
> pcib6: <ACPI PCI-PCI bridge> irq 19 at device 28.3 on pci0
> pci6: <ACPI PCI bus> on pcib6
> igb1: <Intel(R) PRO/1000 Network Connection, Version - 2.5.3-k> port 0xa000-0xa01f mem 0xf7400000-0xf747ffff,0xf7480000-0xf7483fff irq 19 at device 0.0 on pci6
> igb1: Using MSIX interrupts with 5 vectors
> igb1: Ethernet address: 0c:c4:7a:6b:68:17
> igb1: Bound queue 0 to cpu 0
> igb1: Bound queue 1 to cpu 1
> igb1: Bound queue 2 to cpu 2
> igb1: Bound queue 3 to cpu 3
> pcib7: <ACPI PCI-PCI bridge> irq 16 at device 28.4 on pci0
> pci7: <ACPI PCI bus> on pcib7
> mps2: <Avago Technologies (LSI) SAS2008> port 0x9000-0x90ff mem 0xf73c0000-0xf73c3fff,0xf7380000-0xf73bffff irq 16 at device 0.0 on pci7
> mps2: Firmware: 20.00.04.00, Driver: 20.00.00.00-fbsd
> mps2: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
> ehci1: <Intel Lynx Point USB 2.0 controller USB-A> mem 0xf7713000-0xf77133ff irq 22 at device 29.0 on pci0
> usbus2: EHCI version 1.0
> usbus2 on ehci1
> isab0: <PCI-ISA bridge> at device 31.0 on pci0

I guess my questions are:

Am I right to assume the issue is with soon to be failing HDDs?is there anything else I can do to test?
Is normal procedure just to ignore these errors for now since it's only one error each? and do zpool clear?
And obviously if errors persist replace the drive?

Also for the past week I have been transferring data into the system saturating gigabit no freezes or any other issues.

edit: correct smart test

SweetAndLow · Jul 2, 2016

Hardware specs and freenas version, like the rules say should be added to every new thread.

Your drives look fine so I suspect a bad cable or controller problem. Are you using Marvel sata controller by chance?

m3ki · Jul 2, 2016

SweetAndLow said:
Hardware specs and freenas version, like the rules say should be added to every new thread.

Your drives look fine so I suspect a bad cable or controller problem. Are you using Marvel sata controller by chance?

\

It's all in my signature. Should I repost it ?

SweetAndLow · Jul 2, 2016

Yes, signatures are not viable to anyone on mobile devices.

m3ki · Jul 2, 2016

Ahh... sorry about that:

FreeNAS 9.10
MOBO: SuperMicro X10SL7-F
CPU: Intel(R) Xeon(R) CPU E3-1226 v3 @ 3.30GHz
RAM: 32 GB ECC (KVR1333D3E9SK2/16G)
HBA: 2x IBM M1015 (IT)
STORAGE: 2x 500SSD (Mirror)
STORAGE: 24 x 5TB WD RED (4 x 6xRAID-Z2)
BOOT: 2x16GB Sandisk Cruzer Fit USB 2.0
PSU: Seasonic SS-660XP2
UPS: Cyberpower CP1500AVR

SweetAndLow · Jul 2, 2016

Are you using a expander to connect the disks or do they all have sata ports? What chassie?

m3ki · Jul 2, 2016

No expander, in the logs you can see 2xM1015 IT P20 (same as driver ver) + onboard HBA LSI 2308 again in IT P20.
Chasis norco 4224 with adequate cooling drives never go over 40-41 C usually hovering at ~ 35C

SweetAndLow · Jul 2, 2016

Thanks for all the info. Looks like you have a pretty good build. I can't come up with any more ideas as to why you are having these issues. They are something that isn't normal and shouldn't be ignored. Any similarities between the 2 drives that are having problems? Same controller or something else?

m3ki · Jul 2, 2016

That's what I am trying to figure out :( I agree build is solid it was originally built with FreeNAS in mind :) then I ventured into unraid for a year :)
they seem to be on separate controller if you look into the logs you see MPS0 vs MPS2 i think that means separate controller. Is there a way to check without opening up the machine? it's a bit of a pain to take out of the lack rack.

Also ada0 has a timeout in log in the log but no errors, (it is an SSD).

I looked at the server and the disks are in opposite sides of the case on different backplanes AND if i remember how I connected them inside on different controllers.

m3ki · Jul 2, 2016

I found a pic, when I was assembling it http://imgur.com/R7sjTpL so the drives in question would be 1st backplane from the top and 4th (from the top) so those I would connect to different controllers.

m3ki · Jul 2, 2016

Hmmm... actually I just found an error I got from unraid before I moved to freenas

Code:

Event: unRAID array errors
Subject: Warning [CORTEXIPHAN] - array has errors
Description: Array has 1 disk with read errors
Importance: warning

Disk 9 - WDC_WD50EFRX-68MYMN1_WD-WX-----2SH (sdi) (errors 128)

So that means this disk had some errors before from smart... strange that that doesnt show in freenas smart report :(

m3ki · Jul 2, 2016

Yeah this is strange...
My reasoning is:
If it was a cable or something else I would have more errors than 2 in span of 2 weeks on separate disks.
Same probably goes for firmware.

I do have backups of important data, but I don't have backups of 40TB of pain-in-the-ass-to-replace data :)

So I would rather not screw around and figure this out now.

I am hoping that experienced users in the forum can shed some light or ideas as to what this could be.

My gut is telling me it is something to do with HDD having a read/write error and controller times out trying to talk to the HDD until hdd recovers.

Main question is WHY does smart show nothing?

SweetAndLow · Jul 2, 2016

What are the ssds used for? I wonder if it's possible we might want to remove them from the pool and do more testing to eliminate variables.

m3ki · Jul 2, 2016

SSDs are in a separate pool. So chances are not related.

SweetAndLow · Jul 2, 2016

Also I think for 24+ drives a 600watt PSU might be small but I'm not an expert on that and would need to do some research. I use a 1200watt PSU for my 24bay enclosure.

SweetAndLow · Jul 2, 2016

m3ki said:
SSDs are in a separate pool. So chances are not related.

OK, probably not related.

m3ki · Jul 2, 2016

I thoughts so too regarding power but this is a HQ psu.
Also WD reds take 5.3 watts on spinup and average ~ 3 watts so x24 gives 120 watts + mobo and controllers and cpu that doesn't come close to MAX out the PSU.
Theoretically that should be enough. I remember doing calculations before purchase and trying to size the sytem appropriately.
The fans on PSU don't even spin up at load. I think system as a whole never even breaks 180 watts while transcoding 4 streams on plex and doign a scrub at the same time :)

Argh sorry :)

Ericloewe · Jul 3, 2016

m3ki said:
Also WD reds take 5.3 watts on spinup

Wrong. Try 30W. Easily.

m3ki · Jul 3, 2016

Right 5.3 on load.... ok.
So does it mean I'm screwed?

System has been running solid for a year. I never had issues with startup. No delays, no hangs.

I haven't correlated this issue to the spin up. It seemed that it happened at night, system wasnt under much load.

I'm not being ignorant of power requirements and will replace the PSU if needed.

I can retest with kill-a-watt.

Does everyone think it is a PSU issue?

Sent from my Nexus 6P using Tapatalk

Ericloewe · Jul 3, 2016

m3ki said:
Does everyone think it is a PSU issue?

Doesn't seem particularly likely, to be honest.

It could be that the drives are spinning down during operation, but you would've had to have configured that.

You've had what, two or three errors so far? My advice is to keep an eye on things and keep track of where and when the errors show up. When you have more specific data, it'll be easier to determine a course of action.

Just stick with general best practices, like keeping a spare or two burned in and ready to go.

Important Announcement for the TrueNAS Community.

Unretryable error (One or more devices has experienced an unrecoverable error.)

Contributor

Sweet'NASty

Contributor

Sweet'NASty

Contributor

Sweet'NASty

Contributor

Sweet'NASty

Contributor

Contributor

Contributor

Contributor

Sweet'NASty

Contributor

Sweet'NASty

Sweet'NASty

Contributor

Server Wrangler

Contributor

Server Wrangler

Similar threads