zpool degraded, disk unavailable, but SMART ok

Sloefke · Sep 26, 2016

Hi,

After almost a year of lurking, reading a lot, researching and finally building my own FreeNAS server, I'm encountering my first problem I can't seem to solve on my own :)
Just some background: I built a FreeNAS based on a Xeon E3-1230v5, Supermicro X11SSL-CF, supported Samsung ECC RAM (32GB) and 2 pools: a mirrored SSD pool for my test VMs (2x250GB) and a raid-z2 pool for my data (6x4TB WD Reds). One quirk: I'm running FreeNAS as a VM on ESXi, with all my disks passed through and all configuration done as described in these forums and the 'official' blog post by the FreeNAS project. The build was done in July this year, so not too long ago.

Today I got some worrisome mails from my FreeNAS telling me one of the WD Reds was unavailable and the data zpool was degraded. And indeed:

Code:

[root@duvel] ~# zpool status zpool-data -v
cannot open '-v': name must begin with a letter
  pool: zpool-data
state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: scrub repaired 0 in 0h51m with 0 errors on Thu Sep 15 02:51:41 2016
config:

        NAME                                            STATE     READ WRITE CKSUM
        zpool-data                                      DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            gptid/c3c0ff42-4ac9-11e6-9a8e-0cc47a81f344  ONLINE       0     0     0
            gptid/c4e14e68-4ac9-11e6-9a8e-0cc47a81f344  ONLINE       0     0     0
            6054603385877469052                         UNAVAIL      3   152     0  was /dev/gptid/c5fc4597-4ac9-11e6-9a8e-0cc47a81f344
            gptid/c71aad36-4ac9-11e6-9a8e-0cc47a81f344  ONLINE       0     0     0
            gptid/c83372aa-4ac9-11e6-9a8e-0cc47a81f344  ONLINE       0     0     0
            gptid/c9514fab-4ac9-11e6-9a8e-0cc47a81f344  ONLINE       0     0     0

errors: No known data errors

(Don't mind the root thing, my infra is still being built appropriately :p)

Apparently, SMART also started failing for this disk:

Code:

[root@duvel] ~# date
Mon Sep 26 17:49:28 CEST 2016
[root@duvel] ~# tail -20 /var/log/messages
Sep 26 07:43:49 duvel   (pass4:mpr0:0:2:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 01 00 4f 00 c2 00 b0 00 length 512 SMID 232 terminated ioc 804b scsi 0 state c xfer 0
Sep 26 07:43:49 duvel   (pass4:mpr0:0:2:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00 length 0 SMID 769 terminated ioc 804b scsi 0 state c xfer 0
Sep 26 07:43:50 duvel   (pass4:mpr0:0:2:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00 length 0 SMID 223 terminated ioc 804b scsi 0 state c xfer 0
Sep 26 07:43:50 duvel   (pass4:mpr0:0:2:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 338 terminated ioc 804b scsi 0 state c xfer 0
Sep 26 07:43:50 duvel smartd[3455]: Device: /dev/da3 [SAT], failed to read SMART Attribute Data
Sep 26 07:43:50 duvel   (pass4:mpr0:0:2:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 638 terminated ioc 804b scsi 0 state c xfer 0
Sep 26 07:43:50 duvel   (pass4:mpr0:0:2:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 06 00 4f 00 c2 00 b0 00 length 512 SMID 895 terminated ioc 804b scsi 0 state c xfer 0
Sep 26 07:43:51 duvel   (pass4:mpr0:0:2:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 06 00 4f 00 c2 00 b0 00 length 512 SMID 373 terminated ioc 804b scsi 0 state c xfer 0
Sep 26 07:43:51 duvel   (pass4:mpr0:0:2:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 01 00 4f 00 c2 00 b0 00 length 512 SMID 153 terminated ioc 804b scsi 0 state c xfer 0
Sep 26 07:43:51 duvel   (pass4:mpr0:0:2:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 01 00 4f 00 c2 00 b0 00 length 512 SMID 372 terminated ioc 804b scsi 0 state c xfer 0
Sep 26 08:00:50 duvel GEOM_ELI: g_eli_read_done() failed (error=6) da3p1.eli[READ(offset=0, length=131072)]
Sep 26 08:00:50 duvel GEOM_ELI: g_eli_read_done() failed (error=6) da3p1.eli[READ(offset=262144, length=131072)]
Sep 26 08:00:50 duvel GEOM_ELI: g_eli_read_done() failed (error=6) da3p1.eli[READ(offset=2146959360, length=131072)]
Sep 26 08:00:50 duvel GEOM_ELI: g_eli_read_done() failed (error=6) da3p1.eli[READ(offset=2147221504, length=131072)]
Sep 26 12:44:00 duvel upsd[2571]: Data for UPS [ups] is stale - check driver
Sep 26 12:44:00 duvel upsd[2571]: UPS [ups] data is no longer stale
Sep 26 17:11:51 duvel GEOM_ELI: g_eli_read_done() failed (error=6) da3p1.eli[READ(offset=0, length=131072)]
Sep 26 17:11:51 duvel GEOM_ELI: g_eli_read_done() failed (error=6) da3p1.eli[READ(offset=262144, length=131072)]
Sep 26 17:11:51 duvel GEOM_ELI: g_eli_read_done() failed (error=6) da3p1.eli[READ(offset=2146959360, length=131072)]
Sep 26 17:11:51 duvel GEOM_ELI: g_eli_read_done() failed (error=6) da3p1.eli[READ(offset=2147221504, length=131072)]

But the weird thing is, smartctl output doesn't show anything:

Code:

[root@duvel] ~# smartctl -a /dev/da3
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68WT0N0
Serial Number:    WD-WCC4E5NZXNCS
LU WWN Device Id: 5 0014ee 262b146ed
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Sep 26 17:50:04 2016 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (52980) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 530) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   206   179   021    Pre-fail  Always       -       6675
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       15
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       1779
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       11
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       773
194 Temperature_Celsius     0x0022   119   111   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1689         -
# 2  Short offline       Completed without error       00%      1641         -
# 3  Short offline       Completed without error       00%      1593         -
# 4  Extended offline    Completed without error       00%      1558         -
# 5  Short offline       Completed without error       00%      1545         -
# 6  Short offline       Completed without error       00%      1497         -
# 7  Short offline       Completed without error       00%      1449         -
# 8  Short offline       Completed without error       00%      1401         -
# 9  Short offline       Completed without error       00%      1353         -
#10  Short offline       Completed without error       00%      1305         -
#11  Short offline       Completed without error       00%      1258         -
#12  Extended offline    Completed without error       00%      1222         -
#13  Short offline       Completed without error       00%      1210         -
#14  Short offline       Completed without error       00%      1162         -
#15  Short offline       Completed without error       00%      1138         -
#16  Short offline       Completed without error       00%      1090         -
#17  Short offline       Completed without error       00%      1042         -
#18  Short offline       Completed without error       00%       994         -
#19  Short offline       Completed without error       00%       946         -
#20  Short offline       Completed without error       00%       898         -
#21  Short offline       Completed without error       00%       850         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I checked all other disks to make sure they were not renumbered, but this one is clearly The One (serial matches with GUI output for /dev/da3). All other disks have the same SMART output and no disk is missing. smartctl -x also shows recent readouts successfully done:

Code:

Index    Estimated Time   Temperature Celsius
477    2016-09-26 09:54    33  **************
...    ..(476 skipped).    ..  **************
476    2016-09-26 17:51    33  **************

Trying to bring it back online clearly fails though:

Code:

[root@duvel] ~# zpool online zpool-data 6054603385877469052
warning: device '6054603385877469052' onlined, but remains in faulted state
use 'zpool replace' to replace devices that are no longer present

I'm a bit confused here. I fully accept something is wrong with the disk and I should replace it (replacement disk is already on its way, should have provided a spare disk at the start), but how can I verify what is going wrong with this disk if smartctl isn't showing anything (although the errors are clear in the log)?

My current guess is: the disk is b0rked, SMART failed a couple of times so FreeNAS is 'blocking' it, and it's only a matter of time before it completely dies on me. Possible?

Thanks!

Edit: just noticed the g_eli_read_done errors rereading my post, but to be clear: I have no encryption enabled on any zpool.

danb35 · Sep 26, 2016

Try online'ing the disk using the gptid-- zpool online zpool-data gptid/c5fc4597-4ac9-11e6-9a8e-0cc47a81f344.

Edit: SMART doesn't show any errors or questionable results on that disk, but it does show that the disk hasn't seen a SMART self-test for 5000 hours. You'll want to make sure your SMART self-test schedule includes da3, and is reasonable (short tests no less frequently than weekly, long no less frequently than monthly, with daily and weekly being the "most-frequent" recommendations for those).

Ericloewe · Sep 26, 2016

Sloefke said:
My current guess is: the disk is b0rked, SMART failed a couple of times so FreeNAS is 'blocking' it, and it's only a matter of time before it completely dies on me. Possible?

SMART would show something, but it's possible FreeNAS perceived it as faulted. I can think of the following scenarios:

SMART reporting was borked (firmware bug on the HDD?) at the same time it started failing
HBA issues that ruin the channel's operability
Your server might be haunted. An exorcism might be in order.

In any case, I'd recommend wiping the drive, running a burn-in and then see what happens. If it passes a few rounds of badblocks and shows good SMART data (as in valid and not indicative of a fault), it's hard to suggest getting rid of it.

Ericloewe · Sep 26, 2016

danb35 said:
show that the disk hasn't seen a SMART self-test for 5000 hours

No, just 90 hours. I think you read the spin-up time.

Sloefke · Sep 26, 2016

Ericloewe said:
SMART would show something, but it's possible FreeNAS perceived it as faulted. I can think of the following scenarios:

SMART reporting was borked (firmware bug on the HDD?) at the same time it started failing

HBA issues that ruin the channel's operability

Your server might be haunted. An exorcism might be in order.

In any case, I'd recommend wiping the drive, running a burn-in and then see what happens. If it passes a few rounds of badblocks and shows good SMART data (as in valid and not indicative of a fault), it's hard to suggest getting rid of it.

Thanks, will do as soon as I have replaced it with the spare one :)

Ericloewe said:
No, just 90 hours. I think you read the spin-up time.

Indeed, SMART tests are scheduled to run every x amount of days (don't know by heart, followed some best practice guide around here).

danb35 · Sep 26, 2016

Ericloewe said:
No, just 90 hours. I think you read the spin-up time.

Oops, right you are. My bad.

Sloefke · Sep 28, 2016

OK, now I'm worried: the GUI showed /dev/da3p2 being unavailable, so I retrieved the serial number from the "view disks" view and unplugged it from my server (AHCI is enabled, so hot plugging ok). As I replaced it, FreeNAS started sending alerts because of /dev/da4 being unavailable. My first idea was "I pulled the wrong disk", so I put it back in. Result: /dev/da3 is now back online and resilvered, and /dev/da4 is unavailable. Uhhh. Something tells me my drive isn't faulty after all ...

Current state:

Code:

[root@duvel] ~# zpool status zpool-data
  pool: zpool-data
state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: resilvered 1.22M in 0h0m with 0 errors on Wed Sep 28 18:50:58 2016
config:

        NAME                                            STATE     READ WRITE CKSUM
        zpool-data                                      DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            gptid/c3c0ff42-4ac9-11e6-9a8e-0cc47a81f344  ONLINE       0     0     0
            gptid/c4e14e68-4ac9-11e6-9a8e-0cc47a81f344  ONLINE       0     0     0
            gptid/c5fc4597-4ac9-11e6-9a8e-0cc47a81f344  ONLINE       0     0     0
            12896125252193783266                        UNAVAIL      3     5     0  was /dev/gptid/c71aad36-4ac9-11e6-9a8e-0cc47a81f344
            gptid/c83372aa-4ac9-11e6-9a8e-0cc47a81f344  ONLINE       0     0     0
            gptid/c9514fab-4ac9-11e6-9a8e-0cc47a81f344  ONLINE       0     0     0

errors: No known data errors

Code:

[root@duvel] ~# smartctl -a /dev/da4
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68WT0N0
Serial Number:    WD-WCC4E4JRX39C
LU WWN Device Id: 5 0014ee 262b16901
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Sep 28 19:03:53 2016 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (52980) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 530) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   253   184   021    Pre-fail  Always       -       3841
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       18
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       1828
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       14
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       12
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       93
194 Temperature_Celsius     0x0022   122   114   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1785         -
# 2  Short offline       Completed without error       00%      1737         -
# 3  Short offline       Completed without error       00%      1689         -
# 4  Short offline       Completed without error       00%      1641         -
# 5  Short offline       Completed without error       00%      1593         -
# 6  Extended offline    Completed without error       00%      1558         -
# 7  Short offline       Completed without error       00%      1545         -
# 8  Short offline       Completed without error       00%      1497         -
# 9  Short offline       Completed without error       00%      1449         -
#10  Short offline       Completed without error       00%      1401         -
#11  Short offline       Completed without error       00%      1353         -
#12  Short offline       Completed without error       00%      1305         -
#13  Short offline       Completed without error       00%      1258         -
#14  Extended offline    Completed without error       00%      1222         -
#15  Short offline       Completed without error       00%      1210         -
#16  Short offline       Completed without error       00%      1162         -
#17  Short offline       Completed without error       00%      1138         -
#18  Short offline       Completed without error       00%      1090         -
#19  Short offline       Completed without error       00%      1042         -
#20  Short offline       Completed without error       00%       994         -
#21  Short offline       Completed without error       00%       946         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

And messages is also filling up with those GEOM_ELI messages for /dev/da4:

Code:

Sep 28 18:50:50 duvel GEOM_ELI: g_eli_read_done() failed (error=6) da4p1.eli[READ(offset=0, length=131072)]
Sep 28 18:50:50 duvel GEOM_ELI: g_eli_read_done() failed (error=6) da4p1.eli[READ(offset=262144, length=131072)]
Sep 28 18:50:50 duvel GEOM_ELI: g_eli_read_done() failed (error=6) da4p1.eli[READ(offset=2146959360, length=131072)]
Sep 28 18:50:50 duvel GEOM_ELI: g_eli_read_done() failed (error=6) da4p1.eli[READ(offset=2147221504, length=131072)]
Sep 28 18:51:23 duvel GEOM_ELI: g_eli_read_done() failed (error=6) da4p1.eli[READ(offset=0, length=131072)]
Sep 28 18:51:23 duvel GEOM_ELI: g_eli_read_done() failed (error=6) da4p1.eli[READ(offset=262144, length=131072)]
Sep 28 18:51:23 duvel GEOM_ELI: g_eli_read_done() failed (error=6) da4p1.eli[READ(offset=2146959360, length=131072)]
Sep 28 18:51:23 duvel GEOM_ELI: g_eli_read_done() failed (error=6) da4p1.eli[READ(offset=2147221504, length=131072)]

/dev/da3 seems to be all fine now, both in the zpool and according to logs in messages ...

Am I correct to assume that if Storage > zpool-data > Volume status shows a disk ID instead of "da4p2", that this matches the /dev/da4 with its serial number in Storage > View disks?

Just to be complete: I put back the cables on /dev/da3 exactly as they were, so I didn't touch those to /dev/da4 at all, which seems to rule out faulty cable or controller?

Not sure what is happening here, it smells like software malfunctioning instead of hardware malfunctioning.

Edit: completely forgot, but I'm running FreeNAS-9.10.1 (d989edd). I noticed there is a patch since yesterday, but is it safe to try repatching or even plainly rebooting in the current state?

Ericloewe · Sep 28, 2016

Don't bother with 9.10.1-U1, it doesn't have anything major related to this problem.

Sloefke · Sep 28, 2016

Rebooted FreeNAS, all disks back online ...

philhu · Sep 30, 2016

I had this exact thing happen with my da14

Got a bunch of messages about it missing, did zpool list, showed unavailable. I had a raidz3, so left it for after work, got home, system had rebooted (!) about 15 inutes after I left and all drives were online and resilver went find with me not there. It kind of scares me this can happen

Mirfster · Sep 30, 2016

Sloefke said:
Rebooted FreeNAS, all disks back online ...

Yeah, was going to ask you to do that. :)

Magius · Oct 1, 2016

Fortunately it sounds like this problem resolved itself, but I know if it were me, I'd still have some worries about root cause, and whether this could happen again without warning. Turns out the system you built is almost identical to the one I'm planning, so I'm definitely curious what went wrong in the first place. Not trying to scare you, but you did have both da3 and da4 hop out of the vdev on you for "no reason" and if two or more had decided to hop out at the same time, at a particularly bad time, it could end in a really bad day...

So with that said, I thought I'd share my best guess at an interpretation of what the error messages are telling you. First, from your smart test log, you've been running short tests every 48 hours, and a long test every 2 weeks. Sounds good, except as other posters pointed out, it's been 90 hours since your last self test on da3 (it had only been 43 hours on da4, care to check it again now and make sure that test ran 5 hours later?). I would want to figure out why your self test didn't run as scheduled. Did you power the server down that night? The HBA may have been unable to command the drive to perform a self test for some reason? Do you have logs from the last few days that you could skim through?

Second, you didn't mention which HBA you're using, but you had a couple ATA PASS THROUGH errors. This is the command that allows an HBA to transport a raw SATA command across a SCSI (SAS) interface, and get a response back from a native SATA disk. Back ~8 years ago that used to be a common problem when mixing certain HBAs with certain Linux distros, but I have no experience w/ BSD to know if they've ever had similar problems. One way it used to manifest was when the server was under moderate I/O load, the HBA would be unable to complete ATA PASS THROUGH commands, thus unable to query SMART attributes or initiate self tests. It was almost always intermittent rather than a permanent "this no worky" situation, so it was really annoying to deal with. Is it possible your server was under heavy load a couple days ago when this happened?

Anyway, to translate the pass through commands themselves, you can see that the first word '85' is the Opcode for ATA PASS THROUGH. The second to last word, 'b0' is the field for the command Opcode to be passed through, and in this case is the Opcode for all SMART commands. To determine which SMART command, you have to look at the feature word, which is the 4th one after the 85. Your commands use, in order, the features 'd5', 'da', 'd0' then 'd5' again, and the sequence essentially translates to:

SMART READ LOG # Log 1 = Summary SMART error log
SMART RETURN STATUS # Get log directory (this tells the requester which logs the drive supports)
SMART RETURN STATUS # Get log directory (identical command - retry?)
SMART READ DATA # Fairly self explanatory
smartd[3455]: Device: /dev/da3 [SAT], failed to read SMART Attribute Data # Confirmation that at least this one command failed
SMART READ DATA # identical command - retry?
SMART READ LOG # Log 6 = SMART self-test log
SMART READ LOG # Log 6 = SMART self-test log (identical command - retry?)
SMART READ LOG # Log 1 = Summary SMART error log
SMART READ LOG # Log 1 = Summary SMART error log (identical command - retry?)

Here's where my knowledge of SMART and syslog debugging ends, and the guessing begins. In my experience, under normal circumstances, I don't see systems log every single (successful) ATA PASS THROUGH command, so I believe each of the ones from your log is a command that failed (some failed twice in a row?). This is supported by the fact that they all say 'terminated'. I have seen similar log entries where that can say 'completed' or 'timed out' or 'aborted', but I'm not sure what 'terminated' means exactly.

Jumping subjects, I also don't know what 'error=6' means in your GELI read failures (zero BSD experience here), but assuming it's a legitimate error, it appears that you're having intermittent problems reading both actual data (GELI errors) and metadata (SMART errors) from your drives. To me the most likely culprits here are not the drives themselves, but either the HBA, the cabling, the backplanes, etc. Something that would cause commands to intermittently not make it through to the drives correctly... As I said though, the guessing began way back a paragraph or two ago :) Just for curiosity's sake, can you let me know what HBA you're using, and what chassis/backplane?

I hope you're all clear and don't see these gremlins pop up again!

Stux · Oct 1, 2016

Sloefke said:
Edit: just noticed the g_eli_read_done errors rereading my post, but to be clear: I have no encryption enabled on any zpool.

p1.eli is the swap partition. I believe swap is encrypted.

Your drive failed. You had swap in use. Your system needed to be rebooted.

FreeBSD has a bug which causes ARC, VM and UMA to fight, and then you get swap. And swap will crash you if your drive fails or is removed.

See my page_in script.

Why did your drive go offline? Dunno. I'd begin worrying about it if it happens again.

Sloefke · Nov 12, 2016

Thanks for the detailed answers. The problem did occur 2 more times since I created this thread (and after I updated to 9.10 U2), each time with a different disk. So indeed, it doesn't look like it's faulty disks, but something underneath. I have a Supermicro X11SSL-CF motherboard and am using the SAS3 ports on the onboard LSI 3008 controller on it. The controller has been flashed to IT firmware (don't recall the exact version, but it was the latest version back in July). That should not be exotic stuff for FreeNAS from what I read?
I connected the disks using a SAS3->4xSATA cable from Adaptec, maybe the problem sits there. I'll start writing down the 'failure' sequences from now on and see if it's related to a single cable (but not sure, because normally I patched the first 4 disks with 1 cable, and the 2 others with a second, and I have seen failures on da2, da3 and da5).

Stux · Nov 12, 2016

How are your disks supplied with power?

Sloefke · Nov 12, 2016

Stux said:
How are your disks supplied with power?

Regular SATA power connectors, straight from the PSU.

Sloefke · Nov 17, 2016

OK, this is getting confusing. I updated to 9.10U4 last weekend, this morning I got a new alert telling me my pool is degraded. Now it's not a disk UNAVAILABLE, but a disk FAULTED:

Code:

  pool: zpool-data
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
		Sufficient replicas exist for the pool to continue functioning in a
		degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
		repaired.
  scan: scrub repaired 0 in 1h1m with 0 errors on Tue Nov 15 03:01:43 2016
config:

		NAME											STATE	 READ WRITE CKSUM
		zpool-data									  DEGRADED	 0	 0	 0
		  raidz2-0									  DEGRADED	 0	 0	 0
			gptid/c3c0ff42-4ac9-11e6-9a8e-0cc47a81f344  ONLINE	   0	 0	 0
			gptid/c4e14e68-4ac9-11e6-9a8e-0cc47a81f344  ONLINE	   0	 0	 0
			gptid/c5fc4597-4ac9-11e6-9a8e-0cc47a81f344  ONLINE	   0	 0	 0
			gptid/c71aad36-4ac9-11e6-9a8e-0cc47a81f344  FAULTED	  6	79	 0  too many errors
			gptid/c83372aa-4ac9-11e6-9a8e-0cc47a81f344  ONLINE	   0	 0	 0
			gptid/c9514fab-4ac9-11e6-9a8e-0cc47a81f344  ONLINE	   0	 0	 0

smartctl indeed shows *something* went wrong with the last long test of this disk, this morning:

Code:

[root@duvel] ~# smartctl -a /dev/da1 | grep -E "# 1|Power_On_Hours"
  9 Power_On_Hours		  0x0032   096   096   000	Old_age   Always	   -	   3018
# 1  Short offline	   Completed without error	   00%	  3009		 -
[root@duvel] ~# smartctl -a /dev/da2 | grep -E "# 1|Power_On_Hours"
  9 Power_On_Hours		  0x0032   096   096   000	Old_age   Always	   -	   3018
# 1  Short offline	   Completed without error	   00%	  3009		 -
[root@duvel] ~# smartctl -a /dev/da3 | grep -E "# 1|Power_On_Hours"
  9 Power_On_Hours		  0x0032   096   096   000	Old_age   Always	   -	   3017
# 1  Short offline	   Completed without error	   00%	  3008		 -
[root@duvel] ~# smartctl -a /dev/da4 | grep -E "# 1|Power_On_Hours"
  9 Power_On_Hours		  0x0032   096   096   000	Old_age   Always	   -	   3018
# 1  Extended offline	Interrupted (host reset)	  50%	  3015		 -
[root@duvel] ~# smartctl -a /dev/da5 | grep -E "# 1|Power_On_Hours"
  9 Power_On_Hours		  0x0032   096   096   000	Old_age   Always	   -	   3018
# 1  Short offline	   Completed without error	   00%	  3009		 -
[root@duvel] ~# smartctl -a /dev/da6 | grep -E "# 1|Power_On_Hours"
  9 Power_On_Hours		  0x0032   096   096   000	Old_age   Always	   -	   3018
# 1  Short offline	   Completed without error	   00%	  3009		 -

Trying to find out whether the selftest interrupt caused the FreeNAS errors, or errors on the disk caused both the selftest and FreeNAS errors. In any case, I now have had zpool degradations caused by 4 different disks over the past 1.5 months ... I can accept faulty disks, but 4 out of 6 new disks seems a bit far-fetched, no?

Edit: from messages:

Code:

Nov 17 06:06:08 duvel   (noperiph:mpr0:0:4294967295:0): SMID 1 Aborting command 0xfffffe00009716f0
Nov 17 06:06:08 duvel mpr0: Sending reset from mprsas_send_abort for target ID 3
Nov 17 06:06:08 duvel   (da4:mpr0:0:3:0): WRITE(10). CDB: 2a 00 7e 66 ad b8 00 00 08 00 length 4096 SMID 985 terminated ioc 804b scsi 0 state c xfer 0
Nov 17 06:06:08 duvel mpr0: Unfreezing devq for target ID 3
Nov 17 06:06:08 duvel (da4:mpr0:0:3:0): WRITE(10). CDB: 2a 00 7e 66 ad b8 00 00 08 00
Nov 17 06:06:08 duvel (da4:mpr0:0:3:0): CAM status: CCB request completed with an error
Nov 17 06:06:08 duvel (da4:mpr0:0:3:0): Retrying command
Nov 17 06:06:08 duvel (da4:mpr0:0:3:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Nov 17 06:06:08 duvel (da4:mpr0:0:3:0): CAM status: Command timeout
Nov 17 06:06:08 duvel (da4:mpr0:0:3:0): Retrying command
Nov 17 06:06:09 duvel (da4:mpr0:0:3:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Nov 17 06:06:09 duvel (da4:mpr0:0:3:0): CAM status: SCSI Status Error
Nov 17 06:06:09 duvel (da4:mpr0:0:3:0): SCSI status: Check Condition
Nov 17 06:06:09 duvel (da4:mpr0:0:3:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Nov 17 06:06:09 duvel (da4:mpr0:0:3:0): Error 6, Retries exhausted
Nov 17 06:06:09 duvel (da4:mpr0:0:3:0): Invalidating pack

I'll try to open up my server one of these days and figure out how the disks are connected. With some luck da2-5 are connected with the same SAS->SATA cable ;)

Edit2: did a short SMART test on the disk, completed without error. Now started a new long test, let's see in about 10 hours ...

Edit3: yes, well, stupid noob I guess :) looks like my 10.x LSI firmware is mismatching the driver in freenas 9.10, which my freenas has been trying to tell me since installation. So this weekend I'll do a firmware update of my controller, and cross fingers this thing is gone afterwards.

daniels7 · Dec 23, 2016

I have a quite similar problem and get the following messages:

Code:

MSZone.NAS kernel log messages:

>   (da3:mps0:0:14:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 661 command timeout cm 0xfffffe0000b0d390 ccb 0xfffff803907aa800

>   (noperiph:mps0:0:4294967295:0): SMID 1 Aborting command

> 0xfffffe0000b0d390

> mps0: Sending reset from mpssas_send_abort for target ID 14

>   (da3:mps0:0:14:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 801 command timeout cm 0xfffffe0000b18b50 ccb 0xfffff8041e97a800

>   (da3:mps0:0:14:0): WRITE(10). CDB: 2a 00 b1 a6 8f 38 00 00 08 00 length 4096 SMID 622 terminated ioc 804b scsi 0 state c xfer 0

>   (da3:mps0:0:14:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00

> 00 00 00 length 0 SMID 801 terminated ioc 804b scsi 0

> sta(da3:mps0:0:14:0): WRITE(10). CDB: 2a 00 b1 a6 8f 38 00 00 08 00 te

> c xfer 0

> (da3:mps0:0:14:0): CAM status: CCB request completed with an error

> mps0: (da3:Unfreezing devq for target ID 14

> mps0:0:14:0): Retrying command

> (da3:mps0:0:14:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00

> 00 00

> (da3:mps0:0:14:0): CAM status: Command timeout

> (da3:mps0:0:14:0): Retrying command

> (da3:mps0:0:14:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00

> 00 00

> (da3:mps0:0:14:0): CAM status: CCB request completed with an error

> (da3:mps0:0:14:0): Retrying command

> (da3:mps0:0:14:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00

> 00 00

> (da3:mps0:0:14:0): CAM status: SCSI Status Error

> (da3:mps0:0:14:0): SCSI status: Check Condition

> (da3:mps0:0:14:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on,

> reset, or bus device reset occurred)

> (da3:mps0:0:14:0): Error 6, Retries exhausted

> (da3:mps0:0:14:0): Invalidating pack

> GEOM_ELI: g_eli_write_done() failed (error=6)

> gptid/ea5ef253-bcc1-11e6-bd79-000c2957bde2.eli[WRITE(offset=1523859685

> 376, length=49152)]

> GEOM_ELI: g_eli_write_done() failed (error=6)

> gptid/ea5ef253-bcc1-11e6-bd79-000c2957bde2.eli[WRITE(offset=1523859898

> 368, length=8192)]

> GEOM_ELI: g_eli_write_done() failed (error=6)

> gptid/ea5ef253-bcc1-11e6-bd79-000c2957bde2.eli[WRITE(offset=1523859869

> 696, length=28672)]

> GEOM_ELI: g_eli_write_done() failed (error=6)

> gptid/ea5ef253-bcc1-11e6-bd79-000c2957bde2.eli[WRITE(offset=1523859918

> 848, length=4096)]

> GEOM_ELI: g_eli_write_done() failed (error=6)

> gptid/ea5ef253-bcc1-11e6-bd79-000c2957bde2.eli[WRITE(offset=1523859677

> 184, length=4096)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/ea5ef253-bcc1-11e6-bd79-000c2957bde2.eli[READ(offset=270336,

> length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/ea5ef253-bcc1-11e6-bd79-000c2957bde2.eli[READ(offset=99986830868

> 48, length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/ea5ef253-bcc1-11e6-bd79-000c2957bde2.eli[READ(offset=99986833489

> 92, length=8192)]

> GEOM_ELI: g_eli_write_done() failed (error=6)

> gptid/ea5ef253-bcc1-11e6-bd79-000c2957bde2.eli[WRITE(offset=1523859927

> 040, length=16384)]

> GEOM_ELI: g_eli_write_done() failed (error=6)

> gptid/ea5ef253-bcc1-11e6-bd79-000c2957bde2.eli[WRITE(offset=7520873267

> 2, length=32768)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/ea5ef253-bcc1-11e6-bd79-000c2957bde2.eli[READ(offset=270336,

> length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/ea5ef253-bcc1-11e6-bd79-000c2957bde2.eli[READ(offset=99986830868

> 48, length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6) gptid/ea5ef253-bcc1-11e6-bd79-000c2957bde2.eli[READ(offset=9998683348992, length=8192)]

>   (da6:mps0:0:17:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 814 command timeout cm 0xfffffe0000b19c60 ccb 0xfffff80056c11800

>   (noperiph:mps0:0:4294967295:0): SMID 2 Aborting command

> 0xfffffe0000b19c60

> mps0: Sending reset from mpssas_send_abort for target ID 17

>   (da6:mps0:0:17:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 653 command timeout cm 0xfffffe0000b0c910 ccb 0xfffff8039f4ea000

>   (da6:mps0:0:17:0): READ(10). CDB: 28 00 00 cd 0e a8 00 00 08 00 length 4096 SMID 307 terminated ioc 804b scsi 0 state c xfer 0

>   (da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 9c cf 35 c0 00 00

> 00 08 00 00 length 4096 SMID 681 terminated ioc 804b

> sc(da6:mps0:0:17:0): READ(10). CDB: 28 00 00 cd 0e a8 00 00 08 00 si 0

> state c xfer 0

> (da6:mps0:0:17:0): CAM status: CCB request completed with an error

>   (da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 2f a7 69 f0 00 00 00 18 00 00 length 12288 SMID 410 terminated ioc 804b s(da6:csi 0 state c xfer 0

> mps0:0:  (da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 2f a7 69 d8 00 00 00 18 00 00 length 12288 SMID 607 terminated ioc 804b s17:csi 0 state c xfer 0

> 0):   (da6:mps0:0:17:0): WRITE(10). CDB: 2a 00 b1 a6 af 70 00 00 08 00 length 4096 SMID 1023 terminated ioc 804b scsi 0 state c xfeRetrying command

> r 0

> (da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 9c cf 35 c0 00 00 00 08 00 00

>   (da6:mps0:0:17:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00

> 00 00 00 length 0 SMID 653 terminated ioc 804b scsi 0

> sta(da6:mps0:0:17:0): CAM status: CCB request completed with an error

> te c xfer 0

> (da6:mps0: mps0:0:Unfreezing devq for target ID 17

> 17:0): Retrying command

> (da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 2f a7 69 f0 00 00

> 00 18 00 00

> (da6:mps0:0:17:0): CAM status: CCB request completed with an error

> (da6:mps0:0:17:0): Retrying command

> (da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 2f a7 69 d8 00 00

> 00 18 00 00

> (da6:mps0:0:17:0): CAM status: CCB request completed with an error

> (da6:mps0:0:17:0): Retrying command

> (da6:mps0:0:17:0): WRITE(10). CDB: 2a 00 b1 a6 af 70 00 00 08 00

> (da6:mps0:0:17:0): CAM status: CCB request completed with an error

> (da6:mps0:0:17:0): Retrying command

> (da6:mps0:0:17:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00

> 00 00

> (da6:mps0:0:17:0): CAM status: Command timeout

> (da6:mps0:0:17:0): Retrying command

> (da6:mps0:0:17:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00

> 00 00

> (da6:mps0:0:17:0): CAM status: CCB request completed with an error

> (da6:mps0:0:17:0): Retrying command

> (da6:mps0:0:17:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00

> 00 00

> (da6:mps0:0:17:0): CAM status: SCSI Status Error

> (da6:mps0:0:17:0): SCSI status: Check Condition

> (da6:mps0:0:17:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on,

> reset, or bus device reset occurred)

> (da6:mps0:0:17:0): Error 6, Retries exhausted

> (da6:mps0:0:17:0): Invalidating pack

> GEOM_ELI: g_eli_write_done() failed (error=6)

> gptid/b58d0fd5-be7a-11e6-a080-000c2957bde2.eli[WRITE(offset=7521012940

> 8, length=131072)]

> GEOM_ELI: g_eli_write_done() failed (error=6)

> gptid/b58d0fd5-be7a-11e6-a080-000c2957bde2.eli[WRITE(offset=1523864121

> 344, length=8192)]

> GEOM_ELI: g_eli_write_done() failed (error=6)

> gptid/b58d0fd5-be7a-11e6-a080-000c2957bde2.eli[WRITE(offset=1523863994

> 368, length=126976)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/b58d0fd5-be7a-11e6-a080-000c2957bde2.eli[READ(offset=270336,

> length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/b58d0fd5-be7a-11e6-a080-000c2957bde2.eli[READ(offset=99986830868

> 48, length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/b58d0fd5-be7a-11e6-a080-000c2957bde2.eli[READ(offset=99986833489

> 92, length=8192)]

> GEOM_ELI: g_eli_write_done() failed (error=6)

> gptid/b58d0fd5-be7a-11e6-a080-000c2957bde2.eli[WRITE(offset=7521026048

> 0, length=53248)]

> GEOM_ELI: g_eli_write_done() failed (error=6)

> gptid/b58d0fd5-be7a-11e6-a080-000c2957bde2.eli[WRITE(offset=1516022124

> 544, length=32768)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/b58d0fd5-be7a-11e6-a080-000c2957bde2.eli[READ(offset=270336,

> length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/b58d0fd5-be7a-11e6-a080-000c2957bde2.eli[READ(offset=99986830868

> 48, length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/b58d0fd5-be7a-11e6-a080-000c2957bde2.eli[READ(offset=99986833489

> 92, length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/ea5ef253-bcc1-11e6-bd79-000c2957bde2.eli[READ(offset=270336,

> length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/ea5ef253-bcc1-11e6-bd79-000c2957bde2.eli[READ(offset=99986830868

> 48, length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/ea5ef253-bcc1-11e6-bd79-000c2957bde2.eli[READ(offset=99986833489

> 92, length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/b58d0fd5-be7a-11e6-a080-000c2957bde2.eli[READ(offset=270336,

> length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/b58d0fd5-be7a-11e6-a080-000c2957bde2.eli[READ(offset=99986830868

> 48, length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/b58d0fd5-be7a-11e6-a080-000c2957bde2.eli[READ(offset=99986833489

> 92, length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/ea5ef253-bcc1-11e6-bd79-000c2957bde2.eli[READ(offset=270336,

> length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/ea5ef253-bcc1-11e6-bd79-000c2957bde2.eli[READ(offset=99986830868

> 48, length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/ea5ef253-bcc1-11e6-bd79-000c2957bde2.eli[READ(offset=99986833489

> 92, length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/b58d0fd5-be7a-11e6-a080-000c2957bde2.eli[READ(offset=270336,

> length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/b58d0fd5-be7a-11e6-a080-000c2957bde2.eli[READ(offset=99986830868

> 48, length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/b58d0fd5-be7a-11e6-a080-000c2957bde2.eli[READ(offset=99986833489

> 92, length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/ea5ef253-bcc1-11e6-bd79-000c2957bde2.eli[READ(offset=270336,

> length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/ea5ef253-bcc1-11e6-bd79-000c2957bde2.eli[READ(offset=99986830868

> 48, length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/ea5ef253-bcc1-11e6-bd79-000c2957bde2.eli[READ(offset=99986833489

> 92, length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/b58d0fd5-be7a-11e6-a080-000c2957bde2.eli[READ(offset=270336,

> length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/b58d0fd5-be7a-11e6-a080-000c2957bde2.eli[READ(offset=99986830868

> 48, length=8192)]

> GEOM_ELI: g_eli_read_done() failed (error=6)

> gptid/b58d0fd5-be7a-11e6-a080-000c2957bde2.eli[READ(offset=99986833489

It's happening every 1-2 days and each time with one or two different harddrives. All of them are less than 3 weeks old (6x Seagate Ironwolf 10TB in Raid-Z2) and their SMART Data is perfectly fine (no reallocated sectors and stuff).
My controller already has newest firmware and strangely those errors never happened with my old WD Red 6 TB drives. Can anyone help me? I'm frightened I could lose my complete pool if this strange error takes down three harddrives at the same time.

I attached all the debug stuff. If you need any additional infos I'll gladly tell them.

Thank you very much,

daniels7

daniels7 · Jan 1, 2017

Could anyone please respond?
The error is happening again and again each time on different drives and I'm seriously worried about it

JohnDigital · Jan 1, 2017

I once was getting alot of similar jabber in my console And logs on new drives. I noticed it was happening on the same three drives randomly which were on the same PSU rail. I scoped the IPMI event logs and saw that the power on the 5v was throwing errors. I replaced the PSU, spread my drives more evenly over all rails and did away with some old inferior quality molex to SATA power adapters. I think I also bought new SFF SAS to SATA cables. If the errors happen on new drives I think you should look toward HBA, cabling or faulty power of some sort as has been pointed out. I have not had any scary emails in months other than Plex Scanner on signal 11. This was happening to me after I bought and burnt in 12 drives in batches of 5 then 2 more so I knew the drives were ok. It started throwing errors once all 12 were spinning. I hope this helps. Good luck.

Important Announcement for the TrueNAS Community.

zpool degraded, disk unavailable, but SMART ok

Cadet

Hall of Famer

Server Wrangler

Server Wrangler

Cadet

Hall of Famer

Cadet

Server Wrangler

Cadet

Patron

Doesn't know what he's talking about

Explorer

MVP

Cadet

MVP

Cadet

Cadet

Cadet

Attachments

Cadet

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "zpool degraded, disk unavailable, but SMART ok"

Similar threads