I think a drive is bad. Please confirm

Fuganater · Nov 12, 2015

So I got an email from my FreeNAS box.

Code:

The volume Vol1 (ZFS) state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

I check the box and sure enough the little red light is blinking and has the error.

Code:

 CRITICAL: The volume Vol1 (ZFS) state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

So I did a 'zpool status'

Code:

[root@Rick_James] ~# zpool status
  pool: Vol1
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 5.23M in 0h0m with 0 errors on Fri Nov 13 00:14:30 2015
config:

        NAME                                            STATE     READ WRITE CKSUM
        Vol1                                            ONLINE       0     0                               0
          raidz2-0                                      ONLINE       0     0                               0
            gptid/f59a6e80-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0                               0
            gptid/f6a0ed65-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0                               0
            gptid/f7a889f1-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0                               0
            gptid/f8ad1426-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0                               0
            gptid/f9788d6d-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0                               0
            gptid/fa86dfdf-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0                               0
            gptid/fb895231-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0                               0
            gptid/fc8ac078-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0                               0
          raidz2-1                                      ONLINE       0     0                               0
            gptid/fd5cb19c-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0                               0
            gptid/fe6f518a-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0                               0
            gptid/ff728308-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0                               0
            gptid/007e4b9a-87c3-11e5-9257-0cc47a6bd0ac  ONLINE       0     0                               0
            gptid/0189e083-87c3-11e5-9257-0cc47a6bd0ac  ONLINE       0     0                               0
            gptid/028825dd-87c3-11e5-9257-0cc47a6bd0ac  ONLINE       0     0                               0
            gptid/0385f501-87c3-11e5-9257-0cc47a6bd0ac  ONLINE       0     9                               0
            gptid/0450883e-87c3-11e5-9257-0cc47a6bd0ac  ONLINE       0     0                               0
        spares
          gptid/051cb5fd-87c3-11e5-9257-0cc47a6bd0ac    AVAIL

errors: No known data errors

  pool: freenas-boot
state: ONLINE
  scan: none requested
config:

        NAME                                            STATE     READ WRITE CKS                          UM
        freenas-boot                                    ONLINE       0     0                               0
          mirror-0                                      ONLINE       0     0                               0
            gptid/8ead9abf-7f56-11e5-80f4-0cc47a6bd0ac  ONLINE       0     0                               0
            gptid/8ed85293-7f56-11e5-80f4-0cc47a6bd0ac  ONLINE       0     0                               0

errors: No known data errors

gptid/0385f501-87c3-11e5-9257-0cc47a6bd0ac has '9' under WRITE which I assume is bad since all the other drives have zeros

Next I checked to see which drive it is and I found it is 'da2'

Code:

Geom name: da2
modified: false
state: OK
fwheads: 255
fwsectors: 63
last: 3907029134
first: 34
entries: 128
scheme: GPT
Providers:
1. Name: da2p1
   Mediasize: 2147483648 (2.0G)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e1
   rawuuid: 0377b73d-87c3-11e5-9257-0cc47a6bd0ac
   rawtype: 516e7cb5-6ecf-11d6-8ff8-00022d09712b
   label: 1
   length: 2147483648
   offset: 65536
   type: freebsd-swap
   index: 1
   end: 4194431
   start: 128
2. Name: da2p2
   Mediasize: 1998251364352 (1.8T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e2
   rawuuid: 0385f501-87c3-11e5-9257-0cc47a6bd0ac
   rawtype: 516e7cba-6ecf-11d6-8ff8-00022d09712b
   label: 1
   length: 1998251364352
   offset: 2147549184
   type: freebsd-zfs
   index: 2
   end: 3907029127
   start: 4194432
Consumers:
1. Name: da2
   Mediasize: 2000398934016 (1.8T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r2w2e5

I ran a short smart test on it and got these results.

Code:

[root@Rick_James] ~# smartctl -A /dev/da2
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p28 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALU           E
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   175   175   021    Pre-fail  Always       -       4208
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       7
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       312
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       7
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       5
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       169
194 Temperature_Celsius     0x0022   116   108   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

Nothing jumps out at me that it is a bad drive. Am I missing something?

Fuganater · Nov 12, 2015

Several of my drives are now showing errors under write. Currently I am copying data from my old Windows server to this server.

Code:

[root@Rick_James] ~# zpool status clear Vol1
cannot open 'clear': no such pool
  pool: Vol1
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub in progress since Fri Nov 13 00:50:22 2015
        83.4G scanned out of 689G at 417M/s, 0h24m to go
        0 repaired, 12.11% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        Vol1                                            ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/f59a6e80-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/f6a0ed65-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     8     0
            gptid/f7a889f1-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/f8ad1426-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/f9788d6d-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/fa86dfdf-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/fb895231-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     7     0
            gptid/fc8ac078-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
          raidz2-1                                      ONLINE       0     0     0
            gptid/fd5cb19c-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     7     0
            gptid/fe6f518a-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     8     0
            gptid/ff728308-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     7     0
            gptid/007e4b9a-87c3-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/0189e083-87c3-11e5-9257-0cc47a6bd0ac  ONLINE       0     6     0
            gptid/028825dd-87c3-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/0385f501-87c3-11e5-9257-0cc47a6bd0ac  ONLINE       0     9     0
            gptid/0450883e-87c3-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
        spares
          gptid/051cb5fd-87c3-11e5-9257-0cc47a6bd0ac    AVAIL

errors: No known data errors

tvsjr · Nov 12, 2015

What type of drives? Did you burn them in prior to use? Tried a long SMART test? Seems odd that so many drives would start showing errors at once.

Fuganater · Nov 12, 2015

Yes I did burn in testing and all was good.

These are all WD 2TB Reds.

Here are the results of a long test.

Code:

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   175   175   021    Pre-fail  Always       -       4208
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       7
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       319
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       7
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       5
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       169
194 Temperature_Celsius     0x0022   116   108   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

solarisguy · Nov 12, 2015

What about the SATA cables? Are they firmly locked into drives and on the other end?

Any errors on the console or in /var/log/messages ?

Fuganater · Nov 12, 2015

solarisguy said:
What about the SATA cables? Are they firmly locked into drives and on the other end?

Any errors on the console or in /var/log/messages ?

My drives are in a Supermicro SC846-E1 and are all seated firmly in the bays. I'll check the log when I get home from work.

Robert Trevellyan · Nov 12, 2015

Fuganater said:
Here are the results of a long test.

This is a bit of a non-sequitur, but I think worth mentioning to dispel confusion. The smartctl command does not give the results of a short or long SMART test. What it does is display the values of various parameters and contents of logs that the drive maintains. A short or long SMART test is a completely separate operation, and it either passes or fails. The failure or success of each SMART test is logged and can be seen in the output from smartctl.

Fuganater · Nov 12, 2015

I just got an email that my pool is now degrated. What in the world is going on... I'll post what I find when I get home from work but this is so odd since all drives passed all burn in testing. I had the system up for 2 weeks before throwing data at it and that seems to be what is causing the errors.

solarisguy · Nov 12, 2015

Which drive has failed?

And again, although the hard drive failure is the most likely option, you cannot a priori exclude cabling and SATA port faults.

Fuganater · Nov 12, 2015

solarisguy said:
Which drive has failed?

And again, although the hard drive failure is the most likely option, you cannot a priori exclude cabling and SATA port faults.

I think da2 is the problem... I could swap out the SFF-8087 cable with another and/or move da2 to a different bay.

Fuganater · Nov 13, 2015

Since all of my drives are showing write errors it must be the cable. I am going to swap it out, do a 'zpool clear' and move more data and see if the errors come back. If they do then the backplane must be the issue.

Code:

[root@Rick_James] ~# zpool status Vol1
  pool: Vol1
state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 5.47M in 0h0m with 0 errors on Fri Nov 13 12:17:52 2015
config:

        NAME                                            STATE     READ WRITE CKS                           UM
        Vol1                                            DEGRADED     0     0                                0
          raidz2-0                                      ONLINE       0     0                                0
            gptid/f59a6e80-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0    14                                0
            gptid/f6a0ed65-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     6                                0
            gptid/f7a889f1-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0    22                                0
            gptid/f8ad1426-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0    27                                0
            gptid/f9788d6d-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0    34                                0
            gptid/fa86dfdf-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     7                                0
            gptid/fb895231-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0    20                                0
            gptid/fc8ac078-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0    39                                0
          raidz2-1                                      DEGRADED     0     0                                0
            gptid/fd5cb19c-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0    14                                0
            gptid/fe6f518a-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0    21                                0
            gptid/ff728308-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0    20                                0
            gptid/007e4b9a-87c3-11e5-9257-0cc47a6bd0ac  ONLINE       0     7                                0
            gptid/0189e083-87c3-11e5-9257-0cc47a6bd0ac  ONLINE       0    27                                0
            gptid/028825dd-87c3-11e5-9257-0cc47a6bd0ac  DEGRADED     0    52                                0  too many errors
            gptid/0385f501-87c3-11e5-9257-0cc47a6bd0ac  ONLINE       0    34                                0
            gptid/0450883e-87c3-11e5-9257-0cc47a6bd0ac  ONLINE       0    21                                0
        spares
          gptid/051cb5fd-87c3-11e5-9257-0cc47a6bd0ac    AVAIL

errors: No known data errors

Fuganater · Nov 13, 2015

New cable installed and the coutners are all at zero. I am moving the last folder of 3.8TB to the FreeNAS box. We shall see if the fleebay cable is the issue.

Fuganater · Nov 13, 2015

130GB in and I am already starting to see errors. I am currently using the Supermicro SFF-8087 cable that came with the chassis.

Code:

[root@Rick_James] ~# zpool status Vol1
  pool: Vol1
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 5.46M in 0h0m with 0 errors on Fri Nov 13 21:11:35 2015
config:

        NAME                                            STATE     READ WRITE CKSUM
        Vol1                                            ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/f59a6e80-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/f6a0ed65-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/f7a889f1-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/f8ad1426-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/f9788d6d-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/fa86dfdf-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/fb895231-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/fc8ac078-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     7     0
          raidz2-1                                      ONLINE       0     0     0
            gptid/fd5cb19c-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/fe6f518a-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/ff728308-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/007e4b9a-87c3-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/0189e083-87c3-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/028825dd-87c3-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/0385f501-87c3-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/0450883e-87c3-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
        spares
          gptid/051cb5fd-87c3-11e5-9257-0cc47a6bd0ac    AVAIL

errors: No known data errors

Fuganater · Nov 13, 2015

solarisguy said:
What about the SATA cables? Are they firmly locked into drives and on the other end?

Any errors on the console or in /var/log/messages ?

Checking messages here is a snip of what it is full of:

Code:

Nov 13 21:55:21 Rick_James      (da16:mps0:0:25:0): WRITE(10). CDB: 2a 00 21 80 23 d0 00 00 e0 00 length 114688 SMID 349 terminated ioc 804b scsi 0 state c xfer 0
Nov 13 21:55:22 Rick_James (da16:mps0:0:25:0): WRITE(10). CDB: 2a 00 21 80 23 d0 00 00 e0 00
Nov 13 21:55:22 Rick_James (da16:mps0:0:25:0): CAM status: SCSI Status Error
Nov 13 21:55:22 Rick_James (da16:mps0:0:25:0): SCSI status: Check Condition
Nov 13 21:55:22 Rick_James (da16:mps0:0:25:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Nov 13 21:55:22 Rick_James (da16:mps0:0:25:0): Retrying command (per sense data)
Nov 13 21:55:24 Rick_James      (da11:mps0:0:20:0): WRITE(10). CDB: 2a 00 21 81 3f e0 00 00 e8 00 length 118784 SMID 707 terminated ioc 804b scsi 0 state c xfer 0
Nov 13 21:55:25 Rick_James (da11:mps0:0:20:0): WRITE(10). CDB: 2a 00 21 81 3f e0 00 00 e8 00
Nov 13 21:55:25 Rick_James (da11:mps0:0:20:0): CAM status: SCSI Status Error
Nov 13 21:55:25 Rick_James (da11:mps0:0:20:0): SCSI status: Check Condition
Nov 13 21:55:25 Rick_James (da11:mps0:0:20:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Nov 13 21:55:25 Rick_James (da11:mps0:0:20:0): Retrying command (per sense data)
Nov 13 21:55:27 Rick_James      (da9:mps0:0:18:0): WRITE(10). CDB: 2a 00 21 81 f9 80 00 00 e0 00 length 114688 SMID 662 terminated ioc 804b scsi 0 state c xfer 0
Nov 13 21:55:28 Rick_James (da9:mps0:0:18:0): WRITE(10). CDB: 2a 00 21 81 f9 80 00 00 e0 00
Nov 13 21:55:28 Rick_James (da9:mps0:0:18:0): CAM status: SCSI Status Error
Nov 13 21:55:28 Rick_James (da9:mps0:0:18:0): SCSI status: Check Condition
Nov 13 21:55:28 Rick_James (da9:mps0:0:18:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Nov 13 21:55:28 Rick_James (da9:mps0:0:18:0): Retrying command (per sense data)
Nov 13 21:55:28 Rick_James      (da1:mps0:0:9:0): WRITE(10). CDB: 2a 00 1f 81 fd 40 00 00 e8 00 length 118784 SMID 340 terminated ioc 804b scsi 0 state c xfer 0
Nov 13 21:55:29 Rick_James (da1:mps0:0:9:0): WRITE(10). CDB: 2a 00 1f 81 fd 40 00 00 e8 00
Nov 13 21:55:29 Rick_James (da1:mps0:0:9:0): CAM status: SCSI Status Error
Nov 13 21:55:29 Rick_James (da1:mps0:0:9:0): SCSI status: Check Condition
Nov 13 21:55:29 Rick_James (da1:mps0:0:9:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Nov 13 21:55:29 Rick_James (da1:mps0:0:9:0): Retrying command (per sense data)
Nov 13 21:55:39 Rick_James      (da14:mps0:0:23:0): WRITE(10). CDB: 2a 00 1f 84 c3 c0 00 00 e0 00 length 114688 SMID 905 terminated ioc 804b scsi 0 state c xfer 0
Nov 13 21:55:40 Rick_James (da14:mps0:0:23:0): WRITE(10). CDB: 2a 00 1f 84 c3 c0 00 00 e0 00
Nov 13 21:55:40 Rick_James (da14:mps0:0:23:0): CAM status: SCSI Status Error
Nov 13 21:55:40 Rick_James (da14:mps0:0:23:0): SCSI status: Check Condition
Nov 13 21:55:40 Rick_James (da14:mps0:0:23:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Nov 13 21:55:40 Rick_James (da14:mps0:0:23:0): Retrying command (per sense data)
Nov 13 21:55:48 Rick_James      (da3:mps0:0:11:0): WRITE(10). CDB: 2a 00 1f 87 0f 48 00 00 d8 00 length 110592 SMID 919 terminated ioc 804b scsi 0 state c xfer 0
Nov 13 21:55:49 Rick_James (da3:mps0:0:11:0): WRITE(10). CDB: 2a 00 1f 87 0f 48 00 00 d8 00
Nov 13 21:55:49 Rick_James (da3:mps0:0:11:0): CAM status: SCSI Status Error
Nov 13 21:55:49 Rick_James (da3:mps0:0:11:0): SCSI status: Check Condition
Nov 13 21:55:49 Rick_James (da3:mps0:0:11:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Nov 13 21:55:49 Rick_James (da3:mps0:0:11:0): Retrying command (per sense data)
Nov 13 21:55:54 Rick_James      (da6:mps0:0:15:0): WRITE(10). CDB: 2a 00 21 88 d2 c0 00 00 d8 00 length 110592 SMID 523 terminated ioc 804b scsi 0 state c xfer 0
Nov 13 21:55:54 Rick_James      (da0:mps0:0:8:0): WRITE(10). CDB: 2a 00 1f 88 ac 80 00 00 e0 00 length 114688 SMID 989 terminated ioc 804b scsi 0 state c xfer 0
Nov 13 21:55:55 Rick_James (da0:mps0:0:8:0): READ(10). CDB: 28 00 09 c0 cf 70 00 00 08 00
Nov 13 21:55:55 Rick_James (da0:mps0:0:8:0): CAM status: SCSI Status Error
Nov 13 21:55:55 Rick_James (da0:mps0:0:8:0): SCSI status: Check Condition
Nov 13 21:55:55 Rick_James (da0:mps0:0:8:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Nov 13 21:55:55 Rick_James (da0:mps0:0:8:0): Retrying command (per sense data)

Robert Trevellyan · Nov 13, 2015

Fuganater said:
SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)

Could be a power issue.

If you post the output from smartctl -a or -x we can see if the drives are logging errors internally.

Have you double-checked that your HBA firmware version is correct?

EDIT: Please post full hardware specs too.

Bidule0hm · Nov 13, 2015

Sounds like a cable issue or a power issue. What's your PSU? do you use any molex to SATA adapter? or SATA splitter?

Fuganater · Nov 13, 2015

Robert Trevellyan said:
Could be a power issue.

If you post the output from smartctl -a or -x we can see if the drives are logging errors internally.

Have you double-checked that your HBA firmware version is correct?

EDIT: Please post full hardware specs too.

Attached is the results for all 17 drives.

Supermicro SC-846E1-900B
Supermicro X10SRL-F
Xeon E5-1620 V3
32GB ECC Samsung RAM
2x 16GB Sandisk Ultra Fit (Mirrored install)
17x 2TB WD Reds (8x2 RAIDZ2, 8x2 RAIDZ2, 1 hot spare)

Bidule0hm said:
Sounds like a cable issue or a power issue. What's your PSU? do you use any molex to SATA adapter? or SATA splitter?

I'm on cable #2. I have 1 left I could try. PSU is the Supermicro 900B that comes with the chassis. No adapaters. No splitters.

Fuganater · Nov 13, 2015

da12 has a crap ton of UDMA_CRC_Error_Count

Code:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   175   174   021    Pre-fail  Always       -       4208
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       43
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   080   080   000    Old_age   Always       -       14983
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       43
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       33
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       9
194 Temperature_Celsius     0x0022   110   106   000    Old_age   Always       -       37
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   001   000    Old_age   Always       -       381159
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

Robert Trevellyan · Nov 13, 2015

Fuganater said:
da12 has a crap ton of UDMA_CRC_Error_Count

Indeed. According to Wikipedia, #199 indicates "The count of errors in data transfer via the interface cable as determined by ICRC (Interface Cyclic Redundancy Check)."

I don't see any errors logged in the SMART output, which to me indicates that the issue is not with the drives. But please do a smartctl -x on da12, just as a sanity check.

This isn't the cause of your problems, but in my opinion, short SMART tests every 24 hours is way too often.

Fuganater · Nov 13, 2015

Code:

[root@Rick_James] ~# smartctl -x /dev/da12
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p28 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD20EFRX-68EUZN0
Serial Number:    WD-WMC4M1147045
LU WWN Device Id: 5 0014ee 058fc3759
Firmware Version: 80.00A80
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Nov 14 00:03:06 2015 HKT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (27540) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 278) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
  3 Spin_Up_Time            POS--K   175   174   021    -    4208
  4 Start_Stop_Count        -O--CK   100   100   000    -    43
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   080   080   000    -    14984
10 Spin_Retry_Count        -O--CK   100   253   000    -    0
11 Calibration_Retry_Count -O--CK   100   253   000    -    0
12 Power_Cycle_Count       -O--CK   100   100   000    -    43
192 Power-Off_Retract_Count -O--CK   200   200   000    -    33
193 Load_Cycle_Count        -O--CK   200   200   000    -    9
194 Temperature_Celsius     -O---K   111   106   000    -    36
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   001   000    -    381159
200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb7  GPL,SL  VS       1  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 3
        CR     = Command Register
        FEATR  = Features Register
        COUNT  = Count (was: Sector Count) Register
        LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
        LH     = LBA High (was: Cylinder High) Register    ]   LBA
        LM     = LBA Mid (was: Cylinder Low) Register      ] Register
        LL     = LBA Low (was: Sector Number) Register     ]
        DV     = Device (was: Device/Head) Register
        DC     = Device Control Register
        ER     = Error register
        ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3 [2] occurred at disk power-on lifetime: 14969 hours (623 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 00 13 4a b0 a0 40 00  Error: IDNF at LBA = 0x134ab0a0 = 323661984

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 e0 00 00 00 00 13 4a b0 a0 40 00 11d+17:05:52.953  WRITE FPDMA QUEUED
  61 00 e0 00 00 00 00 13 4a af c0 40 00 11d+17:05:52.948  WRITE FPDMA QUEUED
  61 00 e0 00 00 00 00 13 4a ae e0 40 00 11d+17:05:52.946  WRITE FPDMA QUEUED
  61 00 e8 00 00 00 00 13 4a ad f8 40 00 11d+17:05:52.945  WRITE FPDMA QUEUED
  61 01 00 00 00 00 00 13 4a ac f8 40 00 11d+17:05:52.943  WRITE FPDMA QUEUED

Error 2 [1] occurred at disk power-on lifetime: 14961 hours (623 days + 9 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 00 00 5e fb f0 40 00  Error: IDNF at LBA = 0x005efbf0 = 6224880

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 e0 00 00 00 00 00 5e fb f0 40 00 11d+09:22:50.373  WRITE FPDMA QUEUED
  61 00 f8 00 00 00 00 00 5e fa f8 40 00 11d+09:22:50.371  WRITE FPDMA QUEUED
  61 01 00 00 00 00 00 00 5e f9 f8 40 00 11d+09:22:50.346  WRITE FPDMA QUEUED
  61 00 e8 00 00 00 00 00 5e f9 10 40 00 11d+09:22:50.324  WRITE FPDMA QUEUED
  61 00 f0 00 00 00 00 00 5e f8 20 40 00 11d+09:22:50.264  WRITE FPDMA QUEUED

Error 1 [0] occurred at disk power-on lifetime: 14960 hours (623 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 00 07 5d 2a 98 40 00  Error: IDNF at LBA = 0x075d2a98 = 123546264

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 08 00 08 00 00 07 5c 9c 50 40 00 11d+08:32:54.098  READ FPDMA QUEUED
  61 00 e0 00 00 00 00 07 5d 2a 98 40 00 11d+08:32:53.476  WRITE FPDMA QUEUED
  61 00 e8 00 00 00 00 07 5d 29 b0 40 00 11d+08:32:53.472  WRITE FPDMA QUEUED
  61 00 e0 00 00 00 00 07 5d 28 d0 40 00 11d+08:32:53.470  WRITE FPDMA QUEUED
  61 00 e0 00 00 00 00 07 5d 27 f0 40 00 11d+08:32:53.468  WRITE FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     14961         -
# 2  Short offline       Completed without error       00%     14937         -
# 3  Short offline       Completed without error       00%     14913         -
# 4  Short offline       Completed without error       00%     14889         -
# 5  Short offline       Completed without error       00%     14865         -
# 6  Short offline       Completed without error       00%     14841         -
# 7  Extended offline    Completed without error       00%     14824         -
# 8  Short offline       Completed without error       00%     14817         -
# 9  Short offline       Completed without error       00%     14793         -
#10  Conveyance offline  Completed without error       00%     14788         -
#11  Extended offline    Completed without error       00%     14773         -
#12  Short offline       Completed without error       00%     14764         -
#13  Extended offline    Completed without error       00%     14681         -
#14  Short offline       Completed without error       00%     14674         -
#15  Short offline       Completed without error       00%     14664         -
#16  Short offline       Completed without error       00%     14663         -
#17  Short offline       Completed without error       00%     14662         -
#18  Short offline       Completed without error       00%     14661         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    36 Celsius
Power Cycle Min/Max Temperature:     36/37 Celsius
Lifetime    Min/Max Temperature:     22/41 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (85)

Index    Estimated Time   Temperature Celsius
  86    2015-11-13 16:06    39  ********************
...    ..( 35 skipped).    ..  ********************
122    2015-11-13 16:42    39  ********************
123    2015-11-13 16:43     ?  -
124    2015-11-13 16:44    36  *****************
...    ..(114 skipped).    ..  *****************
239    2015-11-13 18:39    36  *****************
240    2015-11-13 18:40    37  ******************
...    ..( 44 skipped).    ..  ******************
285    2015-11-13 19:25    37  ******************
286    2015-11-13 19:26    36  *****************
...    ..( 20 skipped).    ..  *****************
307    2015-11-13 19:47    36  *****************
308    2015-11-13 19:48    38  *******************
...    ..( 41 skipped).    ..  *******************
350    2015-11-13 20:30    38  *******************
351    2015-11-13 20:31    39  ********************
...    ..(211 skipped).    ..  ********************
  85    2015-11-14 00:03    39  ********************

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2           10  R_ERR response for non-data FIS
0x0006  2           10  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2           49  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2           50  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4        12252  Vendor specific

Do you think it could still be the cable? After 1.02 TB of data I still have the same error count below.

Code:

[root@Rick_James] ~# zpool status Vol1
  pool: Vol1
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub in progress since Fri Nov 13 22:09:32 2015
        2.91T scanned out of 4.12T at 441M/s, 0h47m to go
        0 repaired, 70.59% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        Vol1                                            ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/f59a6e80-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/f6a0ed65-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/f7a889f1-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/f8ad1426-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/f9788d6d-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/fa86dfdf-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0    13     0
            gptid/fb895231-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     9     0
            gptid/fc8ac078-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     7     0
          raidz2-1                                      ONLINE       0     0     0
            gptid/fd5cb19c-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/fe6f518a-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/ff728308-87c2-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/007e4b9a-87c3-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/0189e083-87c3-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/028825dd-87c3-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/0385f501-87c3-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
            gptid/0450883e-87c3-11e5-9257-0cc47a6bd0ac  ONLINE       0     0     0
        spares
          gptid/051cb5fd-87c3-11e5-9257-0cc47a6bd0ac    AVAIL

errors: No known data errors

Important Announcement for the TrueNAS Community.

I think a drive is bad. Please confirm

Patron

Patron

Guru

Patron

Guru

Patron

Pony Wrangler

Patron

Guru

Patron

Patron

Patron

Patron

Patron

Pony Wrangler

Server Electronics Sorcerer

Patron

Attachments

Patron

Pony Wrangler

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "I think a drive is bad. Please confirm"

Similar threads