SCSI errors after drive replacement

DashKay

Cadet
Joined
Mar 19, 2020
Messages
2
Hi all,

I have a FreeNAS system that has a degraded RAID-Z2 volume and I'm not sure the best way to proceed - I'm hoping someone can help advise me how to stay safe. It's been through various upgrades and hardware changes over its lifetime but currently consists of:

- MSI Z270M Mortar motherboard
- Core i7-7700K CPU
- 64GB DDR4 (not ECC)
- Raidmax RX-1000AE PSU
- Norco RPC-3216 chassis
- 2x LSI 9211-8i controllers (Originally Dell M1015's) in IT mode
- Crucial CT120BX300SSD1 boot drive
- 8x 6TB WD Red data drives in raidz2:

Code:
[root@booboo ~]# for i in {0..7}; do smartctl -a /dev/da$i | egrep "Device Model|Power_On_Hours"; done
Device Model:     WDC WD60EFRX-68MYMN1
  9 Power_On_Hours          0x0032   045   045   000    Old_age   Always       -       40765
Device Model:     WDC WD60EFRX-68MYMN1
  9 Power_On_Hours          0x0032   054   054   000    Old_age   Always       -       34144
Device Model:     WDC WD60EFRX-68L0BN1
  9 Power_On_Hours          0x0032   090   090   000    Old_age   Always       -       7981
Device Model:     WDC WD60EFRX-68MYMN1
  9 Power_On_Hours          0x0032   045   045   000    Old_age   Always       -       40765
Device Model:     WDC WD60EFRX-68MYMN1
  9 Power_On_Hours          0x0032   045   045   000    Old_age   Always       -       40765
Device Model:     WDC WD60EFRX-68MYMN1
  9 Power_On_Hours          0x0032   045   045   000    Old_age   Always       -       40765
Device Model:     WDC WD60EFRX-68MYMN1
  9 Power_On_Hours          0x0032   045   045   000    Old_age   Always       -       40712
Device Model:     WDC WD60EFAX-68JH4N0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       40


It had been running fine for quite a while until one of the data drives recently started throwing SMART errors, so I replaced it. Problem is, WD has "upgraded" the EFRX to the EFAX now using SMR instead of PMR, and I cannot get the array to accept the new drive. Any time that I attempt a resilver the new drive gets faulted, like this:

Code:
[root@booboo ~]# zpool status Storage
  pool: Storage
 state: DEGRADED
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(7) for details.
  scan: scrub in progress since Thu Mar 19 13:29:09 2020
    6.14T scanned at 1.10G/s, 1.21T issued at 1.00G/s, 32.7T total
    0 repaired, 3.69% done, 0 days 08:56:29 to go
config:

    NAME                                              STATE     READ WRITE CKSUM
    Storage                                           DEGRADED     0     0     0
      raidz2-0                                        DEGRADED     0     0     0
        gptid/835dedeb-d1b8-11e4-afe0-6c626d4af35e    ONLINE       0     0     0
        gptid/4feeb1c8-a914-11e5-93c0-00259028b247    ONLINE       0     0     0
        gptid/84461997-d1b8-11e4-afe0-6c626d4af35e    ONLINE       0     0     0
        gptid/84b98196-d1b8-11e4-afe0-6c626d4af35e    ONLINE       0     0     0
        gptid/e39a940b-0569-11e9-bb96-000c2959bfd4    ONLINE       0     0     0
        gptid/85aae6f2-d1b8-11e4-afe0-6c626d4af35e    ONLINE       0     0     0
        replacing-6                                   UNAVAIL      0     0     6
          1065334894515511882                         UNAVAIL      0     0     0  was /dev/gptid/862982d0-d1b8-11e4-afe0-6c626d4af35e
          gptid/eac356b2-67ff-11ea-89d8-4ccc6ad6a297  FAULTED      0    69     0  too many errors
        gptid/86a13781-d1b8-11e4-afe0-6c626d4af35e    ONLINE       0     0     0

errors: No known data errors


and the dmesg fills with this:

Code:
(da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 80 b7 47 10 00 00 30 00 
(da7:mps1:0:3:0): CAM status: SCSI Status Error
(da7:mps1:0:3:0): SCSI status: Check Condition
(da7:mps1:0:3:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da7:mps1:0:3:0): Info: 0x80b74710
(da7:mps1:0:3:0): Error 22, Unretryable error
    (da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 80 bb aa 58 00 00 60 00 length 49152 SMID 1050 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
(da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 80 bb aa 58 00 00 60 00 
(da7:mps1:0:3:0): CAM status: CCB request completed with an error
(da7:mps1:0:3:0): Retrying command
(da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 80 bb a9 58 00 01 00 00 
(da7:mps1:0:3:0): CAM status: SCSI Status Error
(da7:mps1:0:3:0): SCSI status: Check Condition
(da7:mps1:0:3:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da7:mps1:0:3:0): Info: 0x80bba958
(da7:mps1:0:3:0): Error 22, Unretryable error
    (da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 80 bb ab c0 00 00 60 00 length 49152 SMID 1057 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
    (da7:mps1:0:3:0): READ(10). CDB: 28 00 00 40 02 90 00 00 10 00 length 8192 SMID 227 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
    (da7:mps1:0:3:0): READ(16). CDB: 88 00 00 00 00 02 ba a0 f0 90 00 00 00 10 00 00 length 8192 SMID 638 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
(da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 80 bb ab c0 00 00 60 00 
    (da7:mps1:0:3:0): READ(16). CDB: 88 00 00 00 00 02 ba a0 f2 90 00 00 00 10 00 00 length 8192 SMID 760 terminated ioc 804b log(da7:mps1:0:3:0): CAM status: CCB request completed with an error
(da7:mps1:0:3:0): Retrying command
(da7:mps1:0:3:0): READ(10). CDB: 28 00 00 40 02 90 00 00 10 00 
info 31080000 scsi 0 state 0 xfer 0
(da7:mps1:0:3:0): CAM status: CCB request completed with an error
(da7:mps1:0:3:0): Retrying command
(da7:mps1:0:3:0): READ(16). CDB: 88 00 00 00 00 02 ba a0 f0 90 00 00 00 10 00 00 
(da7:mps1:0:3:0): CAM status: CCB request completed with an error
(da7:mps1:0:3:0): Retrying command
(da7:mps1:0:3:0): READ(16). CDB: 88 00 00 00 00 02 ba a0 f2 90 00 00 00 10 00 00 
(da7:mps1:0:3:0): CAM status: CCB request completed with an error
(da7:mps1:0:3:0): Retrying command
(da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 80 bb aa c0 00 01 00 00 
(da7:mps1:0:3:0): CAM status: SCSI Status Error
(da7:mps1:0:3:0): SCSI status: Check Condition
(da7:mps1:0:3:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da7:mps1:0:3:0): Info: 0x80bbaac0
(da7:mps1:0:3:0): Error 22, Unretryable error
    (da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 00 45 e2 90 00 00 60 00 length 49152 SMID 395 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
(da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 00 45 e2 90 00 00 60 00 
(da7:mps1:0:3:0): CAM status: CCB request completed with an error
(da7:mps1:0:3:0): Retrying command
(da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 00 45 e1 90 00 01 00 00 
(da7:mps1:0:3:0): CAM status: SCSI Status Error
(da7:mps1:0:3:0): SCSI status: Check Condition
(da7:mps1:0:3:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da7:mps1:0:3:0): Info: 0x45e190
(da7:mps1:0:3:0): Error 22, Unretryable error
    (da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 00 40 02 90 00 00 10 00 length 8192 SMID 462 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
    (da7:mps1:0:3:0): WRITE(16). CDB: 8a 00 00 00 00 02 ba a0 f0 90 00 00 00 10 00 00 length 8192 SMID 808 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
    (da7:mps1:0:3:0): WRITE(16). CDB: 8a 00 00 00 00 02 ba a0 f2 90 00 00 00 10 00 00 length 8192 SMID 305 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
(da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 00 40 02 90 00 00 10 00 
(da7:mps1:0:3:0): CAM status: CCB request completed with an error
(da7:mps1:0:3:0): Retrying command
(da7:mps1:0:3:0): WRITE(16). CDB: 8a 00 00 00 00 02 ba a0 f0 90 00 00 00 10 00 00 
(da7:mps1:0:3:0): CAM status: CCB request completed with an error
(da7:mps1:0:3:0): Retrying command
(da7:mps1:0:3:0): WRITE(16). CDB: 8a 00 00 00 00 02 ba a0 f2 90 00 00 00 10 00 00 
(da7:mps1:0:3:0): CAM status: CCB request completed with an error
(da7:mps1:0:3:0): Retrying command
(da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 00 45 c2 d8 00 00 88 00 
(da7:mps1:0:3:0): CAM status: SCSI Status Error
(da7:mps1:0:3:0): SCSI status: Check Condition
(da7:mps1:0:3:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da7:mps1:0:3:0): Info: 0x45c2d8
(da7:mps1:0:3:0): Error 22, Unretryable error
    (da7:mps1:0:3:0): WRITE(16). CDB: 8a 00 00 00 00 02 ba a0 f2 90 00 00 00 10 00 00 length 8192 SMID 587 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
    (da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 00 45 e2 f8 00 01 00 00 length 131072 SMID 223 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
    (da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 00 45 e3 f8 00 00 60 00 length 49152 SMID 852 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
(da7:mps1:0:3:0): WRITE(16). CDB: 8a 00 00 00 00 02 ba a0 f2 90 00 00 00 10 00 00 
(da7:mps1:0:3:0): CAM status: CCB request completed with an error
(da7:mps1:0:3:0): Retrying command
(da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 00 45 e2 f8 00 01 00 00 
(da7:mps1:0:3:0): CAM status: CCB request completed with an error
(da7:mps1:0:3:0): Retrying command
(da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 00 45 e3 f8 00 00 60 00 
(da7:mps1:0:3:0): CAM status: CCB request completed with an error
(da7:mps1:0:3:0): Retrying command
(da7:mps1:0:3:0): WRITE(16). CDB: 8a 00 00 00 00 02 ba a0 f0 90 00 00 00 10 00 00 
(da7:mps1:0:3:0): CAM status: SCSI Status Error
(da7:mps1:0:3:0): SCSI status: Check Condition
(da7:mps1:0:3:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da7:mps1:0:3:0): Info: 0x2baa0f090
(da7:mps1:0:3:0): Error 22, Unretryable error
    (da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 00 45 e3 f8 00 00 60 00 length 49152 SMID 981 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
(da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 00 45 e3 f8 00 00 60 00 
(da7:mps1:0:3:0): CAM status: CCB request completed with an error
(da7:mps1:0:3:0): Retrying command
(da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 00 45 e2 f8 00 01 00 00 
(da7:mps1:0:3:0): CAM status: SCSI Status Error
(da7:mps1:0:3:0): SCSI status: Check Condition
(da7:mps1:0:3:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da7:mps1:0:3:0): Info: 0x45e2f8
(da7:mps1:0:3:0): Error 22, Unretryable error
(da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 00 4a 31 98 00 00 b0 00 
(da7:mps1:0:3:0): CAM status: SCSI Status Error
(da7:mps1:0:3:0): SCSI status: Check Condition
(da7:mps1:0:3:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da7:mps1:0:3:0): Info: 0x4a3198
(da7:mps1:0:3:0): Error 22, Unretryable error
    (da7:mps1:0:3:0): READ(10). CDB: 28 00 00 40 02 90 00 00 10 00 length 8192 SMID 112 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
    (da7:mps1:0:3:0): READ(16). CDB: 88 00 00 00 00 02 ba a0 f0 90 00 00 00 10 00 00 length 8192 SMID 913 terminated ioc 804b log(da7:mps1:0:3:0): READ(10). CDB: 28 00 00 40 02 90 00 00 10 00 
info 31080000 scsi 0 state 0 xfer 0
(da7:mps1:0:3:0): CAM status: CCB request completed with an error
(da7:mps1:0:3:0): Retrying command
    (da7:mps1:0:3:0): READ(16). CDB: 88 00 00 00 00 02 ba a0 f2 90 00 00 00 10 00 00 length 8192 SMID 1077 terminated ioc 804b lo(da7:mps1:0:3:0): READ(16). CDB: 88 00 00 00 00 02 ba a0 f0 90 00 00 00 10 00 00 
ginfo 31080000 scsi 0 state 0 xfer 0
(da7:mps1:0:3:0): CAM status: CCB request completed with an error
(da7:mps1:0:3:0): Retrying command
(da7:mps1:0:3:0): READ(16). CDB: 88 00 00 00 00 02 ba a0 f2 90 00 00 00 10 00 00 
(da7:mps1:0:3:0): CAM status: CCB request completed with an error
(da7:mps1:0:3:0): Retrying command
(da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 00 4a 33 50 00 00 60 00 
(da7:mps1:0:3:0): CAM status: SCSI Status Error
(da7:mps1:0:3:0): SCSI status: Check Condition
(da7:mps1:0:3:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da7:mps1:0:3:0): Info: 0x4a3350
(da7:mps1:0:3:0): Error 22, Unretryable error
    (da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 00 40 02 90 00 00 10 00 length 8192 SMID 326 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
    (da7:mps1:0:3:0): WRITE(16). CDB: 8a 00 00 00 00 02 ba a0 f0 90 00 00 00 10 00 00 length 8192 SMID 214 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
    (da7:mps1:0:3:0): WRITE(16). CDB: 8a 00 00 00 00 02 ba a0 f2 90 00 00 00 10 00 00 length 8192 SMID 493 terminated ioc 804b loginfo 31080000 scsi 0 state 0 xfer 0
(da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 00 40 02 90 00 00 10 00 
(da7:mps1:0:3:0): CAM status: CCB request completed with an error
(da7:mps1:0:3:0): Retrying command
(da7:mps1:0:3:0): WRITE(16). CDB: 8a 00 00 00 00 02 ba a0 f0 90 00 00 00 10 00 00 
(da7:mps1:0:3:0): CAM status: CCB request completed with an error
(da7:mps1:0:3:0): Retrying command
(da7:mps1:0:3:0): WRITE(16). CDB: 8a 00 00 00 00 02 ba a0 f2 90 00 00 00 10 00 00 
(da7:mps1:0:3:0): CAM status: CCB request completed with an error
(da7:mps1:0:3:0): Retrying command
(da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 00 4a 2b 80 00 00 38 00 
(da7:mps1:0:3:0): CAM status: SCSI Status Error
(da7:mps1:0:3:0): SCSI status: Check Condition
(da7:mps1:0:3:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da7:mps1:0:3:0): Info: 0x4a2b80
(da7:mps1:0:3:0): Error 22, Unretryable error
(da7:mps1:0:3:0): WRITE(10). CDB: 2a 00 00 4e 73 e8 00 00 b0 00 
(da7:mps1:0:3:0): CAM status: SCSI Status Error
(da7:mps1:0:3:0): SCSI status: Check Condition
(da7:mps1:0:3:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da7:mps1:0:3:0): Info: 0x4e73e8
(da7:mps1:0:3:0): Error 22, Unretryable error


Note: The new drive checks out fine according to SMART long self-test.

So far I have tried:
- Moving the drive to a different slot
- Upgrading FreeNAS (Was 11.2 U7, now 11.3 U1, this is why you see a zpool upgrade above)
- Updating controller firmware and BIOS (was 20.00.04.00 / 07.39.00.00, now 20.00.07.00 / 07.39.02.00)

Nothing has made any difference. As far as I can tell my options now are:

- Find a WD60EFRX, accept that It's going to be pricey, and run degraded while I wait for it
- Swap for another type of HDD (WD Red Pro?)
- Swap the PSU (it's fairly recent so this feels like a bit of a hail mary)

I'm hesitant to keep stressing such old drives with resilvers without any confidence that it'll work; I can't find anything about the WD60EFAX that indicates why it would be an issue so these all just seem like shots in the dark. I found some issues about the WD60EFRX-68L0BN1 being a problem and I have one of those in there, but it hasn't been an issue at any point and I'm skeptical that it could be causing issues with the new drive because they're on different controllers. I'm also hesitant so switch too far away from the EFRX because if the switch to EFAX has caused issues it seems like this would continue, so I'm left without a convincing option.

Any suggestions?
 

DashKay

Cadet
Joined
Mar 19, 2020
Messages
2
I fixed the issue so I figured I'd follow up here in case others run into the same problem.

Since the EFRX drives are now unreasonably expensive I ended up replacing the EFAX drive with a WD Red Pro, WD6003FFBX-68MU3N0, after which the array resilvered perfectly first time - no read/write errors, no SCSI errors in dmesg, and no other problems since. It seems as though the WD60EFRX-68L0BN1, WD60EFRX-68MYMN1, and WD60EFAX-68JH4N0 are incompatible with each other in an 8-way array.

So:
- I'd recommend that people stay away from WD Red drives for now, at least in 8-way (or greater) arrays and/or SAS2008 controllers. The "upgrades" that WD are introducing are clearly causing problems, and if you can't guarantee continued availability of spare drives then it's not worth the risk. Note this is not the first such issue, see here for more info about the WD60EFRX-68L0BN1 faults.
- Putting my money where my mouth is, I'm planning to completely replace this array in the next few months and am willing to make this hardware (everything except MB/CPU/RAM) available to any FreeNAS or ZFS developers who might want to troubleshoot the problem on a known-bad configuration. Contact me here (particularly if you're in the SF bay area) if you want it.
 
Top