SOLVED mps0: mpssas_prepare_remove: Sending reset for target ID 15 CAM status: CCB request aborted by the host

VioletDragon · Jan 16, 2021

Hi folks,

I have a issue with my Server again im getting these CAM errors it mentions about the drive being re-attached but zpool status shows different as its not re-silvering the array or the drive is not removed from the array, i removed the drive and did a SMART test on it yesterday but i cannot see any issues, would anyone have any ideas?

Code:

mps0: mpssas_prepare_remove: Sending reset for target ID 15

da10 at mps0 bus 0 scbus0 target 15 lun 0

mps0: da10: <ATA ST1000DM010-2EP1 CC43> s/n Z9AR1MFP detached

(da10:mps0:0:15:0): WRITE(10). CDB: 2a 00 02 2b 5a c0 00 00 60 00

Unfreezing devq for target ID 15

(da10:mps0:0:15:0): CAM status: CCB request aborted by the host

(da10:mps0:0:15:0): Error 5, Periph was invalidated

(da10:mps0:0:15:0): Periph destroyed

mps0: SAS Address for SATA device = 3c2f56516485484e

mps0: SAS Address from SATA device = 3c2f56516485484e

da10 at mps0 bus 0 scbus0 target 15 lun 0

da10: <ATA ST1000DM010-2EP1 CC43> Fixed Direct Access SPC-4 SCSI device

da10: Serial Number Z9AR1MFP

da10: 600.000MB/s transfers

da10: Command Queueing enabled

da10: 953869MB (1953525168 512 byte sectors)

da10: quirks=0x8<4K>

ses0: da10,pass11: Element descriptor: 'ArrayDevice04'

ses0: da10,pass11: SAS Device Slot Element: 1 Phys at Slot 4

ses0:  phy 0: SATA device

ses0:  phy 0: parent 500605b0000274bf addr 500605b0000274a4

mps0: mpssas_prepare_remove: Sending reset for target ID 15

da10 at mps0 bus 0 scbus0 target 15 lun 0

mps0: da10: Unfreezing devq for target ID 15

<ATA ST1000DM010-2EP1 CC43> s/n Z9AR1MFP detached

(da10:mps0:0:15:0): Periph destroyed

mps0: SAS Address for SATA device = 3c2f56516485484e

mps0: SAS Address from SATA device = 3c2f56516485484e

da10 at mps0 bus 0 scbus0 target 15 lun 0

da10: <ATA ST1000DM010-2EP1 CC43> Fixed Direct Access SPC-4 SCSI device

da10: Serial Number Z9AR1MFP

da10: 600.000MB/s transfers

da10: Command Queueing enabled

da10: 953869MB (1953525168 512 byte sectors)

da10: quirks=0x8<4K>

ses0: da10,pass11: Element descriptor: 'ArrayDevice04'

ses0: da10,pass11: SAS Device Slot Element: 1 Phys at Slot 4

ses0:  phy 0: SATA device

ses0:  phy 0: parent 500605b0000274bf addr 500605b0000274a4

Code:

zpool status iSCSI

  pool: iSCSI

state: ONLINE

  scan: resilvered 124M in 0 days 00:00:03 with 0 errors on Fri Jan 15 15:40:31 2021

config:


    NAME                                            STATE     READ WRITE CKSUM

    iSCSI                                           ONLINE       0     0     0

      mirror-0                                      ONLINE       0     0     0

        gptid/7c39221d-ea2c-11e9-b7f4-0017087d2e9a  ONLINE       0     0     0

        gptid/7cea36c6-ea2c-11e9-b7f4-0017087d2e9a  ONLINE       0     0     0

        gptid/470b7803-36c0-11eb-80e1-00074309b09a  ONLINE       0     0     0

      mirror-1                                      ONLINE       0     0     0

        gptid/7d87604a-ea2c-11e9-b7f4-0017087d2e9a  ONLINE       0     0     0

        gptid/7e3eae42-ea2c-11e9-b7f4-0017087d2e9a  ONLINE       0     0     0

        gptid/f5b603a3-36d6-11eb-80e1-00074309b09a  ONLINE       0     0     0


errors: No known data errors

artlessknave · Jan 16, 2021

please God use the code tag.

VioletDragon · Jan 16, 2021

artlessknave said:
please God use the code tag.

Hi,

Just fixed that.

Samuel Tai · Jan 16, 2021

Is this drive on a backplane or just a cable connection?

VioletDragon · Jan 16, 2021

Samuel Tai said:
Is this drive on a backplane or just a cable connection?

SAS Backplane with a Expander build into it. GOOXI backplane is whats on the board my 2u chassis came with two backplanes with with and without a SAS Expander but the problem doesnt appear to happens when there is high disk IO im currently running a scrub and no errors atm.

Samuel Tai · Jan 16, 2021

It’s not unknown for backplanes to go bad. Mine did, and I had to replace it, as my server wouldn’t even boot with any drives connected. In your case, it‘s only a single slot that’s possibly going bad. If you have slots free, try moving that drive to another slot.

VioletDragon · Jan 16, 2021

Samuel Tai said:
It’s not unknown for backplanes to go bad. Mine did, and I had to replace it, as my server wouldn’t even boot with any drives connected. In your case, it‘s only a single slot that’s possibly going bad. If you have slots free, try moving that drive to another slot.

All slots are full. I suppose i could put the other backplane in but that means installing another SAS controller. But what i dont get is why is it reporting that the drive is re-attached yet its still in the array? Scrubbing the array and high Disk IO doesnt trigger this im beginning to wonder if this is a bug more than anything,

If the drive is getting re-attached surely the drive will be removed from the array and marked as degraded? its just the same as removing the drive and putting it back in again?

Samuel Tai · Jan 16, 2021

It’s no bug. The drive is disconnecting and reconnecting faster than the ZFS offline threshold.

VioletDragon · Jan 16, 2021

Samuel Tai said:
It’s no bug. The drive is disconnecting and reconnecting faster than the ZFS offline threshold.

I dont believe that is true. When you remove a drive and reinstall it it takes a few seconds for the SAS expander to detect it as the light flashes on the bay for so long then goes solid blue.

VioletDragon · Jan 16, 2021

Scrub passed without any issues.

Code:

% zpool status iSCSI
  pool: iSCSI
 state: ONLINE
  scan: scrub repaired 0 in 0 days 01:41:37 with 0 errors on Sat Jan 16 11:03:24 2021
config:

    NAME                                            STATE     READ WRITE CKSUM
    iSCSI                                           ONLINE       0     0     0
      mirror-0                                      ONLINE       0     0     0
        gptid/7c39221d-ea2c-11e9-b7f4-0017087d2e9a  ONLINE       0     0     0
        gptid/7cea36c6-ea2c-11e9-b7f4-0017087d2e9a  ONLINE       0     0     0
        gptid/470b7803-36c0-11eb-80e1-00074309b09a  ONLINE       0     0     0
      mirror-1                                      ONLINE       0     0     0
        gptid/7d87604a-ea2c-11e9-b7f4-0017087d2e9a  ONLINE       0     0     0
        gptid/7e3eae42-ea2c-11e9-b7f4-0017087d2e9a  ONLINE       0     0     0
        gptid/f5b603a3-36d6-11eb-80e1-00074309b09a  ONLINE       0     0     0

errors: No known data errors

VioletDragon · Jan 16, 2021

Urmm. booted into Ubuntu Server and get no such errors! think FreeNAS is trolling me!

jgreco · Jan 16, 2021

And when you switched back, ....? Curious mostly because I've seen strange things cured by a reboot or power cycle.

VioletDragon · Jan 16, 2021

jgreco said:
And when you switched back, ....? Curious mostly because I've seen strange things cured by a reboot or power cycle.

Rebooting doesn't make much difference. I did a long SMART test but i don't see anything wrong here.

Code:

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 3.5
Device Model:     ST1000DM010-2EP102
Serial Number:    Z9AR1MFP
LU WWN Device Id: 5 000c50 0b06317d9
Firmware Version: CC43
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Jan 16 13:20:44 2021 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (    0) seconds.
Offline data collection
capabilities:              (0x73) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      ( 113) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x1085)    SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   083   063   006    Pre-fail  Always       -       204347040
  3 Spin_Up_Time            0x0003   098   097   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       27
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   072   060   045    Pre-fail  Always       -       18956288
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1013
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       26
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   074   064   040    Old_age   Always       -       26 (Min/Max 23/28)
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       65
194 Temperature_Celsius     0x0022   026   013   000    Old_age   Always       -       26 (0 13 0 0 0)
195 Hardware_ECC_Recovered  0x001a   002   001   000    Old_age   Always       -       204347040
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       1008h+14m+25.142s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       10051865969
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       2903886859

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      1013         -
# 2  Short offline       Completed without error       00%      1011         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

jgreco · Jan 16, 2021

Can you confirm what exact version of firmware is running on your HBA? Because if it's 20.00.07.00, then I am mostly out of easy ideas, except to say that it might be good to see if the SAS expander has a firmware update available.

VioletDragon · Jan 16, 2021

jgreco said:
Can you confirm what exact version of firmware is running on your HBA? Because if it's 20.00.07.00, then I am mostly out of easy ideas, except to say that it might be good to see if the SAS expander has a firmware update available.

The one i recently bought is running 20.00.07.00 but not sure about the other one i have i have this issue regardless of no SAS Expander at all. SAS to Forward breakout cables have the same problem i reported this issue awhile ago. Issues seem to happen when i use these Dell Perc H310 SAS Controllers i recently brought another one that is running 20.00.07.00 but still the same problems.

jgreco · Jan 16, 2021

Between the H200 and H310, I'm sure I have at least a dozen of each in service without issue.

These things run really hot and are really designed for front-to-rear airflow in a rackmount server. Do make sure that you have sufficient airflow flowing over the card, especially over the heatsink. The Dell cards come with solid PCI slot brackets which aren't doing anyone any favors IMHO, it may be worth testing removing the bracket to see if airflow is a problem. Servers often require some "interior design" work to make sure that things like brackets and cables are not impeding airflow, so consider both of those issues. I often get Supermicro 1U's that do not have any fans adequately blowing in the lefthand (PCIe slots) portion of the chassis, because the default configuration doesn't include them. The easy fix is to replace the fan blanks with actual fans...

VioletDragon · Jan 16, 2021

jgreco said:
Between the H200 and H310, I'm sure I have at least a dozen of each in service without issue.

These things run really hot and are really designed for front-to-rear airflow in a rackmount server. Do make sure that you have sufficient airflow flowing over the card, especially over the heatsink. The Dell cards come with solid PCI slot brackets which aren't doing anyone any favors IMHO, it may be worth testing removing the bracket to see if airflow is a problem. Servers often require some "interior design" work to make sure that things like brackets and cables are not impeding airflow, so consider both of those issues. I often get Supermicro 1U's that do not have any fans adequately blowing in the lefthand (PCIe slots) portion of the chassis, because the default configuration doesn't include them. The easy fix is to replace the fan blanks with actual fans...

I wondered about maybe the card is overheating but my 2u Chassis has 4 72mm fans blowing over the cards but i will add a fan to the heatsink and see if that helps. if it was heat tho im pretty certain the card would either drop out or thermal throttle?

jgreco · Jan 16, 2021

No, the LSI cards are famous for starting to act flaky when too warm, and then when actually overheating, they may just start corrupting data, which makes for lots of ZFS excitement (and possibly trashed pools). The cards are 2009-2013 era cards and didn't include temperature sensors, later generations do have a temperature sensor, but still seem to go bonkers if they overheat.

VioletDragon · Jan 16, 2021

jgreco said:
No, the LSI cards are famous for starting to act flaky when too warm, and then when actually overheating, they may just start corrupting data, which makes for lots of ZFS excitement (and possibly trashed pools). The cards are 2009-2013 era cards and didn't include temperature sensors, later generations do have a temperature sensor, but still seem to go bonkers if they overheat.

Ok thanks, i will add a fan to the card later on tonight but i would of thought these 4 72rpm fans blowing air would be good enough. Will give that ago and see what happens but the other card acts up strange as well.

jgreco · Jan 16, 2021

Well that's quite likely sufficient, it tends to be more of a problem in tower cases. It's really the only other thing I could think of that would cause strange behaviours. Beyond that you move into the realm of things like undersized PSU's, bad connectors, etc.

Important Announcement for the TrueNAS Community.

SOLVED mps0: mpssas_prepare_remove: Sending reset for target ID 15 CAM status: CCB request aborted by the host

Patron

Wizard

Patron

Never underestimate your own stupidity

Patron

Never underestimate your own stupidity

Patron

Never underestimate your own stupidity

Patron

Patron

Patron

Resident Grinch

Patron

Resident Grinch

Patron

Resident Grinch

Patron

Resident Grinch

Patron

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "mps0: mpssas_prepare_remove: Sending reset for target ID 15 CAM status: CCB request aborted by the host"

Similar threads