Scary messages and behavior about a disk

Glorious1 · Nov 7, 2020

I noticed recently in /var/log/messages a strange series of entries about my disk da0. It looks like there are some errors maybe, then the pool is taken apart and rebuilt.

Code:

Nov  6 19:42:28 Tabernacle     (da0:mps0:0:0:0): READ(10). CDB: 28 00 09 73 ac 18 00 00 58 00 length 45056 SMID 390 Aborting command 0xfffffe0001254fe0
Nov  6 19:42:28 Tabernacle mps0: Sending reset from mpssas_send_abort for target ID 0
Nov  6 19:42:28 Tabernacle     (da1:mps0:0:1:0): READ(10). CDB: 28 00 09 73 ac 18 00 00 58 00 length 45056 SMID 272 Aborting command 0xfffffe000124b500
Nov  6 19:42:28 Tabernacle mps0: Sending reset from mpssas_send_abort for target ID 1
Nov  6 19:42:28 Tabernacle     (da2:mps0:0:3:0): READ(10). CDB: 28 00 09 73 ac 18 00 00 50 00 length 40960 SMID 877 Aborting command 0xfffffe000127cf10
Nov  6 19:42:28 Tabernacle mps0: Sending reset from mpssas_send_abort for target ID 3
Nov  6 19:42:29 Tabernacle mps0: mpssas_action_scsiio: Freezing devq for target ID 0
Nov  6 19:42:29 Tabernacle (da0:mps0:0:0:0): READ(16). CDB: 88 00 00 00 00 01 4e f5 d6 e8 00 00 00 08 00 00
Nov  6 19:42:29 Tabernacle (da0:mps0:0:0:0): CAM status: CAM subsystem is busy
Nov  6 19:42:29 Tabernacle (da0:mps0:0:0:0): Retrying command
Nov  6 19:42:29 Tabernacle     (da3:mps0:0:4:0): READ(10). CDB: 28 00 09 73 ac c8 00 00 58 00 length 45056 SMID 169 Aborting command 0xfffffe0001242dd0
Nov  6 19:42:29 Tabernacle mps0: Sending reset from mpssas_send_abort for target ID 4
Nov  6 19:42:30 Tabernacle mps0: mpssas_action_scsiio: Freezing devq for target ID 0
Nov  6 19:42:30 Tabernacle (da0:mps0:0:0:0): READ(16). CDB: 88 00 00 00 00 01 4e f5 d6 e8 00 00 00 08 00 00
Nov  6 19:42:30 Tabernacle (da0:mps0:0:0:0): CAM status: CAM subsystem is busy
Nov  6 19:42:30 Tabernacle (da0:mps0:0:0:0): Retrying command
Nov  6 19:42:30 Tabernacle mps0: mpssas_action_scsiio: Freezing devq for target ID 0
Nov  6 19:42:30 Tabernacle (da0:mps0:0:0:0): READ(16). CDB: 88 00 00 00 00 01 4e f5 d6 e8 00 00 00 08 00 00
Nov  6 19:42:30 Tabernacle (da0:mps0:0:0:0): CAM status: CAM subsystem is busy
Nov  6 19:42:30 Tabernacle (da0:mps0:0:0:0): Retrying command
Nov  6 19:42:31 Tabernacle mps0: mpssas_action_scsiio: Freezing devq for target ID 0
Nov  6 19:42:31 Tabernacle (da0:mps0:0:0:0): READ(16). CDB: 88 00 00 00 00 01 4e f5 d6 e8 00 00 00 08 00 00
Nov  6 19:42:31 Tabernacle (da0:mps0:0:0:0): CAM status: CAM subsystem is busy
Nov  6 19:42:31 Tabernacle (da0:mps0:0:0:0): Retrying command
Nov  6 19:42:31 Tabernacle mps0: mpssas_action_scsiio: Freezing devq for target ID 0
Nov  6 19:42:31 Tabernacle (da0:mps0:0:0:0): READ(16). CDB: 88 00 00 00 00 01 4e f5 d6 e8 00 00 00 08 00 00
Nov  6 19:42:31 Tabernacle (da0:mps0:0:0:0): CAM status: CAM subsystem is busy
Nov  6 19:42:31 Tabernacle (da0:mps0:0:0:0): Error 5, Retries exhausted
Nov  6 19:42:31 Tabernacle mps0: mpssas_action_scsiio: Freezing devq for target ID 0
Nov  6 19:42:31 Tabernacle (da0:mps0:0:0:0): READ(10). CDB: 28 00 00 40 02 90 00 00 10 00
Nov  6 19:42:31 Tabernacle (da0:mps0:0:0:0): CAM status: CAM subsystem is busy
Nov  6 19:42:31 Tabernacle (da0:mps0:0:0:0): Retrying command
Nov  6 19:42:32 Tabernacle mps0: mpssas_action_scsiio: Freezing devq for target ID 0
Nov  6 19:42:32 Tabernacle (da0:mps0:0:0:0): READ(10). CDB: 28 00 00 40 02 90 00 00 10 00
Nov  6 19:42:32 Tabernacle (da0:mps0:0:0:0): CAM status: CAM subsystem is busy
Nov  6 19:42:32 Tabernacle (da0:mps0:0:0:0): Retrying command
Nov  6 19:42:32 Tabernacle mps0: mpssas_action_scsiio: Freezing devq for target ID 0
Nov  6 19:42:32 Tabernacle (da0:mps0:0:0:0): READ(10). CDB: 28 00 00 40 02 90 00 00 10 00
Nov  6 19:42:32 Tabernacle (da0:mps0:0:0:0): CAM status: CAM subsystem is busy
Nov  6 19:42:32 Tabernacle (da0:mps0:0:0:0): Retrying command
Nov  6 19:42:33 Tabernacle mps0: mpssas_action_scsiio: Freezing devq for target ID 0
Nov  6 19:42:33 Tabernacle (da0:mps0:0:0:0): READ(10). CDB: 28 00 00 40 02 90 00 00 10 00
Nov  6 19:42:33 Tabernacle (da0:mps0:0:0:0): CAM status: CAM subsystem is busy
Nov  6 19:42:33 Tabernacle (da0:mps0:0:0:0): Retrying command
Nov  6 19:42:35 Tabernacle mps0: mpssas_action_scsiio: Freezing devq for target ID 0
Nov  6 19:42:35 Tabernacle     (xpt0:mps0:0:3:0): SMID 15 task mgmt 0xfffffe00012363b0 timed out
Nov  6 19:42:35 Tabernacle mps0: Reinitializing controller,
Nov  6 19:42:35 Tabernacle (da0:mps0:0:0:0): READ(10). CDB: 28 00 00 40 02 90 00 00 10 00
Nov  6 19:42:35 Tabernacle (da0:mps0:0:0:0): CAM status: CAM subsystem is busy
Nov  6 19:42:35 Tabernacle (da0:mps0:0:0:0): Error 5, Retries exhausted
Nov  6 19:42:35 Tabernacle mps0: Unfreezing devq for target ID 0
Nov  6 19:42:35 Tabernacle mps0: Unfreezing devq for target ID 1
Nov  6 19:42:35 Tabernacle mps0: Unfreezing devq for target ID 3
Nov  6 19:42:35 Tabernacle mps0: Unfreezing devq for target ID 4
Nov  6 19:42:35 Tabernacle mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
Nov  6 19:42:35 Tabernacle mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
Nov  6 19:42:35 Tabernacle mps0: mps_reinit finished sc 0xfffffe0001081000 post 4 free 3
Nov  6 19:42:35 Tabernacle (da0:mps0:0:0:0): Invalidating pack
Nov  6 19:42:35 Tabernacle da0 at mps0 bus 0 scbus0 target 0 lun 0
Nov  6 19:42:35 Tabernacle da0: <ATA WDC WD30EZRX-00S 0A80> s/n WD-WCC4E0749572 detached
Nov  6 19:42:36 Tabernacle mps0: SAS Address for SATA device = 495c3f45dcbcb26f
Nov  6 19:42:36 Tabernacle mps0: SAS Address from SATA device = 495c3f45dcbcb26f
Nov  6 19:42:36 Tabernacle mps0: SAS Address for SATA device = d2696859d9cecf80
Nov  6 19:42:36 Tabernacle mps0: SAS Address from SATA device = d2696859d9cecf80
Nov  6 19:42:37 Tabernacle mps0: SAS Address for SATA device = 495c3e47d6beb471
Nov  6 19:42:37 Tabernacle mps0: SAS Address from SATA device = 495c3e47d6beb471
Nov  6 19:42:37 Tabernacle mps0: SAS Address for SATA device = 495c3e45d8baae6f
Nov  6 19:42:37 Tabernacle mps0: SAS Address from SATA device = 495c3e45d8baae6f
Nov  6 19:42:43 Tabernacle GEOM_MIRROR: Device swap1: provider da0p1 disconnected.
Nov  6 19:42:43 Tabernacle ZFS: vdev state changed, pool_guid=4269423227442216231 vdev_guid=134269995022058455
Nov  6 19:42:43 Tabernacle ZFS: vdev is removed, pool_guid=4269423227442216231 vdev_guid=134269995022058455
Nov  6 19:42:43 Tabernacle (da0:mps0:0:0:0): Periph destroyed
Nov  6 19:42:43 Tabernacle da0 at mps0 bus 0 scbus0 target 0 lun 0
Nov  6 19:42:43 Tabernacle da0: <ATA WDC WD30EZRX-00S 0A80> Fixed Direct Access SPC-4 SCSI device
Nov  6 19:42:43 Tabernacle da0: Serial Number WD-WCC4E0749572
Nov  6 19:42:43 Tabernacle da0: 600.000MB/s transfers
Nov  6 19:42:43 Tabernacle da0: Command Queueing enabled
Nov  6 19:42:43 Tabernacle da0: 2861588MB (5860533168 512 byte sectors)
Nov  6 19:42:43 Tabernacle da0: quirks=0x8<4K>
Nov  6 19:42:43 Tabernacle ZFS: vdev state changed, pool_guid=4269423227442216231 vdev_guid=134269995022058455
Nov  6 19:42:53 Tabernacle GEOM_ELI: Device mirror/swap1.eli destroyed.
Nov  6 19:42:53 Tabernacle GEOM_MIRROR: Device swap1: provider destroyed.
Nov  6 19:42:53 Tabernacle GEOM_MIRROR: Device swap1 destroyed.
Nov  6 19:42:54 Tabernacle GEOM_MIRROR: Device mirror/swap1 launched (2/2).
Nov  6 19:42:55 Tabernacle GEOM_ELI: Device mirror/swap1.eli created.
Nov  6 19:42:55 Tabernacle GEOM_ELI: Encryption: AES-XTS 128
Nov  6 19:42:55 Tabernacle GEOM_ELI:     Crypto: hardware

zpool status reports all is fine, but the disk is labeled differently than the others:

Code:

    NAME                                            STATE     READ WRITE CKSUM
    Ark                                             ONLINE       0     0     0
      raidz1-0                                      ONLINE       0     0     0
        da0p2                                       ONLINE       0     0     0
        gptid/ecf00bc5-d70b-11e7-b968-0cc47a319aec  ONLINE       0     0     0
        gptid/b4f85047-89d1-11e4-8e67-0cc47a319aec  ONLINE       0     0     0
        gptid/b5ec52cc-89d1-11e4-8e67-0cc47a319aec  ONLINE       0     0     0

errors: No known data errors

I can't access it with smartctl as da0 or da0p2, it reports it doesn't exist.

It shows in camcontrol devlist, but look how the labels are reversed!

Code:

Tabernacle:/var/log$ sudo camcontrol devlist
Password:
<ATA WDC WD30EZRX-00S 0A80>        at scbus0 target 0 lun 0 (da0,pass0)
<ATA WDC WD30EZRX-00S 0A80>        at scbus0 target 1 lun 0 (pass1,da1)
<ATA WDC WD30EZRX-00S 0A80>        at scbus0 target 3 lun 0 (pass2,da2)
<ATA WDC WD30EFRX-68E 0A82>        at scbus0 target 4 lun 0 (pass3,da3)
<WDC WD30EFRX-68EUZN0 82.00A82>    at scbus2 target 0 lun 0 (pass4,ada0)
<TS32GSSD370S P1225CH1>            at scbus4 target 0 lun 0 (pass5,ada1)
<WDC WD40EFRX-68N32N0 82.00A82>    at scbus5 target 0 lun 0 (pass6,ada2)
<WDC WD30EFRX-68EUZN0 82.00A82>    at scbus6 target 0 lun 0 (pass7,ada3)

What does this mean? Is something wrong with the disk? Or the HBA (LSI MegaRAID 9240-8i 8-port PCI-E 6Gb RAID Controller IBM M1015 46M0861)? I think the da# disks are on the HBA.

Glorious1 · Nov 8, 2020

Correction about smartctl - I was running the command without sudo (su-duh). It shows no errors in scheduled smart tests. I ran a short one and no errors. I started a long one but it will take 7-8 hours. The other data look like this - doesn't seem remarkable to me:

Code:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   173   173   021    Pre-fail  Always       -       8308
  4 Start_Stop_Count        0x0032   097   097   000    Old_age   Always       -       3825
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   034   034   000    Old_age   Always       -       48828
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       423
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       231
193 Load_Cycle_Count        0x0032   193   193   000    Old_age   Always       -       21075
194 Temperature_Celsius     0x0022   116   111   000    Old_age   Always       -       36
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       3
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

And by the way it's a WD Green

Code:

Model Family:     Western Digital Green
Device Model:     WDC WD30EZRX-00SPEB0
Serial Number:    WD-WCC4E0749572
LU WWN Device Id: 5 0014ee 2b4495161
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)

Glorious1 · Nov 10, 2020

A long test was interrupted by a reset. I've read that can be just a cable needs resetting. So I shut down, reseated. When it rebooted I was hoping the weird label issues noted above would be resolved, but no. Restarted loooong smart test.

This time it finished without error. So I guess the drive is still OK, just the cable reseating issue?

Code:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   174   173   021    Pre-fail  Always       -       8266
  4 Start_Stop_Count        0x0032   097   097   000    Old_age   Always       -       3826
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   034   034   000    Old_age   Always       -       48869
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       424
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       232
193 Load_Cycle_Count        0x0032   193   193   000    Old_age   Always       -       21075
194 Temperature_Celsius     0x0022   117   111   000    Old_age   Always       -       35
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       3
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     48857         -
# 2  Extended offline    Interrupted (host reset)      10%     48836         -
# 3  Short offline       Completed without error       00%     48828         -
# 4  Extended offline    Completed without error       00%     48691         -
# 5  Short offline       Completed without error       00%     48440         -

Strangely, the labels got switched back correctly in camcontrol devlist, but zpool status is still showing da0p2 instead of the gptid. It also shows a little resilvering:

Code:

Tabernacle:~$ zpool status
  pool: Ark
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(7) for details.
  scan: resilvered 20K in 0 days 00:00:01 with 0 errors on Fri Nov  6 19:42:49 2020
config:

    NAME                                            STATE     READ WRITE CKSUM
    Ark                                             ONLINE       0     0     0
      raidz1-0                                      ONLINE       0     0     0
        da0p2                                       ONLINE       0     0     0
        gptid/ecf00bc5-d70b-11e7-b968-0cc47a319aec  ONLINE       0     0     0
        gptid/b4f85047-89d1-11e4-8e67-0cc47a319aec  ONLINE       0     0     0
        gptid/b5ec52cc-89d1-11e4-8e67-0cc47a319aec  ONLINE       0     0     0

errors: No known data errors

Important Announcement for the TrueNAS Community.

Scary messages and behavior about a disk

Glorious1

Guru

Glorious1

Guru

Glorious1

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Scary messages and behavior about a disk

Glorious1

Guru

Glorious1

Guru

Glorious1

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Scary messages and behavior about a disk"

Similar threads