Cam status: SCSI Status Error

DanP.

Dabbler
Joined
Apr 2, 2019
Messages
17
Hi,

I built my TrueNAS a couple moinths ago, space is (temporarily) about 80% full, but the NAS is only used lighly with SMB.

I get this error about once a week:
Code:
May  9 22:23:57 truenas mps0: Controller reported scsi ioc terminated tgt 5 SMID 1143 loginfo 31080000
May  9 22:23:57 truenas (da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 02 68 66 7d 98 00 00 00 10 00 00
May  9 22:23:57 truenas (da5:mps0:0:5:0): CAM status: CCB request completed with an error
May  9 22:23:57 truenas (da5:mps0:0:5:0): Retrying command, 3 more tries remain
May  9 22:23:57 truenas (da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 02 7b 77 df 08 00 00 00 08 00 00
May  9 22:23:57 truenas (da5:mps0:0:5:0): CAM status: SCSI Status Error
May  9 22:23:57 truenas (da5:mps0:0:5:0): SCSI status: Check Condition
May  9 22:23:57 truenas (da5:mps0:0:5:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
May  9 22:23:57 truenas (da5:mps0:0:5:0): Info: 0x27b77df08
May  9 22:23:57 truenas (da5:mps0:0:5:0): Error 22, Unretryable error


smartctl -a /dev/da5:
Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   197   192   021    Pre-fail  Always       -       9133
  4 Start_Stop_Count        0x0032   098   098   000    Old_age   Always       -       2814
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       11901
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1133
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       102
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       5975
194 Temperature_Celsius     0x0022   120   102   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 2
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 occurred at disk power-on lifetime: 8216 hours (342 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 10 f8 84 20 e3  Error: UNC 16 sectors at LBA = 0x032084f8 = 52462840

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 10 f8 84 20 e3 08   1d+09:26:39.523  READ DMA
  e5 00 00 00 00 00 00 08   1d+09:26:39.392  CHECK POWER MODE
  ca 00 01 a0 2d 10 e0 08   1d+09:26:29.702  WRITE DMA

Error 1 occurred at disk power-on lifetime: 6070 hours (252 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 51 01 a0 2d 10 e0  Error: IDNF at LBA = 0x00102da0 = 1060256

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 01 a0 2d 10 e0 08   1d+13:04:01.653  WRITE DMA
  e5 00 00 00 00 00 00 08   1d+13:04:01.518  CHECK POWER MODE

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     11824         -
# 2  Short offline       Completed without error       00%     11787         -
# 3  Short offline       Completed without error       00%     11641         -
# 4  Short offline       Completed without error       00%     11473         -
# 5  Extended offline    Completed without error       00%     11463         -
# 6  Short offline       Completed without error       00%     11306         -
# 7  Short offline       Completed without error       00%     11181         -
# 8  Short offline       Completed without error       00%     11033         -
# 9  Conveyance offline  Completed without error       00%     10807         -
#10  Extended offline    Completed without error       00%     10765         -
#11  Conveyance offline  Completed without error       00%     10753         -
#12  Short offline       Completed without error       00%     10752         -
#13  Short offline       Completed without error       00%      9723         -
#14  Short offline       Completed without error       00%      9599         -
#15  Short offline       Completed without error       00%      9555         -
#16  Short offline       Completed without error       00%      9537         -
#17  Short offline       Completed without error       00%      9478         -
#18  Extended offline    Completed without error       00%      9396         -
#19  Short offline       Completed without error       00%      9175         -
#20  Short offline       Completed without error       00%      9104         -
#21  Short offline       Completed without error       00%      8897         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


zpool status tank:
Code:
  pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 108K in 00:00:00 with 0 errors on Sun May  9 22:24:02 2021
config:

    NAME                                            STATE     READ WRITE CKSUM
    tank                                            ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/038117ff-8ceb-11eb-bb67-002590aaacc7  ONLINE       0     0     0
        gptid/0487cea5-8ceb-11eb-bb67-002590aaacc7  ONLINE       0     0     0
        gptid/04ae7f80-8ceb-11eb-bb67-002590aaacc7  ONLINE       0     0     0
        gptid/04b6861b-8ceb-11eb-bb67-002590aaacc7  ONLINE       0     0     0
        gptid/054422f3-8ceb-11eb-bb67-002590aaacc7  ONLINE       0     0     0
        gptid/055f51ed-8ceb-11eb-bb67-002590aaacc7  ONLINE       0     1     0

errors: No known data errors


dmesg | grep mps:
Code:
mps0: <Avago Technologies (LSI) SAS2008> port 0xe000-0xe0ff mem 0xdfa40000-0xdfa4ffff,0xdfa00000-0xdfa3ffff irq 16 at device 0.0 on pci1
mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
da0 at mps0 bus 0 scbus0 target 0 lun 0
da2 at mps0 bus 0 scbus0 target 2 lun 0
da1 at mps0 bus 0 scbus0 target 1 lun 0
da3 at mps0 bus 0 scbus0 target 3 lun 0
da4 at mps0 bus 0 scbus0 target 4 lun 0
da5 at mps0 bus 0 scbus0 target 5 lun 0
mps0: Controller reported scsi ioc terminated tgt 5 SMID 1143 loginfo 31080000
(da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 02 68 66 7d 98 00 00 00 10 00 00
(da5:mps0:0:5:0): CAM status: CCB request completed with an error
(da5:mps0:0:5:0): Retrying command, 3 more tries remain
(da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 02 7b 77 df 08 00 00 00 08 00 00
(da5:mps0:0:5:0): CAM status: SCSI Status Error
(da5:mps0:0:5:0): SCSI status: Check Condition
(da5:mps0:0:5:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da5:mps0:0:5:0): Info: 0x27b77df08
(da5:mps0:0:5:0): Error 22, Unretryable error



It is always disk5.
Read write cksum is either 0 0 0 or 0 1 0 - never more.
After a scrub it is fine - for a few days.

PSU and SAS controller (Dell H310) are new.

What I've tried:
- mounted a fan on the SAS controller
- cheked and swapped power and SATA connectors (still same cables) for disk5

I will change power and SATA cables next.

Could the SAS controller be faulty and lead to such an error?
Should I replace disk5? SMART has 2 stored errors for that disk.
 

PDM

Dabbler
Joined
Dec 17, 2011
Messages
24
I hope its just the SATA cable!
 

DanP.

Dabbler
Joined
Apr 2, 2019
Messages
17
After replacing the SAS/SATA cable, the machine ran smooth for 12 days.
But now the error is back!

Code:
 May 21 00:18:28 truenas (da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 01 01 c8 e8 08 00 00 00 08 00 00               
May 21 00:18:28 truenas (da5:mps0:0:5:0): CAM status: SCSI Status Error               
May 21 00:18:28 truenas (da5:mps0:0:5:0): SCSI status: Check Condition               
May 21 00:18:28 truenas (da5:mps0:0:5:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)               
May 21 00:18:28 truenas (da5:mps0:0:5:0): Info: 0x101c8e808               
May 21 00:18:28 truenas (da5:mps0:0:5:0): Error 22, Unretryable error                


SMART results are good. zpool status has 1 Write error, resilvered 84K.

I just realized the following:
The last few times the error / command was very similar:
(da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 01 01 c8 e8 08 00 00 00 08 00 00
(da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 02 7b 77 df 08 00 00 00 08 00 00
(da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 02 68 66 7d 98 00 00 00 10 00 00
(da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 02 68 61 9e b0 00 00 00 08 00 00

Could it be a defect sector on the drive?
Can I leave the disk installed as long as it's just those single write errors? Or do you recommend replacing that drive?

Thanks
 

Alecmascot

Guru
Joined
Mar 18, 2014
Messages
1,177
Can I leave the disk installed as long as it's just those single write errors? Or do you recommend replacing that drive?
Which model drives are you using ?
 

DanP.

Dabbler
Joined
Apr 2, 2019
Messages
17
I have an almost identical system as yours and it appears to be a HBA driver issue. I also had mine show that da5, even though another disk was swapped for da5. This looks to be the bug report: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=224496
Thanks for the note!
There are some workarounds mentioned in the bug report for "slow" disks. Increasing the CAM timeout, adding L2ARC, decreasing memory clock... I'll try some of these and then report back.
 

DanP.

Dabbler
Joined
Apr 2, 2019
Messages
17

Gamji

Cadet
Joined
Jun 13, 2021
Messages
3
Thanks for the note!
There are some workarounds mentioned in the bug report for "slow" disks. Increasing the CAM timeout, adding L2ARC, decreasing memory clock... I'll try some of these and then report back.
I am going to jinx this, but I replaced the mini-SAS cable and I have not seen this error post in days! You might want to try that first before those changes in firmware.
 

DanP.

Dabbler
Joined
Apr 2, 2019
Messages
17
I have replaced the SAS / SATA cable from disk da5 already, but I'll change the other one as well. Maybe it's not related to da5, like in your case.
 

DanP.

Dabbler
Joined
Apr 2, 2019
Messages
17
@Gamji I recently swapped the other SAS ? SATA cable, but unfortunately the error came back. Same error, same disk.
Is yours still running fine?

I'll try those workarounds next...
 

Gamji

Cadet
Joined
Jun 13, 2021
Messages
3
Oh man, sorry to hear that. I still have not seen this issue since. Grasping at straws here...but one other thing is I'm currently on TrueNAS-12.0-U4. Are you still running TrueNAS-12.0-U2.1?
 

DanP.

Dabbler
Joined
Apr 2, 2019
Messages
17
Yeah, I'm still on TrueNAS-12.0-U2.1. I was just checking the TrueNAS changelog now.
I'm currently running a disk burn-in with badblocks, but I'll update to TrueNAS-12.0-U4.1 as soon as that's finished. Maybe it makes a difference. Hopefully...
Thanks
 

DanP.

Dabbler
Joined
Apr 2, 2019
Messages
17
In case someone else has a similar problem: I got it (mostly) fixed!

I use TrueNAS for my MacOS time machine backups. I accidentally also enabled snapshots for the backup file. In my case that's a 1,4TB file. I guess TrueNAS got confused making an incremental snapshot of the incremental time machine backup file :)
As soon as I deactivated the snapshots for that folder, the errors disappeared (almost completely). Instead of getting the error every other week, it now happens only once every three months.
 
Top