Hi,
I built my TrueNAS a couple moinths ago, space is (temporarily) about 80% full, but the NAS is only used lighly with SMB.
I get this error about once a week:
smartctl -a /dev/da5:
zpool status tank:
dmesg | grep mps:
It is always disk5.
Read write cksum is either 0 0 0 or 0 1 0 - never more.
After a scrub it is fine - for a few days.
PSU and SAS controller (Dell H310) are new.
What I've tried:
- mounted a fan on the SAS controller
- cheked and swapped power and SATA connectors (still same cables) for disk5
I will change power and SATA cables next.
Could the SAS controller be faulty and lead to such an error?
Should I replace disk5? SMART has 2 stored errors for that disk.
	
		
			
		
		
	
			
			I built my TrueNAS a couple moinths ago, space is (temporarily) about 80% full, but the NAS is only used lighly with SMB.
I get this error about once a week:
Code:
May 9 22:23:57 truenas mps0: Controller reported scsi ioc terminated tgt 5 SMID 1143 loginfo 31080000 May 9 22:23:57 truenas (da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 02 68 66 7d 98 00 00 00 10 00 00 May 9 22:23:57 truenas (da5:mps0:0:5:0): CAM status: CCB request completed with an error May 9 22:23:57 truenas (da5:mps0:0:5:0): Retrying command, 3 more tries remain May 9 22:23:57 truenas (da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 02 7b 77 df 08 00 00 00 08 00 00 May 9 22:23:57 truenas (da5:mps0:0:5:0): CAM status: SCSI Status Error May 9 22:23:57 truenas (da5:mps0:0:5:0): SCSI status: Check Condition May 9 22:23:57 truenas (da5:mps0:0:5:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range) May 9 22:23:57 truenas (da5:mps0:0:5:0): Info: 0x27b77df08 May 9 22:23:57 truenas (da5:mps0:0:5:0): Error 22, Unretryable error
smartctl -a /dev/da5:
Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   197   192   021    Pre-fail  Always       -       9133
  4 Start_Stop_Count        0x0032   098   098   000    Old_age   Always       -       2814
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       11901
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1133
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       102
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       5975
194 Temperature_Celsius     0x0022   120   102   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
SMART Error Log Version: 1
ATA Error Count: 2
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 2 occurred at disk power-on lifetime: 8216 hours (342 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 10 f8 84 20 e3  Error: UNC 16 sectors at LBA = 0x032084f8 = 52462840
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 10 f8 84 20 e3 08   1d+09:26:39.523  READ DMA
  e5 00 00 00 00 00 00 08   1d+09:26:39.392  CHECK POWER MODE
  ca 00 01 a0 2d 10 e0 08   1d+09:26:29.702  WRITE DMA
Error 1 occurred at disk power-on lifetime: 6070 hours (252 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 51 01 a0 2d 10 e0  Error: IDNF at LBA = 0x00102da0 = 1060256
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 01 a0 2d 10 e0 08   1d+13:04:01.653  WRITE DMA
  e5 00 00 00 00 00 00 08   1d+13:04:01.518  CHECK POWER MODE
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     11824         -
# 2  Short offline       Completed without error       00%     11787         -
# 3  Short offline       Completed without error       00%     11641         -
# 4  Short offline       Completed without error       00%     11473         -
# 5  Extended offline    Completed without error       00%     11463         -
# 6  Short offline       Completed without error       00%     11306         -
# 7  Short offline       Completed without error       00%     11181         -
# 8  Short offline       Completed without error       00%     11033         -
# 9  Conveyance offline  Completed without error       00%     10807         -
#10  Extended offline    Completed without error       00%     10765         -
#11  Conveyance offline  Completed without error       00%     10753         -
#12  Short offline       Completed without error       00%     10752         -
#13  Short offline       Completed without error       00%      9723         -
#14  Short offline       Completed without error       00%      9599         -
#15  Short offline       Completed without error       00%      9555         -
#16  Short offline       Completed without error       00%      9537         -
#17  Short offline       Completed without error       00%      9478         -
#18  Extended offline    Completed without error       00%      9396         -
#19  Short offline       Completed without error       00%      9175         -
#20  Short offline       Completed without error       00%      9104         -
#21  Short offline       Completed without error       00%      8897         -
SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.zpool status tank:
Code:
  pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 108K in 00:00:00 with 0 errors on Sun May  9 22:24:02 2021
config:
    NAME                                            STATE     READ WRITE CKSUM
    tank                                            ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/038117ff-8ceb-11eb-bb67-002590aaacc7  ONLINE       0     0     0
        gptid/0487cea5-8ceb-11eb-bb67-002590aaacc7  ONLINE       0     0     0
        gptid/04ae7f80-8ceb-11eb-bb67-002590aaacc7  ONLINE       0     0     0
        gptid/04b6861b-8ceb-11eb-bb67-002590aaacc7  ONLINE       0     0     0
        gptid/054422f3-8ceb-11eb-bb67-002590aaacc7  ONLINE       0     0     0
        gptid/055f51ed-8ceb-11eb-bb67-002590aaacc7  ONLINE       0     1     0
errors: No known data errorsdmesg | grep mps:
Code:
mps0: <Avago Technologies (LSI) SAS2008> port 0xe000-0xe0ff mem 0xdfa40000-0xdfa4ffff,0xdfa00000-0xdfa3ffff irq 16 at device 0.0 on pci1 mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc> da0 at mps0 bus 0 scbus0 target 0 lun 0 da2 at mps0 bus 0 scbus0 target 2 lun 0 da1 at mps0 bus 0 scbus0 target 1 lun 0 da3 at mps0 bus 0 scbus0 target 3 lun 0 da4 at mps0 bus 0 scbus0 target 4 lun 0 da5 at mps0 bus 0 scbus0 target 5 lun 0 mps0: Controller reported scsi ioc terminated tgt 5 SMID 1143 loginfo 31080000 (da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 02 68 66 7d 98 00 00 00 10 00 00 (da5:mps0:0:5:0): CAM status: CCB request completed with an error (da5:mps0:0:5:0): Retrying command, 3 more tries remain (da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 02 7b 77 df 08 00 00 00 08 00 00 (da5:mps0:0:5:0): CAM status: SCSI Status Error (da5:mps0:0:5:0): SCSI status: Check Condition (da5:mps0:0:5:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range) (da5:mps0:0:5:0): Info: 0x27b77df08 (da5:mps0:0:5:0): Error 22, Unretryable error
It is always disk5.
Read write cksum is either 0 0 0 or 0 1 0 - never more.
After a scrub it is fine - for a few days.
PSU and SAS controller (Dell H310) are new.
What I've tried:
- mounted a fan on the SAS controller
- cheked and swapped power and SATA connectors (still same cables) for disk5
I will change power and SATA cables next.
Could the SAS controller be faulty and lead to such an error?
Should I replace disk5? SMART has 2 stored errors for that disk.