Hi,
I built my TrueNAS a couple moinths ago, space is (temporarily) about 80% full, but the NAS is only used lighly with SMB.
I get this error about once a week:
smartctl -a /dev/da5:
zpool status tank:
dmesg | grep mps:
It is always disk5.
Read write cksum is either 0 0 0 or 0 1 0 - never more.
After a scrub it is fine - for a few days.
PSU and SAS controller (Dell H310) are new.
What I've tried:
- mounted a fan on the SAS controller
- cheked and swapped power and SATA connectors (still same cables) for disk5
I will change power and SATA cables next.
Could the SAS controller be faulty and lead to such an error?
Should I replace disk5? SMART has 2 stored errors for that disk.
I built my TrueNAS a couple moinths ago, space is (temporarily) about 80% full, but the NAS is only used lighly with SMB.
I get this error about once a week:
Code:
May 9 22:23:57 truenas mps0: Controller reported scsi ioc terminated tgt 5 SMID 1143 loginfo 31080000 May 9 22:23:57 truenas (da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 02 68 66 7d 98 00 00 00 10 00 00 May 9 22:23:57 truenas (da5:mps0:0:5:0): CAM status: CCB request completed with an error May 9 22:23:57 truenas (da5:mps0:0:5:0): Retrying command, 3 more tries remain May 9 22:23:57 truenas (da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 02 7b 77 df 08 00 00 00 08 00 00 May 9 22:23:57 truenas (da5:mps0:0:5:0): CAM status: SCSI Status Error May 9 22:23:57 truenas (da5:mps0:0:5:0): SCSI status: Check Condition May 9 22:23:57 truenas (da5:mps0:0:5:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range) May 9 22:23:57 truenas (da5:mps0:0:5:0): Info: 0x27b77df08 May 9 22:23:57 truenas (da5:mps0:0:5:0): Error 22, Unretryable error
smartctl -a /dev/da5:
Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 197 192 021 Pre-fail Always - 9133
4 Start_Stop_Count 0x0032 098 098 000 Old_age Always - 2814
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 084 084 000 Old_age Always - 11901
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1133
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 102
193 Load_Cycle_Count 0x0032 199 199 000 Old_age Always - 5975
194 Temperature_Celsius 0x0022 120 102 000 Old_age Always - 32
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
SMART Error Log Version: 1
ATA Error Count: 2
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 2 occurred at disk power-on lifetime: 8216 hours (342 days + 8 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 10 f8 84 20 e3 Error: UNC 16 sectors at LBA = 0x032084f8 = 52462840
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 10 f8 84 20 e3 08 1d+09:26:39.523 READ DMA
e5 00 00 00 00 00 00 08 1d+09:26:39.392 CHECK POWER MODE
ca 00 01 a0 2d 10 e0 08 1d+09:26:29.702 WRITE DMA
Error 1 occurred at disk power-on lifetime: 6070 hours (252 days + 22 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
10 51 01 a0 2d 10 e0 Error: IDNF at LBA = 0x00102da0 = 1060256
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ca 00 01 a0 2d 10 e0 08 1d+13:04:01.653 WRITE DMA
e5 00 00 00 00 00 00 08 1d+13:04:01.518 CHECK POWER MODE
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 11824 -
# 2 Short offline Completed without error 00% 11787 -
# 3 Short offline Completed without error 00% 11641 -
# 4 Short offline Completed without error 00% 11473 -
# 5 Extended offline Completed without error 00% 11463 -
# 6 Short offline Completed without error 00% 11306 -
# 7 Short offline Completed without error 00% 11181 -
# 8 Short offline Completed without error 00% 11033 -
# 9 Conveyance offline Completed without error 00% 10807 -
#10 Extended offline Completed without error 00% 10765 -
#11 Conveyance offline Completed without error 00% 10753 -
#12 Short offline Completed without error 00% 10752 -
#13 Short offline Completed without error 00% 9723 -
#14 Short offline Completed without error 00% 9599 -
#15 Short offline Completed without error 00% 9555 -
#16 Short offline Completed without error 00% 9537 -
#17 Short offline Completed without error 00% 9478 -
#18 Extended offline Completed without error 00% 9396 -
#19 Short offline Completed without error 00% 9175 -
#20 Short offline Completed without error 00% 9104 -
#21 Short offline Completed without error 00% 8897 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.zpool status tank:
Code:
pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: resilvered 108K in 00:00:00 with 0 errors on Sun May 9 22:24:02 2021
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/038117ff-8ceb-11eb-bb67-002590aaacc7 ONLINE 0 0 0
gptid/0487cea5-8ceb-11eb-bb67-002590aaacc7 ONLINE 0 0 0
gptid/04ae7f80-8ceb-11eb-bb67-002590aaacc7 ONLINE 0 0 0
gptid/04b6861b-8ceb-11eb-bb67-002590aaacc7 ONLINE 0 0 0
gptid/054422f3-8ceb-11eb-bb67-002590aaacc7 ONLINE 0 0 0
gptid/055f51ed-8ceb-11eb-bb67-002590aaacc7 ONLINE 0 1 0
errors: No known data errorsdmesg | grep mps:
Code:
mps0: <Avago Technologies (LSI) SAS2008> port 0xe000-0xe0ff mem 0xdfa40000-0xdfa4ffff,0xdfa00000-0xdfa3ffff irq 16 at device 0.0 on pci1 mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc> da0 at mps0 bus 0 scbus0 target 0 lun 0 da2 at mps0 bus 0 scbus0 target 2 lun 0 da1 at mps0 bus 0 scbus0 target 1 lun 0 da3 at mps0 bus 0 scbus0 target 3 lun 0 da4 at mps0 bus 0 scbus0 target 4 lun 0 da5 at mps0 bus 0 scbus0 target 5 lun 0 mps0: Controller reported scsi ioc terminated tgt 5 SMID 1143 loginfo 31080000 (da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 02 68 66 7d 98 00 00 00 10 00 00 (da5:mps0:0:5:0): CAM status: CCB request completed with an error (da5:mps0:0:5:0): Retrying command, 3 more tries remain (da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 02 7b 77 df 08 00 00 00 08 00 00 (da5:mps0:0:5:0): CAM status: SCSI Status Error (da5:mps0:0:5:0): SCSI status: Check Condition (da5:mps0:0:5:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range) (da5:mps0:0:5:0): Info: 0x27b77df08 (da5:mps0:0:5:0): Error 22, Unretryable error
It is always disk5.
Read write cksum is either 0 0 0 or 0 1 0 - never more.
After a scrub it is fine - for a few days.
PSU and SAS controller (Dell H310) are new.
What I've tried:
- mounted a fan on the SAS controller
- cheked and swapped power and SATA connectors (still same cables) for disk5
I will change power and SATA cables next.
Could the SAS controller be faulty and lead to such an error?
Should I replace disk5? SMART has 2 stored errors for that disk.