Hi,
I built my TrueNAS a couple moinths ago, space is (temporarily) about 80% full, but the NAS is only used lighly with SMB.
I get this error about once a week:
smartctl -a /dev/da5:
zpool status tank:
dmesg | grep mps:
It is always disk5.
Read write cksum is either 0 0 0 or 0 1 0 - never more.
After a scrub it is fine - for a few days.
PSU and SAS controller (Dell H310) are new.
What I've tried:
- mounted a fan on the SAS controller
- cheked and swapped power and SATA connectors (still same cables) for disk5
I will change power and SATA cables next.
Could the SAS controller be faulty and lead to such an error?
Should I replace disk5? SMART has 2 stored errors for that disk.
I built my TrueNAS a couple moinths ago, space is (temporarily) about 80% full, but the NAS is only used lighly with SMB.
I get this error about once a week:
Code:
May 9 22:23:57 truenas mps0: Controller reported scsi ioc terminated tgt 5 SMID 1143 loginfo 31080000 May 9 22:23:57 truenas (da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 02 68 66 7d 98 00 00 00 10 00 00 May 9 22:23:57 truenas (da5:mps0:0:5:0): CAM status: CCB request completed with an error May 9 22:23:57 truenas (da5:mps0:0:5:0): Retrying command, 3 more tries remain May 9 22:23:57 truenas (da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 02 7b 77 df 08 00 00 00 08 00 00 May 9 22:23:57 truenas (da5:mps0:0:5:0): CAM status: SCSI Status Error May 9 22:23:57 truenas (da5:mps0:0:5:0): SCSI status: Check Condition May 9 22:23:57 truenas (da5:mps0:0:5:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range) May 9 22:23:57 truenas (da5:mps0:0:5:0): Info: 0x27b77df08 May 9 22:23:57 truenas (da5:mps0:0:5:0): Error 22, Unretryable error
smartctl -a /dev/da5:
Code:
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 197 192 021 Pre-fail Always - 9133 4 Start_Stop_Count 0x0032 098 098 000 Old_age Always - 2814 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 084 084 000 Old_age Always - 11901 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1133 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 102 193 Load_Cycle_Count 0x0032 199 199 000 Old_age Always - 5975 194 Temperature_Celsius 0x0022 120 102 000 Old_age Always - 32 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 ATA Error Count: 2 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 2 occurred at disk power-on lifetime: 8216 hours (342 days + 8 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 10 f8 84 20 e3 Error: UNC 16 sectors at LBA = 0x032084f8 = 52462840 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 10 f8 84 20 e3 08 1d+09:26:39.523 READ DMA e5 00 00 00 00 00 00 08 1d+09:26:39.392 CHECK POWER MODE ca 00 01 a0 2d 10 e0 08 1d+09:26:29.702 WRITE DMA Error 1 occurred at disk power-on lifetime: 6070 hours (252 days + 22 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 10 51 01 a0 2d 10 e0 Error: IDNF at LBA = 0x00102da0 = 1060256 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ca 00 01 a0 2d 10 e0 08 1d+13:04:01.653 WRITE DMA e5 00 00 00 00 00 00 08 1d+13:04:01.518 CHECK POWER MODE SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 11824 - # 2 Short offline Completed without error 00% 11787 - # 3 Short offline Completed without error 00% 11641 - # 4 Short offline Completed without error 00% 11473 - # 5 Extended offline Completed without error 00% 11463 - # 6 Short offline Completed without error 00% 11306 - # 7 Short offline Completed without error 00% 11181 - # 8 Short offline Completed without error 00% 11033 - # 9 Conveyance offline Completed without error 00% 10807 - #10 Extended offline Completed without error 00% 10765 - #11 Conveyance offline Completed without error 00% 10753 - #12 Short offline Completed without error 00% 10752 - #13 Short offline Completed without error 00% 9723 - #14 Short offline Completed without error 00% 9599 - #15 Short offline Completed without error 00% 9555 - #16 Short offline Completed without error 00% 9537 - #17 Short offline Completed without error 00% 9478 - #18 Extended offline Completed without error 00% 9396 - #19 Short offline Completed without error 00% 9175 - #20 Short offline Completed without error 00% 9104 - #21 Short offline Completed without error 00% 8897 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
zpool status tank:
Code:
pool: tank state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P scan: resilvered 108K in 00:00:00 with 0 errors on Sun May 9 22:24:02 2021 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 gptid/038117ff-8ceb-11eb-bb67-002590aaacc7 ONLINE 0 0 0 gptid/0487cea5-8ceb-11eb-bb67-002590aaacc7 ONLINE 0 0 0 gptid/04ae7f80-8ceb-11eb-bb67-002590aaacc7 ONLINE 0 0 0 gptid/04b6861b-8ceb-11eb-bb67-002590aaacc7 ONLINE 0 0 0 gptid/054422f3-8ceb-11eb-bb67-002590aaacc7 ONLINE 0 0 0 gptid/055f51ed-8ceb-11eb-bb67-002590aaacc7 ONLINE 0 1 0 errors: No known data errors
dmesg | grep mps:
Code:
mps0: <Avago Technologies (LSI) SAS2008> port 0xe000-0xe0ff mem 0xdfa40000-0xdfa4ffff,0xdfa00000-0xdfa3ffff irq 16 at device 0.0 on pci1 mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc> da0 at mps0 bus 0 scbus0 target 0 lun 0 da2 at mps0 bus 0 scbus0 target 2 lun 0 da1 at mps0 bus 0 scbus0 target 1 lun 0 da3 at mps0 bus 0 scbus0 target 3 lun 0 da4 at mps0 bus 0 scbus0 target 4 lun 0 da5 at mps0 bus 0 scbus0 target 5 lun 0 mps0: Controller reported scsi ioc terminated tgt 5 SMID 1143 loginfo 31080000 (da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 02 68 66 7d 98 00 00 00 10 00 00 (da5:mps0:0:5:0): CAM status: CCB request completed with an error (da5:mps0:0:5:0): Retrying command, 3 more tries remain (da5:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 02 7b 77 df 08 00 00 00 08 00 00 (da5:mps0:0:5:0): CAM status: SCSI Status Error (da5:mps0:0:5:0): SCSI status: Check Condition (da5:mps0:0:5:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range) (da5:mps0:0:5:0): Info: 0x27b77df08 (da5:mps0:0:5:0): Error 22, Unretryable error
It is always disk5.
Read write cksum is either 0 0 0 or 0 1 0 - never more.
After a scrub it is fine - for a few days.
PSU and SAS controller (Dell H310) are new.
What I've tried:
- mounted a fan on the SAS controller
- cheked and swapped power and SATA connectors (still same cables) for disk5
I will change power and SATA cables next.
Could the SAS controller be faulty and lead to such an error?
Should I replace disk5? SMART has 2 stored errors for that disk.