SOLVED Various SCSI sense errors during scrubbing

jgreco · Jun 23, 2017

tobiasbp said:
This makes me think I really should be OK, when running 24 drives on brand new redundant 1400W PSUs.

Yeah, if you look at the Supermicro 24 drive systems, they typically offer a chassis option with a smaller set of redundant PSU's (think 920 currently) and then a larger. This roughly corresponds to "single CPU and not so complex" and "dual CPU with extra cards".

The PSU sizing guidance is more for people who don't really have the engineering background or resources to understand where these "big" numbers come from, and who've used some craptacular Internet website calculator for PSU sizing that only provides a handful of watts per drive. There were a number of insipid debates by people who were convinced that undersizing your PSU to save a few bucks and putting thousands of dollars of electronics and drives at risk is a "good thing" or somehow "energy saving". I don't actually care if you want to do that, but, damn it, I want people to make educated choices, not just listen to some handwavey stuff, so that's why I banged out all the reasoning and how the numbers were derived.

Other large system manufacturers like 45Drives also talk about power design, and I encourage people to read that too.

Anyways, the rules can be relaxed as the drive count and PSU size increase, but it is kind of complex and more than a little specialized as an area of knowledge. For the most part, your chassis manufacturer will probably ballpark you into the right place if you let them, and if you're running the 1.4k's on a 24 drive dual CPU system, you should have plenty of power.

It isn't clear to me what's wrong here, but you seem to be taking reasonable steps to isolate and resolve. I expect you'll find an issue sooner or later.

tobiasbp · Jun 23, 2017

jgreco said:
Yeah, if you look at the Supermicro 24 drive systems, they typically offer a chassis option with a smaller set of redundant PSU's (think 920 currently) and then a larger. This roughly corresponds to "single CPU and not so complex" and "dual CPU with extra cards".

Great, so I'm not under powered (If the PSUs do what they are supposed to do. They are brand new).

jgreco said:
It isn't clear to me what's wrong here, but you seem to be taking reasonable steps to isolate and resolve. I expect you'll find an issue sooner or later.

I get SCSI errors in the log every time I scrub my pool. People have suggested bad PSUs as the most likely culprit. Hence, I have replaced both PSUs with out resolving the problem.

I have previously been able to scrub the pool without errors when I had degraded all 12 mirrors (So, only 12 disks connected. Down from 24). That would point to a power issue, but since I replaced both PSUs, and I'm not under powered, I'm not sure what to think.

NZ_JJ · Jun 25, 2017

May still be a power issue. Could be "dirty" power from the wall, try either putting moving the server to an alternative circut in the building or using a on-line UPS that filters the power in.
Also, may be degraded power cables internally - did you change these when you replaced the PSUs?

tobiasbp · Jul 3, 2017

NZ_JJ said:
May still be a power issue. Could be "dirty" power from the wall, try either putting moving the server to an alternative circut in the building or using a on-line UPS that filters the power in.

I will run the machine off of an UPS and see what happens.

NZ_JJ said:
Also, may be degraded power cables internally - did you change these when you replaced the PSUs?

No, I did not. The only cables I have replaced, are the SAS ones (SFF8087 connectors).

tobiasbp · Jul 3, 2017

I have moved 12 disks to the smaller backplane. I now have 12 disks on each backplane for a total of 24 disks connected. Each backplane is directly connected to an IBM ServeRAID M1015.

I have started a scrub of the pool.

tobiasbp · Jul 4, 2017

Running with 12 disks on each backplane, did not resolve the issue. I still get SCSI errors, and scrubbing repairs data on in the pool. This time, however, none of the error counters where incremented, but data was repaired as seen below. Does that make sense? I would expect the READ counter to be incremented for the disks where SCSI errors occured.

Code:

  pool: ultraman
state: ONLINE
  scan: scrub repaired 512K in 14h58m with 0 errors on Tue Jul  4 02:46:15 2017
config:

	NAME											STATE	 READ WRITE CKSUM
	ultraman										ONLINE	   0	 0	 0
	  mirror-0									  ONLINE	   0	 0	 0
		gptid/2e00ac23-183d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/6e71919e-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
	  mirror-1									  ONLINE	   0	 0	 0
		gptid/6f22c98c-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/6fe54bfe-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
	  mirror-2									  ONLINE	   0	 0	 0
		gptid/70bfd5c6-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/3d414933-3a05-11e7-af73-0025901ef244  ONLINE	   0	 0	 0
	  mirror-3									  ONLINE	   0	 0	 0
		gptid/7ad0f185-1619-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/9e899578-183c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-4									  ONLINE	   0	 0	 0
		gptid/427c2189-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/4342a98c-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-5									  ONLINE	   0	 0	 0
		gptid/a1ab9a69-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/a2851364-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-6									  ONLINE	   0	 0	 0
		gptid/0dcbcccd-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/0e9ed582-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-7									  ONLINE	   0	 0	 0
		gptid/2b56cf1c-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/2c2a1c62-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-8									  ONLINE	   0	 0	 0
		gptid/69718320-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/6a4e6afa-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-9									  ONLINE	   0	 0	 0
		gptid/8fccb2c6-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/90b48d70-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-10									 ONLINE	   0	 0	 0
		gptid/aeda4f88-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/afb5bfc3-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-11									 ONLINE	   0	 0	 0
		gptid/bca6e4f0-1b71-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/83d890bc-3a08-11e7-af73-0025901ef244  ONLINE	   0	 0	 0

errors: No known data errors

Splitting the pool across the two backplanes (each directly connected to the HBA), increased scrubbing speed from 22 hours to 16 hours.

The log showing that 5 SCSI errors occured during the scrub:

Code:

Jul  3 19:24:05 ultraman	 (da6:mps0:0:17:0): READ(10). CDB: 28 00 d0 2e 0a 20 00 01 00 00 length 131072 SMID 951 terminated ioc 804b scsi 0 state 0 xfer 0
Jul  3 19:24:05 ultraman (da6:mps0:0:17:0): READ(10). CDB: 28 00 d0 2e 0a 20 00 01 00 00
Jul  3 19:24:05 ultraman (da6:mps0:0:17:0): CAM status: CCB request completed with an error
Jul  3 19:24:05 ultraman (da6:mps0:0:17:0): Retrying command
Jul  3 19:24:05 ultraman (da6:mps0:0:17:0): READ(10). CDB: 28 00 d0 2e 09 20 00 01 00 00
Jul  3 19:24:05 ultraman (da6:mps0:0:17:0): CAM status: SCSI Status Error
Jul  3 19:24:05 ultraman (da6:mps0:0:17:0): SCSI status: Check Condition
Jul  3 19:24:05 ultraman (da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jul  3 19:24:05 ultraman (da6:mps0:0:17:0): Info: 0xd02e0920
Jul  3 19:24:05 ultraman (da6:mps0:0:17:0): Error 5, Unretryable error
...
...
Jul  3 20:55:50 ultraman	 (da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 18 27 d8 68 00 00 00 b0 00 00 length 90112 SMID 157 terminated ioc 804b scsi 0 state 0 xfer 0
Jul  3 20:55:50 ultraman (da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 18 27 d8 68 00 00 00 b0 00 00
Jul  3 20:55:50 ultraman (da6:mps0:0:17:0): CAM status: CCB request completed with an error
Jul  3 20:55:50 ultraman (da6:mps0:0:17:0): Retrying command
Jul  3 20:55:50 ultraman (da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 18 27 d7 b8 00 00 00 b0 00 00
Jul  3 20:55:50 ultraman (da6:mps0:0:17:0): CAM status: SCSI Status Error
Jul  3 20:55:50 ultraman (da6:mps0:0:17:0): SCSI status: Check Condition
Jul  3 20:55:50 ultraman (da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jul  3 20:55:50 ultraman (da6:mps0:0:17:0): Info: 0x11827d7b8
Jul  3 20:55:50 ultraman (da6:mps0:0:17:0): Error 5, Unretryable error
...
...
Jul  3 22:04:53 ultraman	 (da0:mps0:0:9:0): READ(10). CDB: 28 00 fc 29 b0 60 00 00 c8 00 length 102400 SMID 970 terminated ioc 804b scsi 0 state 0 xfer 0
Jul  3 22:04:53 ultraman (da0:mps0:0:9:0): READ(10). CDB: 28 00 fc 29 b0 60 00 00 c8 00
Jul  3 22:04:53 ultraman (da0:mps0:0:9:0): CAM status: CCB request completed with an error
Jul  3 22:04:53 ultraman (da0:mps0:0:9:0): Retrying command
Jul  3 22:04:53 ultraman (da0:mps0:0:9:0): READ(10). CDB: 28 00 fc 29 af 98 00 00 c8 00
Jul  3 22:04:53 ultraman (da0:mps0:0:9:0): CAM status: SCSI Status Error
Jul  3 22:04:53 ultraman (da0:mps0:0:9:0): SCSI status: Check Condition
Jul  3 22:04:53 ultraman (da0:mps0:0:9:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jul  3 22:04:53 ultraman (da0:mps0:0:9:0): Info: 0xfc29af98
Jul  3 22:04:53 ultraman (da0:mps0:0:9:0): Error 5, Unretryable error
Jul  3 22:19:33 ultraman	 (da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 38 36 2a 90 00 00 00 c8 00 00 length 102400 SMID 67 terminated ioc 804b scsi 0 state 0 xfer 0
Jul  3 22:19:33 ultraman (da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 38 36 2a 90 00 00 00 c8 00 00
Jul  3 22:19:33 ultraman (da6:mps0:0:17:0): CAM status: CCB request completed with an error
Jul  3 22:19:33 ultraman (da6:mps0:0:17:0): Retrying command
Jul  3 22:19:33 ultraman (da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 38 36 29 c8 00 00 00 c8 00 00
Jul  3 22:19:33 ultraman (da6:mps0:0:17:0): CAM status: SCSI Status Error
Jul  3 22:19:33 ultraman (da6:mps0:0:17:0): SCSI status: Check Condition
Jul  3 22:19:33 ultraman (da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jul  3 22:19:33 ultraman (da6:mps0:0:17:0): Info: 0x1383629c8
Jul  3 22:19:33 ultraman (da6:mps0:0:17:0): Error 5, Unretryable error
...
...
Jul  4 00:35:53 ultraman		(da0:mps0:0:9:0): READ(16). CDB: 88 00 00 00 00 01 2d 9d 54 38 00 00 00 c8 00 00 length 102400 SMID 679 terminated ioc 804b scsi 0 state 0 xfer 0
Jul  4 00:35:53 ultraman (da0:mps0:0:9:0): READ(16). CDB: 88 00 00 00 00 01 2d 9d 54 38 00 00 00 c8 00 00
Jul  4 00:35:53 ultraman (da0:mps0:0:9:0): CAM status: CCB request completed with an error
Jul  4 00:35:53 ultraman (da0:mps0:0:9:0): Retrying command
Jul  4 00:35:53 ultraman (da0:mps0:0:9:0): READ(16). CDB: 88 00 00 00 00 01 2d 9d 53 78 00 00 00 c0 00 00
Jul  4 00:35:53 ultraman (da0:mps0:0:9:0): CAM status: SCSI Status Error
Jul  4 00:35:53 ultraman (da0:mps0:0:9:0): SCSI status: Check Condition
Jul  4 00:35:53 ultraman (da0:mps0:0:9:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jul  4 00:35:53 ultraman (da0:mps0:0:9:0): Info: 0x12d9d5378
Jul  4 00:35:53 ultraman (da0:mps0:0:9:0): Error 5, Unretryable error

SMART data for disk da6 (3 of 5 errors occurred on this disk):

Code:

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   0
  3 Spin_Up_Time			0x0027   202   202   021	Pre-fail  Always	   -	   8875
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   13
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   100   253   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   099   099   000	Old_age   Always	   -	   1174
10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   13
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   11
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   43
194 Temperature_Celsius	 0x0022   115   106   000	Old_age   Always	   -	   37
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART data for disk da0 (2 of 5 errors occurred on this disk):

Code:

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   0
  3 Spin_Up_Time			0x0027   202   202   021	Pre-fail  Always	   -	   8858
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   13
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   100   253   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   099   099   000	Old_age   Always	   -	   1102
10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   13
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   11
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   38
194 Temperature_Celsius	 0x0022   116   107   000	Old_age   Always	   -	   36
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

Both disks (da0 & da6) are on the 24-slot backplane. I have started SMART long tests on both of disks (EDIT: They both passed).

tobiasbp · Jul 4, 2017

I started a new scrub of the pool. Scrubbing aborted with the zpool having the following status:

Code:

  pool: ultraman
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 132K in 0h0m with 0 errors on Tue Jul  4 14:00:59 2017
config:

	NAME											STATE	 READ WRITE CKSUM
	ultraman										ONLINE	   0	 0	 0
	  mirror-0									  ONLINE	   0	 0	 0
		gptid/2e00ac23-183d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/6e71919e-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
	  mirror-1									  ONLINE	   0	 0	 0
		gptid/6f22c98c-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/6fe54bfe-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
	  mirror-2									  ONLINE	   0	 0	 0
		gptid/70bfd5c6-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/3d414933-3a05-11e7-af73-0025901ef244  ONLINE	   0	 0	 0
	  mirror-3									  ONLINE	   0	 0	 0
		gptid/7ad0f185-1619-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/9e899578-183c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-4									  ONLINE	   0	 0	 0
		gptid/427c2189-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/4342a98c-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-5									  ONLINE	   0	 0	 0
		gptid/a1ab9a69-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/a2851364-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-6									  ONLINE	   0	 0	 0
		gptid/0dcbcccd-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/0e9ed582-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 1	 0
	  mirror-7									  ONLINE	   0	 0	 0
		gptid/2b56cf1c-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/2c2a1c62-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-8									  ONLINE	   0	 0	 0
		gptid/69718320-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/6a4e6afa-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-9									  ONLINE	   0	 0	 0
		gptid/8fccb2c6-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/90b48d70-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-10									 ONLINE	   0	 0	 0
		gptid/aeda4f88-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/afb5bfc3-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-11									 ONLINE	   0	 0	 0
		gptid/bca6e4f0-1b71-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/83d890bc-3a08-11e7-af73-0025901ef244  ONLINE	   0	 0	 0

errors: No known data errors

The log show a SCSI write error:

Code:

(da13:mps0:0:24:0): WRITE(10). CDB: 2a 00 01 0a 5a f0 00 00 08 00
(da13:mps0:0:24:0): CAM status: SCSI Status Error
(da13:mps0:0:24:0): SCSI status: Check Condition
(da13:mps0:0:24:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da13:mps0:0:24:0): Info: 0x10a5af0
(da13:mps0:0:24:0): Error 22, Unretryable error

The disk with the error, is in the 24 slot backplane. The errors in the previous scrub, was also in disks in the 24 slot backplane.

Starting a new scrub now. Will run the machine off of an UPS tomorrow.

tobiasbp · Jul 5, 2017

Errors on 7 different disks during last scrub. 1 disk is in the 12 slot backplane, the other 6 are in the 24 slot one.

Code:

Jul  4 16:02:29 ultraman	 (da11:mps0:0:22:0): READ(16). CDB: 88 00 00 00 00 01 9c 50 21 68 00 00 01 00 00 00 length 131072 SMID 983 terminated ioc 804b scsi 0 state 0 xfer 0
Jul  4 16:02:29 ultraman (da11:mps0:0:22:0): READ(16). CDB: 88 00 00 00 00 01 9c 50 21 68 00 00 01 00 00 00
Jul  4 16:02:29 ultraman (da11:mps0:0:22:0): CAM status: CCB request completed with an error
Jul  4 16:02:29 ultraman (da11:mps0:0:22:0): Retrying command
Jul  4 16:02:29 ultraman (da11:mps0:0:22:0): READ(16). CDB: 88 00 00 00 00 01 9c 50 20 68 00 00 01 00 00 00
Jul  4 16:02:29 ultraman (da11:mps0:0:22:0): CAM status: SCSI Status Error
Jul  4 16:02:29 ultraman (da11:mps0:0:22:0): SCSI status: Check Condition
Jul  4 16:02:29 ultraman (da11:mps0:0:22:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jul  4 16:02:29 ultraman (da11:mps0:0:22:0): Info: 0x19c502068
Jul  4 16:02:29 ultraman (da11:mps0:0:22:0): Error 5, Unretryable error
...
...
Jul  4 18:05:12 ultraman	 (da23:mps0:0:34:0): READ(10). CDB: 28 00 6f 68 0d a8 00 01 00 00 length 131072 SMID 263 terminated ioc 804b scsi 0 state 0 xfer 0
Jul  4 18:05:12 ultraman (da23:mps0:0:34:0): READ(10). CDB: 28 00 6f 68 0d a8 00 01 00 00
Jul  4 18:05:12 ultraman (da23:mps0:0:34:0): CAM status: CCB request completed with an error
Jul  4 18:05:12 ultraman (da23:mps0:0:34:0): Retrying command
Jul  4 18:05:12 ultraman (da23:mps0:0:34:0): READ(10). CDB: 28 00 6f 68 0b a8 00 01 00 00
Jul  4 18:05:12 ultraman (da23:mps0:0:34:0): CAM status: SCSI Status Error
Jul  4 18:05:12 ultraman (da23:mps0:0:34:0): SCSI status: Check Condition
Jul  4 18:05:12 ultraman (da23:mps0:0:34:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jul  4 18:05:12 ultraman (da23:mps0:0:34:0): Info: 0x6f680ba8
Jul  4 18:05:12 ultraman (da23:mps0:0:34:0): Error 5, Unretryable error
...
...
Jul  4 21:06:08 ultraman	 (da6:mps0:0:17:0): READ(10). CDB: 28 00 f8 2d 68 50 00 00 c0 00 length 98304 SMID 107 terminated ioc 804b scsi 0 state 0 xfer 0
Jul  4 21:06:08 ultraman (da6:mps0:0:17:0): READ(10). CDB: 28 00 f8 2d 68 50 00 00 c0 00
Jul  4 21:06:08 ultraman (da6:mps0:0:17:0): CAM status: CCB request completed with an error
Jul  4 21:06:08 ultraman (da6:mps0:0:17:0): Retrying command
Jul  4 21:06:08 ultraman (da6:mps0:0:17:0): READ(10). CDB: 28 00 f8 2d 67 88 00 00 c8 00
Jul  4 21:06:08 ultraman (da6:mps0:0:17:0): CAM status: SCSI Status Error
Jul  4 21:06:08 ultraman (da6:mps0:0:17:0): SCSI status: Check Condition
Jul  4 21:06:08 ultraman (da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jul  4 21:06:08 ultraman (da6:mps0:0:17:0): Info: 0xf82d6788
Jul  4 21:06:08 ultraman (da6:mps0:0:17:0): Error 5, Unretryable error
...
...
Jul  4 23:10:41 ultraman	 (da21:mps0:0:32:0): READ(10). CDB: 28 00 d4 b1 eb f0 00 00 b0 00 length 90112 SMID 270 terminated ioc 804b scsi 0 state 0 xfer 0
Jul  4 23:10:41 ultraman (da21:mps0:0:32:0): READ(10). CDB: 28 00 d4 b1 eb f0 00 00 b0 00
Jul  4 23:10:41 ultraman (da21:mps0:0:32:0): CAM status: CCB request completed with an error
Jul  4 23:10:41 ultraman (da21:mps0:0:32:0): Retrying command
Jul  4 23:10:41 ultraman (da21:mps0:0:32:0): READ(10). CDB: 28 00 d4 b1 eb 40 00 00 b0 00
Jul  4 23:10:41 ultraman (da21:mps0:0:32:0): CAM status: SCSI Status Error
Jul  4 23:10:41 ultraman (da21:mps0:0:32:0): SCSI status: Check Condition
Jul  4 23:10:41 ultraman (da21:mps0:0:32:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jul  4 23:10:41 ultraman (da21:mps0:0:32:0): Info: 0xd4b1eb40
Jul  4 23:10:41 ultraman (da21:mps0:0:32:0): Error 5, Unretryable error
...
...
Jul  5 01:12:20 ultraman	 (da20:mps0:0:31:0): READ(16). CDB: 88 00 00 00 00 01 34 63 7f a8 00 00 01 00 00 00 length 131072 SMID 667 terminated ioc 804b scsi 0 state 0 xfer 0
Jul  5 01:12:20 ultraman (da20:mps0:0:31:0): READ(16). CDB: 88 00 00 00 00 01 34 63 7f a8 00 00 01 00 00 00
Jul  5 01:12:20 ultraman (da20:mps0:0:31:0): CAM status: CCB request completed with an error
Jul  5 01:12:20 ultraman (da20:mps0:0:31:0): Retrying command
Jul  5 01:12:20 ultraman (da20:mps0:0:31:0): READ(16). CDB: 88 00 00 00 00 01 34 63 7e a8 00 00 01 00 00 00
Jul  5 01:12:20 ultraman (da20:mps0:0:31:0): CAM status: SCSI Status Error
Jul  5 01:12:20 ultraman (da20:mps0:0:31:0): SCSI status: Check Condition
Jul  5 01:12:20 ultraman (da20:mps0:0:31:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jul  5 01:12:20 ultraman (da20:mps0:0:31:0): Info: 0x134637ea8
Jul  5 01:12:20 ultraman (da20:mps0:0:31:0): Error 5, Unretryable error
...
...
Jul  5 02:14:19 ultraman	 (da18:mps0:0:29:0): READ(16). CDB: 88 00 00 00 00 01 aa 52 c9 78 00 00 01 00 00 00 length 131072 SMID 88 terminated ioc 804b scsi 0 state 0 xfer 0
Jul  5 02:14:19 ultraman (da18:mps0:0:29:0): READ(16). CDB: 88 00 00 00 00 01 aa 52 c9 78 00 00 01 00 00 00
Jul  5 02:14:19 ultraman (da18:mps0:0:29:0): CAM status: CCB request completed with an error
Jul  5 02:14:19 ultraman (da18:mps0:0:29:0): Retrying command
Jul  5 02:14:19 ultraman (da18:mps0:0:29:0): READ(16). CDB: 88 00 00 00 00 01 aa 52 c7 78 00 00 01 00 00 00
Jul  5 02:14:19 ultraman (da18:mps0:0:29:0): CAM status: SCSI Status Error
Jul  5 02:14:19 ultraman (da18:mps0:0:29:0): SCSI status: Check Condition
Jul  5 02:14:19 ultraman (da18:mps0:0:29:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jul  5 02:14:19 ultraman (da18:mps0:0:29:0): Info: 0x1aa52c778
Jul  5 02:14:19 ultraman (da18:mps0:0:29:0): Error 5, Unretryable error
...
...
Jul  5 03:19:25 ultraman	 (da12:mps0:0:23:0): READ(16). CDB: 88 00 00 00 00 01 43 55 06 70 00 00 01 00 00 00 length 131072 SMID 499 terminated ioc 804b scsi 0 state 0 xfer 0
Jul  5 03:19:25 ultraman (da12:mps0:0:23:0): READ(16). CDB: 88 00 00 00 00 01 43 55 06 70 00 00 01 00 00 00
Jul  5 03:19:25 ultraman (da12:mps0:0:23:0): CAM status: CCB request completed with an error
Jul  5 03:19:25 ultraman (da12:mps0:0:23:0): Retrying command
Jul  5 03:19:25 ultraman (da12:mps0:0:23:0): READ(16). CDB: 88 00 00 00 00 01 43 55 08 50 00 00 01 00 00 00
Jul  5 03:19:25 ultraman (da12:mps0:0:23:0): CAM status: SCSI Status Error
Jul  5 03:19:25 ultraman (da12:mps0:0:23:0): SCSI status: Check Condition
Jul  5 03:19:25 ultraman (da12:mps0:0:23:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jul  5 03:19:25 ultraman (da12:mps0:0:23:0): Info: 0x143550850
Jul  5 03:19:25 ultraman (da12:mps0:0:23:0): Error 5, Unretryable error

Pool status after the scrub:

Code:

  scan: scrub repaired 828K in 14h51m with 0 errors on Wed Jul  5 05:51:34 2017
config:

	NAME											STATE	 READ WRITE CKSUM
	ultraman										ONLINE	   0	 0	 0
	  mirror-0									  ONLINE	   0	 0	 0
		gptid/2e00ac23-183d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/6e71919e-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
	  mirror-1									  ONLINE	   0	 0	 0
		gptid/6f22c98c-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/6fe54bfe-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
	  mirror-2									  ONLINE	   0	 0	 0
		gptid/70bfd5c6-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/3d414933-3a05-11e7-af73-0025901ef244  ONLINE	   0	 0	 0
	  mirror-3									  ONLINE	   0	 0	 0
		gptid/7ad0f185-1619-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/9e899578-183c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-4									  ONLINE	   0	 0	 0
		gptid/427c2189-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/4342a98c-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-5									  ONLINE	   0	 0	 0
		gptid/a1ab9a69-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/a2851364-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-6									  ONLINE	   0	 0	 0
		gptid/0dcbcccd-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/0e9ed582-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-7									  ONLINE	   0	 0	 0
		gptid/2b56cf1c-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/2c2a1c62-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-8									  ONLINE	   0	 0	 0
		gptid/69718320-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/6a4e6afa-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-9									  ONLINE	   0	 0	 0
		gptid/8fccb2c6-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/90b48d70-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-10									 ONLINE	   0	 0	 0
		gptid/aeda4f88-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/afb5bfc3-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-11									 ONLINE	   0	 0	 0
		gptid/bca6e4f0-1b71-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/83d890bc-3a08-11e7-af73-0025901ef244  ONLINE	   0	 0	 0

errors: No known data errors

In conclusion:

Changing both PSUs did not resolve the problem
Changing HBA->Backplanes cables did not resolve the problem
Errors occurs randomly on "all" disks.
Errors occurs on disk on either backplane
Running on only 12 (down from 24) disks allowed me to scrub with no errors.

tobiasbp · Jul 5, 2017

I have now moved the machine to an airconditioned room, running it off of an UPS. I have started another scrub.

tobiasbp · Jul 6, 2017

Running the machine off of an UPS in an airconditioned room, did not resolve the issue. I still get SCSI errors on every scrub on the pool.

10 errors occured. All on the same two disks (da0 & da6). Both disks are on the 24-slot backplane. This would make one think these disks are the problem, but on the previous scrub, the errors were spread across 7 different disks. Also, both disks passed a SMART long test, and no problems in the SMART data (Se previous post).

Code:

(da6:mps0:0:17:0): READ(10). CDB: 28 00 64 36 e4 f0 00 00 e0 00 length 114688 SMID 316 terminated ioc 804b scsi 0 state 0 xfer 0
(da6:mps0:0:17:0): READ(10). CDB: 28 00 64 36 e4 f0 00 00 e0 00 
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(10). CDB: 28 00 64 36 e4 10 00 00 e0 00 
(da6:mps0:0:17:0): CAM status: SCSI Status Error
(da6:mps0:0:17:0): SCSI status: Check Condition
(da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:17:0): Info: 0x6436e410
(da6:mps0:0:17:0): Error 5, Unretryable error
	(da6:mps0:0:17:0): READ(10). CDB: 28 00 7c 32 cd 18 00 00 a0 00 length 81920 SMID 594 terminated ioc 804b scsi 0 state 0 xfer 0
(da6:mps0:0:17:0): READ(10). CDB: 28 00 7c 32 cd 18 00 00 a0 00 
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(10). CDB: 28 00 7c 32 cc 70 00 00 a8 00 
(da6:mps0:0:17:0): CAM status: SCSI Status Error
(da6:mps0:0:17:0): SCSI status: Check Condition
(da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:17:0): Info: 0x7c32cc70
(da6:mps0:0:17:0): Error 5, Unretryable error
	(da6:mps0:0:17:0): READ(10). CDB: 28 00 fc 2e b7 48 00 00 98 00 length 77824 SMID 780 terminated ioc 804b scsi 0 state 0 xfer 0
(da6:mps0:0:17:0): READ(10). CDB: 28 00 fc 2e b7 48 00 00 98 00 
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(10). CDB: 28 00 fc 2e b6 b8 00 00 90 00 
(da6:mps0:0:17:0): CAM status: SCSI Status Error
(da6:mps0:0:17:0): SCSI status: Check Condition
(da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:17:0): Info: 0xfc2eb6b8
(da6:mps0:0:17:0): Error 5, Unretryable error
	(da6:mps0:0:17:0): READ(10). CDB: 28 00 d0 32 c3 70 00 00 d8 00 length 110592 SMID 395 terminated ioc 804b scsi 0 state 0 xfer 0
(da6:mps0:0:17:0): READ(10). CDB: 28 00 d0 32 c3 70 00 00 d8 00 
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(10). CDB: 28 00 d0 32 c2 90 00 00 e0 00 
(da6:mps0:0:17:0): CAM status: SCSI Status Error
(da6:mps0:0:17:0): SCSI status: Check Condition
(da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:17:0): Info: 0xd032c290
(da6:mps0:0:17:0): Error 5, Unretryable error
	(da0:mps0:0:9:0): READ(16). CDB: 88 00 00 00 00 01 10 3f df a8 00 00 00 c8 00 00 length 102400 SMID 863 terminated ioc 804b scsi 0 state 0 xfer 0
(da0:mps0:0:9:0): READ(16). CDB: 88 00 00 00 00 01 10 3f df a8 00 00 00 c8 00 00 
(da0:mps0:0:9:0): CAM status: CCB request completed with an error
(da0:mps0:0:9:0): Retrying command
(da0:mps0:0:9:0): READ(16). CDB: 88 00 00 00 00 01 10 3f de d0 00 00 00 d8 00 00 
(da0:mps0:0:9:0): CAM status: SCSI Status Error
(da0:mps0:0:9:0): SCSI status: Check Condition
(da0:mps0:0:9:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da0:mps0:0:9:0): Info: 0x1103fded0
(da0:mps0:0:9:0): Error 5, Unretryable error
	(da0:mps0:0:9:0): READ(16). CDB: 88 00 00 00 00 01 0c 2e 84 e0 00 00 00 b8 00 00 length 94208 SMID 770 terminated ioc 804b scsi 0 state 0 xfer 0
(da0:mps0:0:9:0): READ(16). CDB: 88 00 00 00 00 01 0c 2e 84 e0 00 00 00 b8 00 00 
(da0:mps0:0:9:0): CAM status: CCB request completed with an error
(da0:mps0:0:9:0): Retrying command
(da0:mps0:0:9:0): READ(16). CDB: 88 00 00 00 00 01 0c 2e 84 28 00 00 00 b8 00 00 
(da0:mps0:0:9:0): CAM status: SCSI Status Error
(da0:mps0:0:9:0): SCSI status: Check Condition
(da0:mps0:0:9:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da0:mps0:0:9:0): Info: 0x10c2e8428
(da0:mps0:0:9:0): Error 5, Unretryable error
	(da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 60 3d 3b 70 00 00 00 b8 00 00 length 94208 SMID 225 terminated ioc 804b scsi 0 state 0 xfer 0
(da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 60 3d 3b 70 00 00 00 b8 00 00 
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 60 3d 3a b0 00 00 00 c0 00 00 
(da6:mps0:0:17:0): CAM status: SCSI Status Error
(da6:mps0:0:17:0): SCSI status: Check Condition
(da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:17:0): Info: 0x1603d3ab0
(da6:mps0:0:17:0): Error 5, Unretryable error
	(da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 52 b0 51 e0 00 00 00 d8 00 00 length 110592 SMID 510 terminated ioc 804b scsi 0 state 0 xfer 0
(da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 52 b0 51 e0 00 00 00 d8 00 00 
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 52 b0 51 08 00 00 00 d8 00 00 
(da6:mps0:0:17:0): CAM status: SCSI Status Error
(da6:mps0:0:17:0): SCSI status: Check Condition
(da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:17:0): Info: 0x152b05108
(da6:mps0:0:17:0): Error 5, Unretryable error
	(da0:mps0:0:9:0): READ(16). CDB: 88 00 00 00 00 01 2f e4 86 90 00 00 00 d8 00 00 length 110592 SMID 99 terminated ioc 804b scsi 0 state 0 xfer 0
(da0:mps0:0:9:0): READ(16). CDB: 88 00 00 00 00 01 2f e4 86 90 00 00 00 d8 00 00 
(da0:mps0:0:9:0): CAM status: CCB request completed with an error
(da0:mps0:0:9:0): Retrying command
(da0:mps0:0:9:0): READ(16). CDB: 88 00 00 00 00 01 2f e4 85 b8 00 00 00 d8 00 00 
(da0:mps0:0:9:0): CAM status: SCSI Status Error
(da0:mps0:0:9:0): SCSI status: Check Condition
(da0:mps0:0:9:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da0:mps0:0:9:0): Info: 0x12fe485b8
(da0:mps0:0:9:0): Error 5, Unretryable error
	(da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 5c 23 e3 50 00 00 00 c8 00 00 length 102400 SMID 103 terminated ioc 804b scsi 0 state 0 xfer 0
(da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 5c 23 e3 50 00 00 00 c8 00 00 
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 5c 23 e2 90 00 00 00 c0 00 00 
(da6:mps0:0:17:0): CAM status: SCSI Status Error
(da6:mps0:0:17:0): SCSI status: Check Condition
(da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:17:0): Info: 0x15c23e290
(da6:mps0:0:17:0): Error 5, Unretryable error

Status of the pool after scrubbing:

Code:

  pool: ultraman
 state: ONLINE
  scan: scrub repaired 988K in 15h2m with 0 errors on Thu Jul  6 01:51:53 2017
config:

	NAME											STATE	 READ WRITE CKSUM
	ultraman										ONLINE	   0	 0	 0
	  mirror-0									  ONLINE	   0	 0	 0
		gptid/2e00ac23-183d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/6e71919e-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
	  mirror-1									  ONLINE	   0	 0	 0
		gptid/6f22c98c-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/6fe54bfe-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
	  mirror-2									  ONLINE	   0	 0	 0
		gptid/70bfd5c6-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/3d414933-3a05-11e7-af73-0025901ef244  ONLINE	   0	 0	 0
	  mirror-3									  ONLINE	   0	 0	 0
		gptid/7ad0f185-1619-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/9e899578-183c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-4									  ONLINE	   0	 0	 0
		gptid/427c2189-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/4342a98c-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-5									  ONLINE	   0	 0	 0
		gptid/a1ab9a69-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/a2851364-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-6									  ONLINE	   0	 0	 0
		gptid/0dcbcccd-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/0e9ed582-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-7									  ONLINE	   0	 0	 0
		gptid/2b56cf1c-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/2c2a1c62-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-8									  ONLINE	   0	 0	 0
		gptid/69718320-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/6a4e6afa-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-9									  ONLINE	   0	 0	 0
		gptid/8fccb2c6-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/90b48d70-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-10									 ONLINE	   0	 0	 0
		gptid/aeda4f88-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/afb5bfc3-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-11									 ONLINE	   0	 0	 0
		gptid/bca6e4f0-1b71-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/83d890bc-3a08-11e7-af73-0025901ef244  ONLINE	   0	 0	 0

errors: No known data errors

Stux · Jul 6, 2017

tobiasbp said:

You're seeing read errors... relatively persistently. Either they're being caused by faulty cabling, faulty power, faulty hardware, or faulty software.

You might see one once in a blue moon, but you shouldn't be seeing them regularly. Its a bit like seeing ECC errors. ECC errors could be caused by cosmic rays, but they're normally caused by faulty hardware ;)

I know you've tried everything, except replacing the backplanes, and or updating their software. This is the type of thing that can be caused by buggy/mismatched firmware.

Unfortunately, I can't provide anymore advice than this.

The good news is that ZFS does provide very powerful integrity control, and is probably capable of dealing with these issues.

tobiasbp · Jul 7, 2017

As expected, I get errors on every scrub. But, recently, the errors have only occurred on the disks da0 & da6 (I have written down the serial numbers). I will disconnect them, and scrub.

All errors on same two disks:

Code:

	(da6:mps0:0:17:0): READ(10). CDB: 28 00 50 36 be e0 00 01 00 00 length 131072 SMID 332 terminated ioc 804b scsi 0 state 0 xfer 0
	(da6:mps0:0:17:0): READ(10). CDB: 28 00 4e ad f4 68 00 00 08 00 length 4096 SMID 441 terminated ioc 804b scsi 0 state 0 xfer (da6:mps0:0:17:0): READ(10). CDB: 28 00 50 36 be e0 00 01 00 00
0
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(10). CDB: 28 00 4e ad f4 68 00 00 08 00
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(10). CDB: 28 00 50 36 be 38 00 00 a8 00
(da6:mps0:0:17:0): CAM status: SCSI Status Error
(da6:mps0:0:17:0): SCSI status: Check Condition
(da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:17:0): Info: 0x5036be38
(da6:mps0:0:17:0): Error 5, Unretryable error
	(da6:mps0:0:17:0): READ(10). CDB: 28 00 b4 2c 65 20 00 00 a8 00 length 86016 SMID 914 terminated ioc 804b scsi 0 state 0 xfer 0
(da6:mps0:0:17:0): READ(10). CDB: 28 00 b4 2c 65 20 00 00 a8 00
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(10). CDB: 28 00 b4 2c 64 70 00 00 b0 00
(da6:mps0:0:17:0): CAM status: SCSI Status Error
(da6:mps0:0:17:0): SCSI status: Check Condition
(da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:17:0): Info: 0xb42c6470
(da6:mps0:0:17:0): Error 5, Unretryable error
	(da6:mps0:0:17:0): READ(10). CDB: 28 00 c4 2b 67 78 00 00 c0 00 length 98304 SMID 906 terminated ioc 804b scsi 0 state 0 xfer 0
(da6:mps0:0:17:0): READ(10). CDB: 28 00 c4 2b 67 78 00 00 c0 00
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(10). CDB: 28 00 c4 2b 66 d8 00 00 a0 00
(da6:mps0:0:17:0): CAM status: SCSI Status Error
(da6:mps0:0:17:0): SCSI status: Check Condition
(da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:17:0): Info: 0xc42b66d8
(da6:mps0:0:17:0): Error 5, Unretryable error
	(da6:mps0:0:17:0): READ(10). CDB: 28 00 dc 28 09 10 00 00 b8 00 length 94208 SMID 671 terminated ioc 804b scsi 0 state 0 xfer 0
	(da6:mps0:0:17:0): READ(10). CDB: 28 00 d9 95 3c 70 00 00 08 00 length 4096 SMID 562 terminated ioc 804b scsi 0 state 0 xfer (da6:mps0:0:17:0): READ(10). CDB: 28 00 dc 28 09 10 00 00 b8 00
0
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(10). CDB: 28 00 d9 95 3c 70 00 00 08 00
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(10). CDB: 28 00 dc 28 08 58 00 00 b8 00
(da6:mps0:0:17:0): CAM status: SCSI Status Error
(da6:mps0:0:17:0): SCSI status: Check Condition
(da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:17:0): Info: 0xdc280858
(da6:mps0:0:17:0): Error 5, Unretryable error
	(da0:mps0:0:9:0): READ(10). CDB: 28 00 f4 34 41 38 00 00 b8 00 length 94208 SMID 229 terminated ioc 804b scsi 0 state 0 xfer 0
(da0:mps0:0:9:0): READ(10). CDB: 28 00 f4 34 41 38 00 00 b8 00
(da0:mps0:0:9:0): CAM status: CCB request completed with an error
(da0:mps0:0:9:0): Retrying command
(da0:mps0:0:9:0): READ(10). CDB: 28 00 f4 34 40 88 00 00 b0 00
(da0:mps0:0:9:0): CAM status: SCSI Status Error
(da0:mps0:0:9:0): SCSI status: Check Condition
(da0:mps0:0:9:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da0:mps0:0:9:0): Info: 0xf4344088
(da0:mps0:0:9:0): Error 5, Unretryable error
	(da0:mps0:0:9:0): READ(10). CDB: 28 00 d8 3f 40 88 00 00 f8 00 length 126976 SMID 888 terminated ioc 804b scsi 0 state 0 xfer 0
(da0:mps0:0:9:0): READ(10). CDB: 28 00 d8 3f 40 88 00 00 f8 00
(da0:mps0:0:9:0): CAM status: CCB request completed with an error
(da0:mps0:0:9:0): Retrying command
(da0:mps0:0:9:0): READ(10). CDB: 28 00 d8 3f 3f 90 00 00 f8 00
(da0:mps0:0:9:0): CAM status: SCSI Status Error
(da0:mps0:0:9:0): SCSI status: Check Condition
(da0:mps0:0:9:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da0:mps0:0:9:0): Info: 0xd83f3f90
(da0:mps0:0:9:0): Error 5, Unretryable error

Morpheus187 · Jul 7, 2017

That seems to be a really frustrating problem, I also had such machines were I had an error, unable to trace down.

By now I think the problem is maybe the motherboard itself .

Good luck on resolving that issue, I hope you get it :)

BigDave · Jul 7, 2017

Just kinda glanced through the (now 3 pages long) thread and have not seen the
changing out (or at least inspecting), of the Power Distribution Backplane. I've changed
out mine before and know there are caps in the PCB that could be failing.
You have changed the redundant PSUs, but was the PDU checked?
If I missed it, just ignore me...

tobiasbp · Jul 7, 2017

Morpheus187 said:
By now I think the problem is maybe the motherboard itself.

Me too. I actually have another motherboard lying around, which I'm considering swapping out with the one currently in the machine.

BigDave said:
You have changed the redundant PSUs, but was the PDU checked?

No, I'm unaware of such a thing. I'll go look at the chassis manual. Thank you for the suggestion.

BigDave · Jul 7, 2017

tobiasbp said:
No, I'm unaware of such a thing. I'll go look at the chassis manual. Thank you for the suggestion.

I think this is what @NZ_JJ was refering to in his response.

NZ_JJ said:
Also, may be degraded power cables internally - did you change these when you replaced the PSUs?

Cougar014 · Jul 7, 2017

I have the same issue, I'm just a little overwhelmed how much things you already have tried in comparison to me!

tobiasbp · Jul 17, 2017

I was seeing a lot of errors on the disks da0 and da6. I have previously seen that errors seem to appear on all disks at random. To check, I have disconnected da0 & da6. As expected, this has not resolved the issuse. SCSI Errors seem occur on all disks (on both backplanes) at random.

Code:

  pool: ultraman
state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 248K in 14h57m with 0 errors on Sun Jul 16 14:57:53 2017
config:

	NAME											STATE	 READ WRITE CKSUM
	ultraman										DEGRADED	 0	 0	 0
	  mirror-0									  ONLINE	   0	 0	 0
		gptid/2e00ac23-183d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/6e71919e-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
	  mirror-1									  ONLINE	   0	 0	 0
		gptid/6f22c98c-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/6fe54bfe-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 1	 0
	  mirror-2									  DEGRADED	 0	 0	 0
		gptid/70bfd5c6-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		11172241336233623768						REMOVED	  0	 0	 0  was /dev/gptid/3d414933-3a05-11e7-af73-0025901ef244
	  mirror-3									  ONLINE	   0	 0	 0
		gptid/7ad0f185-1619-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/9e899578-183c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-4									  ONLINE	   0	 0	 0
		gptid/427c2189-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/4342a98c-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-5									  ONLINE	   0	 0	 0
		gptid/a1ab9a69-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 1	 0
		gptid/a2851364-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 1	 0
	  mirror-6									  ONLINE	   0	 0	 0
		gptid/0dcbcccd-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/0e9ed582-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 2
	  mirror-7									  ONLINE	   0	 0	 0
		gptid/2b56cf1c-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/2c2a1c62-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-8									  ONLINE	   0	 0	 0
		gptid/69718320-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/6a4e6afa-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-9									  ONLINE	   0	 0	 0
		gptid/8fccb2c6-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 1	 0
		gptid/90b48d70-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 1	 0
	  mirror-10									 ONLINE	   0	 0	 0
		gptid/aeda4f88-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/afb5bfc3-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-11									 DEGRADED	 0	 0	 0
		gptid/bca6e4f0-1b71-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		3534930770157901362						 REMOVED	  0	 0	 0  was /dev/gptid/83d890bc-3a08-11e7-af73-0025901ef244

errors: No known data errors

tvsjr · Jul 17, 2017

You have a systemic problem. It sounds like the one common thread is the backplane. At this point, I'd either invest in a new backplane or buy/borrow sufficient breakout cables and power adapters (don't go cheap here) to let you get power and data to the drives without the backplane involved.

tobiasbp · Jul 17, 2017

tvsjr said:
You have a systemic problem. It sounds like the one common thread is the backplane. At this point, I'd either invest in a new backplane or buy/borrow sufficient breakout cables and power adapters (don't go cheap here) to let you get power and data to the drives without the backplane involved.

The thing is, errors occur on disks attached to either backplanes (A 12-slot and a 24-slot) which seems to indicate the problem is not the backplane. Others have suggested I inspect the power distribution unit. Looking at my chassis, it seems to be the PDB-PT847-8820. I can not find a page on it at supermicro.com.

Important Announcement for the TrueNAS Community.

SOLVED Various SCSI sense errors during scrubbing

Resident Grinch

Patron

Dabbler

Patron

Patron

Patron

Patron

Patron

Patron

Patron

MVP

Patron

Explorer

FreeNAS Enthusiast

Patron

FreeNAS Enthusiast

Explorer

Patron

Guru

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Various SCSI sense errors during scrubbing"

Similar threads