SOLVED Various SCSI sense errors during scrubbing

Status
Not open for further replies.

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
This makes me think I really should be OK, when running 24 drives on brand new redundant 1400W PSUs.

Yeah, if you look at the Supermicro 24 drive systems, they typically offer a chassis option with a smaller set of redundant PSU's (think 920 currently) and then a larger. This roughly corresponds to "single CPU and not so complex" and "dual CPU with extra cards".

The PSU sizing guidance is more for people who don't really have the engineering background or resources to understand where these "big" numbers come from, and who've used some craptacular Internet website calculator for PSU sizing that only provides a handful of watts per drive. There were a number of insipid debates by people who were convinced that undersizing your PSU to save a few bucks and putting thousands of dollars of electronics and drives at risk is a "good thing" or somehow "energy saving". I don't actually care if you want to do that, but, damn it, I want people to make educated choices, not just listen to some handwavey stuff, so that's why I banged out all the reasoning and how the numbers were derived.

Other large system manufacturers like 45Drives also talk about power design, and I encourage people to read that too.

Anyways, the rules can be relaxed as the drive count and PSU size increase, but it is kind of complex and more than a little specialized as an area of knowledge. For the most part, your chassis manufacturer will probably ballpark you into the right place if you let them, and if you're running the 1.4k's on a 24 drive dual CPU system, you should have plenty of power.

It isn't clear to me what's wrong here, but you seem to be taking reasonable steps to isolate and resolve. I expect you'll find an issue sooner or later.
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
Yeah, if you look at the Supermicro 24 drive systems, they typically offer a chassis option with a smaller set of redundant PSU's (think 920 currently) and then a larger. This roughly corresponds to "single CPU and not so complex" and "dual CPU with extra cards".

Great, so I'm not under powered (If the PSUs do what they are supposed to do. They are brand new).

It isn't clear to me what's wrong here, but you seem to be taking reasonable steps to isolate and resolve. I expect you'll find an issue sooner or later.

I get SCSI errors in the log every time I scrub my pool. People have suggested bad PSUs as the most likely culprit. Hence, I have replaced both PSUs with out resolving the problem.

I have previously been able to scrub the pool without errors when I had degraded all 12 mirrors (So, only 12 disks connected. Down from 24). That would point to a power issue, but since I replaced both PSUs, and I'm not under powered, I'm not sure what to think.
 
Last edited by a moderator:

NZ_JJ

Dabbler
Joined
May 25, 2017
Messages
28
May still be a power issue. Could be "dirty" power from the wall, try either putting moving the server to an alternative circut in the building or using a on-line UPS that filters the power in.
Also, may be degraded power cables internally - did you change these when you replaced the PSUs?
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
May still be a power issue. Could be "dirty" power from the wall, try either putting moving the server to an alternative circut in the building or using a on-line UPS that filters the power in.

I will run the machine off of an UPS and see what happens.

Also, may be degraded power cables internally - did you change these when you replaced the PSUs?

No, I did not. The only cables I have replaced, are the SAS ones (SFF8087 connectors).
 
Last edited:

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
I have moved 12 disks to the smaller backplane. I now have 12 disks on each backplane for a total of 24 disks connected. Each backplane is directly connected to an IBM ServeRAID M1015.

I have started a scrub of the pool.
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
Running with 12 disks on each backplane, did not resolve the issue. I still get SCSI errors, and scrubbing repairs data on in the pool. This time, however, none of the error counters where incremented, but data was repaired as seen below. Does that make sense? I would expect the READ counter to be incremented for the disks where SCSI errors occured.
Code:
  pool: ultraman
state: ONLINE
  scan: scrub repaired 512K in 14h58m with 0 errors on Tue Jul  4 02:46:15 2017
config:

	NAME											STATE	 READ WRITE CKSUM
	ultraman										ONLINE	   0	 0	 0
	  mirror-0									  ONLINE	   0	 0	 0
		gptid/2e00ac23-183d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/6e71919e-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
	  mirror-1									  ONLINE	   0	 0	 0
		gptid/6f22c98c-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/6fe54bfe-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
	  mirror-2									  ONLINE	   0	 0	 0
		gptid/70bfd5c6-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/3d414933-3a05-11e7-af73-0025901ef244  ONLINE	   0	 0	 0
	  mirror-3									  ONLINE	   0	 0	 0
		gptid/7ad0f185-1619-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/9e899578-183c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-4									  ONLINE	   0	 0	 0
		gptid/427c2189-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/4342a98c-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-5									  ONLINE	   0	 0	 0
		gptid/a1ab9a69-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/a2851364-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-6									  ONLINE	   0	 0	 0
		gptid/0dcbcccd-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/0e9ed582-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-7									  ONLINE	   0	 0	 0
		gptid/2b56cf1c-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/2c2a1c62-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-8									  ONLINE	   0	 0	 0
		gptid/69718320-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/6a4e6afa-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-9									  ONLINE	   0	 0	 0
		gptid/8fccb2c6-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/90b48d70-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-10									 ONLINE	   0	 0	 0
		gptid/aeda4f88-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/afb5bfc3-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-11									 ONLINE	   0	 0	 0
		gptid/bca6e4f0-1b71-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/83d890bc-3a08-11e7-af73-0025901ef244  ONLINE	   0	 0	 0

errors: No known data errors


Splitting the pool across the two backplanes (each directly connected to the HBA), increased scrubbing speed from 22 hours to 16 hours.


The log showing that 5 SCSI errors occured during the scrub:
Code:
Jul  3 19:24:05 ultraman	 (da6:mps0:0:17:0): READ(10). CDB: 28 00 d0 2e 0a 20 00 01 00 00 length 131072 SMID 951 terminated ioc 804b scsi 0 state 0 xfer 0
Jul  3 19:24:05 ultraman (da6:mps0:0:17:0): READ(10). CDB: 28 00 d0 2e 0a 20 00 01 00 00
Jul  3 19:24:05 ultraman (da6:mps0:0:17:0): CAM status: CCB request completed with an error
Jul  3 19:24:05 ultraman (da6:mps0:0:17:0): Retrying command
Jul  3 19:24:05 ultraman (da6:mps0:0:17:0): READ(10). CDB: 28 00 d0 2e 09 20 00 01 00 00
Jul  3 19:24:05 ultraman (da6:mps0:0:17:0): CAM status: SCSI Status Error
Jul  3 19:24:05 ultraman (da6:mps0:0:17:0): SCSI status: Check Condition
Jul  3 19:24:05 ultraman (da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jul  3 19:24:05 ultraman (da6:mps0:0:17:0): Info: 0xd02e0920
Jul  3 19:24:05 ultraman (da6:mps0:0:17:0): Error 5, Unretryable error
...
...
Jul  3 20:55:50 ultraman	 (da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 18 27 d8 68 00 00 00 b0 00 00 length 90112 SMID 157 terminated ioc 804b scsi 0 state 0 xfer 0
Jul  3 20:55:50 ultraman (da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 18 27 d8 68 00 00 00 b0 00 00
Jul  3 20:55:50 ultraman (da6:mps0:0:17:0): CAM status: CCB request completed with an error
Jul  3 20:55:50 ultraman (da6:mps0:0:17:0): Retrying command
Jul  3 20:55:50 ultraman (da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 18 27 d7 b8 00 00 00 b0 00 00
Jul  3 20:55:50 ultraman (da6:mps0:0:17:0): CAM status: SCSI Status Error
Jul  3 20:55:50 ultraman (da6:mps0:0:17:0): SCSI status: Check Condition
Jul  3 20:55:50 ultraman (da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jul  3 20:55:50 ultraman (da6:mps0:0:17:0): Info: 0x11827d7b8
Jul  3 20:55:50 ultraman (da6:mps0:0:17:0): Error 5, Unretryable error
...
...
Jul  3 22:04:53 ultraman	 (da0:mps0:0:9:0): READ(10). CDB: 28 00 fc 29 b0 60 00 00 c8 00 length 102400 SMID 970 terminated ioc 804b scsi 0 state 0 xfer 0
Jul  3 22:04:53 ultraman (da0:mps0:0:9:0): READ(10). CDB: 28 00 fc 29 b0 60 00 00 c8 00
Jul  3 22:04:53 ultraman (da0:mps0:0:9:0): CAM status: CCB request completed with an error
Jul  3 22:04:53 ultraman (da0:mps0:0:9:0): Retrying command
Jul  3 22:04:53 ultraman (da0:mps0:0:9:0): READ(10). CDB: 28 00 fc 29 af 98 00 00 c8 00
Jul  3 22:04:53 ultraman (da0:mps0:0:9:0): CAM status: SCSI Status Error
Jul  3 22:04:53 ultraman (da0:mps0:0:9:0): SCSI status: Check Condition
Jul  3 22:04:53 ultraman (da0:mps0:0:9:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jul  3 22:04:53 ultraman (da0:mps0:0:9:0): Info: 0xfc29af98
Jul  3 22:04:53 ultraman (da0:mps0:0:9:0): Error 5, Unretryable error
Jul  3 22:19:33 ultraman	 (da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 38 36 2a 90 00 00 00 c8 00 00 length 102400 SMID 67 terminated ioc 804b scsi 0 state 0 xfer 0
Jul  3 22:19:33 ultraman (da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 38 36 2a 90 00 00 00 c8 00 00
Jul  3 22:19:33 ultraman (da6:mps0:0:17:0): CAM status: CCB request completed with an error
Jul  3 22:19:33 ultraman (da6:mps0:0:17:0): Retrying command
Jul  3 22:19:33 ultraman (da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 38 36 29 c8 00 00 00 c8 00 00
Jul  3 22:19:33 ultraman (da6:mps0:0:17:0): CAM status: SCSI Status Error
Jul  3 22:19:33 ultraman (da6:mps0:0:17:0): SCSI status: Check Condition
Jul  3 22:19:33 ultraman (da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jul  3 22:19:33 ultraman (da6:mps0:0:17:0): Info: 0x1383629c8
Jul  3 22:19:33 ultraman (da6:mps0:0:17:0): Error 5, Unretryable error
...
...
Jul  4 00:35:53 ultraman		(da0:mps0:0:9:0): READ(16). CDB: 88 00 00 00 00 01 2d 9d 54 38 00 00 00 c8 00 00 length 102400 SMID 679 terminated ioc 804b scsi 0 state 0 xfer 0
Jul  4 00:35:53 ultraman (da0:mps0:0:9:0): READ(16). CDB: 88 00 00 00 00 01 2d 9d 54 38 00 00 00 c8 00 00
Jul  4 00:35:53 ultraman (da0:mps0:0:9:0): CAM status: CCB request completed with an error
Jul  4 00:35:53 ultraman (da0:mps0:0:9:0): Retrying command
Jul  4 00:35:53 ultraman (da0:mps0:0:9:0): READ(16). CDB: 88 00 00 00 00 01 2d 9d 53 78 00 00 00 c0 00 00
Jul  4 00:35:53 ultraman (da0:mps0:0:9:0): CAM status: SCSI Status Error
Jul  4 00:35:53 ultraman (da0:mps0:0:9:0): SCSI status: Check Condition
Jul  4 00:35:53 ultraman (da0:mps0:0:9:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jul  4 00:35:53 ultraman (da0:mps0:0:9:0): Info: 0x12d9d5378
Jul  4 00:35:53 ultraman (da0:mps0:0:9:0): Error 5, Unretryable error



SMART data for disk da6 (3 of 5 errors occurred on this disk):
Code:
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   0
  3 Spin_Up_Time			0x0027   202   202   021	Pre-fail  Always	   -	   8875
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   13
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   100   253   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   099   099   000	Old_age   Always	   -	   1174
10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   13
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   11
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   43
194 Temperature_Celsius	 0x0022   115   106   000	Old_age   Always	   -	   37
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0


SMART data for disk da0 (2 of 5 errors occurred on this disk):
Code:
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   0
  3 Spin_Up_Time			0x0027   202   202   021	Pre-fail  Always	   -	   8858
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   13
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   100   253   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   099   099   000	Old_age   Always	   -	   1102
10 Spin_Retry_Count		0x0032   100   253   000	Old_age   Always	   -	   0
11 Calibration_Retry_Count 0x0032   100   253   000	Old_age   Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   13
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   11
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   38
194 Temperature_Celsius	 0x0022   116   107   000	Old_age   Always	   -	   36
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   100   253   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0


Both disks (da0 & da6) are on the 24-slot backplane. I have started SMART long tests on both of disks (EDIT: They both passed).
 
Last edited:

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
I started a new scrub of the pool. Scrubbing aborted with the zpool having the following status:

Code:
  pool: ultraman
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 132K in 0h0m with 0 errors on Tue Jul  4 14:00:59 2017
config:

	NAME											STATE	 READ WRITE CKSUM
	ultraman										ONLINE	   0	 0	 0
	  mirror-0									  ONLINE	   0	 0	 0
		gptid/2e00ac23-183d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/6e71919e-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
	  mirror-1									  ONLINE	   0	 0	 0
		gptid/6f22c98c-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/6fe54bfe-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
	  mirror-2									  ONLINE	   0	 0	 0
		gptid/70bfd5c6-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/3d414933-3a05-11e7-af73-0025901ef244  ONLINE	   0	 0	 0
	  mirror-3									  ONLINE	   0	 0	 0
		gptid/7ad0f185-1619-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/9e899578-183c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-4									  ONLINE	   0	 0	 0
		gptid/427c2189-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/4342a98c-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-5									  ONLINE	   0	 0	 0
		gptid/a1ab9a69-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/a2851364-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-6									  ONLINE	   0	 0	 0
		gptid/0dcbcccd-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/0e9ed582-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 1	 0
	  mirror-7									  ONLINE	   0	 0	 0
		gptid/2b56cf1c-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/2c2a1c62-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-8									  ONLINE	   0	 0	 0
		gptid/69718320-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/6a4e6afa-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-9									  ONLINE	   0	 0	 0
		gptid/8fccb2c6-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/90b48d70-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-10									 ONLINE	   0	 0	 0
		gptid/aeda4f88-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/afb5bfc3-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-11									 ONLINE	   0	 0	 0
		gptid/bca6e4f0-1b71-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/83d890bc-3a08-11e7-af73-0025901ef244  ONLINE	   0	 0	 0

errors: No known data errors


The log show a SCSI write error:
Code:
(da13:mps0:0:24:0): WRITE(10). CDB: 2a 00 01 0a 5a f0 00 00 08 00
(da13:mps0:0:24:0): CAM status: SCSI Status Error
(da13:mps0:0:24:0): SCSI status: Check Condition
(da13:mps0:0:24:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(da13:mps0:0:24:0): Info: 0x10a5af0
(da13:mps0:0:24:0): Error 22, Unretryable error


The disk with the error, is in the 24 slot backplane. The errors in the previous scrub, was also in disks in the 24 slot backplane.

Starting a new scrub now. Will run the machine off of an UPS tomorrow.
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
Errors on 7 different disks during last scrub. 1 disk is in the 12 slot backplane, the other 6 are in the 24 slot one.

Code:
Jul  4 16:02:29 ultraman	 (da11:mps0:0:22:0): READ(16). CDB: 88 00 00 00 00 01 9c 50 21 68 00 00 01 00 00 00 length 131072 SMID 983 terminated ioc 804b scsi 0 state 0 xfer 0
Jul  4 16:02:29 ultraman (da11:mps0:0:22:0): READ(16). CDB: 88 00 00 00 00 01 9c 50 21 68 00 00 01 00 00 00
Jul  4 16:02:29 ultraman (da11:mps0:0:22:0): CAM status: CCB request completed with an error
Jul  4 16:02:29 ultraman (da11:mps0:0:22:0): Retrying command
Jul  4 16:02:29 ultraman (da11:mps0:0:22:0): READ(16). CDB: 88 00 00 00 00 01 9c 50 20 68 00 00 01 00 00 00
Jul  4 16:02:29 ultraman (da11:mps0:0:22:0): CAM status: SCSI Status Error
Jul  4 16:02:29 ultraman (da11:mps0:0:22:0): SCSI status: Check Condition
Jul  4 16:02:29 ultraman (da11:mps0:0:22:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jul  4 16:02:29 ultraman (da11:mps0:0:22:0): Info: 0x19c502068
Jul  4 16:02:29 ultraman (da11:mps0:0:22:0): Error 5, Unretryable error
...
...
Jul  4 18:05:12 ultraman	 (da23:mps0:0:34:0): READ(10). CDB: 28 00 6f 68 0d a8 00 01 00 00 length 131072 SMID 263 terminated ioc 804b scsi 0 state 0 xfer 0
Jul  4 18:05:12 ultraman (da23:mps0:0:34:0): READ(10). CDB: 28 00 6f 68 0d a8 00 01 00 00
Jul  4 18:05:12 ultraman (da23:mps0:0:34:0): CAM status: CCB request completed with an error
Jul  4 18:05:12 ultraman (da23:mps0:0:34:0): Retrying command
Jul  4 18:05:12 ultraman (da23:mps0:0:34:0): READ(10). CDB: 28 00 6f 68 0b a8 00 01 00 00
Jul  4 18:05:12 ultraman (da23:mps0:0:34:0): CAM status: SCSI Status Error
Jul  4 18:05:12 ultraman (da23:mps0:0:34:0): SCSI status: Check Condition
Jul  4 18:05:12 ultraman (da23:mps0:0:34:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jul  4 18:05:12 ultraman (da23:mps0:0:34:0): Info: 0x6f680ba8
Jul  4 18:05:12 ultraman (da23:mps0:0:34:0): Error 5, Unretryable error
...
...
Jul  4 21:06:08 ultraman	 (da6:mps0:0:17:0): READ(10). CDB: 28 00 f8 2d 68 50 00 00 c0 00 length 98304 SMID 107 terminated ioc 804b scsi 0 state 0 xfer 0
Jul  4 21:06:08 ultraman (da6:mps0:0:17:0): READ(10). CDB: 28 00 f8 2d 68 50 00 00 c0 00
Jul  4 21:06:08 ultraman (da6:mps0:0:17:0): CAM status: CCB request completed with an error
Jul  4 21:06:08 ultraman (da6:mps0:0:17:0): Retrying command
Jul  4 21:06:08 ultraman (da6:mps0:0:17:0): READ(10). CDB: 28 00 f8 2d 67 88 00 00 c8 00
Jul  4 21:06:08 ultraman (da6:mps0:0:17:0): CAM status: SCSI Status Error
Jul  4 21:06:08 ultraman (da6:mps0:0:17:0): SCSI status: Check Condition
Jul  4 21:06:08 ultraman (da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jul  4 21:06:08 ultraman (da6:mps0:0:17:0): Info: 0xf82d6788
Jul  4 21:06:08 ultraman (da6:mps0:0:17:0): Error 5, Unretryable error
...
...
Jul  4 23:10:41 ultraman	 (da21:mps0:0:32:0): READ(10). CDB: 28 00 d4 b1 eb f0 00 00 b0 00 length 90112 SMID 270 terminated ioc 804b scsi 0 state 0 xfer 0
Jul  4 23:10:41 ultraman (da21:mps0:0:32:0): READ(10). CDB: 28 00 d4 b1 eb f0 00 00 b0 00
Jul  4 23:10:41 ultraman (da21:mps0:0:32:0): CAM status: CCB request completed with an error
Jul  4 23:10:41 ultraman (da21:mps0:0:32:0): Retrying command
Jul  4 23:10:41 ultraman (da21:mps0:0:32:0): READ(10). CDB: 28 00 d4 b1 eb 40 00 00 b0 00
Jul  4 23:10:41 ultraman (da21:mps0:0:32:0): CAM status: SCSI Status Error
Jul  4 23:10:41 ultraman (da21:mps0:0:32:0): SCSI status: Check Condition
Jul  4 23:10:41 ultraman (da21:mps0:0:32:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jul  4 23:10:41 ultraman (da21:mps0:0:32:0): Info: 0xd4b1eb40
Jul  4 23:10:41 ultraman (da21:mps0:0:32:0): Error 5, Unretryable error
...
...
Jul  5 01:12:20 ultraman	 (da20:mps0:0:31:0): READ(16). CDB: 88 00 00 00 00 01 34 63 7f a8 00 00 01 00 00 00 length 131072 SMID 667 terminated ioc 804b scsi 0 state 0 xfer 0
Jul  5 01:12:20 ultraman (da20:mps0:0:31:0): READ(16). CDB: 88 00 00 00 00 01 34 63 7f a8 00 00 01 00 00 00
Jul  5 01:12:20 ultraman (da20:mps0:0:31:0): CAM status: CCB request completed with an error
Jul  5 01:12:20 ultraman (da20:mps0:0:31:0): Retrying command
Jul  5 01:12:20 ultraman (da20:mps0:0:31:0): READ(16). CDB: 88 00 00 00 00 01 34 63 7e a8 00 00 01 00 00 00
Jul  5 01:12:20 ultraman (da20:mps0:0:31:0): CAM status: SCSI Status Error
Jul  5 01:12:20 ultraman (da20:mps0:0:31:0): SCSI status: Check Condition
Jul  5 01:12:20 ultraman (da20:mps0:0:31:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jul  5 01:12:20 ultraman (da20:mps0:0:31:0): Info: 0x134637ea8
Jul  5 01:12:20 ultraman (da20:mps0:0:31:0): Error 5, Unretryable error
...
...
Jul  5 02:14:19 ultraman	 (da18:mps0:0:29:0): READ(16). CDB: 88 00 00 00 00 01 aa 52 c9 78 00 00 01 00 00 00 length 131072 SMID 88 terminated ioc 804b scsi 0 state 0 xfer 0
Jul  5 02:14:19 ultraman (da18:mps0:0:29:0): READ(16). CDB: 88 00 00 00 00 01 aa 52 c9 78 00 00 01 00 00 00
Jul  5 02:14:19 ultraman (da18:mps0:0:29:0): CAM status: CCB request completed with an error
Jul  5 02:14:19 ultraman (da18:mps0:0:29:0): Retrying command
Jul  5 02:14:19 ultraman (da18:mps0:0:29:0): READ(16). CDB: 88 00 00 00 00 01 aa 52 c7 78 00 00 01 00 00 00
Jul  5 02:14:19 ultraman (da18:mps0:0:29:0): CAM status: SCSI Status Error
Jul  5 02:14:19 ultraman (da18:mps0:0:29:0): SCSI status: Check Condition
Jul  5 02:14:19 ultraman (da18:mps0:0:29:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jul  5 02:14:19 ultraman (da18:mps0:0:29:0): Info: 0x1aa52c778
Jul  5 02:14:19 ultraman (da18:mps0:0:29:0): Error 5, Unretryable error
...
...
Jul  5 03:19:25 ultraman	 (da12:mps0:0:23:0): READ(16). CDB: 88 00 00 00 00 01 43 55 06 70 00 00 01 00 00 00 length 131072 SMID 499 terminated ioc 804b scsi 0 state 0 xfer 0
Jul  5 03:19:25 ultraman (da12:mps0:0:23:0): READ(16). CDB: 88 00 00 00 00 01 43 55 06 70 00 00 01 00 00 00
Jul  5 03:19:25 ultraman (da12:mps0:0:23:0): CAM status: CCB request completed with an error
Jul  5 03:19:25 ultraman (da12:mps0:0:23:0): Retrying command
Jul  5 03:19:25 ultraman (da12:mps0:0:23:0): READ(16). CDB: 88 00 00 00 00 01 43 55 08 50 00 00 01 00 00 00
Jul  5 03:19:25 ultraman (da12:mps0:0:23:0): CAM status: SCSI Status Error
Jul  5 03:19:25 ultraman (da12:mps0:0:23:0): SCSI status: Check Condition
Jul  5 03:19:25 ultraman (da12:mps0:0:23:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jul  5 03:19:25 ultraman (da12:mps0:0:23:0): Info: 0x143550850
Jul  5 03:19:25 ultraman (da12:mps0:0:23:0): Error 5, Unretryable error


Pool status after the scrub:
Code:
  scan: scrub repaired 828K in 14h51m with 0 errors on Wed Jul  5 05:51:34 2017
config:

	NAME											STATE	 READ WRITE CKSUM
	ultraman										ONLINE	   0	 0	 0
	  mirror-0									  ONLINE	   0	 0	 0
		gptid/2e00ac23-183d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/6e71919e-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
	  mirror-1									  ONLINE	   0	 0	 0
		gptid/6f22c98c-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/6fe54bfe-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
	  mirror-2									  ONLINE	   0	 0	 0
		gptid/70bfd5c6-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/3d414933-3a05-11e7-af73-0025901ef244  ONLINE	   0	 0	 0
	  mirror-3									  ONLINE	   0	 0	 0
		gptid/7ad0f185-1619-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/9e899578-183c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-4									  ONLINE	   0	 0	 0
		gptid/427c2189-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/4342a98c-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-5									  ONLINE	   0	 0	 0
		gptid/a1ab9a69-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/a2851364-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-6									  ONLINE	   0	 0	 0
		gptid/0dcbcccd-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/0e9ed582-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-7									  ONLINE	   0	 0	 0
		gptid/2b56cf1c-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/2c2a1c62-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-8									  ONLINE	   0	 0	 0
		gptid/69718320-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/6a4e6afa-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-9									  ONLINE	   0	 0	 0
		gptid/8fccb2c6-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/90b48d70-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-10									 ONLINE	   0	 0	 0
		gptid/aeda4f88-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/afb5bfc3-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-11									 ONLINE	   0	 0	 0
		gptid/bca6e4f0-1b71-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/83d890bc-3a08-11e7-af73-0025901ef244  ONLINE	   0	 0	 0

errors: No known data errors


In conclusion:
  • Changing both PSUs did not resolve the problem
  • Changing HBA->Backplanes cables did not resolve the problem
  • Errors occurs randomly on "all" disks.
  • Errors occurs on disk on either backplane
  • Running on only 12 (down from 24) disks allowed me to scrub with no errors.
 
Last edited:

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
I have now moved the machine to an airconditioned room, running it off of an UPS. I have started another scrub.
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
Running the machine off of an UPS in an airconditioned room, did not resolve the issue. I still get SCSI errors on every scrub on the pool.

10 errors occured. All on the same two disks (da0 & da6). Both disks are on the 24-slot backplane. This would make one think these disks are the problem, but on the previous scrub, the errors were spread across 7 different disks. Also, both disks passed a SMART long test, and no problems in the SMART data (Se previous post).

Code:
(da6:mps0:0:17:0): READ(10). CDB: 28 00 64 36 e4 f0 00 00 e0 00 length 114688 SMID 316 terminated ioc 804b scsi 0 state 0 xfer 0
(da6:mps0:0:17:0): READ(10). CDB: 28 00 64 36 e4 f0 00 00 e0 00 
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(10). CDB: 28 00 64 36 e4 10 00 00 e0 00 
(da6:mps0:0:17:0): CAM status: SCSI Status Error
(da6:mps0:0:17:0): SCSI status: Check Condition
(da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:17:0): Info: 0x6436e410
(da6:mps0:0:17:0): Error 5, Unretryable error
	(da6:mps0:0:17:0): READ(10). CDB: 28 00 7c 32 cd 18 00 00 a0 00 length 81920 SMID 594 terminated ioc 804b scsi 0 state 0 xfer 0
(da6:mps0:0:17:0): READ(10). CDB: 28 00 7c 32 cd 18 00 00 a0 00 
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(10). CDB: 28 00 7c 32 cc 70 00 00 a8 00 
(da6:mps0:0:17:0): CAM status: SCSI Status Error
(da6:mps0:0:17:0): SCSI status: Check Condition
(da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:17:0): Info: 0x7c32cc70
(da6:mps0:0:17:0): Error 5, Unretryable error
	(da6:mps0:0:17:0): READ(10). CDB: 28 00 fc 2e b7 48 00 00 98 00 length 77824 SMID 780 terminated ioc 804b scsi 0 state 0 xfer 0
(da6:mps0:0:17:0): READ(10). CDB: 28 00 fc 2e b7 48 00 00 98 00 
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(10). CDB: 28 00 fc 2e b6 b8 00 00 90 00 
(da6:mps0:0:17:0): CAM status: SCSI Status Error
(da6:mps0:0:17:0): SCSI status: Check Condition
(da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:17:0): Info: 0xfc2eb6b8
(da6:mps0:0:17:0): Error 5, Unretryable error
	(da6:mps0:0:17:0): READ(10). CDB: 28 00 d0 32 c3 70 00 00 d8 00 length 110592 SMID 395 terminated ioc 804b scsi 0 state 0 xfer 0
(da6:mps0:0:17:0): READ(10). CDB: 28 00 d0 32 c3 70 00 00 d8 00 
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(10). CDB: 28 00 d0 32 c2 90 00 00 e0 00 
(da6:mps0:0:17:0): CAM status: SCSI Status Error
(da6:mps0:0:17:0): SCSI status: Check Condition
(da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:17:0): Info: 0xd032c290
(da6:mps0:0:17:0): Error 5, Unretryable error
	(da0:mps0:0:9:0): READ(16). CDB: 88 00 00 00 00 01 10 3f df a8 00 00 00 c8 00 00 length 102400 SMID 863 terminated ioc 804b scsi 0 state 0 xfer 0
(da0:mps0:0:9:0): READ(16). CDB: 88 00 00 00 00 01 10 3f df a8 00 00 00 c8 00 00 
(da0:mps0:0:9:0): CAM status: CCB request completed with an error
(da0:mps0:0:9:0): Retrying command
(da0:mps0:0:9:0): READ(16). CDB: 88 00 00 00 00 01 10 3f de d0 00 00 00 d8 00 00 
(da0:mps0:0:9:0): CAM status: SCSI Status Error
(da0:mps0:0:9:0): SCSI status: Check Condition
(da0:mps0:0:9:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da0:mps0:0:9:0): Info: 0x1103fded0
(da0:mps0:0:9:0): Error 5, Unretryable error
	(da0:mps0:0:9:0): READ(16). CDB: 88 00 00 00 00 01 0c 2e 84 e0 00 00 00 b8 00 00 length 94208 SMID 770 terminated ioc 804b scsi 0 state 0 xfer 0
(da0:mps0:0:9:0): READ(16). CDB: 88 00 00 00 00 01 0c 2e 84 e0 00 00 00 b8 00 00 
(da0:mps0:0:9:0): CAM status: CCB request completed with an error
(da0:mps0:0:9:0): Retrying command
(da0:mps0:0:9:0): READ(16). CDB: 88 00 00 00 00 01 0c 2e 84 28 00 00 00 b8 00 00 
(da0:mps0:0:9:0): CAM status: SCSI Status Error
(da0:mps0:0:9:0): SCSI status: Check Condition
(da0:mps0:0:9:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da0:mps0:0:9:0): Info: 0x10c2e8428
(da0:mps0:0:9:0): Error 5, Unretryable error
	(da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 60 3d 3b 70 00 00 00 b8 00 00 length 94208 SMID 225 terminated ioc 804b scsi 0 state 0 xfer 0
(da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 60 3d 3b 70 00 00 00 b8 00 00 
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 60 3d 3a b0 00 00 00 c0 00 00 
(da6:mps0:0:17:0): CAM status: SCSI Status Error
(da6:mps0:0:17:0): SCSI status: Check Condition
(da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:17:0): Info: 0x1603d3ab0
(da6:mps0:0:17:0): Error 5, Unretryable error
	(da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 52 b0 51 e0 00 00 00 d8 00 00 length 110592 SMID 510 terminated ioc 804b scsi 0 state 0 xfer 0
(da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 52 b0 51 e0 00 00 00 d8 00 00 
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 52 b0 51 08 00 00 00 d8 00 00 
(da6:mps0:0:17:0): CAM status: SCSI Status Error
(da6:mps0:0:17:0): SCSI status: Check Condition
(da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:17:0): Info: 0x152b05108
(da6:mps0:0:17:0): Error 5, Unretryable error
	(da0:mps0:0:9:0): READ(16). CDB: 88 00 00 00 00 01 2f e4 86 90 00 00 00 d8 00 00 length 110592 SMID 99 terminated ioc 804b scsi 0 state 0 xfer 0
(da0:mps0:0:9:0): READ(16). CDB: 88 00 00 00 00 01 2f e4 86 90 00 00 00 d8 00 00 
(da0:mps0:0:9:0): CAM status: CCB request completed with an error
(da0:mps0:0:9:0): Retrying command
(da0:mps0:0:9:0): READ(16). CDB: 88 00 00 00 00 01 2f e4 85 b8 00 00 00 d8 00 00 
(da0:mps0:0:9:0): CAM status: SCSI Status Error
(da0:mps0:0:9:0): SCSI status: Check Condition
(da0:mps0:0:9:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da0:mps0:0:9:0): Info: 0x12fe485b8
(da0:mps0:0:9:0): Error 5, Unretryable error
	(da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 5c 23 e3 50 00 00 00 c8 00 00 length 102400 SMID 103 terminated ioc 804b scsi 0 state 0 xfer 0
(da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 5c 23 e3 50 00 00 00 c8 00 00 
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 5c 23 e2 90 00 00 00 c0 00 00 
(da6:mps0:0:17:0): CAM status: SCSI Status Error
(da6:mps0:0:17:0): SCSI status: Check Condition
(da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:17:0): Info: 0x15c23e290
(da6:mps0:0:17:0): Error 5, Unretryable error


Status of the pool after scrubbing:
Code:
  pool: ultraman
 state: ONLINE
  scan: scrub repaired 988K in 15h2m with 0 errors on Thu Jul  6 01:51:53 2017
config:

	NAME											STATE	 READ WRITE CKSUM
	ultraman										ONLINE	   0	 0	 0
	  mirror-0									  ONLINE	   0	 0	 0
		gptid/2e00ac23-183d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/6e71919e-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
	  mirror-1									  ONLINE	   0	 0	 0
		gptid/6f22c98c-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/6fe54bfe-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
	  mirror-2									  ONLINE	   0	 0	 0
		gptid/70bfd5c6-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/3d414933-3a05-11e7-af73-0025901ef244  ONLINE	   0	 0	 0
	  mirror-3									  ONLINE	   0	 0	 0
		gptid/7ad0f185-1619-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/9e899578-183c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-4									  ONLINE	   0	 0	 0
		gptid/427c2189-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/4342a98c-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-5									  ONLINE	   0	 0	 0
		gptid/a1ab9a69-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/a2851364-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-6									  ONLINE	   0	 0	 0
		gptid/0dcbcccd-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/0e9ed582-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-7									  ONLINE	   0	 0	 0
		gptid/2b56cf1c-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/2c2a1c62-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-8									  ONLINE	   0	 0	 0
		gptid/69718320-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/6a4e6afa-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-9									  ONLINE	   0	 0	 0
		gptid/8fccb2c6-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/90b48d70-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-10									 ONLINE	   0	 0	 0
		gptid/aeda4f88-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/afb5bfc3-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-11									 ONLINE	   0	 0	 0
		gptid/bca6e4f0-1b71-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/83d890bc-3a08-11e7-af73-0025901ef244  ONLINE	   0	 0	 0

errors: No known data errors
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Installing two brand new PSU's have not solved the problem. I Still get SCSI errors during scrubbing/resilvering.

My understanding is, that SCSI errors should never happen (Do people agree on this?).

The latest error
Code:
Jun 22 16:46:50 ultraman	 (da6:mps0:0:17:0): READ(10). CDB: 28 00 80 08 86 80 00 01 00 00 length 131072 SMID 781 terminated ioc 804b scsi 0 state 0 xfer 0
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): READ(10). CDB: 28 00 80 08 86 80 00 01 00 00
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): CAM status: CCB request completed with an error
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): Retrying command
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): READ(10). CDB: 28 00 80 08 85 b8 00 00 c8 00
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): CAM status: SCSI Status Error
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): SCSI status: Check Condition
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): Info: 0x800885b8
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): Error 5, Unretryable error

You're seeing read errors... relatively persistently. Either they're being caused by faulty cabling, faulty power, faulty hardware, or faulty software.

You might see one once in a blue moon, but you shouldn't be seeing them regularly. Its a bit like seeing ECC errors. ECC errors could be caused by cosmic rays, but they're normally caused by faulty hardware ;)

I know you've tried everything, except replacing the backplanes, and or updating their software. This is the type of thing that can be caused by buggy/mismatched firmware.

Unfortunately, I can't provide anymore advice than this.

The good news is that ZFS does provide very powerful integrity control, and is probably capable of dealing with these issues.
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
As expected, I get errors on every scrub. But, recently, the errors have only occurred on the disks da0 & da6 (I have written down the serial numbers). I will disconnect them, and scrub.

All errors on same two disks:
Code:
	(da6:mps0:0:17:0): READ(10). CDB: 28 00 50 36 be e0 00 01 00 00 length 131072 SMID 332 terminated ioc 804b scsi 0 state 0 xfer 0
	(da6:mps0:0:17:0): READ(10). CDB: 28 00 4e ad f4 68 00 00 08 00 length 4096 SMID 441 terminated ioc 804b scsi 0 state 0 xfer (da6:mps0:0:17:0): READ(10). CDB: 28 00 50 36 be e0 00 01 00 00
0
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(10). CDB: 28 00 4e ad f4 68 00 00 08 00
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(10). CDB: 28 00 50 36 be 38 00 00 a8 00
(da6:mps0:0:17:0): CAM status: SCSI Status Error
(da6:mps0:0:17:0): SCSI status: Check Condition
(da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:17:0): Info: 0x5036be38
(da6:mps0:0:17:0): Error 5, Unretryable error
	(da6:mps0:0:17:0): READ(10). CDB: 28 00 b4 2c 65 20 00 00 a8 00 length 86016 SMID 914 terminated ioc 804b scsi 0 state 0 xfer 0
(da6:mps0:0:17:0): READ(10). CDB: 28 00 b4 2c 65 20 00 00 a8 00
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(10). CDB: 28 00 b4 2c 64 70 00 00 b0 00
(da6:mps0:0:17:0): CAM status: SCSI Status Error
(da6:mps0:0:17:0): SCSI status: Check Condition
(da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:17:0): Info: 0xb42c6470
(da6:mps0:0:17:0): Error 5, Unretryable error
	(da6:mps0:0:17:0): READ(10). CDB: 28 00 c4 2b 67 78 00 00 c0 00 length 98304 SMID 906 terminated ioc 804b scsi 0 state 0 xfer 0
(da6:mps0:0:17:0): READ(10). CDB: 28 00 c4 2b 67 78 00 00 c0 00
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(10). CDB: 28 00 c4 2b 66 d8 00 00 a0 00
(da6:mps0:0:17:0): CAM status: SCSI Status Error
(da6:mps0:0:17:0): SCSI status: Check Condition
(da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:17:0): Info: 0xc42b66d8
(da6:mps0:0:17:0): Error 5, Unretryable error
	(da6:mps0:0:17:0): READ(10). CDB: 28 00 dc 28 09 10 00 00 b8 00 length 94208 SMID 671 terminated ioc 804b scsi 0 state 0 xfer 0
	(da6:mps0:0:17:0): READ(10). CDB: 28 00 d9 95 3c 70 00 00 08 00 length 4096 SMID 562 terminated ioc 804b scsi 0 state 0 xfer (da6:mps0:0:17:0): READ(10). CDB: 28 00 dc 28 09 10 00 00 b8 00
0
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(10). CDB: 28 00 d9 95 3c 70 00 00 08 00
(da6:mps0:0:17:0): CAM status: CCB request completed with an error
(da6:mps0:0:17:0): Retrying command
(da6:mps0:0:17:0): READ(10). CDB: 28 00 dc 28 08 58 00 00 b8 00
(da6:mps0:0:17:0): CAM status: SCSI Status Error
(da6:mps0:0:17:0): SCSI status: Check Condition
(da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da6:mps0:0:17:0): Info: 0xdc280858
(da6:mps0:0:17:0): Error 5, Unretryable error
	(da0:mps0:0:9:0): READ(10). CDB: 28 00 f4 34 41 38 00 00 b8 00 length 94208 SMID 229 terminated ioc 804b scsi 0 state 0 xfer 0
(da0:mps0:0:9:0): READ(10). CDB: 28 00 f4 34 41 38 00 00 b8 00
(da0:mps0:0:9:0): CAM status: CCB request completed with an error
(da0:mps0:0:9:0): Retrying command
(da0:mps0:0:9:0): READ(10). CDB: 28 00 f4 34 40 88 00 00 b0 00
(da0:mps0:0:9:0): CAM status: SCSI Status Error
(da0:mps0:0:9:0): SCSI status: Check Condition
(da0:mps0:0:9:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da0:mps0:0:9:0): Info: 0xf4344088
(da0:mps0:0:9:0): Error 5, Unretryable error
	(da0:mps0:0:9:0): READ(10). CDB: 28 00 d8 3f 40 88 00 00 f8 00 length 126976 SMID 888 terminated ioc 804b scsi 0 state 0 xfer 0
(da0:mps0:0:9:0): READ(10). CDB: 28 00 d8 3f 40 88 00 00 f8 00
(da0:mps0:0:9:0): CAM status: CCB request completed with an error
(da0:mps0:0:9:0): Retrying command
(da0:mps0:0:9:0): READ(10). CDB: 28 00 d8 3f 3f 90 00 00 f8 00
(da0:mps0:0:9:0): CAM status: SCSI Status Error
(da0:mps0:0:9:0): SCSI status: Check Condition
(da0:mps0:0:9:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da0:mps0:0:9:0): Info: 0xd83f3f90
(da0:mps0:0:9:0): Error 5, Unretryable error

 

Morpheus187

Explorer
Joined
Mar 11, 2016
Messages
61
That seems to be a really frustrating problem, I also had such machines were I had an error, unable to trace down.

By now I think the problem is maybe the motherboard itself .

Good luck on resolving that issue, I hope you get it :)
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
Just kinda glanced through the (now 3 pages long) thread and have not seen the
changing out (or at least inspecting), of the Power Distribution Backplane. I've changed
out mine before and know there are caps in the PCB that could be failing.
You have changed the redundant PSUs, but was the PDU checked?
If I missed it, just ignore me...
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
By now I think the problem is maybe the motherboard itself.

Me too. I actually have another motherboard lying around, which I'm considering swapping out with the one currently in the machine.

You have changed the redundant PSUs, but was the PDU checked?

No, I'm unaware of such a thing. I'll go look at the chassis manual. Thank you for the suggestion.
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479

Cougar014

Explorer
Joined
Oct 30, 2016
Messages
57
I have the same issue, I'm just a little overwhelmed how much things you already have tried in comparison to me!
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
I was seeing a lot of errors on the disks da0 and da6. I have previously seen that errors seem to appear on all disks at random. To check, I have disconnected da0 & da6. As expected, this has not resolved the issuse. SCSI Errors seem occur on all disks (on both backplanes) at random.

Code:
  pool: ultraman
state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 248K in 14h57m with 0 errors on Sun Jul 16 14:57:53 2017
config:

	NAME											STATE	 READ WRITE CKSUM
	ultraman										DEGRADED	 0	 0	 0
	  mirror-0									  ONLINE	   0	 0	 0
		gptid/2e00ac23-183d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/6e71919e-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
	  mirror-1									  ONLINE	   0	 0	 0
		gptid/6f22c98c-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/6fe54bfe-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 1	 0
	  mirror-2									  DEGRADED	 0	 0	 0
		gptid/70bfd5c6-1618-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		11172241336233623768						REMOVED	  0	 0	 0  was /dev/gptid/3d414933-3a05-11e7-af73-0025901ef244
	  mirror-3									  ONLINE	   0	 0	 0
		gptid/7ad0f185-1619-11e7-a3b7-0025901ef244  ONLINE	   0	 0	 0
		gptid/9e899578-183c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-4									  ONLINE	   0	 0	 0
		gptid/427c2189-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/4342a98c-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-5									  ONLINE	   0	 0	 0
		gptid/a1ab9a69-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 1	 0
		gptid/a2851364-184c-11e7-ae9d-0025901ef244  ONLINE	   0	 1	 0
	  mirror-6									  ONLINE	   0	 0	 0
		gptid/0dcbcccd-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/0e9ed582-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 2
	  mirror-7									  ONLINE	   0	 0	 0
		gptid/2b56cf1c-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/2c2a1c62-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-8									  ONLINE	   0	 0	 0
		gptid/69718320-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/6a4e6afa-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-9									  ONLINE	   0	 0	 0
		gptid/8fccb2c6-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 1	 0
		gptid/90b48d70-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 1	 0
	  mirror-10									 ONLINE	   0	 0	 0
		gptid/aeda4f88-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		gptid/afb5bfc3-184d-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
	  mirror-11									 DEGRADED	 0	 0	 0
		gptid/bca6e4f0-1b71-11e7-ae9d-0025901ef244  ONLINE	   0	 0	 0
		3534930770157901362						 REMOVED	  0	 0	 0  was /dev/gptid/83d890bc-3a08-11e7-af73-0025901ef244

errors: No known data errors

 

tvsjr

Guru
Joined
Aug 29, 2015
Messages
959
You have a systemic problem. It sounds like the one common thread is the backplane. At this point, I'd either invest in a new backplane or buy/borrow sufficient breakout cables and power adapters (don't go cheap here) to let you get power and data to the drives without the backplane involved.
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
You have a systemic problem. It sounds like the one common thread is the backplane. At this point, I'd either invest in a new backplane or buy/borrow sufficient breakout cables and power adapters (don't go cheap here) to let you get power and data to the drives without the backplane involved.

The thing is, errors occur on disks attached to either backplanes (A 12-slot and a 24-slot) which seems to indicate the problem is not the backplane. Others have suggested I inspect the power distribution unit. Looking at my chassis, it seems to be the PDB-PT847-8820. I can not find a page on it at supermicro.com.
 
Status
Not open for further replies.
Top