SOLVED Various SCSI sense errors during scrubbing

Status
Not open for further replies.

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
You put all your drives back yet?
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
4th scrub of the degraded 12-mirror pool finished with no SCSI errors running only on PSU 2. I have concluded, that PSU 1 was the cause of the random SCSI errors I was seeing.

I have now re-added the missing 12 disks to the pool. Thus, the pool is now 24 6TB disks in 12 mirrors. PSU 2 should be able to handle the load with out SCSI errors occuring.

The pool is resilvering.
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
During resilvering the 24 disk pool running only on PSU 2, a SCSI Error occured:
Code:
Jun 13 10:39:35 ultraman zfsd: CaseFile::Serialize: Unable to open /etc/zfs/cases/pool_16006326459371220184_vdev_17769119138064296099.case.
Jun 13 10:39:35 ultraman (da11:isci0:0:33:0): WRITE(16). CDB: 8a 00 00 00 00 01 64 be e7 60 00 00 00 c0 00 00 
Jun 13 10:39:35 ultraman (da11:isci0:0:33:0): CAM status: SCSI Status Error
Jun 13 10:39:35 ultraman (da11:isci0:0:33:0): SCSI status: Check Condition
Jun 13 10:39:35 ultraman (da11:isci0:0:33:0): SCSI sense: ILLEGAL REQUEST asc:ffffffff,ffffffff (Reserved ASC/ASCQ pair)
Jun 13 10:39:35 ultraman (da11:isci0:0:33:0): Error 22, Unretryable error


Also, ZFS started to resilver the entire pool as soon as the original resilvering was done. What could the reason for this be?

I will wait for the resilvering to complete on PSU 2 only. I wonder if it will start another resilevering once it's done?

I'm now thinking, that both of my PSUs need to be changed. I have bought a new one which I will try out once resilvering is complete.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Did you look into updating the backplanes firmware?
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
More SCSI errors occuring during what looks like a never ending resilvering process.
Code:
Jun 13 16:43:31 ultraman zfsd: CaseFile::Serialize: Unable to open /etc/zfs/cases/pool_16006326459371220184_vdev_13826095765754444936.case.
Jun 13 16:43:31 ultraman (da18:isci0:0:37:0): WRITE(10). CDB: 2a 00 43 45 ac 60 00 01 00 00
Jun 13 16:43:31 ultraman (da18:isci0:0:37:0): CAM status: SCSI Status Error
Jun 13 16:43:31 ultraman (da18:isci0:0:37:0): SCSI status: Check Condition
Jun 13 16:43:31 ultraman (da18:isci0:0:37:0): SCSI sense: ILLEGAL REQUEST asc:ffffffff,ffffffff (Reserved ASC/ASCQ pair)
...
...
Jun 14 01:47:52 ultraman zfsd: CaseFile::Serialize: Unable to open /etc/zfs/cases/pool_16006326459371220184_vdev_9506731245968758733.case.
Jun 14 01:47:52 ultraman (da20:isci0:0:38:0): WRITE(10). CDB: 2a 00 a9 42 84 08 00 00 c0 00
Jun 14 01:47:52 ultraman (da20:isci0:0:38:0): CAM status: SCSI Status Error
Jun 14 01:47:52 ultraman (da20:isci0:0:38:0): SCSI status: Check Condition
Jun 14 01:47:52 ultraman (da20:isci0:0:38:0): SCSI sense: ILLEGAL REQUEST asc:ffffffff,ffffffff (Reserved ASC/ASCQ pair)
Jun 14 01:47:52 ultraman (da20:isci0:0:38:0): Error 22, Unretryable error
...
...Jun 14 05:54:06 ultraman zfsd: CaseFile::Serialize: Unable to open /etc/zfs/cases/pool_16006326459371220184_vdev_15352642184540883744.case.
Jun 14 05:54:06 ultraman zfsd: CaseFile::Serialize: Unable to open /etc/zfs/cases/pool_16006326459371220184_vdev_15352642184540883744.case.
Jun 14 05:54:06 ultraman (da22:isci0:0:39:0): WRITE(10). CDB: 2a 00 06 c6 58 90 00 00 90 00
Jun 14 05:54:06 ultraman (da22:isci0:0:39:0): CAM status: SCSI Status Error
Jun 14 05:54:06 ultraman (da22:isci0:0:39:0): SCSI status: Check Condition
Jun 14 05:54:06 ultraman (da22:isci0:0:39:0): SCSI sense: ILLEGAL REQUEST asc:ffffffff,ffffffff (Reserved ASC/ASCQ pair)
Jun 14 05:54:06 ultraman (da22:isci0:0:39:0): Error 22, Unretryable error
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
I have now added my brand new PSU to the server. I'm now running with PSU 2 (Was OK with 12 disks, but not with 24 disks it seems) and my brand new PSU 3.

So, now all 12 mirrors are running resilvering with 2 PSU's attached. One PSU (PSU 2) is older and one is brand new (PSU 3).
 
Last edited:

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
A SCSI error occured during resilvering. This was using the brand new PSU 3 and the older PSU 2. Using only PSU 2, I was able to scrub the pool, when degraded to 12 disks (Down from 24). That made me think, that PSU 2 was good, and that the previously removed PSU 1 was the cause of the errors. Now, I guess that's not the case :(

I will remove PSU 2, and run the machine only on the brand new PSU 3. Resilvering continues.

Code:
Jun 14 12:52:16 ultraman zfsd: CaseFile::Serialize: Unable to open /etc/zfs/cases/pool_16006326459371220184_vdev_7446801437149806016.case.
Jun 14 12:52:16 ultraman (da4:isci0:0:30:0): WRITE(10). CDB: 2a 00 91 76 91 c8 00 01 00 00 
Jun 14 12:52:16 ultraman (da4:isci0:0:30:0): CAM status: SCSI Status Error
Jun 14 12:52:16 ultraman (da4:isci0:0:30:0): SCSI status: Check Condition
Jun 14 12:52:16 ultraman (da4:isci0:0:30:0): SCSI sense: ILLEGAL REQUEST asc:ffffffff,ffffffff (Reserved ASC/ASCQ pair)
Jun 14 12:52:16 ultraman (da4:isci0:0:30:0): Error 22, Unretryable error
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
Resilvering finished with out errors occuring, after I had switched to running only on the new PSU 3. I have now started a scrub of the full 24-disk pool running only on PSU 3 (Brand new PSU). So, now I'm thinking both original PSUs could be bad. I will report back once the current scrub is done.
 
Last edited:

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
3 SCSI read errors occured when scrubbing the 24 disk pool using only the brand new PSU 3. ZFS reported no read/write errors in the pool.

I will scrub again, to see if the only errors I now get are read errors. I'm used to a mix of read and write errors.

I'm unsure what my next step should be.

If it is true, that I should never get a SCSI Read errors, something is wrong.

So
A: All 3 PSU's, including the brand new one, are bad.
B: Something is wrong with the backplanes (Have seen errors on disks attached to both). Should I update the firmware? How is it done?
C: Motherboard is bad.
D: Power cabling is bad.

Any suggestions?
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
I have shut down the machine, and connected both backplanes to my IBM HBA. I have ordered a second brand new PSU. Will scrub with two brand new PSU's.
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
Installing two brand new PSU's have not solved the problem. I Still get SCSI errors during scrubbing/resilvering.

My understanding is, that SCSI errors should never happen (Do people agree on this?).

The latest error
Code:
Jun 22 16:46:50 ultraman	 (da6:mps0:0:17:0): READ(10). CDB: 28 00 80 08 86 80 00 01 00 00 length 131072 SMID 781 terminated ioc 804b scsi 0 state 0 xfer 0
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): READ(10). CDB: 28 00 80 08 86 80 00 01 00 00
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): CAM status: CCB request completed with an error
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): Retrying command
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): READ(10). CDB: 28 00 80 08 85 b8 00 00 c8 00
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): CAM status: SCSI Status Error
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): SCSI status: Check Condition
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): Info: 0x800885b8
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): Error 5, Unretryable error
 

Artion

Patron
Joined
Feb 12, 2016
Messages
331
Have you made some calculations on how much power do you need from the PSUs for all your drives and components? The max power needed must not exceed 80% of the rated PSU power.
You can also try adding one drive at a time starting from the 12 config you did the tests without errors.
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
Have you made some calculations on how much power do you need from the PSUs for all your drives and components? The max power needed must not exceed 80% of the rated PSU power.

I have two 1400W PSU's. Looking at this guide (From this forum) i have come up with these numbers based on the most conservative numbers I could find:

My drives are WD Red 6TB. According to the datasheet, they peak at 12V * 1.79A = 21.5W.

PSU: 0.8 * 1400W = 1120W

Drives: 24 * 21.5W = 516W
Fans: 7 * 30W = 210W
RAM: 8 * 6W = 48W
CPU: 2 * 80W = 160W
HBA: 1 * 10W = 10W
Motherboard: 1 * 25W = 25W
Backplane: 2 * 50W = 100W (I'm just guessing here).

516W + 210W + 48W + 160W + 10W + 25W + 100W= 1069W

Guide suggest a PSU rated for 1069W * 1.25 = 1336W.

So, it looks like the PSU's should be adequate. Did I miss something?

You can also try adding one drive at a time starting from the 12 config you did the tests without errors.

Yes, It would be nice to reach a point where adding a single drive will make SCSI errors occur, and removing a drive will make the errors go away.

I will remove 6 drives, and run the pool with 18 drives.
 
Last edited:

Artion

Patron
Joined
Feb 12, 2016
Messages
331
Can you post the make/model of PSUs?
 

Artion

Patron
Joined
Feb 12, 2016
Messages
331
Are you on 110V or a 220V grid?
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
Usually for something beyond 12 drives, you want to get a redundant power supply. The fact of the matter is that 12 drives are unlikely to all spin up at exactly the same time and actually consume all your power. You'll notice that in most rackmount designs that there are two power supplies, either of which are capable of holding the load, often just barely, but when teamed they are both just lazily feeding power.

This makes me think I really should be OK, when running 24 drives on brand new redundant 1400W PSUs.
 
Status
Not open for further replies.
Top