SOLVED Various SCSI sense errors during scrubbing

Stux · Jun 11, 2017

You put all your drives back yet?

tobiasbp · Jun 11, 2017

Stux said:
You put all your drives back yet?

No, I'm waiting for the 4th scrub to complete (with out errors) on my 2nd PSU (Only). I will then re-add drives to restore redundancy in the pool.

tobiasbp · Jun 12, 2017

4th scrub of the degraded 12-mirror pool finished with no SCSI errors running only on PSU 2. I have concluded, that PSU 1 was the cause of the random SCSI errors I was seeing.

I have now re-added the missing 12 disks to the pool. Thus, the pool is now 24 6TB disks in 12 mirrors. PSU 2 should be able to handle the load with out SCSI errors occuring.

The pool is resilvering.

tobiasbp · Jun 13, 2017

During resilvering the 24 disk pool running only on PSU 2, a SCSI Error occured:

Code:

Jun 13 10:39:35 ultraman zfsd: CaseFile::Serialize: Unable to open /etc/zfs/cases/pool_16006326459371220184_vdev_17769119138064296099.case.
Jun 13 10:39:35 ultraman (da11:isci0:0:33:0): WRITE(16). CDB: 8a 00 00 00 00 01 64 be e7 60 00 00 00 c0 00 00 
Jun 13 10:39:35 ultraman (da11:isci0:0:33:0): CAM status: SCSI Status Error
Jun 13 10:39:35 ultraman (da11:isci0:0:33:0): SCSI status: Check Condition
Jun 13 10:39:35 ultraman (da11:isci0:0:33:0): SCSI sense: ILLEGAL REQUEST asc:ffffffff,ffffffff (Reserved ASC/ASCQ pair)
Jun 13 10:39:35 ultraman (da11:isci0:0:33:0): Error 22, Unretryable error

Also, ZFS started to resilver the entire pool as soon as the original resilvering was done. What could the reason for this be?

I will wait for the resilvering to complete on PSU 2 only. I wonder if it will start another resilevering once it's done?

I'm now thinking, that both of my PSUs need to be changed. I have bought a new one which I will try out once resilvering is complete.

Stux · Jun 13, 2017

Did you look into updating the backplanes firmware?

tobiasbp · Jun 13, 2017

Stux said:
Did you look into updating the backplanes firmware?

No, any suggestion on how to get started? I'd like to know what firmware I'm running.

tobiasbp · Jun 14, 2017

More SCSI errors occuring during what looks like a never ending resilvering process.

Code:

Jun 13 16:43:31 ultraman zfsd: CaseFile::Serialize: Unable to open /etc/zfs/cases/pool_16006326459371220184_vdev_13826095765754444936.case.
Jun 13 16:43:31 ultraman (da18:isci0:0:37:0): WRITE(10). CDB: 2a 00 43 45 ac 60 00 01 00 00
Jun 13 16:43:31 ultraman (da18:isci0:0:37:0): CAM status: SCSI Status Error
Jun 13 16:43:31 ultraman (da18:isci0:0:37:0): SCSI status: Check Condition
Jun 13 16:43:31 ultraman (da18:isci0:0:37:0): SCSI sense: ILLEGAL REQUEST asc:ffffffff,ffffffff (Reserved ASC/ASCQ pair)
...
...
Jun 14 01:47:52 ultraman zfsd: CaseFile::Serialize: Unable to open /etc/zfs/cases/pool_16006326459371220184_vdev_9506731245968758733.case.
Jun 14 01:47:52 ultraman (da20:isci0:0:38:0): WRITE(10). CDB: 2a 00 a9 42 84 08 00 00 c0 00
Jun 14 01:47:52 ultraman (da20:isci0:0:38:0): CAM status: SCSI Status Error
Jun 14 01:47:52 ultraman (da20:isci0:0:38:0): SCSI status: Check Condition
Jun 14 01:47:52 ultraman (da20:isci0:0:38:0): SCSI sense: ILLEGAL REQUEST asc:ffffffff,ffffffff (Reserved ASC/ASCQ pair)
Jun 14 01:47:52 ultraman (da20:isci0:0:38:0): Error 22, Unretryable error
...
...Jun 14 05:54:06 ultraman zfsd: CaseFile::Serialize: Unable to open /etc/zfs/cases/pool_16006326459371220184_vdev_15352642184540883744.case.
Jun 14 05:54:06 ultraman zfsd: CaseFile::Serialize: Unable to open /etc/zfs/cases/pool_16006326459371220184_vdev_15352642184540883744.case.
Jun 14 05:54:06 ultraman (da22:isci0:0:39:0): WRITE(10). CDB: 2a 00 06 c6 58 90 00 00 90 00
Jun 14 05:54:06 ultraman (da22:isci0:0:39:0): CAM status: SCSI Status Error
Jun 14 05:54:06 ultraman (da22:isci0:0:39:0): SCSI status: Check Condition
Jun 14 05:54:06 ultraman (da22:isci0:0:39:0): SCSI sense: ILLEGAL REQUEST asc:ffffffff,ffffffff (Reserved ASC/ASCQ pair)
Jun 14 05:54:06 ultraman (da22:isci0:0:39:0): Error 22, Unretryable error

tobiasbp · Jun 14, 2017

I have now added my brand new PSU to the server. I'm now running with PSU 2 (Was OK with 12 disks, but not with 24 disks it seems) and my brand new PSU 3.

So, now all 12 mirrors are running resilvering with 2 PSU's attached. One PSU (PSU 2) is older and one is brand new (PSU 3).

tobiasbp · Jun 14, 2017

A SCSI error occured during resilvering. This was using the brand new PSU 3 and the older PSU 2. Using only PSU 2, I was able to scrub the pool, when degraded to 12 disks (Down from 24). That made me think, that PSU 2 was good, and that the previously removed PSU 1 was the cause of the errors. Now, I guess that's not the case :(

I will remove PSU 2, and run the machine only on the brand new PSU 3. Resilvering continues.

Code:

Jun 14 12:52:16 ultraman zfsd: CaseFile::Serialize: Unable to open /etc/zfs/cases/pool_16006326459371220184_vdev_7446801437149806016.case.
Jun 14 12:52:16 ultraman (da4:isci0:0:30:0): WRITE(10). CDB: 2a 00 91 76 91 c8 00 01 00 00 
Jun 14 12:52:16 ultraman (da4:isci0:0:30:0): CAM status: SCSI Status Error
Jun 14 12:52:16 ultraman (da4:isci0:0:30:0): SCSI status: Check Condition
Jun 14 12:52:16 ultraman (da4:isci0:0:30:0): SCSI sense: ILLEGAL REQUEST asc:ffffffff,ffffffff (Reserved ASC/ASCQ pair)
Jun 14 12:52:16 ultraman (da4:isci0:0:30:0): Error 22, Unretryable error

tobiasbp · Jun 16, 2017

Resilvering finished with out errors occuring, after I had switched to running only on the new PSU 3. I have now started a scrub of the full 24-disk pool running only on PSU 3 (Brand new PSU). So, now I'm thinking both original PSUs could be bad. I will report back once the current scrub is done.

tobiasbp · Jun 18, 2017

3 SCSI read errors occured when scrubbing the 24 disk pool using only the brand new PSU 3. ZFS reported no read/write errors in the pool.

I will scrub again, to see if the only errors I now get are read errors. I'm used to a mix of read and write errors.

I'm unsure what my next step should be.

If it is true, that I should never get a SCSI Read errors, something is wrong.

So
A: All 3 PSU's, including the brand new one, are bad.
B: Something is wrong with the backplanes (Have seen errors on disks attached to both). Should I update the firmware? How is it done?
C: Motherboard is bad.
D: Power cabling is bad.

Any suggestions?

tobiasbp · Jun 19, 2017

I have shut down the machine, and connected both backplanes to my IBM HBA. I have ordered a second brand new PSU. Will scrub with two brand new PSU's.

tobiasbp · Jun 23, 2017

Installing two brand new PSU's have not solved the problem. I Still get SCSI errors during scrubbing/resilvering.

My understanding is, that SCSI errors should never happen (Do people agree on this?).

The latest error

Code:

Jun 22 16:46:50 ultraman	 (da6:mps0:0:17:0): READ(10). CDB: 28 00 80 08 86 80 00 01 00 00 length 131072 SMID 781 terminated ioc 804b scsi 0 state 0 xfer 0
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): READ(10). CDB: 28 00 80 08 86 80 00 01 00 00
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): CAM status: CCB request completed with an error
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): Retrying command
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): READ(10). CDB: 28 00 80 08 85 b8 00 00 c8 00
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): CAM status: SCSI Status Error
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): SCSI status: Check Condition
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): Info: 0x800885b8
Jun 22 16:46:50 ultraman (da6:mps0:0:17:0): Error 5, Unretryable error

Artion · Jun 23, 2017

Have you made some calculations on how much power do you need from the PSUs for all your drives and components? The max power needed must not exceed 80% of the rated PSU power.
You can also try adding one drive at a time starting from the 12 config you did the tests without errors.

tobiasbp · Jun 23, 2017

Artion said:
Have you made some calculations on how much power do you need from the PSUs for all your drives and components? The max power needed must not exceed 80% of the rated PSU power.

I have two 1400W PSU's. Looking at this guide (From this forum) i have come up with these numbers based on the most conservative numbers I could find:

My drives are WD Red 6TB. According to the datasheet, they peak at 12V * 1.79A = 21.5W.

PSU: 0.8 * 1400W = 1120W

Drives: 24 * 21.5W = 516W
Fans: 7 * 30W = 210W
RAM: 8 * 6W = 48W
CPU: 2 * 80W = 160W
HBA: 1 * 10W = 10W
Motherboard: 1 * 25W = 25W
Backplane: 2 * 50W = 100W (I'm just guessing here).

516W + 210W + 48W + 160W + 10W + 25W + 100W= 1069W

Guide suggest a PSU rated for 1069W * 1.25 = 1336W.

So, it looks like the PSU's should be adequate. Did I miss something?

Artion said:
You can also try adding one drive at a time starting from the 12 config you did the tests without errors.

Yes, It would be nice to reach a point where adding a single drive will make SCSI errors occur, and removing a drive will make the errors go away.

I will remove 6 drives, and run the pool with 18 drives.

Artion · Jun 23, 2017

Can you post the make/model of PSUs?

tobiasbp · Jun 23, 2017

Artion said:
Can you post the make/model of PSUs?

Model: PWS-1K41P-1R
Manufacturer: 672042046881

All branded Supermicro. Data on PSUs (including mine).

Artion · Jun 23, 2017

Are you on 110V or a 220V grid?

tobiasbp · Jun 23, 2017

Artion said:
Are you on 110V or a 220V grid?

220V. I Have no UPS in front of the machine (yet).

tobiasbp · Jun 23, 2017

jgreco said:
Usually for something beyond 12 drives, you want to get a redundant power supply. The fact of the matter is that 12 drives are unlikely to all spin up at exactly the same time and actually consume all your power. You'll notice that in most rackmount designs that there are two power supplies, either of which are capable of holding the load, often just barely, but when teamed they are both just lazily feeding power.

This makes me think I really should be OK, when running 24 drives on brand new redundant 1400W PSUs.

Important Announcement for the TrueNAS Community.

SOLVED Various SCSI sense errors during scrubbing

MVP

Patron

Patron

Patron

MVP

Patron

Patron

Patron

Patron

Patron

Patron

Patron

Patron

Patron

Patron

Patron

Patron

Patron

Patron

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Various SCSI sense errors during scrubbing"

Similar threads