Random Drive Faults with new and tested Hardware

Cokehero · May 10, 2019

Yes

The new board an CPU will arrive next week.

The thing is, everything works fine, as long as there is only write or read IO. But this is just simply no realistic case.

Last night I was able to write about 25TB to the disks and as soon as the first jobs start to verify the whole thing collapses because there is still 7GBit/s of writing going on.

alexra · May 18, 2019

Cokehero, you are not the only one having those issues...

I experience the same with a X10SDV-TLN4F, with an LSI-9305-16i, having 3 pools
1 with 4 x 6TB WD Red's
2 with 4 x 1TB Segate baracuda's
3 with 3 x 1.92TB Intel DC S4500

The "failures" of the drives are random, below are the logs on how it starts:

Code:

May 18 02:07:49 seel-stor02 #011(pass8: mpr0:0:9:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 278 Aborting command 0xfffffe0001620fa0
May 18 02:07:49 seel-stor02 mpr0: Sending reset from mprsas_send_abort for target ID 9
May 18 02:07:49 seel-stor02 #011(pass7: mpr0:0:8:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 260 Aborting command 0xfffffe000161f5c0
May 18 02:07:49 seel-stor02 mpr0: Sending reset from mprsas_send_abort for target ID 8
May 18 02:07:49 seel-stor02 mpr0: Unfreezing devq for target ID 9
May 18 02:07:49 seel-stor02 mpr0: Unfreezing devq for target ID 8
May 18 02:07:49 seel-stor02 #011(pass9: mpr0:0:10:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 814 Aborting command 0xfffffe0001651220
May 18 02:07:49 seel-stor02 mpr0: Sending reset from mprsas_send_abort for target ID 10
May 18 02:07:49 seel-stor02 #011(pass10: mpr0:0:11:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 487 Aborting command 0xfffffe0001633c10
May 18 02:07:49 seel-stor02 mpr0: Sending reset from mprsas_send_abort for target ID 11
May 18 02:07:49 seel-stor02 mpr0: Unfreezing devq for target ID 10
May 18 02:07:49 seel-stor02 mpr0: Unfreezing devq for target ID 11
May 18 02:07:49 seel-stor02 #011(pass5: mpr0:0:5:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00 length 0 SMID 501 Aborting command 0xfffffe0001635030
May 18 02:07:49 seel-stor02 mpr0: Sending reset from mprsas_send_abort for target ID 5
May 18 02:07:49 seel-stor02 #011(pass4: mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 960 Aborting command 0xfffffe000165e400
.......
May 18 02:23:50 seel-stor02 (da3: mpr0:0:3:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
May 18 02:23:50 seel-stor02 (da3: mpr0:0:3:0): CAM status: SCSI Status Error
May 18 02:23:50 seel-stor02 (da3: mpr0:0:3:0): SCSI status: Check Condition
May 18 02:23:50 seel-stor02 (da3: mpr0:0:3:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
May 18 02:23:50 seel-stor02 (da3: mpr0:0:3:0): Error 6, Retries exhausted
May 18 02:23:50 seel-stor02 (da3: mpr0:0:3:0): Invalidating pack
May 18 02:23:50 seel-stor02 (da2: mpr0:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
May 18 02:23:50 seel-stor02 (da2: mpr0:0:2:0): CAM status: SCSI Status Error
May 18 02:23:50 seel-stor02 (da2: mpr0:0:2:0): SCSI status: Check Condition
May 18 02:23:50 seel-stor02 (da2: mpr0:0:2:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
May 18 02:23:50 seel-stor02 (da2: mpr0:0:2:0): Error 6, Retries exhausted
May 18 02:23:50 seel-stor02 (da2: mpr0:0:2:0): Invalidating pack
May 18 02:23:50 seel-stor02 (da0: mpr0:0:0:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
May 18 02:23:50 seel-stor02 (da0: mpr0:0:0:0): CAM status: SCSI Status Error
May 18 02:23:50 seel-stor02 (da0: mpr0:0:0:0): SCSI status: Check Condition
May 18 02:23:50 seel-stor02 (da0: mpr0:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
May 18 02:23:50 seel-stor02 (da0: mpr0:0:0:0): Error 6, Retries exhausted
May 18 02:23:50 seel-stor02 (da0: mpr0:0:0:0): Invalidating pack
May 18 02:23:50 seel-stor02 (da1: mpr0:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
May 18 02:23:50 seel-stor02 (da1: mpr0:0:1:0): CAM status: SCSI Status Error
May 18 02:23:50 seel-stor02 (da1: mpr0:0:1:0): SCSI status: Check Condition
May 18 02:23:50 seel-stor02 (da1: mpr0:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
May 18 02:23:50 seel-stor02 (da1: mpr0:0:1:0): Error 6, Retries exhausted
May 18 02:23:50 seel-stor02 (da1: mpr0:0:1:0): Invalidating pack
May 18 02:23:53 seel-stor02 #011(pass9: mpr0:0:10:0): INQUIRY. CDB: 12 00 00 00 40 00 length 64 SMID 342 Aborting command 0xfffffe0001626ba0
May 18 02:23:53 seel-stor02 mpr0: Sending reset from mprsas_send_abort for target ID 10
May 18 02:23:53 seel-stor02 mpr0: Unfreezing devq for target ID 10

As far as my investigation went first time, I could not read the drives properly and the controller didn't respond to the mprutil properly,
but the sas3flash, for example manages to communicate with it without any issues.

One funny thing that I need to mention is that one drive "failure" in regards to ZIL was actually only 1 of the partitions used on one of the pool with the 1TB seagate drives, hence I doubt that a partition could fail, while the other one works fine in the other pool.
Unfortunately I can't copy paste from the console output, hence the screenshot is attached.

The MB is up to date with the BIOS, controller firmware also (16.00.01.00 / BSD 18.00.00.00) and all are sitting in a Supermicro 3U case SC836A-R1200B

Also this thing came along with 11.2, hence I'm suspecting something related to the freebsd driver.

Important Announcement for the TrueNAS Community.

Random Drive Faults with new and tested Hardware

Cokehero

Dabbler

alexra

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

Random Drive Faults with new and tested Hardware

Cokehero

Dabbler

alexra

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Random Drive Faults with new and tested Hardware"

Similar threads