9305-24i Upgrade and issues after

VlaDeMaN · Jun 4, 2018

Hi everyone.

I got an email from my freenas server and I'm not sure what is going on. Backstory below. Could this be a bad cable or 2? Bad HBA? The 12 drives this is happening on are the original 12 I built this array with 2 years ago. They're Seagate 6TB enterprise drives, all the same batch (I know).

serverfqdn.com kernel log messages:
> (da9:mpr0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 01 0d c3 8e a8 00 00 00 18 00 00 length 12288 SMID 95 Aborting command 0xfffffe00011bf890
> mpr0: Sending reset from mprsas_send_abort for target ID 9
> (da9:mpr0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 01 0d c3 88 b8 00 00 00 18 00 00 length 12288 SMID 115 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> (da9:mpr0:0:9:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 981 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
> mpr0: Unfreezing devq for target ID 9
> (da9:mpr0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 01 0d c3 88 b8 00 00 00 18 00 00
> (da9:mpr0:0:9:0): CAM status: CCB request completed with an error
> (da9:mpr0:0:9:0): Retrying command
> (da9:mpr0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 01 0d c3 8e a8 00 00 00 18 00 00
> (da9:mpr0:0:9:0): CAM status: Command timeout
> (da9:mpr0:0:9:0): Retrying command
> (da9:mpr0:0:9:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
> (da9:mpr0:0:9:0): CAM status: CCB request completed with an error
> (da9:mpr0:0:9:0): Retrying command
> (da9:mpr0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 01 0d c3 8e a8 00 00 00 18 00 00
> (da9:mpr0:0:9:0): CAM status: SCSI Status Error
> (da9:mpr0:0:9:0): SCSI status: Check Condition
> (da9:mpr0:0:9:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> (da9:mpr0:0:9:0): Retrying command (per sense data)
> (da9:mpr0:0:9:0): WRITE(16). CDB: 8a 00 00 00 00 01 0a f5 25 20 00 00 00 40 00 00
> (da9:mpr0:0:9:0): CAM status: SCSI Status Error
> (da9:mpr0:0:9:0): SCSI status: Check Condition
> (da9:mpr0:0:9:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> (da9:mpr0:0:9:0): Retrying command (per sense data)

Last week I upgraded one of our 12 drive (striped 2x raidz2, i'll call them #1 and #2) freenas servers with a 9305-24i, appropriate cables (8643 to 8087) and an additional 12 drives, to fill up the whole array. Had a 9201-16i before (8087 to 8087). The chassis is a 24 port Norco with an 800w Athena Power dual PSU. We're on FreeNAS-11.1-U4.

The swap went great, everything worked, new drives were recognized and setup, and the new 9305 card updated to the latest firmware . 2 days later, one of the drives in the striped array, raidz2 #2 drive number 3, is getting a ton of read errors. I wasn't around for 2 more days so I couldn't swap it out until Friday. We have backups, so whatever. Friday comes along and another drive is having the same issue but kinda luckily in raidz2 #1 this time which ends up being drive #9, a completely separate row, cable and port on the card. I only had 2 spares so that worked out fine, drives rebuilt in a day and we're cooking again. The additional 12 drives are running flawlessly.

VlaDeMaN · Jun 8, 2018

bump?

Borja Marcos · Jun 8, 2018

No clue from the errors, but a couple of years ago I suffered a vaguely similar issue. In our case it was a SAS backplane with SATA SSDs and a SAS3 (LSI 3008). Now and then ZFS registered an error or two, with some error bursts in /var/log/messages.

Having enough redundancy we never lost any data, a scrub resilvered the troublesome files and everything was fine.

I blamed the backplane (even had quite a fight with IBM who refused to do anything because their diagnostics didn't detect anything) and turned out it was the HBA. I swapped it with an older LSI2008 card and, voila, fixed!

It was a defective card, I had other identical systems with LSI3008 working like a charm. It just happened I had a LSI2008 card immediately available.

I guess that the HBA had some random timing error that made the commands fail in high load situations. I even was able to reproduce it reliably using several bonnie++ processes running in parallel. Of course I don't have a several GHz oscilloscope with SAS probes to look at it ;)

Code:

Feb 12 07:43:59 clientes-ssd8 kernel: (noperiph:mpr0:0:4294967295:0): SMID 33 Aborting command 0xfffffe0000c7baf0
Feb 12 07:43:59 clientes-ssd8 kernel: (da14:mpr0:0:40:0): WRITE(10). CDB: 2a 00 39 a1 fe f0 00 00 20 00 length 16384 SMID 989 terminated ioc 804b scsi 0 state c xfer 0
Feb 12 07:43:59 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: 28 00 31 40 ea 20 00 00 18 00 length 12288 SMID 953 terminated ioc 804b scsi 0 state c xfe(da14:mpr0:0:40:0): READ(10). CDB: 28 00 31 40 ea 40 00 00 20 00 
Feb 12 07:43:59 clientes-ssd8 kernel: (da14:mpr0:0:40:0): CAM status: Command timeout
Feb 12 07:43:59 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: 28 00 31 40 ea 00 00 00 20 00 length 16384 SMID 571 terminated ioc 804b scsi 0 state c xfe(da14:r 0
Feb 12 07:43:59 clientes-ssd8 kernel: mpr0:0:	(da14:mpr0:0:40:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 638 te40:rminated ioc 804b scsi 0 state c xfer 0
Feb 12 07:44:00 clientes-ssd8 kernel: mpr0: log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440)
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): WRITE(10). CDB: 2a 00 39 a1 fe f0 00 00 20 00 length 16384 SMID 818 terminated ioc 804b scsi 0 state c xfer 0
Feb 12 07:44:00 clientes-ssd8 kernel: mpr0: log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440)
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: 28 00 31 40 ea 40 00 00 20 00 length 16384 SMID 952 terminated ioc 804b scsi 0 state c xfer 0
Feb 12 07:44:00 clientes-ssd8 kernel: mpr0: log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440)
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: 28 00 31 40 ea 20 00 00 18 00 length 12288 SMID 922 terminated ioc 804b scsi 0 state c xfer 0
Feb 12 07:44:00 clientes-ssd8 kernel: mpr0: log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440)
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: 28 00 31 40 ea 00 00 00 20 00 length 16384 SMID 823 terminated ioc 804b scsi 0 state c xfer 0
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): WRITE(10). CDB: 2a 00 39 a1 fe f0 00 00 20 00 
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): CAM status: SCSI Status Error
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): SCSI status: Check Condition
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): Retrying command (per sense data)

Important Announcement for the TrueNAS Community.

9305-24i Upgrade and issues after

VlaDeMaN

Cadet

VlaDeMaN

Cadet

Borja Marcos

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

9305-24i Upgrade and issues after

VlaDeMaN

Cadet

VlaDeMaN

Cadet

Borja Marcos

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "9305-24i Upgrade and issues after"

Similar threads