SOLVED LSI/Avago SAS HBA problem possibly leading to disconnected drives

Status
Not open for further replies.

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Hi all,

my installation is stock FreeBSD but since there are so many people here who know things about storage, I hope you will permit me to ask in this forum, too ;)

I just got the third SSD drive failing in the last couple of weeks. For the first two drives we just noticed this:
Code:
	NAME		STATE	 READ WRITE CKSUM
	zdata	   DEGRADED	 0	 0	 0
	  mirror-0  DEGRADED	 0	 0	 0
		da2p1   FAULTED	 15   248	 0  too many errors
		da3p1   ONLINE	   0	 0	 0

then replaced the supposedly failed drive, resilvered, everything OK.

Now with the third drive failing i dug a little deeper. First, SMART suggest the drive is perfectly healthy if I'm not mistaken:
Code:
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010	Pre-fail  Always	   -	   0
  9 Power_On_Hours		  0x0032   099   099   000	Old_age   Always	   -	   3646
 12 Power_Cycle_Count	   0x0032   099   099   000	Old_age   Always	   -	   28
177 Wear_Leveling_Count	 0x0013   099   099   000	Pre-fail  Always	   -	   10
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010	Pre-fail  Always	   -	   0
181 Program_Fail_Cnt_Total  0x0032   100   100   010	Old_age   Always	   -	   0
182 Erase_Fail_Count_Total  0x0032   100   100   010	Old_age   Always	   -	   0
183 Runtime_Bad_Block	   0x0013   100   100   010	Pre-fail  Always	   -	   0
187 Reported_Uncorrect	  0x0032   100   100   000	Old_age   Always	   -	   0
190 Airflow_Temperature_Cel 0x0032   079   066   000	Old_age   Always	   -	   21
195 Hardware_ECC_Recovered  0x001a   200   200   000	Old_age   Always	   -	   0
199 UDMA_CRC_Error_Count	0x003e   100   100   000	Old_age   Always	   -	   0
235 Unknown_Attribute	   0x0012   100   100   000	Old_age   Always	   -	   0
241 Total_LBAs_Written	  0x0032   099   099   000	Old_age   Always	   -	   11929282916

SMART Error Log Version: 1
No Errors Logged


So what's going on? This:
Code:
	(da2:mpr0:0:4:0): READ(10). CDB: 28 00 07 db d0 21 00 00 07 00 length 3584 SMID 197 Aborting command 0xfffffe000107ab30
mpr0: Sending reset from mprsas_send_abort for target ID 4
	(da2:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 883 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
	(da2:mpr0:0:4:0): WRITE(10). CDB: 2a 00 13 4c 65 7d 00 00 01 00 length 512 SMID 828 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
	(da2:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 919 terminated ioc 804b loginfo 311(da2:mpr0:0:4:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 
30000 scsi 0 state c xfer 0
mpr0: Unfreezing devq for target ID 4
(da2:mpr0:0:4:0): CAM status: CCB request completed with an error
(da2:mpr0:0:4:0): Retrying command
(da2:mpr0:0:4:0): WRITE(10). CDB: 2a 00 13 4c 65 7d 00 00 01 00 
(da2:mpr0:0:4:0): CAM status: CCB request completed with an error
(da2:mpr0:0:4:0): Retrying command
(da2:mpr0:0:4:0): READ(10). CDB: 28 00 07 db d0 21 00 00 07 00 
(da2:mpr0:0:4:0): CAM status: Command timeout
(da2:mpr0:0:4:0): Retrying command
(da2:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
(da2:mpr0:0:4:0): CAM status: CCB request completed with an error
(da2:mpr0:0:4:0): Retrying command
(da2:mpr0:0:4:0): READ(10). CDB: 28 00 07 db d0 21 00 00 07 00 
(da2:mpr0:0:4:0): CAM status: SCSI Status Error
(da2:mpr0:0:4:0): SCSI status: Check Condition
(da2:mpr0:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da2:mpr0:0:4:0): Retrying command (per sense data)
	(da2:mpr0:0:4:0): READ(10). CDB: 28 00 07 db c1 b5 00 00 07 00 length 3584 SMID 995 terminated ioc 804b loginfo 31110e03 scsi 0 state c xfer 0
	(da2:mpr0:0:4:0): WRITE(10). CDB: 2a 00 27 01 0d a0 00 00 10 00 length 8192 SMID 348 terminated ioc 804b loginfo 31110e03 scs(da2:mpr0:0:4:0): READ(10). CDB: 28 00 07 db c1 b5 00 00 07 00 
i 0 state c xfer 0
	(da2:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 304 terminated ioc 804b loginfo 311(da2:mpr0:0:4:0): CAM status: CCB request completed with an error
10e03 scsi 0 state c xfer 0
(da2:mpr0:0:4:0): Retrying command
(da2:mpr0:0:4:0): WRITE(10). CDB: 2a 00 27 01 0d a0 00 00 10 00 
(da2:mpr0:0:4:0): CAM status: CCB request completed with an error
(da2:mpr0:0:4:0): Retrying command
(da2:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
(da2:mpr0:0:4:0): CAM status: CCB request completed with an error
(da2:mpr0:0:4:0): Retrying command
(da2:mpr0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
(da2:mpr0:0:4:0): CAM status: SCSI Status Error
(da2:mpr0:0:4:0): SCSI status: Check Condition
(da2:mpr0:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da2:mpr0:0:4:0): Error 6, Retries exhausted
(da2:mpr0:0:4:0): Invalidating pack


This is what the kernel has to tell me about the controller:
Code:
mpr0: <Avago Technologies (LSI) SAS3008> port 0xe000-0xe0ff mem 0xdf240000-0xdf24ffff,0xdf200000-0xdf23ffff irq 16 at device 0.0 on pci1
mpr0: Firmware: 10.00.03.00, Driver: 15.03.00.00-fbsd
mpr0: IOCCapabilities: 6985c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR,MSIXIndex,FastPath,RDPQArray>
mpr0: SAS Address for SATA device = a6949033fdccc5a0
mpr0: SAS Address from SAS device page0 = 4433221100000000
mpr0: SAS Address from SATA device = a6949033fdccc5a0
mpr0: Found device <881<SataDev,Direct>,End Device> <6.0Gbps> handle<0x0009> enclosureHandle<0x0001> slot 0
mpr0: At enclosure level 0 and connector name (	)
mpr0: SAS Address for SATA device = a08d8d33fdccc5a1
mpr0: SAS Address from SAS device page0 = 4433221101000000
mpr0: SAS Address from SATA device = a08d8d33fdccc5a1
mpr0: Found device <881<SataDev,Direct>,End Device> <6.0Gbps> handle<0x000a> enclosureHandle<0x0001> slot 1
uhub1: mpr0: At enclosure level 0 and connector name (	)
mpr0: SAS Address for SATA device = a4b8b10cdbcbc695
mpr0: SAS Address from SAS device page0 = 4433221102000000
mpr0: SAS Address from SATA device = a4b8b10cdbcbc695
mpr0: Found device <881<SataDev,Direct>,End Device> <6.0Gbps> handle<0x000b> enclosureHandle<0x0001> slot 2
mpr0: At enclosure level 0 and connector name (	)
mpr0: SAS Address for SATA device = a5b1a50cdbcbc695
mpr0: SAS Address from SAS device page0 = 4433221103000000
mpr0: SAS Address from SATA device = a5b1a50cdbcbc695
mpr0: Found device <881<SataDev,Direct>,End Device> <6.0Gbps> handle<0x000c> enclosureHandle<0x0001> slot 3
mpr0: At enclosure level 0 and connector name (	)


Any ideas?

Thanks,
Patrick
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
And as expected camcontrol reset followed by zpool offline and zpool online restored everything to working order. I wonder for how long ...
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I would check for an update on the firmware first.

Sent from my SAMSUNG-SGH-I537 using Tapatalk
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
The drives' or the controller's? Or both? ;)

Thanks
Patrick
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I was thinking about the controller, but it might be a reasonable idea to check the drive also.

Sent from my SAMSUNG-SGH-I537 using Tapatalk
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
There's wayyyy more recent firmware available and think I should flash it to IT mode, while I'm at it. Going to schedule a maintenance window for a non-critical system and let you know how it goes.

Thanks
Patrick
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Code:
mpr0: <Avago Technologies (LSI) SAS3008> port 0xe000-0xe0ff mem 0xdf240000-0xdf24ffff,0xdf200000-0xdf23ffff irq 16 at device 0.0 on pci1
mpr0: Firmware: 10.00.03.00, Driver: 15.03.00.00-fbsd
mpr0: IOCCapabilities: 6985c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR,MSIXIndex,FastPath,RDPQArray>
I thought the firmware looked a little old. Often, but not always, the firmware and driver come out together.
There's wayyyy more recent firmware available and think I should flash it to IT mode
IT mode firmware is certainly a good choice. I didn't notice that when I was looking at it the first time, from my phone.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
OK, now I don't have a perfect match since the firmware offered for download by Supermicro is phase 16 while the FreeBSD driver seems to be phase 15, but that's hopefully still better than phase 10 ...

Flashing and changing to IT mode went perfectly smooth from a FreeDOS USB drive - remember to read and note your SAS HBA MAC address before you flash ;-) There's a sticker on the mainboard, but that's pretty inconvenient for a production machine already mounted in a rack.

One question, Chris, if you happen to know - is this supposed to work from a running FreeBSD system?
Code:
mprutil flash update firmware 3008IT16.ROM


When trying to flash IR firmware on a controller that is in IT mode or vice versa I get an error message like this
Code:
root# mprutil flash update firmware 3008IR16.ROM
mprutil: Invalid image:
mprutil:   Expected Product ID: 2221
mprutil:   Image Product ID: 2721


When trying to flash matching firmware it is this error:
Code:
root# mprutil flash update firmware 3008IT16.ROM
Updating firmware...
mprutil: Fail to update firmware
root# dmesg
[...]
mpr0: mpr_user_pass_thru: user reply buffer (64) smaller than returned buffer (68)
mpr0: mpr_user_command: unsupported parameter or unsupported function in request (function = 0x9)


Thanks and kind regards,
Patrick
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
I changed the thread title to better match the topic and in case somebody is stumbling upon this looking for instructions on flashing LSI controllers ...

The SAS MAC you are looking for is this when viewed from a running FreeBSD/FreeNAS:
Code:
root@freebsd:~ # mprutil show all
Adapter:
mpr0 Adapter:
	  Board Name: LSI3008-IT
   Board Assembly:
		Chip Name: LSISAS3008
	Chip Revision: ALL
	BIOS Revision: 8.37.00.00
Firmware Revision: 16.00.01.00
  Integrated RAID: no
[...]
Enclosures:
Slots	  Logical ID	 SEPHandle  EncHandle	Type
  08	500304801e285d01			   0001	 Direct Attached SGPIO
[...]


Despite the text "Enclosures" suggesting something different this is the HBA ID that is also printed on the mainboard. You need the lower 9 digits for flashing. So look them up, take a note, then reboot the system into DOS to flash.

HTH,
Patrick
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
One question, Chris, if you happen to know - is this supposed to work from a running FreeBSD system?

Thanks and kind regards,
Patrick

I always boot from the FreeDOS USB drive. It has never worked for me to do it with FreeNAS.

Sent from my SAMSUNG-SGH-I537 using Tapatalk
 

Touche

Explorer
Joined
Nov 26, 2016
Messages
55
Hi all,

my installation is stock FreeBSD but since there are so many people here who know things about storage, I hope you will permit me to ask in this forum, too ;)

I just got the third SSD drive failing in the last couple of weeks. For the first two drives we just noticed this...

Unfortunately, this problem has been around for years with LSI controllers and FreeBSD. You can see my issues with it here.
How do you find FW 16? Did it resolve the issue?
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Flashed 5 servers without problems, but the systems are not yet under load. I'll report in a couple of weeks.

Kind regards
Patrick
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Flashed the first two production systems with heavy load. Looks good so far. Still a firmware/driver mismatch, FreeBSD 11.2 seems to come with an updated driver and there is nothing newer than firmware 16 on Supermicro's FTP server:
Code:
mpr0: Firmware: 16.00.01.00, Driver: 18.03.00.00-fbsd


If I don't get a disconnect of a drive for a week, I'll report again. Sooner, if I do ;)

Patrick
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
No more disconnects so far.

Patrick
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
FreeBSD 11.2, not FreeNAS.

Patrick
 
Status
Not open for further replies.
Top