CAM status: SCSI Status Error, Does this mean my HDD controller has failed?

pnunn

Dabbler
Joined
Jan 31, 2015
Messages
39
Hi Guys, I have one disk in my array that is constantly failing,

I've changed the hdd and it seemed to fix the issue, however, I then start to get SCSI errors again and the disk fails out of the array.

I've re-plugged the cables on the controller and rebooted but the SCSI errors were so fast that it failed to let the machine boot. I pulled the disk and the machine booted through.

The host is a Dell R520 with the HBA changed to one reflashed to be be non-raid.

smart -x /dev/da7 gives

Warning: settings changed through the CLI are not written to
the configuration database and will be reset on reboot.

root@freenas2[~]# smartctl -x /dev/da7
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p12 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: HP
Product: EG0600JETKA
Revision: HPD2
Compliance: SPC-4
User Capacity: 600,127,266,816 bytes [600 GB]
Logical block size: 512 bytes
Rotation Rate: 10000 rpm
Form Factor: 2.5 inches
Logical Unit id: 0x50000397281b32e9
Serial number: 76C0A1D7FUYB1628
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Sat Mar 12 19:24:01 2022 AEDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
Read Cache is: Enabled
Writeback Cache is: Disabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature: 27 C
Drive Trip Temperature: 60 C

Manufactured in week 28 of year 2016
Specified cycle count over device lifetime: 50000
Accumulated start-stop cycles: 89
Specified load-unload count over device lifetime: 600000
Accumulated load-unload cycles: 89
Elements in grown defect list: 0

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 1607 0 0 0 1207745.270 0
write: 0 1227 0 0 0 38174.543 0

Non-medium error count: 12409

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Completed - 39613 - [- - -]
# 2 Background short Completed - 39573 - [- - -]

Long (extended) Self-test duration: 5089 seconds [84.8 minutes]

Background scan results log
Status: scan is active
Accumulated power on time, hours:minutes 39708:51 [2382531 minutes]
Number of background scans performed: 1539, scan progress: 9.91%
Number of background medium scans performed: 0

Protocol Specific port log page for SAS SSP
relative target port id = 1
generation code = 2
number of phys = 1
phy identifier = 0
attached device type: SAS or SATA device
attached reason: loss of dword synchronization
reason: loss of dword synchronization
negotiated logical link rate: phy enabled; 6 Gbps
attached initiator port: ssp=1 stp=1 smp=1
attached target port: ssp=0 stp=0 smp=0
SAS address = 0x50000397281b32ea
attached SAS address = 0x5c81f660e19e1c06
attached phy identifier = 2
Invalid DWORD count = 8
Running disparity error count = 8
Loss of DWORD synchronization = 2
Phy reset problem = 0
Phy event descriptors:
Invalid word count: 8
Running disparity error count: 8
Loss of dword synchronization count: 2
Phy reset problem count: 0
Elasticity buffer overflow count: 0
Received abandon-class OPEN_REJECT count: 0
Transmitted BREAK count: 0
Received BREAK count: 0
Transmitted SSP frame error count: 195
Received SSP frame error count: 0
relative target port id = 2
generation code = 2
number of phys = 1
phy identifier = 1
attached device type: no device attached
attached reason: unknown
reason: unknown
negotiated logical link rate: phy enabled; unknown
attached initiator port: ssp=0 stp=0 smp=0
attached target port: ssp=0 stp=0 smp=0
SAS address = 0x50000397281b32eb
attached SAS address = 0x0
attached phy identifier = 0
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization = 0
Phy reset problem = 0
Phy event descriptors:
Invalid word count: 0
Running disparity error count: 0
Loss of dword synchronization count: 0
Phy reset problem count: 0

Should I change the HBA? Is it the disk, do I give up and replace the R520?

Peter.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
I woud post a hardware spec first
 

pnunn

Dabbler
Joined
Jan 31, 2015
Messages
39
Dell R520
2 x Intel(R) Xeon(R) CPU E5-2407 v2 @ 2.40GHz (2400.05-MHz K8-class CPU)
The problem disk
da7 at mps0 bus 0 scbus0 target 14 lun 0
da7: <HP EG0600JETKA HPD2> Fixed Direct Access SPC-4 SCSI device
da7: Serial Number 76C0A1D7FUYB1628
da7: 600.000MB/s transfers
da7: Command Queueing enabled
Dell H310 MiNi Monolithic K09CJ with LSI 9211-8i P20 IT Mode ZFS FreeNAS unRAID
TrueNAS-12.0-U8

Does that help?
 
Last edited:

pnunn

Dabbler
Joined
Jan 31, 2015
Messages
39
I've just ordered another HBA. Hopefully that fixes the issue??
 

pnunn

Dabbler
Joined
Jan 31, 2015
Messages
39
OK, HBA didn't fix it... but a replacement hdd back plane did. So now I've got a spare HBA in IT mode and one in Raid mode if anyone wants them :)
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Spares are always worth keeping
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
OK, HBA didn't fix it... but a replacement hdd back plane did.

Yeah, that's kinda the nature of the business. We've been slowly upgrading from LSI 2008 to LSI 3008, and after upgrading one host, it developed a weird problem. After temporarily replacing the controller and cable on the bench, weird problems remained, and not having spare backplanes for that particular chassis, I ordered one for $25 on eBay from one of the usual suspects. Swapping it in, ... the problem remained. And on top of it, the replacement had one HDD LED that didn't work. So it turned out that there was something weird about the HP 8643-8087 cable I had used for four of the ports, and the alternate cable I had tried was the exact same part. Ugh. Fortunately, I did not feel bad returning the backplane because it did have a bad LED, and, yay, free eBay returns.

Home hobbyists look at me with a pained expression when I say that debugging is often a process of swapping things out until you find the actual culprit. Yes I know it sucks.
 
Top