SCSI disk failure ABORTED COMMAND asc:4b,4 (NAK received)

pnunn

Dabbler
Joined
Jan 31, 2015
Messages
39
A few weeks ago, I was notified of a disk failure in my TrueNAS setup.

I changed the drive, without looking too hard, with a SATA drive, which joined the pool OK, but eventually crapped out, justifiably, because it was a slower drive than the 7 SAS drives that make up the rest of the pool.

I then went and purchased a second hand SAS drive and tried to join this to the pool, but was greeted with a bunch of errors,

Apr 10 14:13:17 freenas2 (da2:mps0:0:2:0): READ(10). CDB: 28 00 45 dd 2d 80 00 01 00 00
Apr 10 14:13:17 freenas2 (da2:mps0:0:2:0): CAM status: SCSI Status Error
Apr 10 14:13:17 freenas2 (da2:mps0:0:2:0): SCSI status: Check Condition
Apr 10 14:13:17 freenas2 (da2:mps0:0:2:0): SCSI sense: ABORTED COMMAND asc:4b,4 (NAK received)
Apr 10 14:13:17 freenas2 (da2:mps0:0:2:0): Error 5, Retries exhausted

I reseated all of the cables, rebooted, etc but nothing changed.

I figured that putting the SATA drive in may have damaged the HBA (an IT mode HBA in the Dell R520) so I purchased another one, and have just replaced the existing on with the new one. Unfortunately, nothing has changed.

The hardware: Dell R52[, Dell H310 mini monolithic K09CJ with LSI 9211-8i P20 IT Mode ZFS FreeNAS unRAID, TrueNAS 12.0-U8, dual Intel(R) Xeon(R) CPU E5-2407 v2 @ 2.40GHz, 32GB Ram, 8 300GB SAS drives.

Not sure what else is relivant.

Any suggestions on what to try next? Back plane? Another disk?

Any other diagnostics anyone can suggest?

Thanks.

Peter.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
I then went and purchased a second hand SAS drive and tried to join this to the pool, but was greeted with a bunch of errors
Can you dump a smartctl -a of that drive?

I suspect it came from a storage array that uses 520-byte sectors.
 

pnunn

Dabbler
Joined
Jan 31, 2015
Messages
39
Absolutely....

Here is the data


root@freenas2[~]# smartctl -a /dev/da2
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p12 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: HP
Product: EG0600JETKA
Revision: HPD2
Compliance: SPC-4
User Capacity: 600,127,266,816 bytes [600 GB]
Logical block size: 512 bytes
Rotation Rate: 10000 rpm
Form Factor: 2.5 inches
Logical Unit id: 0x50000397281b32e9
Serial number: 76C0A1D7FUYB1628
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Wed Apr 13 12:34:09 2022 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature: 25 C
Drive Trip Temperature: 60 C

Accumulated power on time, hours:minutes 40470:38
Manufactured in week 28 of year 2016
Specified cycle count over device lifetime: 50000
Accumulated start-stop cycles: 91
Specified load-unload count over device lifetime: 600000
Accumulated load-unload cycles: 91
Elements in grown defect list: 0

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 1631 0 0 0 1226312.556 0
write: 0 1227 0 0 0 38197.467 0

Non-medium error count: 13767

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed - 40413 - [- - -]
# 2 Background short Completed - 40245 - [- - -]
# 3 Background short Completed - 40076 - [- - -]
# 4 Background short Completed - 39908 - [- - -]
# 5 Background long Completed - 39765 - [- - -]
# 6 Background short Completed - 39740 - [- - -]
# 7 Background long Completed - 39613

and it looks as though you are quite correct.
 

pnunn

Dabbler
Joined
Jan 31, 2015
Messages
39
However, other drives seem to have the same 512 block size if that's what I should be looking at.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
However, other drives seem to have the same 512 block size if that's what I should be looking at.
It is, but the other factor is if there is a value for "formatted with protection information" which is the other 8 bytes of the 512+8=520 formula - I don't see that line though.

I will say that I've never seen a SAS/SATA mismatch ever damage an HBA. Hardware RAID controllers will often throw them out for incompatibility but software RAID doesn't care.

Check your cables and possibly the slot itself, a bent/dirty pin or debris in the backplane could cause intermittent contact.
 

pnunn

Dabbler
Joined
Jan 31, 2015
Messages
39
OK thanks @HoneyBadger, at least it makes me feel better about plugging in the wrong drive :). I have had a look a the cables, but I'll have another go at the backplane and see what I can see in there.
 

pnunn

Dabbler
Joined
Jan 31, 2015
Messages
39
OK, after much mucking around, and swapping out the HBA to a new one, I have now replaced the back plane for the drives and lo, everything came back to life, so it would seem one of the sockets on the original back plane died. YAY. Now I have a spare HBA with IT mode and the original HBA with Raid mode on it, guess I should sell them. :)
 
Top