Kernel SCSI errors

Status
Not open for further replies.

c32767a

Patron
Joined
Dec 13, 2012
Messages
371
System:

reenas.local
Build FreeNAS-8.3.0-RELEASE-x64 (r12701M)
Platform Six-Core AMD Opteron(tm) Processor 2419 EE
Memory 32751MB
System Time Fri Dec 21 00:20:32 EST 2012
Uptime 12:20AM up 54 mins, 2 users
Load Average 2.29, 2.29, 2.55
Connected through 192.168.1.196





I'm experimenting with a new build and am getting the following kernel error when the disks are under heavy write load:

Dec 21 00:13:03 freenas kernel: (da4:mps1:0:3:0): WRITE(10). CDB: 2a 0 2b f1 91 88 0 1 0 0 length 131072 SMID 378 terminated ioc 804b scsi 0 state c xfer 0
Dec 21 00:13:03 freenas kernel: (da4:mps1:0:3:0): WRITE(10). CDB: 2a 0 2b f1 8e 88 0 1 0 0 length 131072 SMID 849 terminated ioc 804b scsi 0 state c xfer 0
Dec 21 00:13:03 freenas kernel: (da4:mps1:0:3:0): WRITE(10). CDB: 2a 0 2b f1 8d 88 0 1 0 0 length 131072 SMID 415 terminated ioc 804b scsi 0 state c xfer 0
Dec 21 00:13:03 freenas kernel: (da4:mps1:0:3:0): WRITE(10). CDB: 2a 0 2b f1 8b 88 0 1 0 0 length 131072 SMID 601 terminated ioc 804b scsi 0 state c xfer 0
Dec 21 00:13:03 freenas kernel: (da4:mps1:0:3:0): WRITE(10). CDB: 2a 0 2b f1 8c 88 0 1 0 0 length 131072 SMID 409 terminated ioc 804b scsi 0 state c xfer 0
Dec 21 00:13:03 freenas kernel: (da4:mps1:0:3:0): WRITE(10). CDB: 2a 0 2b f1 8f 88 0 1 0 0 length 131072 SMID 172 terminated ioc 804b scsi 0 state c xfer 0
Dec 21 00:13:03 freenas kernel: (da4:mps1:0:3:0): WRITE(10). CDB: 2a 0 2b f1 88 88 0 1 0 0 length 131072 SMID 903 terminated ioc 804b scsi 0 state c xfer 0
Dec 21 00:13:03 freenas kernel: (da4:mps1:0:3:0): WRITE(10). CDB: 2a 0 2b f1 8a 88 0 1 0 0 length 131072 SMID 715 terminated ioc 804b scsi 0 state c xfer 0
Dec 21 00:13:03 freenas kernel: (da4:mps1:0:3:0): WRITE(10). CDB: 2a 0 2b f1 89 88 0 1 0 0 length 131072 SMID 880 terminated ioc 804b scsi 0 state c xfer 0
Dec 21 00:13:03 freenas kernel: (da4:mps1:0:3:0): WRITE(10). CDB: 2a 0 2b f1 90 88 0 1 0 0 length 131072 SMID 402 terminated ioc 804b scsi 0 state c xfer 0
Dec 21 00:13:03 freenas kernel: (da4:mps1:0:3:0): WRITE(10). CDB: 2a 0 2b f1 88 88 0 1 0 0
Dec 21 00:13:03 freenas kernel: (da4:mps1:0:3:0): CAM status: SCSI Status Error
Dec 21 00:13:03 freenas kernel: (da4:mps1:0:3:0): SCSI status: Check Condition
Dec 21 00:13:03 freenas kernel: (da4:mps1:0:3:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Dec 21 00:13:04 freenas kernel: (da4:mps1:0:3:0): WRITE(6). CDB: a 0 1 b8 8 0
Dec 21 00:13:04 freenas kernel: (da4:mps1:0:3:0): CAM status: SCSI Status Error
Dec 21 00:13:04 freenas kernel: (da4:mps1:0:3:0): SCSI status: Check Condition
Dec 21 00:13:04 freenas kernel: (da4:mps1:0:3:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)

SMARTD doesn't turn anything up on any of the drives.
zpool scrub returns a clean bill of health, no errors or repairs.

The case is a 12 bay case and I've moved the drives around between bays with no effect. The problems seems to follow the drives, subject to the fact that it's 'sticky' to one drive per reboot.
drive location doesn't seen matter, nor does drive type.
The system is well capable of powering the full 12 bays. I'm only experimenting with 3 drives at the moment. No other bays are populated.


There are 2 controllers, both are LSI SAS9211-8i.
Disks are a couple new 3TB drives I had laying around as spares for a production NAS.
2 are:
<ATA ST3000DM001-1CH1 CC43> at scbus0 target 2 lun 0 (pass0,da0)
<ATA ST3000DM001-1CH1 CC43> at scbus1 target 2 lun 0 (pass3,da3)

and one is:
<ATA WDC WD30EZRX-00M 0A80> at scbus1 target 3 lun 0 (pass4,da4)

All are/were new out of the packaging today.

The LSI controllers have version 14 firmware on them.

Any thoughts as to where I should look? Is this a driver bug? Other than the syslog messages, I haven't been able to find any evidence of a failure anywhere.
With these 3 drives (yeah, I know.. one's a 5400 and all 3 are desktop drives.. I have cheap users to support.) the system can sustain 100MB/s writes, even with the syslog messages.


Thanks
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I think this is the problem I had with my old Seagates. There was a super long thread about this on the Seagate website, but I can't find it now so I'll give you the basic info I can remember. This cost me over $2500 and is the sole reason I will never buy Seagates ever again. I wasn't using FreeBSD/FreeNAS when I had these drives, but I was getting a "reset" error which is exactly what you are getting.

Seagate drives have an issue where the firmware on the drive will return a HALT command when the disk's on-board cache is full. This is supposed to tell the SATA/RAID controller to stop sending data until the disk sends another command(I believe it is RESUME). Some controllers will still continue to send data. In this case the hard disk sends a much more forceful "RESET" command. This forces the SATA/RAID controller to stop sending data because the disk is basically saying "I'm busy and if you send me anything I won't respond while I do internal things". Normally this is diagnostics, bootup, etc type of things. The disk replies back with READY after the disk has gone idle. No data is actually lost, but this is usually a sign of a failing drive. This is actually a normal function for certain firmware versions by some Seagate drives and is not a cause for concern. For non-Seagate drives this would be a "ZOMG this drive is failing".

Seagate does not always follow the industry standard and this is why you are getting those messages. They should be a cause for concern, but not in your case. Seagate seems to love doing their own thing often. If I bootup one of my FreeNAS servers and do a scrub with nobody accessing the server and run a "smartctl -a /dev/ada0" of ALL of the seagate drives that I have I get a SMART ID#1(Raw_Read_Error_Rate) that is increasing by about 20 million per hour. What do the WD drives do in the same zpool? They all show a value of 0. Seagate's entry for ID#1 is not the same as all of the other manufacturers I've dealt with. To Seagate they don't seem to mean the same thing as everyone else. I called Seagate and they said that the ID#1 going up is not an indication of anything wrong(but its a very good indication of something wrong for.. just about every OTHER manufacturer out there). I get the impression that the value actually shows the number of sectors read that had to be corrected using the hard drives ECC data. In effect, at least 1 bit was in error on that sector when read and was corrected by the hard drive hardware. Another good example, a brand new Seagate drive I was playing with yesterday showed that the raw value for "Head_Flying_Hours" is 59644210839851. I'm pretty sure that the 3TB drive has not been powered on for 6.8 billion years.

Some RAID controllers will take a RESET command of a hard drive and freak out. My Highpoint RocketRAID 3560, 3520 and 4320 all freak out if any of the drives on the controller send a RESET command. Additionally, most RAID controllers(note I do not say SATA controllers) will drop any drive from any array if a RESET command is received. This mean that home users with Intel SATA/RAID had zero problems. But if you bought any aftermarket hardware RAID controller this "issue" would crop up. If you called Seagate they'd tell you "sorry.. that drive isn't meant for NAS use so we won't help you". The drive was sold for "Desktops and Desktop RAID". What is "Desktop RAID" you ask? Seagate says "Desktop RAID" is using the Intel RAID controller in a RAID0 or RAID1. This was a sneaky(and assumed to be deliberate) to prevent people from buying consumer drives and trying to use them in places where Seagate would have preferred to sell you a MUCH more expensive Enterprise class drive. It worked really at getting people to stop using Seagate consumer drives. They switched to WDs and other brands. There was also something wrong with the Highpoint controllers and I could never identify the cause but sometimes the entire controller would lock up(the device would disappear from the Device Manager in Windows Server 2008 R2). If left idle it could run indefinitely with no problems except sometimes hard drives would randomly drop because they'd start doing their own background idle tasks and the HALT command would be sent. Really sucks because it makes it damn near impossible to get your data off of the RAID after you've figured this out, and you are left freaking out and sweating bullets when it happens to be the only copy of your data plus drives are also randomly dropping from the array and the RAID keeps going completely away because the controller locks up. To add to my frustration I spent days researching the exact hard drive to buy and my hard drives and firmware version were on the Highpoint compatibility list(lots of help those are...). I turned around and spent something like $2k+ on Seagates only to turn around and buy another $2k+ worth of WDs less than 3 months later when I was migrating data around and found this issue. The REALLY REALLY shitty thing about it is that the drives will work perfectly when the RAID is empty. It's not until you start loading it with data and the drives start performing more slowly that the issue really started happening for me. In my case the drives ran for about 70 days with zero issues. Then suddenly I started having the issue every other day with a different disk every time. I ended up putting my Seagates in a box and never trusted them with any data ever again.

TL:DR - Drives are fine. Ignore the error and trust that your SMART tests and scrubs are proving the disk is fine. When your SMART tests and scrubs start showing tons of errors, then you should start worrying. Also, do not EVER use those drives in a hardware RAID. They may not be trustworthy.

Remember that I am not sure if this is your issue, but it sounds exactly like the problem I had.
 

c32767a

Patron
Joined
Dec 13, 2012
Messages
371
I buy the flow control/bandwidth issue. This machine is the largest ram machine I've worked with, so there is ample RAM to cache and flush. There are also only 3 drives, so the write load is concentrated on just 3 spindles, instead of 6 or 10.

But, I'm seeing this problem on both WD and Seagate drives. So, though it sounds similar, I'm not sure I can blame seagate just yet.

It looks like there are some tunables for the driver and ZFS that I can play with to see if the problem is load related. Lots of reboots and test cases to play with.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Oh damn. I just realized those messages were for da4 and not da3. I got nothing then. I have 8 of those in a testbed right now and they've had zero problems. I'd try a SMART long test and see what happens there.
 

c32767a

Patron
Joined
Dec 13, 2012
Messages
371
Yeah, I've run all the SMART tests on the drives and nothing comes up.
 

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,403
Ignore protosd suggestion at your own peril. IIRC, this looks just like a few threads I've seen on the FreeBSD mailing lists. The solution was to, then, upgrade to a new firmware revision that LSI released to address the problem. If an older firmware corrects the problem why wouldn't you run it?
 

c32767a

Patron
Joined
Dec 13, 2012
Messages
371
I've been around IT long enough to know that vendors occasionally rev chipsets and components and don't necessarily backport firmware support and that can cause problems too.

The discussion in the link that was posted and on other fora talks about version 12 firmware and upgrading to 13 vs 14. It wasn't clear if the card needed to be at 13, or => 13.

I'm not going to flash older firmware on the cards until I'm convinced that's the problem.
 

c32767a

Patron
Joined
Dec 13, 2012
Messages
371
Just an update,

After a great deal of experimentation, it's looking more and more like the WD disk is part of the problem.

Building zpools with just the Seagate disks results in excellent performance and no errors.

I also condensed everything down onto one controller. Now it's time to go back and distribute the disks across 2 controllers again.
 

c32767a

Patron
Joined
Dec 13, 2012
Messages
371
Ignore protosd suggestion at your own peril. IIRC, this looks just like a few threads I've seen on the FreeBSD mailing lists. The solution was to, then, upgrade to a new firmware revision that LSI released to address the problem. If an older firmware corrects the problem why wouldn't you run it?

It turns out that phase 13 firmware is a 'minimum' version. The driver is working fine with the phase 14 firmware. I didn't downgrade the firmware that shipped with the card because that was not the problem.

After some extensive testing, the problem turned out to be a bad SAS backplane, not the controller firmware or driver. The server platform passes extensive load testing now that the backplane has been replaced.
 
Status
Not open for further replies.
Top