Strange RAIDZ Behavior

Hobbes · Aug 30, 2016

I've got 8, 3tb WD Red drives which all have a really high number of UDMA_CRC error count. They all have 5-6 million when I do a smartctl -a. I previously had these in a UNAS 800 enclosure with a Jetway JNF9A-Q67 mobo, core i3 CPU, 16GB, non ECC RAM, IBM M1015 HBA. I suspect I have an issue with either the SFF 8087 to sata cables or the M1015 card. to cause those UDMA CRC errors. When running a scrub on a pool, I would see checksum errors as well, so could be related to non ECC RAM and/or issues with cable or HBA. I'm running FreeNAS 9.10.1 stable.

I've run short and long smartctl tests on the drives and all of them show 0 for reallocated sector, current pending sector and offlined uncorrectable. So I think the drives are OK, except for the exorbitantly high value of UDMA_CRC errors.

I recently bought a Supermicro SC846 24 bay enclosure with a Supermicro X8DT3-F, Xeon E5506 CPU, and 48GB of ECC RAM. 8 Bays are connected to the mobo via SFF8087 for onboard HBA running in IT mode. I've also got 2, LSI 9211-8i HBA's running FW ver.20 that connect the other 16 bays.

I've dettached and wiped the drives and put them into the SC846. All 8 drives are being managed by 1 LSI controller. I created an 8 disk RAIDZ2 array. Once I started testing it, I found that I was seeing a lot of these errors,

CAM status: SCSI Status Error
Aug 30 13:50:37 echelon (da1:mps1:0:1:0): SCSI status: Check Condition
Aug 30 13:50:37 echelon (da1:mps1:0:1:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected)
Aug 30 13:50:37 echelon (da1:mps1:0:1:0): Retrying command (per sense data)

These errors will appear when I do a scrub on the pool. Once the scrub is done, these errors can occur when data is read from the pool.

I am seeing these across all drives. The UDMA_CRC errors are also increasing when I see these SCSI errors.

Through more testing, it seems if I create a RAIDZ2 array with 5 disks, I don't have any errors. Once I go to 6 or more disks, these errors start to occur.

I'm going to try spreading these disk across both LSI controllers and see if the errors occur when I spread out the disks across both HBA's.

This seems really strange, so I've got into as much detail as I can think of. I've done as much searching as I can and can't find anything that points to the issue. I wish I had 8 new drives to test with, but that's not in the cards right now.

When these drives were in the UNAS800, I would see checksum errors after a scrub. So it appears that the ECC RAM is helping as I don't see the checksum errors, but I see the SCSI status errors and the scrub performance is terrible.

When I have the 5 drive RAIDZ2 array, I will see 500-600mb/sec when running a scrub. But when I have the 8 drive RAIDZ2 array with a lot of those SCSI status errors the performance is dismal at 11-14mb/sec and would take over a week to complete.

Could I have 2 bad cards? I can't say for sure it's the cables. I can't use the onboard HBA as it'll only support 2tb drives.

Sorry for this long post, but hoping the details I provided will be helpful..

Any other suggestions to narrow this down?

rs225 · Aug 30, 2016

So you have 8 drives that throw large numbers of UDMA CRC errors when used in two completely separate systems that have nothing else in common? Not even the same source of SATA cables?

rs225 · Aug 30, 2016

What is your electrical setup, and do you have any testing tools for it?

Hobbes · Aug 30, 2016

Correct, the only thing common between the two systems are the drives. I had to look twice at the number of UDMA CRC errors. I thought I had issues with the non ECC system and just moved the drives across to the SC846. The SC846 has redundant power supplies and I had just plugged in one. But for the last day I plugged in both. I don't have any tools to measure power, if that's what you're asking.

Mirfster · Aug 31, 2016

What version of FreeNAS?

I think that "UDMA_CRC errors" are mainly due to communication between the disk and system; say like faulty cabling. However, you appeared to have tested on two entirely different systems.

Another thought is to check and see if there is a Firmware update for the WD Drives?

rs225 · Aug 31, 2016

Well, you either have eight hard drives with broken SATA chips (not impossible, but unlikely) or you have an electrical problem in your building wiring.

Stux · Aug 31, 2016

rs225 said:
Well, you either have eight hard drives with broken SATA chips (not impossible, but unlikely) or you have an electrical problem in your building wiring.

A good ups would be good for that, if that were the problem.

rs225 · Aug 31, 2016

Stux said:
A good ups would be good for that, if that were the problem.

No, that would be ignoring the problem.

Hobbes · Aug 31, 2016

Mirfster said:
What version of FreeNAS?

I think that "UDMA_CRC errors" are mainly due to communication between the disk and system; say like faulty cabling. However, you appeared to have tested on two entirely different systems.

Another thought is to check and see if there is a Firmware update for the WD Drives?

I'm running 9.10.1 stable, latest at this point. I'll check into FW. These are circa 2013-2014 drives all running ver. 80.00A80. Was hoping changing system would have fixed this.

Hobbes · Aug 31, 2016

rs225 said:
No, that would be ignoring the problem.

I think power is OK, but this is without measuring anything of course. This NAS is plugged into a SmartUPS 1400, so power should be clean to the system.

Hobbes · Aug 31, 2016

Just to update, spread the 8 drives across the two LSI controllers (5 on 1, 3 on the other) and am still seeing those CAM SCSI status errors. Ugh. Can't see there being an actual problem with all 8 drives, could there?

rs225 · Aug 31, 2016

First, whatever the problem is, it does seem to be of the unusual variety.

If it is electrical, it isn't juice that might be the problem. It is the wiring. And I don't know if UPS can do anything about wiring except run on battery. I'm pretty sure a good electrician with some testing tools could do some checks at your outlet and the mains and rule it in or out. Bad grounds, weak neutrals, reverse current flows, something that would potentially be showing up on your systems internal power and maybe putting the SATA bus voltages (or the ground pins) out of spec or incapable of serving >6 devices within spec.

Also if you have any of those 2-pin to 3-pin adapters anywhere on the circuit.(If you do, whatever device is plugged in to it potentially is a shock-you-dead hazard, and you better not mess with it.)

rs225 · Aug 31, 2016

Another wild guess: Do those WD drives have a setting for Spread Spectrum Clocking (SSC?) If so, is it enabled or disabled and have you tried swapping settings?

Edit: Looking at the drive SATA connector head-on, the pin block of eight to the right, if the right-most upper and lower are jumpered together, SSC is enabled. If so, that could be the problem. Pretty unlikely though.

Hobbes · Aug 31, 2016

rs225 said:
Another wild guess: Do those WD drives have a setting for Spread Spectrum Clocking (SSC?) If so, is it enabled or disabled and have you tried swapping settings?

Edit: Looking at the drive SATA connector head-on, the pin block of eight to the right, if the right-most upper and lower are jumpered together, SSC is enabled. If so, that could be the problem. Pretty unlikely though.

No unfortunately those pins aren't jumpered on the drives. You're certainly correct, this is a really strange one.

Is there anything to validate health of the HBA's?

rs225 · Sep 1, 2016

If you still have support/warranty from LSI, you could try contacting them. Explain the strange situation, and they might have some ideas.

Hobbes · Sep 1, 2016

I found that the FW on my HBA's is not most current. I'm running 20.00.00.00 on both, but just checking I see that there is 20.00.07.00 released Feb 2016. Hopefully it's just a FW bug that's been fixed. Considering 20.00.00.00 was from 2014, I'm hopeful this will fix it. Fingers crossed!

If it'll help anyone, here's the link to the latest 20.00.07.00 FW from Avago tech.

http://www.avagotech.com/support/do...olution+Brief;User+Guide;&area4=&dnd-keyword=

Hobbes · Sep 1, 2016

I'm confident that the updated FW, 20.00.07.00 for my 2 9211-8i HBA's has fixed my strange issue. Once both were flashed, I re-created my 8 x 3tb disk RAIDZ2 pool, copied about 150gb of data to it and did a scrub. With the old firmware, almost immediately, I would see the CAM status scsi errors and the scrub throughput was really slow. Now it runs right through with no scsi errors at the console and I'm seeing 450M/s to 600M/s for the scrub.

Moral of this is to make sure FW is current and double check. I thought I was current having flashed to 20.00.00.00 when I upgraded to 9.10.

Thanks rs225 for offering suggestions. I was out of ideas until I checked the Avagotech site again.

rs225 · Sep 2, 2016

That is good to hear.

Will it fix your M1015 as well?

Hobbes · Sep 2, 2016

haha, we're thinking alike. Flashed it last night and the M1015 isn't throwing up the scsi errors, but I do get some checksum errors when I run a scrub on some test data. Moved those 8 drives over to the system with the M1015, built a new pool, copied data and scrubbed. More than 1 drive has a checksum error in single digits with about 130gb of data. Would you chalk that up to not having ECC RAM?

Mirfster · Sep 2, 2016

Hobbes said:
Would you chalk that up to not having ECC RAM?

That is where I would bet my money. Hopefully that damage is not too bad, but then again...

Important Announcement for the TrueNAS Community.

Strange RAIDZ Behavior

Dabbler

Guru

Guru

Dabbler

Doesn't know what he's talking about

Guru

MVP

Guru

Dabbler

Dabbler

Dabbler

Guru

Guru

Dabbler

Guru

Dabbler

Dabbler

Guru

Dabbler

Doesn't know what he's talking about

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Strange RAIDZ Behavior"

Similar threads