I've got 8, 3tb WD Red drives which all have a really high number of UDMA_CRC error count. They all have 5-6 million when I do a smartctl -a. I previously had these in a UNAS 800 enclosure with a Jetway JNF9A-Q67 mobo, core i3 CPU, 16GB, non ECC RAM, IBM M1015 HBA. I suspect I have an issue with either the SFF 8087 to sata cables or the M1015 card. to cause those UDMA CRC errors. When running a scrub on a pool, I would see checksum errors as well, so could be related to non ECC RAM and/or issues with cable or HBA. I'm running FreeNAS 9.10.1 stable.
I've run short and long smartctl tests on the drives and all of them show 0 for reallocated sector, current pending sector and offlined uncorrectable. So I think the drives are OK, except for the exorbitantly high value of UDMA_CRC errors.
I recently bought a Supermicro SC846 24 bay enclosure with a Supermicro X8DT3-F, Xeon E5506 CPU, and 48GB of ECC RAM. 8 Bays are connected to the mobo via SFF8087 for onboard HBA running in IT mode. I've also got 2, LSI 9211-8i HBA's running FW ver.20 that connect the other 16 bays.
I've dettached and wiped the drives and put them into the SC846. All 8 drives are being managed by 1 LSI controller. I created an 8 disk RAIDZ2 array. Once I started testing it, I found that I was seeing a lot of these errors,
CAM status: SCSI Status Error
Aug 30 13:50:37 echelon (da1:mps1:0:1:0): SCSI status: Check Condition
Aug 30 13:50:37 echelon (da1:mps1:0:1:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected)
Aug 30 13:50:37 echelon (da1:mps1:0:1:0): Retrying command (per sense data)
These errors will appear when I do a scrub on the pool. Once the scrub is done, these errors can occur when data is read from the pool.
I am seeing these across all drives. The UDMA_CRC errors are also increasing when I see these SCSI errors.
Through more testing, it seems if I create a RAIDZ2 array with 5 disks, I don't have any errors. Once I go to 6 or more disks, these errors start to occur.
I'm going to try spreading these disk across both LSI controllers and see if the errors occur when I spread out the disks across both HBA's.
This seems really strange, so I've got into as much detail as I can think of. I've done as much searching as I can and can't find anything that points to the issue. I wish I had 8 new drives to test with, but that's not in the cards right now.
When these drives were in the UNAS800, I would see checksum errors after a scrub. So it appears that the ECC RAM is helping as I don't see the checksum errors, but I see the SCSI status errors and the scrub performance is terrible.
When I have the 5 drive RAIDZ2 array, I will see 500-600mb/sec when running a scrub. But when I have the 8 drive RAIDZ2 array with a lot of those SCSI status errors the performance is dismal at 11-14mb/sec and would take over a week to complete.
Could I have 2 bad cards? I can't say for sure it's the cables. I can't use the onboard HBA as it'll only support 2tb drives.
Sorry for this long post, but hoping the details I provided will be helpful..
Any other suggestions to narrow this down?
I've run short and long smartctl tests on the drives and all of them show 0 for reallocated sector, current pending sector and offlined uncorrectable. So I think the drives are OK, except for the exorbitantly high value of UDMA_CRC errors.
I recently bought a Supermicro SC846 24 bay enclosure with a Supermicro X8DT3-F, Xeon E5506 CPU, and 48GB of ECC RAM. 8 Bays are connected to the mobo via SFF8087 for onboard HBA running in IT mode. I've also got 2, LSI 9211-8i HBA's running FW ver.20 that connect the other 16 bays.
I've dettached and wiped the drives and put them into the SC846. All 8 drives are being managed by 1 LSI controller. I created an 8 disk RAIDZ2 array. Once I started testing it, I found that I was seeing a lot of these errors,
CAM status: SCSI Status Error
Aug 30 13:50:37 echelon (da1:mps1:0:1:0): SCSI status: Check Condition
Aug 30 13:50:37 echelon (da1:mps1:0:1:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected)
Aug 30 13:50:37 echelon (da1:mps1:0:1:0): Retrying command (per sense data)
These errors will appear when I do a scrub on the pool. Once the scrub is done, these errors can occur when data is read from the pool.
I am seeing these across all drives. The UDMA_CRC errors are also increasing when I see these SCSI errors.
Through more testing, it seems if I create a RAIDZ2 array with 5 disks, I don't have any errors. Once I go to 6 or more disks, these errors start to occur.
I'm going to try spreading these disk across both LSI controllers and see if the errors occur when I spread out the disks across both HBA's.
This seems really strange, so I've got into as much detail as I can think of. I've done as much searching as I can and can't find anything that points to the issue. I wish I had 8 new drives to test with, but that's not in the cards right now.
When these drives were in the UNAS800, I would see checksum errors after a scrub. So it appears that the ECC RAM is helping as I don't see the checksum errors, but I see the SCSI status errors and the scrub performance is terrible.
When I have the 5 drive RAIDZ2 array, I will see 500-600mb/sec when running a scrub. But when I have the 8 drive RAIDZ2 array with a lot of those SCSI status errors the performance is dismal at 11-14mb/sec and would take over a week to complete.
Could I have 2 bad cards? I can't say for sure it's the cables. I can't use the onboard HBA as it'll only support 2tb drives.
Sorry for this long post, but hoping the details I provided will be helpful..
Any other suggestions to narrow this down?
Last edited: