High Checksum Error Rate

dev246 · Mar 29, 2016

Hi,

I have a rather strange problem with the checksums on disk.
At the beginning it presents my configuration:

Supermicro X10SLH-F Motherboard
Xeon® Processor E3-1240 v3
Supermicro 4x8GB DDR3 1600MHz ECC
Supermicro SuperChassis 846BE1C-R1K28B (with SAS expander)
N2215 - HBA
10x 4TB WD RED
1 Volume Using Raid Z2 (on 10x HDD) + 1 SSD 250GB as cache + 1 SSD 120GB as LOG

This configuration is almost brand new, and worked well for 2 months (the only difference was that I had used the 10x some old HDD). After the 2month test period i destroy that RAID Z2 volume , upgrade Freenas to newest version (FreeNAS-9.3-STABLE-201602031011) and swapped old used HDD to brand new 4TB WD Red and cerate new volume using this same settings Raid Z2 on 10HDD + 1 SSD as LOG +1 SSD as cache. And after that i have high CheckSum error rate on all HDD disks (except SSD disks working as Log and cache) .
I check Smart, after short and long test and any of all 10 disks don't have errors.

If anyone has an idea, what else can I check to solve the problem without destroying the volume and creating it again?. Or is there a chance that there is a bug in the latest version of which causes such behavior?

Ericloewe · Mar 29, 2016

Probably a cable issue. Try replacing the cables to the backplane.

It could also be bad power, but that's a lot less likely.

dev246 · Mar 29, 2016

Hi
I'm pretty sure it's not a power issue because server is connected to two powerfull server UPS (like few other servers).
I try replace cables from HBA to the backplane in next service window but as I wrote earlier only 2 things were done before the problem, update Freenas and replace HDD on backplane so even Chassis was not be opened so cable should be intact.
I would put more of a problem with disks or update but i don't known what i can check except smart status on disks. If that will help i can paste results from smartctl

hugovsky · Mar 29, 2016

dev246 said:
power issue

Power issue as in power supply(PSU), not UPS

dev246 · Mar 29, 2016

Yest but sometimes poor power quality from town = power problem on god PSU.
This Chassis have dual PSU and voltage monitoring on motherboard. All power lines looks good, both under heavy load and at idle

Mlovelace · Mar 29, 2016

You may have had a dimm go bad. Did the pool scrub recently? Run a memtest to make sure the memory is okay.

dev246 · Mar 30, 2016

I may be wrong but I think Registred ECC memory dimm problem should be visible on motherboard health status. Of course i will run mem test in next service window, but mayby is something else what i can check on production server without stopping Freenas ?
I manually start scrub pool every 2 days to repair errors because on every scrub i got checksum errors (10-120 chksum per disk)

# zpool status -x
pool: RZ2-10x6T
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub repaired 4.19M in 21h35m with 0 errors on Wed Mar 30 06:45:11 2016
config:

NAME STATE READ WRITE CKSUM
RZ2-10x6T DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
gptid/d890a277-e144-11e5-a2aa-a0369f6c3304 ONLINE 0 0 27
gptid/d939581b-e144-11e5-a2aa-a0369f6c3304 DEGRADED 0 0 26 too many errors
gptid/d9eb597a-e144-11e5-a2aa-a0369f6c3304 ONLINE 0 0 32
gptid/da905ca3-e144-11e5-a2aa-a0369f6c3304 DEGRADED 0 0 17 too many errors
gptid/db36ed34-e144-11e5-a2aa-a0369f6c3304 ONLINE 0 0 28
gptid/dbdde9aa-e144-11e5-a2aa-a0369f6c3304 ONLINE 0 0 29
gptid/dc8f5066-e144-11e5-a2aa-a0369f6c3304 ONLINE 0 0 22
gptid/dd40f193-e144-11e5-a2aa-a0369f6c3304 DEGRADED 0 0 27 too many errors
gptid/dde7c0e8-e144-11e5-a2aa-a0369f6c3304 DEGRADED 0 0 20 too many errors
gptid/de8f0046-e144-11e5-a2aa-a0369f6c3304 DEGRADED 0 0 40 too many errors
logs
gptid/defa2526-e144-11e5-a2aa-a0369f6c3304 ONLINE 0 0 0
cache
gptid/ded277dd-e144-11e5-a2aa-a0369f6c3304 ONLINE 0 0 0

Arwen · Mar 30, 2016

It sounds like the old disks might have been SATA II, (3Gbps), or even SATA I, (1.5Gbps).
And the new disks SATA III, (6Gbps). If that was the case, and the disk data cables were of
lower quality, then the higher speed of the new disks causes problems.

That would explain why it worked before with only 2 things changed.

Of course, there was a FreeNAS update too...

dev246 · Mar 31, 2016

Unfortunately all of removed disks are SATA3 (i double check all old disk-s).

I try replace HBA calbe and run memtest on Sunday, but if someone have any other ideas what can I check on production i will be grateful.

dev246 · Apr 5, 2016

Hi. I replace HBA calbes, and run memtest86+ for 6-7 hours. Memtest doesn't show any error. After scrub freenas found few CKSUM errors on all disks, all errors was repaired. So i start scrub again to verify if replacing HBA cables resolve problem, but i still get chekcsum errors. Any other ideas what can i do ?

Maybe is something wrong with my tunables after upgrade, can someone verify that :
kern.ipc.maxsockbuf 2097152
net.inet.tcp.delayed_ack 0
net.inet.tcp.recvbuf_max 2097152
net.inet.tcp.sendbuf_max 2097152
vfs.zfs.arc_max 15569256448
vfs.zfs.l2arc_headroom 2
vfs.zfs.l2arc_noprefetch 0
vfs.zfs.l2arc_norw 0
vfs.zfs.l2arc_write_boost 40000000
vfs.zfs.l2arc_write_max 10000000
vm.kmem_size 31138512896

I have 32GB ECC Registred RAM.

Mlovelace · Apr 5, 2016

dev246 said:
Hi. I replace HBA calbes, and run memtest86+ for 6-7 hours. Memtest doesn't show any error. After scrub freenas found few CKSUM errors on all disks, all errors was repaired. So i start scrub again to verify if replacing HBA cables resolve problem, but i still get chekcsum errors. Any other ideas what can i do ?

Maybe is something wrong with my tunables after upgrade, can someone verify that :
kern.ipc.maxsockbuf 2097152
net.inet.tcp.delayed_ack 0
net.inet.tcp.recvbuf_max 2097152
net.inet.tcp.sendbuf_max 2097152
vfs.zfs.arc_max 15569256448
vfs.zfs.l2arc_headroom 2
vfs.zfs.l2arc_noprefetch 0
vfs.zfs.l2arc_norw 0
vfs.zfs.l2arc_write_boost 40000000
vfs.zfs.l2arc_write_max 10000000
vm.kmem_size 31138512896

I have 32GB ECC Registred RAM.

Well you have 32GB of ram and a max ARC of 14.5Gb... yeah thats a problem. Uncheck autotune, delete those settings and reboot. Autotune isn't giving you the checksum errors but it's hamstringing your system pretty badly.

dev246 · Apr 8, 2016

Enabling autotune, and upgrade to 9.3.1 has no effect on CKSUM errors.
If writing to disk is problematic for HBA maybe it's some way to slow down this write to disk speed ?
Ore some other way to resolve problem ?

rs225 · Apr 8, 2016

You will probably have to swap things until you find the solution. Most likely in this case is a bad cable to the expander, a bad expander, or a bad connector/controller that the expander connects to. This might be a case where contacting SuperMicro or the seller might help.

Important Announcement for the TrueNAS Community.

High Checksum Error Rate

dev246

Dabbler

Ericloewe

Server Wrangler

dev246

Dabbler

hugovsky

Guru

dev246

Dabbler

Mlovelace

Guru

dev246

Dabbler

Arwen

MVP

dev246

Dabbler

dev246

Dabbler

Mlovelace

Guru

dev246

Dabbler

rs225

Guru

Similar threads