High Checksum Error Rate

Status
Not open for further replies.

dev246

Dabbler
Joined
May 15, 2014
Messages
16
Hi,

I have a rather strange problem with the checksums on disk.
At the beginning it presents my configuration:

Supermicro X10SLH-F Motherboard
Xeon® Processor E3-1240 v3
Supermicro 4x8GB DDR3 1600MHz ECC
Supermicro SuperChassis 846BE1C-R1K28B (with SAS expander)
N2215 - HBA
10x 4TB WD RED
1 Volume Using Raid Z2 (on 10x HDD) + 1 SSD 250GB as cache + 1 SSD 120GB as LOG

This configuration is almost brand new, and worked well for 2 months (the only difference was that I had used the 10x some old HDD). After the 2month test period i destroy that RAID Z2 volume , upgrade Freenas to newest version (FreeNAS-9.3-STABLE-201602031011) and swapped old used HDD to brand new 4TB WD Red and cerate new volume using this same settings Raid Z2 on 10HDD + 1 SSD as LOG +1 SSD as cache. And after that i have high CheckSum error rate on all HDD disks (except SSD disks working as Log and cache) .
I check Smart, after short and long test and any of all 10 disks don't have errors.

If anyone has an idea, what else can I check to solve the problem without destroying the volume and creating it again?. Or is there a chance that there is a bug in the latest version of which causes such behavior?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Probably a cable issue. Try replacing the cables to the backplane.

It could also be bad power, but that's a lot less likely.
 

dev246

Dabbler
Joined
May 15, 2014
Messages
16
Hi
I'm pretty sure it's not a power issue because server is connected to two powerfull server UPS (like few other servers).
I try replace cables from HBA to the backplane in next service window but as I wrote earlier only 2 things were done before the problem, update Freenas and replace HDD on backplane so even Chassis was not be opened so cable should be intact.
I would put more of a problem with disks or update but i don't known what i can check except smart status on disks. If that will help i can paste results from smartctl
 

dev246

Dabbler
Joined
May 15, 2014
Messages
16
Yest but sometimes poor power quality from town = power problem on god PSU.
This Chassis have dual PSU and voltage monitoring on motherboard. All power lines looks good, both under heavy load and at idle
 

Mlovelace

Guru
Joined
Aug 19, 2014
Messages
1,111
You may have had a dimm go bad. Did the pool scrub recently? Run a memtest to make sure the memory is okay.
 

dev246

Dabbler
Joined
May 15, 2014
Messages
16
I may be wrong but I think Registred ECC memory dimm problem should be visible on motherboard health status. Of course i will run mem test in next service window, but mayby is something else what i can check on production server without stopping Freenas ?
I manually start scrub pool every 2 days to repair errors because on every scrub i got checksum errors (10-120 chksum per disk)

# zpool status -x
pool: RZ2-10x6T
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub repaired 4.19M in 21h35m with 0 errors on Wed Mar 30 06:45:11 2016
config:

NAME STATE READ WRITE CKSUM
RZ2-10x6T DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
gptid/d890a277-e144-11e5-a2aa-a0369f6c3304 ONLINE 0 0 27
gptid/d939581b-e144-11e5-a2aa-a0369f6c3304 DEGRADED 0 0 26 too many errors
gptid/d9eb597a-e144-11e5-a2aa-a0369f6c3304 ONLINE 0 0 32
gptid/da905ca3-e144-11e5-a2aa-a0369f6c3304 DEGRADED 0 0 17 too many errors
gptid/db36ed34-e144-11e5-a2aa-a0369f6c3304 ONLINE 0 0 28
gptid/dbdde9aa-e144-11e5-a2aa-a0369f6c3304 ONLINE 0 0 29
gptid/dc8f5066-e144-11e5-a2aa-a0369f6c3304 ONLINE 0 0 22
gptid/dd40f193-e144-11e5-a2aa-a0369f6c3304 DEGRADED 0 0 27 too many errors
gptid/dde7c0e8-e144-11e5-a2aa-a0369f6c3304 DEGRADED 0 0 20 too many errors
gptid/de8f0046-e144-11e5-a2aa-a0369f6c3304 DEGRADED 0 0 40 too many errors
logs
gptid/defa2526-e144-11e5-a2aa-a0369f6c3304 ONLINE 0 0 0
cache
gptid/ded277dd-e144-11e5-a2aa-a0369f6c3304 ONLINE 0 0 0
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
It sounds like the old disks might have been SATA II, (3Gbps), or even SATA I, (1.5Gbps).
And the new disks SATA III, (6Gbps). If that was the case, and the disk data cables were of
lower quality, then the higher speed of the new disks causes problems.

That would explain why it worked before with only 2 things changed.

Of course, there was a FreeNAS update too...
 

dev246

Dabbler
Joined
May 15, 2014
Messages
16
Unfortunately all of removed disks are SATA3 (i double check all old disk-s).

I try replace HBA calbe and run memtest on Sunday, but if someone have any other ideas what can I check on production i will be grateful.
 

dev246

Dabbler
Joined
May 15, 2014
Messages
16
Hi. I replace HBA calbes, and run memtest86+ for 6-7 hours. Memtest doesn't show any error. After scrub freenas found few CKSUM errors on all disks, all errors was repaired. So i start scrub again to verify if replacing HBA cables resolve problem, but i still get chekcsum errors. Any other ideas what can i do ?

Maybe is something wrong with my tunables after upgrade, can someone verify that :
kern.ipc.maxsockbuf 2097152
net.inet.tcp.delayed_ack 0
net.inet.tcp.recvbuf_max 2097152
net.inet.tcp.sendbuf_max 2097152
vfs.zfs.arc_max 15569256448
vfs.zfs.l2arc_headroom 2
vfs.zfs.l2arc_noprefetch 0
vfs.zfs.l2arc_norw 0
vfs.zfs.l2arc_write_boost 40000000
vfs.zfs.l2arc_write_max 10000000
vm.kmem_size 31138512896

I have 32GB ECC Registred RAM.
 

Mlovelace

Guru
Joined
Aug 19, 2014
Messages
1,111
Hi. I replace HBA calbes, and run memtest86+ for 6-7 hours. Memtest doesn't show any error. After scrub freenas found few CKSUM errors on all disks, all errors was repaired. So i start scrub again to verify if replacing HBA cables resolve problem, but i still get chekcsum errors. Any other ideas what can i do ?

Maybe is something wrong with my tunables after upgrade, can someone verify that :
kern.ipc.maxsockbuf 2097152
net.inet.tcp.delayed_ack 0
net.inet.tcp.recvbuf_max 2097152
net.inet.tcp.sendbuf_max 2097152
vfs.zfs.arc_max 15569256448
vfs.zfs.l2arc_headroom 2
vfs.zfs.l2arc_noprefetch 0
vfs.zfs.l2arc_norw 0
vfs.zfs.l2arc_write_boost 40000000
vfs.zfs.l2arc_write_max 10000000
vm.kmem_size 31138512896

I have 32GB ECC Registred RAM.
Well you have 32GB of ram and a max ARC of 14.5Gb... yeah thats a problem. Uncheck autotune, delete those settings and reboot. Autotune isn't giving you the checksum errors but it's hamstringing your system pretty badly.
 

dev246

Dabbler
Joined
May 15, 2014
Messages
16
Enabling autotune, and upgrade to 9.3.1 has no effect on CKSUM errors.
If writing to disk is problematic for HBA maybe it's some way to slow down this write to disk speed ?
Ore some other way to resolve problem ?
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
You will probably have to swap things until you find the solution. Most likely in this case is a bad cable to the expander, a bad expander, or a bad connector/controller that the expander connects to. This might be a case where contacting SuperMicro or the seller might help.
 
Status
Not open for further replies.
Top