NFS Writes over Load Balanced aggr cause ZFS wait/lock/deadlock

hcamacho · Apr 30, 2013

Over the weekend I upgraded to 8.3.1-P2 from 8.3.1-release. After the upgrade I added a load balanced aggr (2 gigabit ethernet going to a cisco 3550) and then started to perform Storage VMotions from VMware. After a few minutes of high IO the ZFS file system would stop serving data both over the network and locally. If I cd /mnt/zfsstore a subsquent ls -la hung.

zpool status showed state ONLINE for all devices:

[root@freenas] /mnt/prod# zpool status
pool: prod
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
prod ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gptid/e5e94455-9f32-11e2-8748-000eb6285fb8 ONLINE 0 0 0
gptid/e68d3895-9f32-11e2-8748-000eb6285fb8 ONLINE 0 0 0
gptid/e71dba22-9f32-11e2-8748-000eb6285fb8 ONLINE 0 0 0
gptid/e7b8261b-9f32-11e2-8748-000eb6285fb8 ONLINE 0 0 0
gptid/e848a5b7-9f32-11e2-8748-000eb6285fb8 ONLINE 0 0 0
logs
mirror-1 ONLINE 0 0 0
gptid/e89063ae-9f32-11e2-8748-000eb6285fb8 ONLINE 0 0 0
gptid/e8c4fb8a-9f32-11e2-8748-000eb6285fb8 ONLINE 0 0 0
cache
gptid/e8fbeb72-9f32-11e2-8748-000eb6285fb8 ONLINE 0 0 0

errors: No known data errors

I could only recover by rebooting the system. The system came up without errors.

I thought perhaps VMware was doing something funny, so I mounted a file system using my MAC and was able to replicate the problem using several instances of iozone -A

I wondered if this was just network or local, I performed a test using serveral iozone instances from the FreeNAS shell. 160+MB/s for 10 hours and the problem did not appear.

Added the MAC NFS to it and shortly thereafter the problem returned.

I shut one of the legs to the load balance aggr, and I have not been able to replicate the failure after having many SVMotions, MACs with NFS iozones running.

Since I have a setup that can be used to replicate the problem, if a developer would like to ask me questions I'd be happy to assist trying to figure out if this is a bug in FreeNAS or in FreeBSD or if it is something else entirely.

System Configuration:
5 1T drives SATA
2 Samsung 840 128G (Mirrored ZIL)
1 Samsung 840 128G (L2ARC)
8G Memory
8G Flash Boot USB
2x AMD Opteron 250 2.4Hz
Supermicro H8DAE
LSI SATA 300-8X

cyberjock · Apr 30, 2013

Your info is lacking specifics to identify the problem(you are asking for help in this area) but my first guess is that it has to do with ESXi.

This is one of the long list of reasons why ZFS + ESXi = major problems. Even problems that would either correct themselves or be minor for a FreeNAS machine without virtualization suddenly become major show stoppers with virtualization. Even small things like a single bad sector for a zpool are minor on bare metal, but add in virtualization and you can have a zpool that randomly freezes indefinitely because of the virtualization layer.

If you search the forums the ESXi + FreeNAS + ZFS = bad joo joo has been discussed to death.

Maybe the ESXi geniuses in the forum will have some good ideas....I know that link aggregation doesn't work how 99% of people think it works, and it could be that you are just asking too much from ESXi and LA. I don't know.

jgreco · May 5, 2013

see bug 1531 and also discussions of nfs sync writes on forum. your pool is likely just too slow/unresponsive under heavy load.

hcamacho · May 6, 2013

jgreco said:
see bug 1531 and also discussions of nfs sync writes on forum. your pool is likely just too slow/unresponsive under heavy load.

That is very interesting. I do have mirror'd Zil running static drives so I am not sure why my pool would be slow for writes. The ZIL never got over 1G used. With one leg of the AGGR shut I have not had 1 ounce of problems and I've been hitting the thing pretty hard.

Additionally I think the bug in 1531 talked about delays and performance, I actually get the ZFS file system to stop all operations, and it stays that way until I reboot the system.

jgreco · May 6, 2013

The ZIL has very little to do with write throughput; "dd if=/dev/zero of=/mnt/pool/file bs=1048576" basically won't touch your ZIL device but will be a firehose for pool writes. If you can catch the system getting hung up in the kernel in txg flush (see 1531 and use of control-T while dd'ing) for even short periods, ZFS may be overestimating your pool's write throughput.

See, the solution to that is to fix ZFS's estimation of your pool write throughput, so if things work at a slower speed (single gigE) but fail at a higher speed (dual gigE) then that looks like it could be an accidental version of my fix.

Locking up entirely? That sounds more like maybe hardware/interrupt/etc problems. So what I suggest you do is to maybe log on the console and do a dd locally while also pegging a single gigE and see what happens. If that works, shut down that interface and then try the same experiment with the other. If that works, then stop all pool I/O, and then try some netperf instances to see if maybe there's a networking IRQ conflict. And so on.

pbucher · May 9, 2013

On the dd test, if you set sync to always on the zfs volume you are testing then dd will hit your ZIL for testing purposes.

I have seen NFS get hung in some kind of deadlock when hitting with Oracle Database's directNFS client, so it's possible that load balancing might trigger the same bug I've run into. I haven't taken the time to go back and try to track down what exactly hangs it, not to mention it takes some load on Oracle to trigger it in the first place. In the mean time I've left Oracle use the host OS(RHEL 6) NFS mounted drive without issue with many months of up time now.

Important Announcement for the TrueNAS Community.

NFS Writes over Load Balanced aggr cause ZFS wait/lock/deadlock

hcamacho

Cadet

cyberjock

Inactive Account

jgreco

Resident Grinch

hcamacho

Cadet

jgreco

Resident Grinch

pbucher

Contributor

Similar threads