So I found an old thread with some great answers from Jgreco talking specifically about tuning ZFS for VMware. I don't want to post to a really old thread, and a search did not find a more current thread. If there is one, please help an old grey-haired man find it...;)
In that thread from 2012, there was discussion of VMware loosing it's connection to the iSCSI filesystem due to ZFS flushing causing the iSCSI initiator to drop. This was under FreeNAS 8.0 and 8.2 back in 2012. My question is basically to revive the question for ESXi 6.0 and FreeNAS 9.3 and 9.10. I have seen one post from someone stating that 9.10 is causing this same problem since the update... VMware works for a while, then drops the filesystem.
The reason I am asking is that I am in the process of migrating my VMware systems off of a failing, really old SAN, to a FreeNAS iSCSI "SAN". I moved one server over, and I really like the performance and the flexibility it is giving me. It was a big "win". But based on the other user's complaint, it seems like I should spent some time getting the system properly tuned... or at least researching the issue so that I don't create a problem for myself later.
My VMware systems do not have a huge amount of disk i/o. They are mostly utility systems such as DNS, DHCP, and other mostly idle applications. So I'm not creating a performance bottleneck, and I don't care if there is some contention for resources. I've got multiple 10g switch blades in large chassis layer 3 switches, and the 10g ports are mostly idle, perhaps 2G of total network bandwidth between all of the switches combined. Most of the network is used for router QA where routers typically run at speeds below 100mb. It is a fast, mostly idle network, and fast but mostly idle VMware servers.
So I was trying to research the tunables that Jgreco cited in 2012. One I found, and it seems to be defaulted to the recommended value, and the other I can't find at all.
Jgreco said "By far the most important caveat is that your ZFS *must* be tuned to be responsive, which almost certainly involves setting vfs.zfs.txg.timeout to 5, and setting write_limit_override to a value substantially lower than you might expect, to limit the amount of stuff ZFS has to flush out to disk. When ZFS is flushing, it may be unresponsive, and that can cause an iSCSI initiator to drop. "
My main FreeNAS system has 64gig of ram, 30T of disk in a half-filled supermicro server system, with a RaidZ3 running on 11 drives. I have tried to do this one "right". This is not my first "big stack of drives" under FreeNAS, and I've done it is the distant past under FreeBSD... So I'm pretty comfortable with ZFS in general. However this IS my first FreeNAS iSCSI san. I have no budget for going out and buying a new SAN of a size similar to what I can afford on FreeNAS. So once again I want to do it right, and I humbly submit a request for guidance. I want to do this before I cause a problem.
In that thread from 2012, there was discussion of VMware loosing it's connection to the iSCSI filesystem due to ZFS flushing causing the iSCSI initiator to drop. This was under FreeNAS 8.0 and 8.2 back in 2012. My question is basically to revive the question for ESXi 6.0 and FreeNAS 9.3 and 9.10. I have seen one post from someone stating that 9.10 is causing this same problem since the update... VMware works for a while, then drops the filesystem.
The reason I am asking is that I am in the process of migrating my VMware systems off of a failing, really old SAN, to a FreeNAS iSCSI "SAN". I moved one server over, and I really like the performance and the flexibility it is giving me. It was a big "win". But based on the other user's complaint, it seems like I should spent some time getting the system properly tuned... or at least researching the issue so that I don't create a problem for myself later.
My VMware systems do not have a huge amount of disk i/o. They are mostly utility systems such as DNS, DHCP, and other mostly idle applications. So I'm not creating a performance bottleneck, and I don't care if there is some contention for resources. I've got multiple 10g switch blades in large chassis layer 3 switches, and the 10g ports are mostly idle, perhaps 2G of total network bandwidth between all of the switches combined. Most of the network is used for router QA where routers typically run at speeds below 100mb. It is a fast, mostly idle network, and fast but mostly idle VMware servers.
So I was trying to research the tunables that Jgreco cited in 2012. One I found, and it seems to be defaulted to the recommended value, and the other I can't find at all.
Jgreco said "By far the most important caveat is that your ZFS *must* be tuned to be responsive, which almost certainly involves setting vfs.zfs.txg.timeout to 5, and setting write_limit_override to a value substantially lower than you might expect, to limit the amount of stuff ZFS has to flush out to disk. When ZFS is flushing, it may be unresponsive, and that can cause an iSCSI initiator to drop. "
My main FreeNAS system has 64gig of ram, 30T of disk in a half-filled supermicro server system, with a RaidZ3 running on 11 drives. I have tried to do this one "right". This is not my first "big stack of drives" under FreeNAS, and I've done it is the distant past under FreeBSD... So I'm pretty comfortable with ZFS in general. However this IS my first FreeNAS iSCSI san. I have no budget for going out and buying a new SAN of a size similar to what I can afford on FreeNAS. So once again I want to do it right, and I humbly submit a request for guidance. I want to do this before I cause a problem.