Current iSCSI recommendations with large RAM?

Status
Not open for further replies.

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
TL;DR:

I think I am hitting a well-known ZFS issue where iSCSI is tanking because large NAS RAM is causing large transaction group sizes to be used, causing server-side timeouts on iSCSI. The most recent advice on this is quite old on these forums, and not really suited to 11.0/11.1. I also use my NAS for other things where tiny transaction groups would be a blow.

I understand that iSCSI might need manual tuning of the transaction group size and perhaps other config to prevent timeouts, which can conflict with settings used for other high performance. The system has 96GB ECC, a fast Xeon on Supermicro, mirrored pool (3 x disks each say 140MB/s steady), a P3700 ZIL + NVMe L2ARC, and 10G networking - all factors that would suggest it might be prone to the transaction group issue in that bug report. The iSCSI initiator (WIndows Event logs) are stuffed with thousands of iSCSI timeout errors, while Samba on the same client is 100% reliable (although LAN speeds aren't optimised yet), which adds to my reasons why I think this is the issue I'm hitting. But the info on "step by step what to do as at FreeNAS 11.x" it is very minimal and I don't want to kill my other performance by randomly fiddling with "magic settings" I don't fully understand.

On the assumptions that my issue is iSCSI timeouts, and that these are due to transaction group size / other ZFS config:
  1. How do I approach fixing the problem these days, and, beyond that, what general approaches will help me to get optimal performance with good working iSCSI as well? (set a txg TIME rather than txg SIZE?)
  2. What kinds of stats / CLI commands / tunables will be most relevant to identify the bottleneck and get a balanced tune for my workload?
  3. Should I manually let the Chelsios offload iSCSI to help the system? Or run it as an NFS/Samba share instead?

OTHER DETAIL STUFF (IF USEFUL):

My NAS is being mainly used in 3 ways - file server/archive for a small Windows LAN, VM store for ESXi, and iSCSI store for a Windows workstation whose data I want directly on the NAS itself.

The hardware is fine - Supermicro, 3.5GHz Xeon, Chelsio 10G, 96GB 2400 ECC, Intel P3700 ZIL, 250GB NVMe L2ARC, and 16 x 6TB Enterprise HDDs in mirror vdevs - and I'm getting write-to-NAS-and-read-back-to-verify speeds between 100 MB/s and 500 MB/s depending on the exact task, with 490 - 510 MB/s steady state on Samba for a wide range of file size mixes. (I'd expect to get closer to 1GB/s than 500 but can't complain!)

When testing the Windows iSCSI store to check its performance, and how well ARC/L2ARC helps it to maintain decent speeds, I hit an issue where iSCSI flies along reporting implausible speeds of 1GB/s+ for a few tens of seconds, then crashes to zero for minutes. Sometimes it eventually crawls up again to a few tens of MB/s, sometimes it doesn't seem to. The Windows logs also show thousands of iSCSI timeout related issues.

My gut feeling is it's the iSCSI transaction group issue, because Samba isn't hitting such timeouts whereas iSCSI's stalling after some time show that some kind of cache is filling and then traffic halts, along with timeout-related events, and in a system likely to be prone to a known iSCSI RAM + txg issue. That said, it could also be due to MTU issues, misconfiguring the NICs (I set them all to "maximum buffers" etc on the Windows side), some registry stuff or aux settings on the NAS, as well as the well-known issue about ZFS transaction group sizes on systems with large RAM - or something completely different. The Windows station I'm testing on has an octo-core Xeon v3 and over 64GB free, so it should have enough cores/threads, jumbo frames are in use on all NICs and my 10G managed switch, VLANs aren't not in use but will be in future, all HW is 9000+ mtu capable (some NICs are specifically 9014 rather than 9000 capable), the pool is 50% full with about 22TB capacity and just under 10TB of it free, the iSCSI zvol is 2TB and empty for testing speeds, so it's not forcing fragmentation, and the network is otherwise quiet.

I've tried to find an update on the iSCSI txg recommendations for FreeNAS / FreeBAS 11 or 11.1, but I can't, the most recent firm advice on the forums is quite out of date.
 

mav@

iXsystems
iXsystems
Joined
Sep 29, 2011
Messages
1,428
If you think the problem you hit caused by transaction size, you can try to tune it via set of sysctls/loader tunables:
vfs.zfs.dirty_data_max_percent -- maximum percentage of RAM to be used for dirty data;
vfs.zfs.dirty_data_max -- real result of above in bytes;
vfs.zfs.dirty_data_sync -- minimal amount of data to start TXG syncing;
vfs.zfs.txg.timeout -- maximum TXG timeout in second.

Reduction of those sizes and timeout will make ZFS write more metadata in each transaction, increasing write overhead, but indeed can make traffic less spiky. Unfortunately those tunables are global may not be perfectly tuned for different workloads same time.

On the other side at some point ZFS got write throttling mechanism, that should partially equalize the write throughput, hopefully making this problem less significant. There are some other tunables to control throttle logic, but that needs deeper understanding.

Speaking about other possible reasons, I am not exactly sure I understand correctly your pool configuration. In one place you tell "mirrored pool (3 x disks)", but later "16 x 6TB Enterprise HDDs in mirror vdevs". Which of those are true? Or you have several pools? For iSCSI operation it is usually good to have as many disks and top level vdevs as possible to maximize IOPS. Numbers like 140MB/s you've mentioned are rarely applicable due to fragmentation and respective head seeks.

Also you may use `ctladm dumpooa` command in command line to see what commands exactly are blocked in your case.

Speaking about networking, if you have decent Chelsio NIC, then using jumbo frames should not give you much additional performance. So I would not insist on you using them. Jumbo frames can cause some complications for OS memory management. Speaking about iSCSI offload, latest FreeNAS indeed include TCP and iSCSI offload support for Chelsio NICs and you may experiment with them, but they are not enabled by default since there are still random issues appearing there, while benefit is usually not so huge, since NAS CPU is rarely a sufficient bottleneck to heavily offload it.
 

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
That's really good, useful information, @mav@ . I'll digest it and try it out - obviously it'll take a few days to see where it gets me.

As for the pool - the main pool just had some disks added. The current config is 4 x (3 way mirror made up of 3 x 6TB HDD). But the backup server also has the same config (non-redundant, I figure it's extremely unlikely I'll lose a disk in the backup at the precise time some disaster kills all of the mirrors in the master, and if I do, it'll probably take out all the backup disks as well!) So there are 4 x 3-way disks in the master, but also 4 x 1-way in the backup, making 16 total. I should have just described the master but I counted disks in the backup by mistake, giving 16 total. The mirrored vdevs do give a nice speed (as does decent RAM and L2ARC/ZIL) - I just reinstalled my workstation and I'm currently getting > 1GB/sec across Samba, so I should do okay on iSCSI... we shall see.

Once the new disks finish resilvering, I'll retry iSCSI and see if anything's improved, and if necessary experiment with the info + sysctls above. Then I'll post back with the results, but it might take a bit of time to be sure.
 
Status
Not open for further replies.
Top