SOLVED Poor iSCSI performance after about a day

Tim Sievers · Feb 1, 2018

All,

I have two servers with identical setups:
Dell PowerEdge 2950 Servers
Dual E5450 3.0 GHz CPU
32 GB ECC RAM
6 2TB SATA drives in RAIDZ2
Onboard Dual Broadcom (for WebGUI, CIFS, and NFS)
Intel Pro/1000 Quad card (separated with vlans for iSCSI)
FreeNAS 11.1

I am using one specifically for backups, and I had a single portal (all four ports) feeding a VMware datastore (with four connections) and Windows iSCSI (with a single connection). I've had a few issues where I needed to reboot to regain performance, but nothing horrible. I just recently split the portal into two (each with two ports) and separated the iSCSI and gave the Windows box two connections with MPIO. Now I'm having issues where the whole iSCSI is having a heart attack after about a day (my CIFS and NFS shares seem to be completely unaffected by this). The machine behaves normally with CPU barely over 10% up until the interrupts usage goes crazy. Here are the logs is generates when the problem starts:

Code:

2018-02-01 09:31:27 Notification (5) WARNING 10.10.81.32 (iqn.1991-05.com.microsoft:acronis): tasks terminated
2018-02-01 09:31:27 Notification (5) WARNING 10.10.81.32 (iqn.1991-05.com.microsoft:acronis): waiting for CTL to terminate 1 tasks
2018-02-01 09:31:27 Notification (5) WARNING 10.10.81.32 (iqn.1991-05.com.microsoft:acronis): no ping reply (NOP-Out) after 5 seconds; dropping connection
2018-02-01 09:31:21 Notification (5) WARNING 10.10.81.32 (iqn.1991-05.com.microsoft:acronis): tasks terminated
2018-02-01 09:31:21 Notification (5) WARNING 10.10.81.32 (iqn.1991-05.com.microsoft:acronis): waiting for CTL to terminate 9 tasks
2018-02-01 09:31:21 Notification (5) WARNING 10.10.81.32 (iqn.1991-05.com.microsoft:acronis): no ping reply (NOP-Out) after 5 seconds; dropping connection

I have attempted disabling the ARC compression and tweaking the interrupt thresholds as per other forum entries I found, but no luck. It takes about 12 to 24 hours for the problem to come back, but it purrs like a kitten before then. I just tweaked the Jumbo frames to 9216 just to see if that will make a difference. I've tried following with systat -v 1 and top, but I can't find anything conclusive.

My other box is set up similarly, but it doesn't have any great loads on it.

Anyone have any suggestions or tips?

gpsguy · Feb 1, 2018

Search the forum for iSCSI and @jgreco.

He has had similar discussions with other users in the last couple of days. You will find the answers to your questions in his responses.

Tim Sievers · Feb 1, 2018

gpsguy said:
Search the forum for iSCSI and @jgreco.

He has had similar discussions with other users in the last couple of days. You will find the answers to your questions in his responses.

@jgreco has posted on every thread in every topic, so finding the one situation that matches mine has been like looking for a needle in a hay stack (hence why I opened this thread).

The only specific entries I could find are people trying to use 16GB RAM or have all the NICs on one subnet. The more generic entries are about disabling Jumbo Frames and the inherit problems of RAIDZ and iSCSI (but then other people seem to endorse RAIDZ2, so I don't know).

I was able to find one person who had a failing NIC card, but I can instigate the problem on the other server too. I can also try to disable the MPIO completely and see how that fares.

Update: Apparently my "Automatically check for updates" is not doing it's thing and 11.1-U1 is available. I will update and get back to you.

gpsguy · Feb 1, 2018

Here's a forum search, based on the parameters I suggested in my original message - https://forums.freenas.org/index.ph...post&o=date&c[date]=1517115600&c[user][0]=115

There are only 3 message threads in that search. Please read them all. There's a lot of good information that applies to your environment.

Tim Sievers said:
so finding the one situation that matches mine has been like looking for a needle in a hay stack

Here's a link tink to a specific message in one of theose threads - https://forums.freenas.org/index.php?threads/iscsi-using-twice-the-space.61123/#post-434542 How full is your pool?

Tim Sievers · Feb 2, 2018

gpsguy said:
How full is your pool?

I'm only about 40% full, but if there is that much contention between RAIDZ2 and iSCSI, they need to put a disclaimer in the wizard when you are setting everything up. I already knew that the mirrored vdevs would be more performant, but I wanted the additional space and safety (like if both drives in a vdev die).

I've been using it as a dedup location for Acronis for about a year without this much problems, it seemingly only has been since reconfiguring the portal/target that iSCSI dies on me. My CIFS and NFS shares are completely unaffected and keep doing what they do.

It could be possible that the machine is too old to keep up, but it seems completely fine up until some invisible line I keep tripping over.

By the way, the update to 11.1-U1 didn't quite help, but it clearly is releasing memory immediately after use now. I just removed the MPIO/Round robin and will let it run tonight. Next thing would be to remove the tunables (yes I know a lot of people will criticize me for having them in). Worst case scenario, I zfs snapshot them off and rebuild as mirror vdevs. I also opened a support ticket, maybe they can tell me exactly where the holdup is.

gpsguy · Feb 2, 2018

"Dedupe location" - are you saying that you enabled deduce on FreeNAS. We don't recommend doing so, since it requires lots of RAM. The rule of thumb is 5GB of RAM per 1 TB storage.

Please post a link to your support ticket here.

Tim Sievers · Feb 7, 2018

gpsguy said:
"Dedupe location" - are you saying that you enabled deduce on FreeNAS. We don't recommend doing so, since it requires lots of RAM. The rule of thumb is 5GB of RAM per 1 TB storage.

Please post a link to your support ticket here.

I am not using ZFS dedupe, I am performing Acronis dedupe from a windows machine with the iSCSI storage.

https://redmine.ixsystems.com/issues/28132

Tim Sievers · Feb 12, 2018

Update:
I removed the auto-tune generated tunables (I have had in for over a year) and disabled the MPIO. It was running smoothly for five days, so I turned the MPIO back on. It seems to be running fine after another day. I'm leaning towards the tunables giving me this problem.

I'm willing to accept that something in the tunables interfered after upgrading to 11.1, but I feel it was odd that it only affected the iSCSI/Intel scenario (CIFS, NFS, and FTP working fine on the Broadcom ports). While I know the feelings toward the auto-tune feature is mixed, is it possible to suggest a prompt/nag while upgrading that says "Hey, your tunables may be depreciated or no longer supported"?

Tim Sievers · Feb 21, 2018

I am marking this as resolved. I am leaning that it could have been the tunables for the memory/network performance, but I'd rather have a stable system then trying to induce a fault. I currently have just the two tunables, one to ignore MAC changes and the other to ignore SNMP messages.

jgreco · Feb 26, 2018

Tim Sievers said:
inherit problems of RAIDZ and iSCSI (but then other people seem to endorse RAIDZ2, so I don't know).

Well, people endorse all sorts of bad things, doesn't make them *right*. iSCSI on RAIDZ2 might be okay for certain types of things, especially those guys who want to fake out their PC into thinking that the storage is local so that it passes some security check in their games, but then you're just likely to run into the next problem which will be block allocations are complicated with RAIDZ's variable storage sizes, so you have to get your stuff "right" and even then, you're losing much/most of the speed potential of RAIDZ.

Tim Sievers said:
I'm only about 40% full, but if there is that much contention between RAIDZ2 and iSCSI, they need to put a disclaimer in the wizard when you are setting everything up.

It's not contention. It is that it becomes more difficult to allocate space over time, because ZFS is a Copy-on-Write filesystem. It is totally possible to make things fly. The speed will be related to the number of vdevs you have, but most people creating RAIDZn's are creating a single vdev, while mirror users have one vdev per every two or three HDD (usually). I just finished setting up a filer with 12x HDD and set up five two-drive mirrors. I know that I can get similar performance out of a RAIDZ2 system with five vdevs, but since you really need at least six disks for a RAIDZ2 vdev, that turns into a 30 drive pool.

I already knew that the mirrored vdevs would be more performant, but I wanted the additional space and safety (like if both drives in a vdev die).

So move up to three-way mirrors.

Tim Sievers said:
I'm willing to accept that something in the tunables interfered after upgrading to 11.1, but I feel it was odd that it only affected the iSCSI/Intel scenario (CIFS, NFS, and FTP working fine on the Broadcom ports). While I know the feelings toward the auto-tune feature is mixed, is it possible to suggest a prompt/nag while upgrading that says "Hey, your tunables may be depreciated or no longer supported"?

autotune probably deserves to be deprecated. It was handier back in the days when ZFS wasn't as well-integrated as it is now, and ZFS picks up more reasonable defaults than it did five years ago, and FreeNAS ships with better defaults now that things like 8GB are set as the base memory configuration.

Tim Sievers · Apr 4, 2018

I have two identical servers, I removed my pool on one and redid it with mirrors. I then filled it up with VMs and left it alone with VDP running for a month and not a single "Waiting for CTL" error has returned on that machine. The other machine I left pretty much as-is and had the Acronis Dedupe via iSCSI, and that one will run for about a week or two before CTL chokes. Like before, NFS and CIFS are working just fine at the same time. I'm torn in just removing iSCSI from my second NAS or rebuilding in mirrors.

jgreco said:
It's not contention.

I don't have any swap usage when CTL crashes, so I don't think I am too low on memory. It has to be either a problem with having multiple sharing on one box or using the RAIDZ2 with iSCSI. If contention is the wrong word, please let me know what I should label it.

Also, I ran a MemTest for 18 hours and other diagnostics and nothing came up, I also have everything looking good on iDRAC (IPMI). Temps and power are golden.

Tim Sievers · Apr 11, 2018

Update:
As I was changing the pool on my second box, my first box started having issues again (new error but same problems). I've opened up a new bug ticket and they are leading me to turn off Jumbo Packets. I tried this before briefly, but without rebooting to clear the existing memory fragmentation.

This seems to be a problem with the Intel drivers as noted on other FreeBSD sites, so fingers crossed this remedies the issue.

For those following along, changing from RAIDZ2 did aid in how long it took for this problem to rear it's ugly head. But, it was very likely a driver change somewhere in the past few updates that killed my compatibility with my existing tunables and jumbo frames. I'm not pointing blame though, this is a free software and I know fully well that open source projects can revise and remove code at any time.

Tim Sievers · May 2, 2018

Update:
Well over two weeks now and no issues observed. Removing the mtu setting on the interfaces resolved the issue.

Important Announcement for the TrueNAS Community.

SOLVED Poor iSCSI performance after about a day

Tim Sievers

Dabbler

gpsguy

Active Member

Tim Sievers

Dabbler

gpsguy

Active Member

Tim Sievers

Dabbler

gpsguy

Active Member

Tim Sievers

Dabbler

Tim Sievers

Dabbler

Tim Sievers

Dabbler

jgreco

Resident Grinch

Tim Sievers

Dabbler

Tim Sievers

Dabbler

Tim Sievers

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

SOLVED Poor iSCSI performance after about a day

Dabbler

Active Member

Dabbler

Active Member

Dabbler

Active Member

Dabbler

Dabbler

Dabbler

Resident Grinch

Dabbler

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Poor iSCSI performance after about a day"

Similar threads