Ms 2k8r2 cluster and iscsi issues

tbaror · Mar 20, 2013

Hello All,

We currently have 3node MS 2k8r2 cluster that based on FREENAS 8.3 P1, this cluster is used for 3~4 years already without any issue
Since yesterday we having serious issues with the cluster working on-off and some SQL instances want work at all , in the beginning i tought its the iscsi switched and it where replaced
But still we have same issues , i checked all connection and settings nothing seems to be changed , i do see iscsi errors on Freenas console but i really don't have a clue what could cause of this error.
I really need help on it , any advice would be appreciated , i attached screenshot from Freenas console
Please advice
Thanks

Skud · Mar 27, 2013

tbaror said:
Hello All,

We currently have 3node MS 2k8r2 cluster that based on FREENAS 8.3 P1, this cluster is used for 3~4 years already without any issue
Since yesterday we having serious issues with the cluster working on-off and some SQL instances want work at all , in the beginning i tought its the iscsi switched and it where replaced
But still we have same issues , i checked all connection and settings nothing seems to be changed , i do see iscsi errors on Freenas console but i really don't have a clue what could cause of this error.
I really need help on it , any advice would be appreciated , i attached screenshot from Freenas console
Please advice
Thanks
View attachment 1998

I'm seeing many of the same errors. I was on 8.3 and now am on 8.3.1. Eventually, they cause all the VMs on the Hyper-V server to crash. I've been researching this for quite some time, but haven't come across any solutions. I think I might have to move away from FreeNAS and go back to a Solaris based OS and napp-it :(. I ran it that way for over a year without issues.

Riley

jgreco · Mar 27, 2013

Your initiators are closing the connection. Perhaps your FreeNAS box is responding slowly or performing poorly. Excessively discussed over the years. Lots of useful information on strategies for evaluation and system tuning in comments on bug 1531 that may or may not be relevant.

Skud · Mar 27, 2013

Well, I'm not *sure* yet, but I seem to have resolved it or made the issue significantly less prevalent. I changed the QueueDepth to 64 on the target and that seems to have done something. Previously, just copying a large 30GB file via SMB from/to a VM running on the iSCSI target would cause errors almost immediately. I came across another individual having similar issues and he indicated it was resolved with setting QueueDepth to 64.

It's possible that it could be another variable that is derived from the QueueDepth because if I change the depth to 128 I see the issues again.

Currently, I have a 30GB file transfer running saturating a 1Gb link and ATTO running on the Hyper-V host (10Gb link) against the target with no issues.

Riley

jgreco · Mar 27, 2013

Make it sufficiently busy and you are likely to run into issues. The trick is to tune everything appropriately to handle a load larger than your busiest load.

Skud · Mar 27, 2013

To stress test I'v been concurrently:

- Running Anandtech's IOMeter benchmark against the box over the 10Gb iSCSI link.
- Copying a 3.4TB file from one ZFS filesystem to another on the same pool.
- Copying some 30GB files to/from the VMs running over the iSCSI link.

So far, it's been rock stable. Performance leaves a little to be desired for uncached data with so much hitting the spindles, but once the l2arc warms up it's super fast! After the 25GB IOMeter test file went into l2arc I was seeing over 15,000 iops even with all of the other stuff going on.

Here is the system:
Xeon E5-2609 (2.4Ghz)
20GB DDR3 ECC (I know, unbalanced config, but I need capacity over bandwidth)
Supermicro X9SRH-7TF (2 x Intel 10Gb onboard)
Norco 4220
1 x LSI 9211 with IT firmware (15.xx I believe)
8 x Hitachi 2TB 5k3000
2 x 256GB OCZ Agility 2 for l2arc
1 x 100GB (provisioned to 20GB) for ZIL

Riley

Skud · Mar 29, 2013

Well, I may have spoken too soon. Around 2:30am last night everything crashed again. The logs are full of errors for approximately 60 minutes straight until the Hyper-V host finally gave up and shut down the VMs :

Code:

Mar 29 02:26:11 san01 istgt[54099]: istgt_iscsi.c:1261:istgt_iscsi_write_pdu_internal: ***ERROR*** writev() failed (errno=22,iqn.1991-05.com.microsoft:hv01,time=0)
Mar 29 02:26:11 san01 istgt[54099]: istgt_iscsi.c:3484:istgt_iscsi_transfer_in_internal: ***ERROR*** iscsi_write_pdu() failed
Mar 29 02:26:11 san01 istgt[54099]: istgt_iscsi.c:3853:istgt_iscsi_task_response: ***ERROR*** iscsi_transfer_in() failed
Mar 29 02:26:11 san01 istgt[54099]: istgt_iscsi.c:5392:sender: ***ERROR*** iscsi_task_response() CmdSN=1680056 failed on iqn.2011-03.istgt:target-san01,t,0x0001(iqn.1991-05.com.microsoft:hv01.
Mar 29 02:26:11 san01 istgt[54099]: istgt_iscsi.c:1261:istgt_iscsi_write_pdu_internal: ***ERROR*** writev() failed (errno=22,iqn.1991-05.com.microsoft:hv01,time=0)
Mar 29 02:26:11 san01 istgt[54099]: istgt_iscsi.c:3484:istgt_iscsi_transfer_in_internal: ***ERROR*** iscsi_write_pdu() failed
Mar 29 02:26:11 san01 istgt[54099]: istgt_iscsi.c:3853:istgt_iscsi_task_response: ***ERROR*** iscsi_transfer_in() failed
Mar 29 02:26:11 san01 istgt[54099]: istgt_iscsi.c:5392:sender: ***ERROR*** iscsi_task_response() CmdSN=1680057 failed on iqn.2011-03.istgt:target-san01,t,0x0001(iqn.1991-05.com.microsoft:hv01.
Mar 29 02:26:12 san01 istgt[54099]: Login from iqn.1991-05.com.microsoft:hv01 (192.168.99.2) on iqn.2011-03.istgt:target-san01 LU1 (192.168.99.1:3260,1), ISID=400001370000, TSIH=2, CI$
Mar 29 02:26:18 san01 istgt[54099]: istgt_iscsi.c:1261:istgt_iscsi_write_pdu_internal: ***ERROR*** writev() failed (errno=22,iqn.1991-05.com.microsoft:hv01,time=0)
Mar 29 02:26:18 san01 istgt[54099]: istgt_iscsi.c:3484:istgt_iscsi_transfer_in_internal: ***ERROR*** iscsi_write_pdu() failed
Mar 29 02:26:18 san01 istgt[54099]: istgt_iscsi.c:3853:istgt_iscsi_task_response: ***ERROR*** iscsi_transfer_in() failed

What's strange is that there was no heavy I/O during this time - everything was pretty much idle. Nothing like the load testing I performed earlier. At this point, I don't know what's happening. After ~60 minutes everything goes back to normal again and the Hyper-V host is able to log back in.

Code:

Mar 29 03:29:52 san01 istgt[54099]: istgt_iscsi.c:1261:istgt_iscsi_write_pdu_internal: ***ERROR*** writev() failed (errno=22,iqn.1991-05.com.microsoft:hv01,time=0)
Mar 29 03:29:52 san01 istgt[54099]: istgt_iscsi.c:3484:istgt_iscsi_transfer_in_internal: ***ERROR*** iscsi_write_pdu() failed
Mar 29 03:29:52 san01 istgt[54099]: istgt_iscsi.c:3853:istgt_iscsi_task_response: ***ERROR*** iscsi_transfer_in() failed
Mar 29 03:29:52 san01 istgt[54099]: istgt_iscsi.c:5392:sender: ***ERROR*** iscsi_task_response() CmdSN=1919968 failed on iqn.2011-03.istgt:target-san01,t,0x0001(iqn.1991-05.com.microsoft:hv01
Mar 29 03:29:52 san01 istgt[54099]: istgt_iscsi.c:1261:istgt_iscsi_write_pdu_internal: ***ERROR*** writev() failed (errno=22,iqn.1991-05.com.microsoft:hv01,time=0)
Mar 29 03:29:52 san01 istgt[54099]: istgt_iscsi.c:3484:istgt_iscsi_transfer_in_internal: ***ERROR*** iscsi_write_pdu() failed
Mar 29 03:29:52 san01 istgt[54099]: istgt_iscsi.c:3853:istgt_iscsi_task_response: ***ERROR*** iscsi_transfer_in() failed
Mar 29 03:29:52 san01 istgt[54099]: istgt_iscsi.c:5392:sender: ***ERROR*** iscsi_task_response() CmdSN=1919969 failed on iqn.2011-03.istgt:target-san01,t,0x0001(iqn.1991-05.com.microsoft:hv01
Mar 29 03:30:52 san01 istgt[54099]: Login from iqn.1991-05.com.microsoft:hv01 (192.168.99.2) on iqn.2011-03.istgt:target-san01 LU1 (192.168.99.1:3260,1), ISID=400001370000, TSIH=214, $
Mar 29 03:38:31 san01 ntpd[2101]: kernel time sync status change 6001
Mar 29 03:55:34 san01 ntpd[2101]: kernel time sync status change 2001

I've read over ticket #1531 and had vfs.zfs.write_limit_shift set to 6 since my box went in. I'm going to try and set vfs.zfs.txg.synctime_ms to 200 and see what happens.

Riley

cyberjock · Mar 29, 2013

I don't recommend running iscsi over ZFS. I know, not something you want to hear. Especially since its been running fine for years. Unfortunately ZFS slowly fragments the heck out of iscsi devices and there is no way to defrag ZFS. Also if you read the manual you should max out the RAM on your motherboard before considering an L2ARC. Adding an L2ARC robs the system of ARC memory and can therefore result in a performance decrease.

Have you run any SMART tests on the hard drives or examined the SMART data? If one of the drives is starting to fail and having problems reading data it can bring down the whole zpool. If you don't know how to understand the SMART data you can post the SMART data for your hard drives and one of us senior guys can take a look. If you post it please put it in code.

Skud · Mar 29, 2013

cyberjock said:
I don't recommend running iscsi over ZFS. I know, not something you want to hear. Especially since its been running fine for years. Unfortunately ZFS slowly fragments the heck out of iscsi devices and there is no way to defrag ZFS. Also if you read the manual you should max out the RAM on your motherboard before considering an L2ARC. Adding an L2ARC robs the system of ARC memory and can therefore result in a performance decrease.

Have you run any SMART tests on the hard drives or examined the SMART data? If one of the drives is starting to fail and having problems reading data it can bring down the whole zpool. If you don't know how to understand the SMART data you can post the SMART data for your hard drives and one of us senior guys can take a look. If you post it please put it in code.

Thanks for the quick reply. I had a failing disk last week, but that's been since replaced and no other disks have any pending or reallocated sectors.

I do understand the implications of running a copy-on-write filesystem as back-end iSCSI storage, but I don't believe I should be hitting any sort of performance limit that would cause these issues. Like you mentioned - it had been running for years on OpenIndiana/Napp-it with no issues like this.

The thing that really confuses me is that the iSCSI drop-outs seem to happen at random times - even where there is little to no load. I suppose it could be a network issue, but nothing indicates that and the connection is a direct Cat6a cable between the two Intel x540 NICs.

Thanks!!
Riley

jgreco · Mar 29, 2013

When testing, I suggest being very aggressive and, among other things you dream up yourself, set the pool to do a scrub. Since you note how I "resolved" 1531, do be aware that given sufficient load, one can make a spinning hard disk based iSCSI system unresponsive... so the first thing to do is to make sure you're getting that to be as responsive as possible under stressy conditions. Once you've done that, if you find it dying off more, careful analysis of what's going on at that time is called for (sorry I know that's "not helpful" but it's the truth).

Skud · Mar 29, 2013

It just crashed again. At the time I was:

- Running a scrub, it's still going at 350MB/s
- Deleting a 3TB file (successful)
- Running a Hyper-V replication task to another, smaller (4 x 2TB RAIDZ, AMD e-350, 4GB RAM) box.

Really, the only difference I've seen in *this* case is the scrub. I thought that maybe when it crashed in middle of last night a scrub had kicked off, but that wasn't the case.

So, it's crashed under lots of load and not under load. In either case, nothing in shown in the logs except for istgt errors.

I've taken some screen shots of the reporting graphs. The first shows when the iSCSI connections were dropped. According to the logs it happened at 12:56pm, so probably a minute or two before that is when the trouble began. You can see that for some reason CPU usage increases after the crash. The only think I can think of is that the scrub runs faster so more CPU is used. ix0 is the 10Gb link to the Hyper-V server and em0 is the management interface. Curiously, em0 shows a steady increase in traffic after iSCSI is lost.

Image #2 shows when the scrub was started (12:10pm). There seems to be a curious decrease in traffic on ix0 (iSCSI) and increase on em0 (management) at around 12:20pm. Load also shoots up just before 12:30, but I wonder if that's the scrub "speeding up". I've often noticed that scrubs and resilvers start out extremely slow and then take off.

Riley

cyberjock · Mar 29, 2013

SInce this has been running for years without any issues and is probably very fragmented, is there any chance you could backup the iscsi devices to another drive, delete them from the zpool, then copy them back? This would clean up the fragmentation and may help.

Aside from fragmentation(I know.. you said they sometimes crash under no load) I don't have any other good advice to provide. Obviously with copy-on-write file systems and no defrag tool a zpool will only get slower over time and maybe you're hitting the threshold of what your server can do?

Skud · Mar 29, 2013

Well, if I'm hitting the limit of what the pool is capable of then it is what it is and I will look into other solutions. However, iSCSI connections have been lost with the system just basically sitting there doing nothing. I can hit the pool as hard as I want with no issues (other than a bit of slowness) and it's rock stable. Under OI/Napp-It it was the same way.

I think there is some underlying issue with istgt and the Windows iSCSI initiator and/or a specific traffic flow. I've watched it through one of these episodes and the Windows host is unable to even reconnect to istgt for up to 60 seconds at a time. It just hangs there at "reconnecting". It will reconnect for a few seconds and then it's lost again. This will happen over and over for a while before going back to normal (last night it happened for an hour). During this time, disk usage is 20 - 40% using gstat and iostat -xn.

I've monitored istgt using "top -m io" and istgt is just sitting there, doing nothing. During normal operation everything moves along nicely and then the istgt process just flatlines with 0s across the board.

Riley

cyberjock · Mar 29, 2013

I'm not sure how the 20-40% is being calculated. In reality tyou could be doing less than 1Mb/sec but be doing all the zpool can because of the seek time of hard drives. This is where fragmentation is killer. If ZFS "breathes" and it has 1000 seeks of 1kbyte writes to do the zpool will be locked out for quite a while while it writes all the data. From what I've seen once ZFS starts breathing you can't really do anything else until it finishes. This is where a ZIL saves the day because it allows ZFS to breathe to the ZIL saving the zpool from being busy for extended periods of time. At least, that's the theory :P

Skud · Mar 29, 2013

Right - with a ZIL all writes to the pool then become sequential. The ZIL "buffers" all the random writes into a transaction group and then sequentially writes that out to the pool. I have a 100GB Intel S3700 provisioned to 20GB for a ZIL.

I'm not sure on how the usage is calculated, too. However, I can load up the pool so that the disks are almost always over 90% busy and not have issues. Then, out of the blue, istgt will start reporting errors when the pool shows idle and very few reads or writes per second.

Riley

jgreco · Mar 30, 2013

Skud said:
Right - with a ZIL all writes to the pool then become sequential. The ZIL "buffers" all the random writes into a transaction group and then sequentially writes that out to the pool.

You have a fundamental misunderstanding of what the ZIL is and how it works. The ZIL is not in the data path and does none of the things you say.

Skud · Mar 30, 2013

jgreco said:
You have a fundamental misunderstanding of what the ZIL is and how it works. The ZIL is not in the data path and does none of the things you say.

Why would you say that? I think you're confusing the ZIL with a log device. They are not the same thing. I admit, I am guilty of perpetuating some confusion by saying that I have an SSD for a ZIL. I should have said "I have an SSD dedicated for the ZIL".

The ZIL is just a transaction log. When a write comes in, to maintain POSIX compliant synchronous writes, the data is buffered in memory and written to the ZIL. Normally, without a dedicated log device, the ZIL is part of the pool. Data comes in and once the transaction log is full or is to be flushed the data is read out of memory and put onto stable storage. Only during power loss or other failure is the ZIL queried and that data is read out to the pool.

With ZFS being a copy-on-write filesystem data is never updated in-place, so when a transaction group is committed to disk it is read from memory (or the ZIL in a rare case) and written out sequentially to disk. That is why adding an SSD as a dedicated log device makes such a performance difference in synchronous write situations. ZFS can just put it to the fast log device and then "forget about it" rather than making sure it gets to disk before moving on.

Riley

jgreco · Apr 1, 2013

Well, then, look at what you said, and notice that there is nothing correct in the bit I quoted.

Even without a ZIL, ZFS is a CoW filesystem and with the design of transaction groups, blocks within files aren't being overwritten, but rather new space is allocated and data laid down. The ZIL does not change the way that block allocation happens on the pool. If it happened to be sequential with a dedicated SLOG, then it is probably sequential without the dedicated SLOG.

The ZIL is never in the data path except possibly during pool import, to ensure that all sync changes are flushed out to the pool (POSIX compliance). It certainly doesn't "buffer" all the random writes into a transaction group and doesn't then sequentially write them out to the pool. It isn't buffering writes, it has nothing to do with random writes. The ZIL is simply a log of sync writes that ought to have been written to the pool, that might or might not have been, but because ZFS has promised some consumer that they've been written, the mechanism exists to make good on that promise.

cyberjock · Apr 1, 2013

In effect, your drive will become more fragmented regardless of if you use a ZIL or not and won't get worse at a different rate because you chose to use a ZIL. Also your fragmentation will only get worse without wiping the zpool clean and starting over. For most everyone I'd consider this far from trivial to do regularly to keep iscsi performance at tip-top shape on ZFS.

I do believe that the writes to the ZIL itself are sequential, but that has no bearing on how or where the data is stored on the zpool itself.

Important Announcement for the TrueNAS Community.

Ms 2k8r2 cluster and iscsi issues

Contributor

Dabbler

Resident Grinch

Dabbler

Resident Grinch

Dabbler

Dabbler

Inactive Account

Dabbler

Resident Grinch

Dabbler

Inactive Account

Dabbler

Inactive Account

Dabbler

Resident Grinch

Dabbler

Resident Grinch

Inactive Account

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Ms 2k8r2 cluster and iscsi issues"

Similar threads