VMWare + iSCSI hourly ctl_datamove issues

Status
Not open for further replies.

paulatmig

Dabbler
Joined
Jul 14, 2014
Messages
41
Previously my system had some memory problems (faulty) which caused the system to reboot and hang during the boot process.

Now things seem to be working much better, but recently I've been seeing ctl_datamove errors show up at 5 minutes past the hour, every hour, and about 3 times in a few seconds - like so:

> ctl_datamove: tag 0x166c1 on (1:4:0:0) aborted
> ctl_datamove: tag 0x594e7 on (3:4:0:0) aborted
> ctl_datamove: tag 0xe5cfd8 on (0:4:0:1) aborted

Also, I'm getting additional SAS errors about once a month:
> ses0: da0,pass0: Element descriptor: 'Slot 01'
> ses0: da0,pass0: SAS Device Slot Element: 1 Phys at Slot 0
> ses0: phy 0: SAS device type 1 id 0
> ses0: phy 0: protocols: Initiator( None ) Target( SSP )
> ses0: phy 0: parent 50030480013ccd3f addr 50000c0f01ed1402

... across all slots in the array. I'm running a Supermicro 6027R-E1R12L with 192Gb of RAM, a 10Gb Chelsio (with Chelsio optics), an Intel P3700 ZIL, and 12 WD 2Tb RE SAS disks. I've got iSCSI running off those two 10Gb connections to my ESXi hosts, round-robin, and on their own network. Not running de-dupe, very vanilla ESXi installs (haven't toyed with any drivers in that regard, they're using Intel X520-da2's). Freenas is hosting the vmdk storage for the ESXi hosts, and the RPM speed has been set to 7200.

The regularity of the ctl_datamove errors makes me think that the ESXi hosts (there are two) are "checking in" or some such thing. I've only got the one VM guest server running now, and it's not doing anything. Accessing and performing tasks on the guest server doesn't cause errors either.

When the ctl_errors occur, the VMs become unresponsive for about a minute, but then pop right back up to working normally.

My guess is that there's something I should look at in the ESXi network settings? I'm just drawing a blank as to which ones I should be giving a look at.
 
J

jpaetzel

Guest
The datamove errors could be a result of the networking going out to lunch. Is there any chance you could open a ticket at bugs.freenas.org and attach the output of save debug? This will get the CTL author a chance to look at the errors as well.

The ses0 messages are the enclosure management driver reattaching to the enclosure. No idea why that happens but it's an innocuous bug that doesn't involve the data path at all. Presumably while it is reattaching you'd be unable to read the enclosure stats and so forth, although I've never been lucky enough to catch one in the act to find out if that is the case.
 

bbytes

Cadet
Joined
Aug 8, 2015
Messages
1
I can report the same problem with Freenas 9.3 latest release. At 01.00 this evening a file backup started. At 01.05 freenas showed

Aug 9 00:00:06 ...... syslog-ng[2685]: Configuration reload request received, reloading configuration;
Aug 9 01:05:12 ....... ctl_datamove: tag 0xe2483f on (3:4:0:0) aborted
Aug 9 01:05:18 ....... ctl_datamove: tag 0x2181397 on (1:4:0:0) aborted
Aug 9 01:05:18 ....... ctl_datamove: tag 0x1deae94 on (0:4:0:0) aborted

What do the numbers mean between the brackets? I just hope my data is safe and without errors.
The same error was shown when data was moved between datastores (vsphere)

I use a Dell R710, 32GB RAM, H200 (LSI 2911-8i IT mode), Intel Pro 1000 quad Nic (2x2 Lagg), Intel 256GB SSD for zil (32GB) and l2arc (200GB), 4x Hitachi 3TB SAS 7200 (RAID1+0)
I have 2 ESX hosts connected through iSCSI MPIO with round robin enabled. I'm currently running around 10 VPS servers.
 
Last edited:

paulatmig

Dabbler
Joined
Jul 14, 2014
Messages
41
In my case it was two things: heavy read/write traffic, even with a ZIL, was still slower than the system expected since I was running 7200 rpm drives. It's not harmful to your data, but the performance sinks - probably to the degree it'll be noticed by whatever software you're using.

Also, my ZIL (intel p3700) was doing TRIM every hour - so if there was an overlap of slow data read plus a TRIM run, my systems would hang for about 4-10 seconds. It was brutal. Fixed that by disabling TRIM on my ZIL, and fixed the read lag on my array by replacing my 7200 SAS disks with SSD drives (S3500). Expensive, yes, but the responsiveness is so much better!
 
Status
Not open for further replies.
Top