Slow iSCSI Read Performance

clifford64

Explorer
Joined
Aug 18, 2019
Messages
87
Any ideas?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
I'm afraid I don't know what's going on. There almost seems to be a pathological issue impacting iSCSI performance in the latest release(s) as performance "out of the gate" on TN12 were very promising; but the potential data corruption issue necessitied a quick follow-up to fix.
 

clifford64

Explorer
Joined
Aug 18, 2019
Messages
87
I'm afraid I don't know what's going on. There almost seems to be a pathological issue impacting iSCSI performance in the latest release(s) as performance "out of the gate" on TN12 were very promising; but the potential data corruption issue necessitied a quick follow-up to fix.

Do you think it would be worth testing bypassing the switch and going directly to the SAN on one host to see if that makes a difference?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
Do you think it would be worth testing bypassing the switch and going directly to the SAN on one host to see if that makes a difference?
It's worth a shot certainly, just to eliminate the switch as a potential bottleneck.
 

clifford64

Explorer
Joined
Aug 18, 2019
Messages
87
I'll see if I can find some time this weekend to do some testing and update this thread.
 

clifford64

Explorer
Joined
Aug 18, 2019
Messages
87
It's worth a shot certainly, just to eliminate the switch as a potential bottleneck.

So I did some more tests and when I run that DD test on my test SSD pool (Just a single samsung 850 PRO SATA III SSD), I am getting speeds of 3GB/sec. I am not sure how this is possible if it is only SATA III speeds. I even bumped it up to 2M to make sure its not hitting cache and it was still doing the same thing.

Performing the DD test on my standard iSCSI pool, I am seeing same speeds of iSCSI performance. It bounces around a little bit 10-50MB/sec before reaching higher speeds of 120-150MB/sec or so with a few spikes here or there to 200-300MB/sec. Disk busy doesn't really have any spikes above 50% on a single drive at a time. Average is probably 5-20% most of the time.

I also tried a direct network connection between host and TrueNAS for iSCSI on a separate network and I had the same speeds as before. svMotion to host read speeds were 50-80MB/sec with 30DAVG/cmd. svMotion to iSCSI from host was 600-800MB/sec.

I also believe I can do svMotion between local host's storage at about 300MB/sec or so. (I think thats the limit of the M.2 NVME drive I have. I performed local speed tests on it before and only got around that speed.)

I also svMotioned a test VM to the SSDTest pool and then gave it about 15-30 minutes and performed svMotion to local M.2 and it transferred at 300MB/sec. I can perform another test transferring it the SSDTest Pool and then waiting a day to make sure cache is completely cleared out and then trying an svMotion back to M.2.

At this time, it seems to be more of an issue with the pool itself than hardware. Maybe it is the disks I have, but I wouldn't think it would be.

Also, when doing a standard vMotion and changing compute only, switch usage shows 10Gbps and transfers pretty darn quickly, but I also think that's normal for just compute vMotion.
 
Last edited:

clifford64

Explorer
Joined
Aug 18, 2019
Messages
87
It may very well be my primary iSCSI pool. After leaving the test VM running on the test SSD Pool for a while and then svMotioning it locally, it transferred at full speed for that connection. I didn't think it would be, but it might be the mismatched disks, or at least something going on in my primary pool.
 

clifford64

Explorer
Joined
Aug 18, 2019
Messages
87
Also still unsure. Doing Veeam backups and I am seeing DAVG/cmd to be 300-900. Disk busy is on average 30%. Very slow transfer speeds.
 

clifford64

Explorer
Joined
Aug 18, 2019
Messages
87
There is definitely latency in the kernel of the host. Although, I don't fully understand how to interpret this. This is while performing a backup on my VMs.
1616208222857.png
 

clifford64

Explorer
Joined
Aug 18, 2019
Messages
87
Alright, I am at a full loss here. I ran a crystal disk benchmark on a VM that is stored on the iSCSI pool and ran it with 32GB. I was monitoring gstat -dp and esxtop. During sequential reads, disk busy in gstat didn't get above 30% and reads were somewhat slow. DAVG was also anywhere from 30-90. However, 4ki q8t8, 4ki q32t1, and 4kib q1t1 all ran at 100% disk busy and DAVG was lower, around 10-30. All write tests showed high disk busy as well. So it really seems to be sequential reads that are giving me problems.

I also rebooted all devices and VMs today. I upgraded switch firmware. Swapped out cable to SAN. I also added the tuneables mentioned in the first page from Samuel Tai. I also update Truenas to 12.0 U2.1.

I then tried the same test again but on the test SSD pool and initial write speeds of the 32GB were terrible at like 40Mb/sec and disk usage was 100% for the SSD. Final seq reads were 130Mb/sec while final seq writes were 450/sec. To me this doesn't make any sense.

Anyone have any other ideas?
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
DQLEN dropping on your iSCSI LUNs is indicative of either VMware adaptive queueing or SIOC kicking in (and I'd wager you haven't got access to or haven't enabled the latter) - this happens when VMware sees the SCSI sense codes for BUSY or QUEUE FULL.

At this point I'm guessing your disks are just getting bogged down trying to randomly seek around when asked to piece together that large VM for an svMotion. I also wonder if maybe those WD EARS disks are choking things up; did you ever use the wdidle tool to stop their aggressive head-parking?

Can you pull SMART stats on everything and attach as a .txt? If you've got a disk or two that's choking hard and throwing errors that will drag the whole thing down.

Another question: how old is the pool itself, as in when was it created? Over time, copy-on-write will fragment the data.
 

clifford64

Explorer
Joined
Aug 18, 2019
Messages
87
DQLEN dropping on your iSCSI LUNs is indicative of either VMware adaptive queueing or SIOC kicking in (and I'd wager you haven't got access to or haven't enabled the latter) - this happens when VMware sees the SCSI sense codes for BUSY or QUEUE FULL.

I am not sure what that is or how to change it. I don't think I have configured that in ESXi on my hosts. I do know that my two hosts are showing different values for DQLEN when viewing esxtop on each host.

At this point I'm guessing your disks are just getting bogged down trying to randomly seek around when asked to piece together that large VM for an svMotion. I also wonder if maybe those WD EARS disks are choking things up; did you ever use the wdidle tool to stop their aggressive head-parking?.

I have not used this tool. Can you provide more info on it and how to use it? Will it do anything to my pool?

Can you pull SMART stats on everything and attach as a .txt? If you've got a disk or two that's choking hard and throwing errors that will drag the whole thing down.

Smart info is attached in zip file. Includes all drives in my system. I am missing smart tests on some of my drives because I replaced the drives and I guess this does not add them to the smart test cycle. I noticed that today when I was poking around. I originally setup my pool with all drives to perform smart tests on regularly.

Another question: how old is the pool itself, as in when was it created? Over time, copy-on-write will fragment the data.

I recreated the pool back in summer of 2020. I originally made it in 2019 with 2 Raidz2s with 6 drives each, but was trying to get better performance and went to 6 mirrors. I backed up the data VMs with Veeam, recreated the pool and then restored the VMs.
 

Attachments

  • smartctl.zip
    29.8 KB · Views: 209

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
I am not sure what that is or how to change it. I don't think I have configured that in ESXi on my hosts. I do know that my two hosts are showing different values for DQLEN when viewing esxtop on each host.

Adaptive queue length is enabled by default, and this is likely what's causing the queue depth to get chopped. If the storage array is overwhelmed or fills its own device queue it sends back SCSI codes stating it's BUSY or QUEUE FULL, which VMware responds to by trying to throttle back how much data it's sending.

SIOC (Storage I/O Control) requires you to manually switch it on and is basically "Quality of Service" for storage, it tries to prioritize "fairness" and prevents any one VM or workload from stomping all over the others.

I have not used this tool. Can you provide more info on it and how to use it? Will it do anything to my pool?

It's designed to stop the overly-aggressive head-parking timer on the WD Greens - but if you've never heard of it, the proverbial damage is probably already done on the Green drives.

Check this very old thread by @cyberjock for its usage - https://www.truenas.com/community/threads/hacking-wd-greens-and-reds-with-wdidle3-exe.18171/ - the "Ultimate Boot CD" still apparently contains it. Shouldn't do anything to your pool, but again, I say "shouldn't" - if your Green drives decide to give up the ghost, then your pool would likely be damaged.

Smart info is attached in zip file. Includes all drives in my system. I am missing smart tests on some of my drives because I replaced the drives and I guess this does not add them to the smart test cycle. I noticed that today when I was poking around. I originally setup my pool with all drives to perform smart tests on regularly.

I took a quick look at the SMART data; only one drive had a single reallocated sector, which is good, but your load cycle counts are enormous. I didn't see a total count under 5 figures for any one drive, most of your Seagates were in the 50K range, and the WD Greens ranged from about 180K on the low end to these two ones that are downright narcoleptic with how often they've tried to go to sleep on you:

Code:
mirror0_drive2:
  9 Power_On_Hours          0x0032   034   034   000    Old_age   Always       -       48702
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       1058439

mirror5_drive2:
  9 Power_On_Hours          0x0032   024   024   000    Old_age   Always       -       56135
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       708539


For comparison, I have a drive over nine years spinning with 290 load cycles. Not 290,000 - 290.

Is ditching the Green drives an option, or at least replacing the two called out above?

I recreated the pool back in summer of 2020. I originally made it in 2019 with 2 Raidz2s with 6 drives each, but was trying to get better performance and went to 6 mirrors. I backed up the data VMs with Veeam, recreated the pool and then restored the VMs.

Is there a lot of activity on the datastore itself (deletes, overwrites, changes) that would fragment it? But right now I'm calling "green sus"
 

clifford64

Explorer
Joined
Aug 18, 2019
Messages
87
Adaptive queue length is enabled by default, and this is likely what's causing the queue depth to get chopped. If the storage array is overwhelmed or fills its own device queue it sends back SCSI codes stating it's BUSY or QUEUE FULL, which VMware responds to by trying to throttle back how much data it's sending.

SIOC (Storage I/O Control) requires you to manually switch it on and is basically "Quality of Service" for storage, it tries to prioritize "fairness" and prevents any one VM or workload from stomping all over the others.



It's designed to stop the overly-aggressive head-parking timer on the WD Greens - but if you've never heard of it, the proverbial damage is probably already done on the Green drives.

Check this very old thread by @cyberjock for its usage - https://www.truenas.com/community/threads/hacking-wd-greens-and-reds-with-wdidle3-exe.18171/ - the "Ultimate Boot CD" still apparently contains it. Shouldn't do anything to your pool, but again, I say "shouldn't" - if your Green drives decide to give up the ghost, then your pool would likely be damaged.



I took a quick look at the SMART data; only one drive had a single reallocated sector, which is good, but your load cycle counts are enormous. I didn't see a total count under 5 figures for any one drive, most of your Seagates were in the 50K range, and the WD Greens ranged from about 180K on the low end to these two ones that are downright narcoleptic with how often they've tried to go to sleep on you:

Code:
mirror0_drive2:
  9 Power_On_Hours          0x0032   034   034   000    Old_age   Always       -       48702
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       1058439

mirror5_drive2:
  9 Power_On_Hours          0x0032   024   024   000    Old_age   Always       -       56135
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       708539


For comparison, I have a drive over nine years spinning with 290 load cycles. Not 290,000 - 290.

Is ditching the Green drives an option, or at least replacing the two called out above?



Is there a lot of activity on the datastore itself (deletes, overwrites, changes) that would fragment it? But right now I'm calling "green sus"


The main datastore changes would be a media server that transcodes to h265 and a minecraft server. The minecraft server is only 80GB. The media server is 6TB that I zero out the free space every now and then because my backups get enlarged.

I do have plans to replace all drives with schucked 8tb elements, but I have mostly been waiting for drive failures. I got all my drives for free and they are pretty darn old. I have 1 free drive left and then I have 1 8tb that I have already schucked and started to buy. I suspect over the next year I will start replacing the drives. (Going with 8tb elements because WD has no SMR at 8tb or above and its cheaper to buy elements than it is to buy 4tb red pros. Would rather upgrade now than buy 2TB and upgrade in a year or two.)

Should I leave the ESXi config in adaptive queue length?

As for widdle, I would think that this shouldn't be having an impact on drive performance during sequential read and writes right? It should only be doing it if the pool goes idle for an amount of time?
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
The main datastore changes would be a media server that transcodes to h265 and a minecraft server. The minecraft server is only 80GB. The media server is 6TB that I zero out the free space every now and then because my backups get enlarged.

How often does the content on the media server change, and what method are you using to zero free space?

From the esxtop go to the u for disk device stats, then hit f for field changes and make it so that only A and O (NAME and VAAI Stats) are visible. Do you have values under the ZERO or DELETE columns? A proper VMFS setup should pass the UNMAP commands through and register them as DELETE - if you're seeing ZERO it might not be configured for UNMAP.

I do have plans to replace all drives with schucked 8tb elements, but I have mostly been waiting for drive failures. I got all my drives for free and they are pretty darn old. I have 1 free drive left and then I have 1 8tb that I have already schucked and started to buy. I suspect over the next year I will start replacing the drives. (Going with 8tb elements because WD has no SMR at 8tb or above and its cheaper to buy elements than it is to buy 4tb red pros. Would rather upgrade now than buy 2TB and upgrade in a year or two.)

See if you can accelerate the replacement, even if you can find another couple of cheap 2TB Seagates to fill in for now. If you have recent backups (although that's the source of these woes, so I'm not banking on it) you could even manually offline the two drives with very high LCC values and see if that changes things; if they're being particularly slow to respond, that could drag your whole pool down.

Should I leave the ESXi config in adaptive queue length?

Yes; the queue depth throttling is a symptom, not a cause.

As for widdle, I would think that this shouldn't be having an impact on drive performance during sequential read and writes right? It should only be doing it if the pool goes idle for an amount of time?
The Green drives specifically like to park their heads after only a few seconds of idle time, so it's entirely possible they're parking when you're trying to drive I/O to them. Or they've done it so often in the past that the actuator motor for the disk arm is wearing out.

Re: sequential I/O - a virtualization workload is rarely, if ever, sequential. Logically you're reading block 1, 2, 3, 4 from the VMFS level for the given VMDK, but that might be arranged all over your physical disks which forces them to seek back and forth. And that's assuming no other read/writes are occurring, which on a filesystem with multiple running VMs isn't likely.

There's a significant amount of overhead here with your Plex server being a VM on a remote hypervisor. Is it possible and/or have you considered making the media server share an SMB mount point on the TrueNAS/FreeNAS machine itself, and having the Plex VM connect to/index it remotely?
 

clifford64

Explorer
Joined
Aug 18, 2019
Messages
87
How often does the content on the media server change, and what method are you using to zero free space?

I run tdarr which transcodes all items that are imported into the server automatically. I haven't had any major imports recently. Mostly about 5-10GB a week. Currently, my backup disk is smaller than the fully disk of my VM, but the actual used space should still fit on the backup. I can trick Veeam into only backing up the used space and it will fit. Currently have 1TB free on backup RAID.

I zero out the freespace by using the zerofree tool in ubuntu. I boot the main plex VM to a live ubuntu install and run the command to zero out unused space.


From the esxtop go to the u for disk device stats, then hit f for field changes and make it so that only A and O (NAME and VAAI Stats) are visible. Do you have values under the ZERO or DELETE columns? A proper VMFS setup should pass the UNMAP commands through and register them as DELETE - if you're seeing ZERO it might not be configured for UNMAP.

I believe the main LUN is going to be the middle one. All VMs and VMDKs are thick provisioned.

1616360485325.png


See if you can accelerate the replacement, even if you can find another couple of cheap 2TB Seagates to fill in for now. If you have recent backups (although that's the source of these woes, so I'm not banking on it) you could even manually offline the two drives with very high LCC values and see if that changes things; if they're being particularly slow to respond, that could drag your whole pool down.

I will do my best. I will probably be buying a drive every two weeks.


There's a significant amount of overhead here with your Plex server being a VM on a remote hypervisor. Is it possible and/or have you considered making the media server share an SMB mount point on the TrueNAS/FreeNAS machine itself, and having the Plex VM connect to/index it remotely?

I have thought about it, but I was trying to go for a setup that allowed me to learn about vCenter, ESXi, and iSCSI. I work as a sysadmin so being able to play around with the technologies is a little more important. Yeah, it would probably be better to use it as an SMB share rather than iSCSI, but I also like having iSCSI because of being able to use HA resources with vCenter.

I do have current backups. The system works and I am not getting any performance issues within the VMs themselves yet. Really the bottlenecks seem to be when doing svMotions and backups. I will work on replacing my drives and hopefully that will help with it.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
I run tdarr which transcodes all items that are imported into the server automatically. I haven't had any major imports recently. Mostly about 5-10GB a week. Currently, my backup disk is smaller than the fully disk of my VM, but the actual used space should still fit on the backup. I can trick Veeam into only backing up the used space and it will fit. Currently have 1TB free on backup RAID.

I zero out the freespace by using the zerofree tool in ubuntu. I boot the main plex VM to a live ubuntu install and run the command to zero out unused space.

I believe the main LUN is going to be the middle one. All VMs and VMDKs are thick provisioned.

View attachment 46089

It's sending DELETE commands through so the UNMAP is being properly sent. You're reclaiming space at least but right now I believe there's a lot of fragmentation of your media server's VMDK - it might think it's writing to LBA 1, 2, 3, 4 in the guest, but that might be getting scattered across the full range of disk space on the ZFS pool depending on what gets written where.

I will do my best. I will probably be buying a drive every two weeks.

Let's see what happens with the removal/replacement of the Greens. Do you have an extra/available slot to replace the drive without degrading the mirror (eg: make it a mirror-3 first, then remove the Green to return to mirror-2)

I have thought about it, but I was trying to go for a setup that allowed me to learn about vCenter, ESXi, and iSCSI. I work as a sysadmin so being able to play around with the technologies is a little more important. Yeah, it would probably be better to use it as an SMB share rather than iSCSI, but I also like having iSCSI because of being able to use HA resources with vCenter.

The reason I ask here is because the iSCSI overhead is chopping those media files up into tiny little pieces (16K max volblocksize unless you've adjusted it) rather than letting them be larger (128K or even 1M if adjusted) chunks if accessed directly over SMB. You could still use an iSCSI ZVOL to hold the VM's boot device, benefit from being able to migrate the VM around live between hosts, and further learn about the VMware technology stack, but let the large media files sit on the SMB share for efficiency reasons. VMFS is also another layer of abstraction that could cause things to fragment around more under read/write/modify.

I do have current backups. The system works and I am not getting any performance issues within the VMs themselves yet. Really the bottlenecks seem to be when doing svMotions and backups. I will work on replacing my drives and hopefully that will help with it.

Keep us updated.
 

clifford64

Explorer
Joined
Aug 18, 2019
Messages
87
It's sending DELETE commands through so the UNMAP is being properly sent. You're reclaiming space at least but right now I believe there's a lot of fragmentation of your media server's VMDK - it might think it's writing to LBA 1, 2, 3, 4 in the guest, but that might be getting scattered across the full range of disk space on the ZFS pool depending on what gets written where.

Would there be a good way to defrag it? I am in the business I work at, we have quite a few DBs and such on datastores, wouldn't those get fragmented as well?


Let's see what happens with the removal/replacement of the Greens. Do you have an extra/available slot to replace the drive without degrading the mirror (eg: make it a mirror-3 first, then remove the Green to return to mirror-2)

I'll begin working on replacing the greens first, but it might take some time. Unfortunately, I have all drive slots used up. I have a few offline cold spares that I keep around for when I need to do replacements.


The reason I ask here is because the iSCSI overhead is chopping those media files up into tiny little pieces (16K max volblocksize unless you've adjusted it) rather than letting them be larger (128K or even 1M if adjusted) chunks if accessed directly over SMB. You could still use an iSCSI ZVOL to hold the VM's boot device, benefit from being able to migrate the VM around live between hosts, and further learn about the VMware technology stack, but let the large media files sit on the SMB share for efficiency reasons. VMFS is also another layer of abstraction that could cause things to fragment around more under read/write/modify.
Keep us updated.

I take it there is no way to change the volblocksize without re-doing the pool? If I need to re-do it in the future, then I would probably set it up that way first.

I don't have much experience with FreeNAS as a NAS, would you recommend two different pools for that? Or a SMB share for plex and zvol for iscsi?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
Would there be a good way to defrag it? I am in the business I work at, we have quite a few DBs and such on datastores, wouldn't those get fragmented as well?

ZFS has no defrag functionality. The only way to do it is to send the data to another pool (svMotion will work here) and then back.

I'll begin working on replacing the greens first, but it might take some time. Unfortunately, I have all drive slots used up. I have a few offline cold spares that I keep around for when I need to do replacements.

I would consider putting one of the cold-spares in place for the 1M-load-cycle Green.

I take it there is no way to change the volblocksize without re-doing the pool? If I need to re-do it in the future, then I would probably set it up that way first.

You can't change volblocksize after a zvol has been created - you'd have to create a new zvol, but even then, don't use a 1M volblocksize as it will really hurt random I/O. You can change recordsize on datasets (eg: SMB mounts) and it will take effect on newly written data.

I don't have much experience with FreeNAS as a NAS, would you recommend two different pools for that? Or a SMB share for plex and zvol for iscsi?

Two separate pools would be ideal here because it's a very different I/O pattern. The Plex data isn't latency-sensitive, and would do well on RAIDZ2 with a large recordsize. The zvols for iSCSI are latency-sensitive, and should be on mirrors with SLOG and potentially L2ARC. If you can't arrange for two pools though, you could do SMB for the Plex data and zvol for iSCSI, and then control L2ARC usage with the secondarycache property (set to "metadata" for Plex, and "all" for the zvol)
 
Top