Failed drive causes iSCSI disconnect and high latency

nkoconno · Oct 3, 2023

We've seen an issue twice now in production. When a disk in poolX fails, iscsi latency skyrockets to 2 SECONDS and/or we see iSCSI disconnects on both ESXi and KVM hosts (Centos 7 and Ubuntu 20.04). The VMs sometimes recover, sometimes they bluescreen (windows), sometimes they go into a read-only file system requiring fschk to resolve (linux).

When the HDD failed in pool1, we saw these errors on the NVME disks, although they were never marked offline in zpool status

root@truenas01[~]# 2023 Sep 30 09:37:15 truenas01.sec1 md/raid1:md123: Disk failure on sdz4, disabling device. md/raid1:md123: Operation continuing on 1 devices. 2023 Sep 30 09:37:16 truenas01.sec1 md/raid1:md125: Disk failure on nvme7n1p1, disabling device. md/raid1:md125: Operation continuing on 1 devices. 2023 Sep 30 09:37:16 truenas01.sec1 md/raid1:md124: Disk failure on nvme4n1p1, disabling device. md/raid1:md124: Operation continuing on 1 devices. 2023 Sep 30 09:37:17 truenas01.sec1 md/raid1:md126: Disk failure on nvme0n1p1, disabling device. md/raid1:md126: Operation continuing on 1 devices. 2023 Sep 30 09:37:17 truenas01.sec1 md/raid1:md127: Disk failure on nvme3n1p1, disabling device. md/raid1:md127: Operation continuing on 1 devices.

The same concern happened when we were running a POC on old test hardware both with core and scale. The same thing happened in my homelab with some old WD red drives, when one of the drives failed, on both core and scale.

This past weekend a disk failed with 2 read errors. Truenas marked the disk as faulted as expected, however the iscsi mounts on the ESXi and KVM hosts were experiencing intermittent issues for about 4 hours... starting at the time when truenas marked the disk as faulted. By issues, we were seeing path evaluation messages on the ESXi side, performance deterioration logs in the vmkernel.log, etc. the issues eventually resolved themselves.

When the drive failed, all of our zvols attached to either the hard disk pool or the nvme pool experienced the same issues.

What we can't seem to understand is why a hard drive failing in pool X causes a problem with our NVMe pool Y and iSCSI access, and why a single disk failing would cause this much disruption to any zvol being consumed by ESXi and KVM and hosted on this instance of TrueNAS scale.

Has anyone else experienced this? What can we do to ensure that a failed disk doesn't result in a production outage?

Truenas version:
TrueNAS-SCALE-22.12.0

Hardware:
Manufacturer: Supermicro
Product Name: SYS-220U-TNR
2x Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz
256GB ddr4 memory
2x 128gb optane persistent memory for log on HDD pool
NVMe Controllers - 2x AOC-SLG4-4E4T-O
SAS controller - LSI SAS3008
Boot volume 2x sata ssd's connected to motherboard sata ports
Intel x710 quad port 10g network cards

Pool 0 - NVMe (in server chassis)
Raidz1 - 2x 6 drive vdevs
one pool spare

Pool1 - HDD (supermicro SAS3 JBOD)
Raidz2 - 3x 8 drive vdevs (no spare)
Log vdev on mirrored pair of 128GB optane persistent memory

Cross Posting here as I put the origional in a legacy location...

HoneyBadger · Oct 3, 2023

Hey @nkoconno

I scrubbed the RGB coding from your first paragraph as it was really hard to read on the dark theme. Let me jump in with a couple questions.

1. I see that you're still running the initial release of Bluefin (22.12.0) - is this correct? We've made a lot of updates since then (22.12.3.3) so I'd have to ask if there's an opportunity for a maintenance window to upgrade SCALE to the latest release.

2. You're using iSCSI to ESXi - can you describe the network topology? I see a quad-port X710 but it's also described as "cards" - how many IP addresses/subnets are in use for the IP SAN? Is MPIO being used?

3. Have you manually set sync=always on the zvols being used for your iSCSI extents? This is a question about data safety here, as the default VMware behavior for iSCSI doesn't enforce it on the client side.

Gathering a debug file (System -> Advanced -> Save Debug) may be helpful as well when reporting a bug - however, don't post the debug publicly - either attach it to the bug report after it's been generated (you'll get a reply with a link to the means to privately attach it) or DM me if you can't get it attached there.

nkoconno · Oct 3, 2023

Thanks @HoneyBadger

1. I see that you're still running the initial release of Bluefin (22.12.0) - is this correct? We've made a lot of updates since then (22.12.3.3) so I'd have to ask if there's an opportunity for a maintenance window to upgrade SCALE to the latest release.

Yes, we're still running the initial release. We can attempt to schedule a maintenance window, but I'd need a very good reason, IE known bug in the current version that's been fixed.

2. You're using iSCSI to ESXi - can you describe the network topology? I see a quad-port X710 but it's also described as "cards" - how many IP addresses/subnets are in use for the IP SAN? Is MPIO being used?

Sorry I didn't include that.
We have two iSCSI subnets (10.x.10.50 and 10.x.11.50) on TrueNAS
each ESXi host has 2 physical 10g connections for iSCSI only, one per subnet. Each ESXi host has a vmk adapter bound to the iSCSI storage adapter
We are using MPIO/Round Robin

3. Have you manually set sync=always on the zvols being used for your iSCSI extents? This is a question about data safety here, as the default VMware behavior for iSCSI doesn't enforce it on the client side.

we have not changed this. It's currently set to standard/default

HoneyBadger · Oct 3, 2023

nkoconno said:
Thanks @HoneyBadger

Yes, we're still running the initial release. We can attempt to schedule a maintenance window, but I'd need a very good reason, IE known bug in the current version that's been fixed.

There's been over 800 bugs fixed since the initial release of SCALE - I'm looking through the logs to see if there's any that are related to scst or ZFS that might account for this directly.

https://www.truenas.com/docs/scale/22.12/gettingstarted/scalereleasenotes/

nkoconno said:
Sorry I didn't include that.
We have two iSCSI subnets (10.x.10.50 and 10.x.11.50) on TrueNAS
each ESXi host has 2 physical 10g connections for iSCSI only, one per subnet. Each ESXi host has a vmk adapter bound to the iSCSI storage adapter
We are using MPIO/Round Robin

Separate-subnet design is correct for TrueNAS, but are you using the explicit VMware "vmkernel port binding" in this setup? That's intended for single-subnet design and shouldn't be used with TrueNAS - but if you've just assigned a single vmnic to each network (and/or explicitly set the failover order for the VM networks) then that's different.

Additional questions to "are you using VMkernel port binding?"
- Are you separating by VLAN as well?
- Do you have an SATP claim rule set up for the devices? If so, can you show me that (esxcli storage nmp satp rule list) - if you haven't, can you show a screencap in vCenter for the rules and ensure they're consistent across all hosts?

nkoconno said:
we have not changed this. It's currently set to standard/default

If the zvols are set to standard/default then your Optane pmem devices aren't being properly used - and there's potentially a risk of data loss in an event of sudden power loss. Switching your volumes to sync=always will likely cause some reduction in performance as opposed to the current setting (which buffers writes in memory only) so I recommend you make this change gradually and to one zvol at a time.

nkoconno · Oct 3, 2023

Separate-subnet design is correct for TrueNAS, but are you using the explicit VMware "vmkernel port binding" in this setup? That's intended for single-subnet design and shouldn't be used with TrueNAS - but if you've just assigned a single vmnic to each network (and/or explicitly set the failover order for the VM networks) then that's different.

Additional questions to "are you using VMkernel port binding?"
- Are you separating by VLAN as well?
- Do you have an SATP claim rule set up for the devices? If so, can you show me that (esxcli storage nmp satp rule list) - if you haven't, can you show a screencap in vCenter for the rules and ensure they're consistent across all hosts?

Sorry I'm not using the"Network Port Binding" option on the software iscsi adapter. I modified the failover order, sorry should have been more clear.

I do not have a claimrule set, but we're using host profiles for all 3 nodes in the impacted cluster, so they're all compliant and configured the same. End result is the same IMO, but I can configure a claimrule if that's the cause.

This is the configuration of all of the iSCSI volumes in ESX

Code:

naa.6589cfc0000005248b47b1132be34639
   Device Display Name: TrueNAS iSCSI Disk (naa.6589cfc0000005248b47b1132be34639)
   Storage Array Type: VMW_SATP_DEFAULT_AA
   Storage Array Type Device Config: {action_OnRetryErrors=off}
   Path Selection Policy: VMW_PSP_RR
   Path Selection Policy Device Config: {policy=rr,iops=1,bytes=10485760,useANO=0; lastPathIndex=1: NumIOsPending=0,numBytesPending=0}
   Path Selection Policy Device Custom Config:
   Working Paths: vmhba64:C0:T0:L0, vmhba64:C1:T0:L0
   Is USB: false

If the zvols are set to standard/default then your Optane pmem devices aren't being properly used - and there's potentially a risk of data loss in an event of sudden power loss. Switching your volumes to sync=always will likely

It looks like your reply was cut off.. all of the VMs on either KVM or ESXi are served from the NVMe pool that did NOT have a disk failure.. but in the future if we do host VMs off of the HDD pool, we should configure sync=always for any zvols served from the HDD pool? NVMe pool does not have a separate log vdev, I thought it was slower to use the optane PMEM as a log vdev on an NVMe pool?

ChrisRJ · Oct 4, 2023

nkoconno said:
Yes, we're still running the initial release. We can attempt to schedule a maintenance window, but I'd need a very good reason, IE known bug in the current version that's been fixed.

While not relevant for the topic of this thread, I must admit that to me running Scale seems a less than optimal choice in such an environment. Having to request maintenance windows and in addition also needing a sign-off on the reason, indicates a high level of criticality of IT for running a sizable business. If that assumption is correct, it is at odds with the level of maturity and stability of Scale. Running Core instead would, for me, be the obvious path.

Anyway, good luck for your problem! (Not meant in a sarcastic or ironic way at all.)

HoneyBadger · Oct 5, 2023

nkoconno said:
Sorry I'm not using the"Network Port Binding" option on the software iscsi adapter. I modified the failover order, sorry should have been more clear.
View attachment 70849

I do not have a claimrule set, but we're using host profiles for all 3 nodes in the impacted cluster, so they're all compliant and configured the same. End result is the same IMO, but I can configure a claimrule if that's the cause.

This is the configuration of all of the iSCSI volumes in ESX

Code:
naa.6589cfc0000005248b47b1132be34639 Device Display Name: TrueNAS iSCSI Disk (naa.6589cfc0000005248b47b1132be34639) Storage Array Type: VMW_SATP_DEFAULT_AA Storage Array Type Device Config: {action_OnRetryErrors=off} Path Selection Policy: VMW_PSP_RR Path Selection Policy Device Config: {policy=rr,iops=1,bytes=10485760,useANO=0; lastPathIndex=1: NumIOsPending=0,numBytesPending=0} Path Selection Policy Device Custom Config: Working Paths: vmhba64:C0:T0:L0, vmhba64:C1:T0:L0 Is USB: false

It looks like your reply was cut off.. all of the VMs on either KVM or ESXi are served from the NVMe pool that did NOT have a disk failure.. but in the future if we do host VMs off of the HDD pool, we should configure sync=always for any zvols served from the HDD pool? NVMe pool does not have a separate log vdev, I thought it was slower to use the optane PMEM as a log vdev on an NVMe pool?

Good to know that vmkernel port binding isn't in play; it looks like you've got everything sorted, and using Host Profiles (and a dvSwitch) will keep things consistent across your cluster. Your pathing is correct as far as it defaulting to SATP_AA rather than ALUA (that's showing up with Cobia)

Re: the sync=always questions, yes it looks like part of my reply got eaten, so there's potentially a risk of data loss in an event of sudden power loss. Switching your volumes to sync=always will likely cause some reduction in performance as opposed to the current setting (which buffers writes in memory only) so I recommend you make this change gradually and to one zvol at a time."

It's about the safety of data - if you're mapping the iSCSI extents directly into the VMs themselves (RDM under vSphere or direct assignment in KVM) then the guest OS should handle cache flushing and coherency, but if they're holding a VMFS datastore it's likely best to have it enforced there even if there aren't live VMs running.

Performance-wise, your Optane DC pmem devices are obviously much faster than the HDDs, so they'll be fine to cover that for sync writes - as to whether they'll be faster than your collective NVMe writes it will depend on what the NVMe devices themselves are. Good SLOG devices have low write latency and high endurance - both of which the Optane devices have in spades, so using them to accelerate the NVMe pool may also have validity. Since you can tweak the sync settings per-zvol you can set it on a single zvol from the NVMe pool and determine the impact on performance of the VMs.

Important Announcement for the TrueNAS Community.

Failed drive causes iSCSI disconnect and high latency

nkoconno

Cadet

HoneyBadger

actually does care

nkoconno

Cadet

HoneyBadger

actually does care

nkoconno

Cadet

ChrisRJ

Wizard

HoneyBadger

actually does care

Similar threads