We've seen an issue twice now in production. When a disk in poolX fails, iscsi latency skyrockets to 2 SECONDS and/or we see iSCSI disconnects on both ESXi and KVM hosts (Centos 7 and Ubuntu 20.04). The VMs sometimes recover, sometimes they bluescreen (windows), sometimes they go into a read-only file system requiring fschk to resolve (linux).
When the HDD failed in pool1, we saw these errors on the NVME disks, although they were never marked offline in zpool status
The same concern happened when we were running a POC on old test hardware both with core and scale. The same thing happened in my homelab with some old WD red drives, when one of the drives failed, on both core and scale.
This past weekend a disk failed with 2 read errors. Truenas marked the disk as faulted as expected, however the iscsi mounts on the ESXi and KVM hosts were experiencing intermittent issues for about 4 hours... starting at the time when truenas marked the disk as faulted. By issues, we were seeing path evaluation messages on the ESXi side, performance deterioration logs in the vmkernel.log, etc. the issues eventually resolved themselves.
When the drive failed, all of our zvols attached to either the hard disk pool or the nvme pool experienced the same issues.
What we can't seem to understand is why a hard drive failing in pool X causes a problem with our NVMe pool Y and iSCSI access, and why a single disk failing would cause this much disruption to any zvol being consumed by ESXi and KVM and hosted on this instance of TrueNAS scale.
Has anyone else experienced this? What can we do to ensure that a failed disk doesn't result in a production outage?
Truenas version:
TrueNAS-SCALE-22.12.0
Hardware:
Manufacturer: Supermicro
Product Name: SYS-220U-TNR
2x Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz
256GB ddr4 memory
2x 128gb optane persistent memory for log on HDD pool
NVMe Controllers - 2x AOC-SLG4-4E4T-O
SAS controller - LSI SAS3008
Boot volume 2x sata ssd's connected to motherboard sata ports
Intel x710 quad port 10g network cards
Pool 0 - NVMe (in server chassis)
Raidz1 - 2x 6 drive vdevs
one pool spare
Pool1 - HDD (supermicro SAS3 JBOD)
Raidz2 - 3x 8 drive vdevs (no spare)
Log vdev on mirrored pair of 128GB optane persistent memory
Cross Posting here as I put the origional in a legacy location...
When the HDD failed in pool1, we saw these errors on the NVME disks, although they were never marked offline in zpool status
root@truenas01[~]# 2023 Sep 30 09:37:15 truenas01.sec1 md/raid1:md123: Disk failure on sdz4, disabling device. md/raid1:md123: Operation continuing on 1 devices. 2023 Sep 30 09:37:16 truenas01.sec1 md/raid1:md125: Disk failure on nvme7n1p1, disabling device. md/raid1:md125: Operation continuing on 1 devices. 2023 Sep 30 09:37:16 truenas01.sec1 md/raid1:md124: Disk failure on nvme4n1p1, disabling device. md/raid1:md124: Operation continuing on 1 devices. 2023 Sep 30 09:37:17 truenas01.sec1 md/raid1:md126: Disk failure on nvme0n1p1, disabling device. md/raid1:md126: Operation continuing on 1 devices. 2023 Sep 30 09:37:17 truenas01.sec1 md/raid1:md127: Disk failure on nvme3n1p1, disabling device. md/raid1:md127: Operation continuing on 1 devices.
The same concern happened when we were running a POC on old test hardware both with core and scale. The same thing happened in my homelab with some old WD red drives, when one of the drives failed, on both core and scale.
This past weekend a disk failed with 2 read errors. Truenas marked the disk as faulted as expected, however the iscsi mounts on the ESXi and KVM hosts were experiencing intermittent issues for about 4 hours... starting at the time when truenas marked the disk as faulted. By issues, we were seeing path evaluation messages on the ESXi side, performance deterioration logs in the vmkernel.log, etc. the issues eventually resolved themselves.
When the drive failed, all of our zvols attached to either the hard disk pool or the nvme pool experienced the same issues.
What we can't seem to understand is why a hard drive failing in pool X causes a problem with our NVMe pool Y and iSCSI access, and why a single disk failing would cause this much disruption to any zvol being consumed by ESXi and KVM and hosted on this instance of TrueNAS scale.
Has anyone else experienced this? What can we do to ensure that a failed disk doesn't result in a production outage?
Truenas version:
TrueNAS-SCALE-22.12.0
Hardware:
Manufacturer: Supermicro
Product Name: SYS-220U-TNR
2x Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz
256GB ddr4 memory
2x 128gb optane persistent memory for log on HDD pool
NVMe Controllers - 2x AOC-SLG4-4E4T-O
SAS controller - LSI SAS3008
Boot volume 2x sata ssd's connected to motherboard sata ports
Intel x710 quad port 10g network cards
Pool 0 - NVMe (in server chassis)
Raidz1 - 2x 6 drive vdevs
one pool spare
Pool1 - HDD (supermicro SAS3 JBOD)
Raidz2 - 3x 8 drive vdevs (no spare)
Log vdev on mirrored pair of 128GB optane persistent memory
Cross Posting here as I put the origional in a legacy location...