I'm at the limit of what I believe I can see here, and while there are a couple of additional things I need to do to rule out/isolate hardware, my intent here is to gain insight into things I either didn't know I can see or at least get additional sets of eyes on.
I'd posted a few things in the past, but I want to get to the detail here as something doesn't quite line up.
With that said, here's the HW I'm using.
This is the testing I've done thus far. ARC and compression disabled for this testing.
docs.google.com
And additional, initial testing of the pool itself
Now, I'm cognizant of the disks and their history/known issues. One mention in that they slow down when getting full. They're only about 55% full so I don't expect that to necessarily be an issue. Given the metrics in the link above, my assumption is that, individually, they perform as expected.
Additionally, this...
...is supposedly also known and normal for Seagate. While I neither expect nor believe that, I'm taking it with a large grain of anecdotal salt.
That said, the underlying issue here is the general, actual, R/W performance. From my Win11 client, I get ~100-150MBps write and ~350MBps read. A local VM gets about 450W/R. The previous setup, with the main differences being one less HBA, no special vdev, and a 7x3 disk vdev as z1 with 6TB WD Red pluses, I was getting significantly better performance, which is to say I was hitting the 10Gb ceiling of the client's link.
I'd split the HBAs as in a previous iteration with these disks, I was getting R/W/V failures all over the place. In some additional testing with a mass 0 write across multiple disks, I was also getting some odd errors. Removing some disks, and retesting suggested the HBA didn't like doing that many things at once, so I split the load. With all the WD disks in that previous setup, things were fine. This is the one additional thing I need to test; removing one HBA/split the pool to ensure that one isn't bad for some reason and causing this issue. While it will be degraded, if there is an issue my expectation is that I will see a performance difference between, even though I'm only leveraging 50% of the disks. 11 of these should still give great read performance, and if there is an issue, I'd expect to see it on the writes.
So, other than my to-do of the above, I'm very open to any suggestions regarding what other testing could potentially shed some light into what the issue is. I don't think there are any metrics to glean from the HBA itself. It would be great to see load of that but I don't believe any metrics of that are exposed.
TIA
I'd posted a few things in the past, but I want to get to the detail here as something doesn't quite line up.
With that said, here's the HW I'm using.
-ESXi 8
-Core 13.0-U6. Previous versions had no differences
-EPYC 7543. 16 vCPUs to VM
-256 GB RAM
-E810-XXV via SR-IOV at 25Gb
-2x 9305-24i passed through. FW 16.00.12.00. Driver 23.00.00.00-fbsd.
-22x Seagate EXOS X12 12TB SAS as 11x mirror VDEVs split across the two HBAs (Model ST12000NM0037)
-3x mirror NVMe as special/metadata
-1M recordsize for media dataset (relevant dataset for this discussion)
-2x HDD and 1x NVMe as hot spare
-Core 13.0-U6. Previous versions had no differences
-EPYC 7543. 16 vCPUs to VM
-256 GB RAM
-E810-XXV via SR-IOV at 25Gb
-2x 9305-24i passed through. FW 16.00.12.00. Driver 23.00.00.00-fbsd.
-22x Seagate EXOS X12 12TB SAS as 11x mirror VDEVs split across the two HBAs (Model ST12000NM0037)
-3x mirror NVMe as special/metadata
-1M recordsize for media dataset (relevant dataset for this discussion)
-2x HDD and 1x NVMe as hot spare
Client->Server
-10Gb copper to Unifi USW-Enterprise-8-PoE
-10Gb copper from above to USW Pro Aggregation
-25Gb MMF from above direct to NIC on server
-All on same L2
-No jumbo frames
-No flow control
-10Gb copper to Unifi USW-Enterprise-8-PoE
-10Gb copper from above to USW Pro Aggregation
-25Gb MMF from above direct to NIC on server
-All on same L2
-No jumbo frames
-No flow control
This is the testing I've done thus far. ARC and compression disabled for this testing.
NAS Disks.xlsx
And additional, initial testing of the pool itself
Write performance
root@truenas[/mnt/TheEndHouse]# dd if=/dev/zero of=/mnt/TheEndHouse/ddfile bs=2048k count=100000
100000+0 records in
100000+0 records out
209715200000 bytes transferred in 91.586038 secs (2289816280 bytes/sec)
Read performance with the cache disabled (cancelled as it was taking forever)
root@truenas[/mnt/TheEndHouse]# dd of=/dev/null if=/mnt/TheEndHouse/ddfile bs=2048k count=100000
30813+0 records in
30813+0 records out
64619544576 bytes transferred in 771.877023 secs (83717409 bytes/sec)
root@truenas[/mnt/TheEndHouse]# dd if=/dev/zero of=/mnt/TheEndHouse/ddfile bs=2048k count=100000
100000+0 records in
100000+0 records out
209715200000 bytes transferred in 91.586038 secs (2289816280 bytes/sec)
Read performance with the cache disabled (cancelled as it was taking forever)
root@truenas[/mnt/TheEndHouse]# dd of=/dev/null if=/mnt/TheEndHouse/ddfile bs=2048k count=100000
30813+0 records in
30813+0 records out
64619544576 bytes transferred in 771.877023 secs (83717409 bytes/sec)
Now, I'm cognizant of the disks and their history/known issues. One mention in that they slow down when getting full. They're only about 55% full so I don't expect that to necessarily be an issue. Given the metrics in the link above, my assumption is that, individually, they perform as expected.
Additionally, this...
Code:
Error counter log: Errors Corrected Total Correction Gigabytes Total by ECC rereads/ errors algorithm processed uncorrected fast| delayed rewrites corrected invocations [10^9 bytes] errors read: 2136549412 0 0 2136549412 0 26358.286 0 write: 0 0 0 0 0 517.500 0 Non-medium error count: 0
...is supposedly also known and normal for Seagate. While I neither expect nor believe that, I'm taking it with a large grain of anecdotal salt.
That said, the underlying issue here is the general, actual, R/W performance. From my Win11 client, I get ~100-150MBps write and ~350MBps read. A local VM gets about 450W/R. The previous setup, with the main differences being one less HBA, no special vdev, and a 7x3 disk vdev as z1 with 6TB WD Red pluses, I was getting significantly better performance, which is to say I was hitting the 10Gb ceiling of the client's link.
I'd split the HBAs as in a previous iteration with these disks, I was getting R/W/V failures all over the place. In some additional testing with a mass 0 write across multiple disks, I was also getting some odd errors. Removing some disks, and retesting suggested the HBA didn't like doing that many things at once, so I split the load. With all the WD disks in that previous setup, things were fine. This is the one additional thing I need to test; removing one HBA/split the pool to ensure that one isn't bad for some reason and causing this issue. While it will be degraded, if there is an issue my expectation is that I will see a performance difference between, even though I'm only leveraging 50% of the disks. 11 of these should still give great read performance, and if there is an issue, I'd expect to see it on the writes.
So, other than my to-do of the above, I'm very open to any suggestions regarding what other testing could potentially shed some light into what the issue is. I don't think there are any metrics to glean from the HBA itself. It would be great to see load of that but I don't believe any metrics of that are exposed.
TIA