What is your record size set for? With 128k being the recommended size, performance seemed to be capped as you found. I ended up going to 16k record size to get decent performance.
With 16k record size:
fio --filename=test --direct=1 --rw=randrw --randrepeat=0 --rwmixread=100 --iodepth=128 --numjobs=8 --runtime=60 --group_reporting --name=4ktest --ioengine=psync --size=4G --bs=4k
4ktest: Laying out IO file (1 file / 4096MiB)
Jobs: 8 (f=8): [r(8)][100.0%][r=2822MiB/s][r=722k IOPS][eta 00m:00s]
4ktest: (groupid=0, jobs=8): err= 0: pid=10750: Mon Apr 10 07:23:27 2023
read: IOPS=719k, BW=2810MiB/s (2947MB/s)(32.0GiB/11660msec)
clat (usec): min=2, max=2670, avg=10.65, stdev=21.41
lat (usec): min=2, max=2670, avg=10.68, stdev=21.41
clat percentiles (usec):
| 1.00th=[ 6], 5.00th=[ 6], 10.00th=[ 6], 20.00th=[ 7],
| 30.00th=[ 7], 40.00th=[ 7], 50.00th=[ 8], 60.00th=[ 8],
| 70.00th=[ 9], 80.00th=[ 10], 90.00th=[ 12], 95.00th=[ 18],
| 99.00th=[ 118], 99.50th=[ 157], 99.90th=[ 306], 99.95th=[ 375],
| 99.99th=[ 494]
bw ( MiB/s): min= 2749, max= 2902, per=100.00%, avg=2814.20, stdev= 5.04, samples=176
iops : min=703811, max=743039, avg=720430.77, stdev=1291.25, samples=176
lat (usec) : 4=0.13%, 10=84.69%, 20=11.20%, 50=2.38%, 100=0.59%
lat (usec) : 250=0.83%, 500=0.17%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%
cpu : usr=5.81%, sys=94.14%, ctx=4195, majf=0, minf=0
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=8388608,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128
Run status group 0 (all jobs):
READ: bw=2810MiB/s (2947MB/s), 2810MiB/s-2810MiB/s (2947MB/s-2947MB/s), io=32.0GiB (34.4GB), run=11660-11660msec
After that, NFS performance was still terrible (compared to identical settings on the R720xd). iSCSI performance is excellent. So now we're running iSCSI.