[Perfomance] ZFS underperforming against UFS

dacabdi

Cadet
Joined
Dec 10, 2021
Messages
6
I am considering setting up a NAS in a Dell R710 with 2-tiered cache (48 GB RAM -> 500GB NVME SSD -> 4x8TB spinning media). Before that, I am familiarizing myself with OpenZFS and TrueNAS. Today I setup a VM on HyperV, created 4 virtual disks to simulate the members of a zpool's data vdev and passed them along just to get a feeling of the functional-logical setup. I understand that this is not a setup to test performance per se, but at least I can observe relative performance, all virtual disks block files being equal. If I setup a zpool with only 1 data disk (remember, virtual), and a separate UFS partition on another disk (same, virtual), I observe double the performance on the UFS partition. So, both underlying devices basically should offer the same performance, however low due to it being through the entire virtualization stack of HyperV, but relatively equal; yet, the performance on ZFS is half of that on a disk formatted with UFS. Is this expected due to some angle I am missing about ZFS? (I also tried creating a pool with more members, but in this case, since all virtual disks block files are hosted on the same drive of the host, I actually would expect less performance).

Setup​


Code:
root@truenas[/mnt/data3]# zpool status
  pool: boot-pool
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          da0p2     ONLINE       0     0     0

errors: No known data errors

  pool: data-alone
 state: ONLINE
config:

        NAME                                          STATE     READ WRITE CKSUM
        data-alone                                    ONLINE       0     0     0
          gptid/7c105c5b-5a19-11ec-a99c-00155d00b402  ONLINE       0     0     0

errors: No known data errors

root@truenas[/mnt/data3]# gpart show
=>      40  67108784  da0  GPT  (32G)
        40    532480    1  efi  (260M)
    532520  66551808    2  freebsd-zfs  (32G)
  67084328     24496       - free -  (12M)

=>      40  33554352  da1  GPT  (16G)
        40        88       - free -  (44K)
       128  33554264    1  freebsd-zfs  (16G)

=>       40  209715120  da6  GPT  (100G)
         40  209715120    1  freebsd-ufs  (100G)

=>       40  209715120  da3  GPT  (100G)
         40         88       - free -  (44K)
        128    4194304    1  freebsd-swap  (2.0G)
    4194432  205520728    2  freebsd-zfs  (98G)

=>       40  209715120  da4  GPT  (100G)
         40         88       - free -  (44K)
        128    4194304    1  freebsd-swap  (2.0G)
    4194432  205520728    2  freebsd-zfs  (98G)

=>       40  209715120  da2  GPT  (100G)
         40         88       - free -  (44K)
        128    4194304    1  freebsd-swap  (2.0G)
    4194432  205520728    2  freebsd-zfs  (98G)

root@truenas[/mnt/data3]#
#


Results​



Code:
root@truenas[/mnt/data3]# fio --name=write_throughput --directory=/mnt/data-alone --numjobs=1 \
--size=10G --time_based --runtime=60s --ramp_time=2s \
--direct=0 --verify=0 --bs=4M --iodepth=64 --rw=write \
--group_reporting=1
write_throughput: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=64
fio-3.27
Starting 1 process
write_throughput: Laying out IO file (1 file / 10240MiB)
^Cbs: 1 (f=1): [W(1)][29.0%][w=1401MiB/s][w=350 IOPS][eta 00m:44s]
fio: terminating on signal 2

write_throughput: (groupid=0, jobs=1): err= 0: pid=4294: Fri Dec 10 17:03:39 2021
  write: IOPS=350, BW=1404MiB/s (1472MB/s)(22.6GiB/16484msec); 0 zone resets
    clat (usec): min=606, max=43728, avg=2693.70, stdev=2015.40
     lat (usec): min=626, max=43743, avg=2843.31, stdev=2163.25
    clat percentiles (usec):
     |  1.00th=[  668],  5.00th=[  742], 10.00th=[  807], 20.00th=[ 1287],
     | 30.00th=[ 1942], 40.00th=[ 2180], 50.00th=[ 2376], 60.00th=[ 2573],
     | 70.00th=[ 2900], 80.00th=[ 3425], 90.00th=[ 4424], 95.00th=[ 5800],
     | 99.00th=[10552], 99.50th=[12911], 99.90th=[19530], 99.95th=[29492],
     | 99.99th=[43779]
   bw (  MiB/s): min= 1086, max= 1571, per=100.00%, avg=1408.52, stdev=98.63, samples=32
   iops        : min=  271, max=  392, avg=351.66, stdev=24.70, samples=32
  lat (usec)   : 750=5.83%, 1000=11.34%
  lat (msec)   : 2=14.97%, 4=54.07%, 10=12.67%, 20=1.04%, 50=0.09%
  cpu          : usr=5.07%, sys=87.74%, ctx=15850, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,5785,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=1404MiB/s (1472MB/s), 1404MiB/s-1404MiB/s (1472MB/s-1472MB/s), io=22.6GiB (24.3GB), run=16484-16484msec
root@truenas[/mnt/data3]# fio --name=write_throughput --directory=/mnt/data3 --numjobs=1 --size=10G --time_based --runtime=60s --ramp_time=2s --direct=0 --verify=0 --bs=4M --iodepth=64 --rw=write --group_reporting=1
write_throughput: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=64
fio-3.27
Starting 1 process
write_throughput: Laying out IO file (1 file / 10240MiB)
fio: io_u error on file /mnt/data3/write_throughput.0.0: No space left on device: write offset=8568963072, buflen=4194304
fio: pid=4304, err=28/file:io_u.c:1841, func=io_u error, error=No space left on device

write_throughput: (groupid=0, jobs=1): err=28 (file:io_u.c:1841, func=io_u error, error=No space left on device): pid=4304: Fri Dec 10 17:04:05 2021
  write: IOPS=710, BW=2839MiB/s (2977MB/s)(3464MiB/1220msec); 0 zone resets
    clat (usec): min=829, max=24092, avg=1374.16, stdev=1531.18
     lat (usec): min=848, max=24130, avg=1398.46, stdev=1531.79
    clat percentiles (usec):
     |  1.00th=[  873],  5.00th=[  914], 10.00th=[  938], 20.00th=[  971],
     | 30.00th=[ 1012], 40.00th=[ 1045], 50.00th=[ 1090], 60.00th=[ 1139],
     | 70.00th=[ 1188], 80.00th=[ 1287], 90.00th=[ 1500], 95.00th=[ 2671],
     | 99.00th=[11600], 99.50th=[13304], 99.90th=[23987], 99.95th=[23987],
     | 99.99th=[23987]
   bw (  MiB/s): min= 2858, max= 2921, per=100.00%, avg=2889.63, stdev=44.37, samples=2
   iops        : min=  714, max=  730, avg=722.00, stdev=11.31, samples=2
  lat (usec)   : 1000=26.07%
  lat (msec)   : 2=66.78%, 4=5.42%, 10=0.35%, 20=1.15%, 50=0.12%
  cpu          : usr=1.07%, sys=83.76%, ctx=65, majf=0, minf=4
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.1%, 4=99.9%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,867,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=2839MiB/s (2977MB/s), 2839MiB/s-2839MiB/s (2977MB/s-2977MB/s), io=3464MiB (3632MB), run=1220-1220msec
root@truenas[/mnt/data3]#
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
How much RAM did you allocate to the VM?
 

dacabdi

Cadet
Joined
Dec 10, 2021
Messages
6
Doubled the RAM and still I observed the same issue, the performance of a pool with one vdev and one data disk on it, is half that of the same disk hosting a UFS partition.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Which is what to expect, because your one-disk pool has to perform 2 writes for each write: once to the ZIL on the disk, and then from the ZIL to the disk itself.

Since you're just experimenting, you can disable this behavior by setting zfs set sync=disabled <name of your root dataset>. In production, this isn't recommended, as the design of ZFS prioritizes data safety over performance.
 
Last edited:

dacabdi

Cadet
Joined
Dec 10, 2021
Messages
6
Interesting point, however, disabling it does not cause any noticeable improvement. Also, it would be intuitive to expect reading performance to not be affected by log writes. However, on reading, I still see the same issue; actually, it is worse, about 2.5x less performance.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Since you're using virtual disks instead of passing through a physical HBA, I'm inclined to suspect Microsoft optimization for UFS.
 

Kailee71

Contributor
Joined
Jul 8, 2018
Messages
110
Yes same here. HyperV is just about the worst for IO intensive applications. For sure ESXi *with hardware passthrough* will far surpass the IO performance of what you're seeing, to the point of that comparison being made within HyperV being kind of pointless. It is vitally important to read up on the issues that will hit you when running TrueNAS as a VM.

Also, a single disk vdev is really not representative... Too much depends on the actual architecture that this again makes it a comparison of little value. Yes, fun and educational for a first play, but do try at least different vdev layouts and see what happens.

Kai.
 

dacabdi

Cadet
Joined
Dec 10, 2021
Messages
6
Agreed. So, a little update folks. I now have a DELL R710 with a JBOD controller serving 5 8TB rotating media cached by 64 GB RAM (ARC) + 500 GB NVME (on x8 PCIe) acting as L2ARC. The NAS has an 8 NICs LACP LAG, and the 2 Proxmox VEs each have 4 NICs bonded in LACP l3+4 mode (although the switch only supports l2+3, so it's hashing at IP level only). I have about 10 client VMs mounting this NFSv4 share and all of them can R/W and are capped by the network performance (1GbE), even when running in parallel, so it looks like the cache is doing wonders. I also did ZFS over iSCSI from the Proxmox VE hosts and it works pretty fine (had to use a client plugin that somebody posted online, I will lookup the URL later and post it here for anybody trying to do the same).

This is a tad off topic but I also got a slim-resourced FreeIPA server in a VM on top of the TrueNAS. Which is already authenticating users on the NAS but now I am trying to do automount for the home dirs of the users on the network. If you happen to have done this on the past or can point me to a good guide, I would appreciate it.
 

Kailee71

Contributor
Joined
Jul 8, 2018
Messages
110
Erm, are you sure you need the L2ARC dev? With 5 disks you'll easily saturate a 1Gb link. If anything, I would personally stay away from iSCSI (your data is not written sync by default), use NFS4 with sync on, and if this hits your performance, add an SLOG dev. This of course is assuming you have ample RAM to have appropriately sized ARC in flight at any time (depends on workload, but can easily be checked using your graphs in TrueNAS).

Kai.
 
Top