[Perfomance] ZFS underperforming against UFS

dacabdi · Dec 10, 2021

I am considering setting up a NAS in a Dell R710 with 2-tiered cache (48 GB RAM -> 500GB NVME SSD -> 4x8TB spinning media). Before that, I am familiarizing myself with OpenZFS and TrueNAS. Today I setup a VM on HyperV, created 4 virtual disks to simulate the members of a zpool's data vdev and passed them along just to get a feeling of the functional-logical setup. I understand that this is not a setup to test performance per se, but at least I can observe relative performance, all virtual disks block files being equal. If I setup a zpool with only 1 data disk (remember, virtual), and a separate UFS partition on another disk (same, virtual), I observe double the performance on the UFS partition. So, both underlying devices basically should offer the same performance, however low due to it being through the entire virtualization stack of HyperV, but relatively equal; yet, the performance on ZFS is half of that on a disk formatted with UFS. Is this expected due to some angle I am missing about ZFS? (I also tried creating a pool with more members, but in this case, since all virtual disks block files are hosted on the same drive of the host, I actually would expect less performance).

Setup

Code:

root@truenas[/mnt/data3]# zpool status
  pool: boot-pool
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          da0p2     ONLINE       0     0     0

errors: No known data errors

  pool: data-alone
 state: ONLINE
config:

        NAME                                          STATE     READ WRITE CKSUM
        data-alone                                    ONLINE       0     0     0
          gptid/7c105c5b-5a19-11ec-a99c-00155d00b402  ONLINE       0     0     0

errors: No known data errors

root@truenas[/mnt/data3]# gpart show
=>      40  67108784  da0  GPT  (32G)
        40    532480    1  efi  (260M)
    532520  66551808    2  freebsd-zfs  (32G)
  67084328     24496       - free -  (12M)

=>      40  33554352  da1  GPT  (16G)
        40        88       - free -  (44K)
       128  33554264    1  freebsd-zfs  (16G)

=>       40  209715120  da6  GPT  (100G)
         40  209715120    1  freebsd-ufs  (100G)

=>       40  209715120  da3  GPT  (100G)
         40         88       - free -  (44K)
        128    4194304    1  freebsd-swap  (2.0G)
    4194432  205520728    2  freebsd-zfs  (98G)

=>       40  209715120  da4  GPT  (100G)
         40         88       - free -  (44K)
        128    4194304    1  freebsd-swap  (2.0G)
    4194432  205520728    2  freebsd-zfs  (98G)

=>       40  209715120  da2  GPT  (100G)
         40         88       - free -  (44K)
        128    4194304    1  freebsd-swap  (2.0G)
    4194432  205520728    2  freebsd-zfs  (98G)

root@truenas[/mnt/data3]#
#

Results

Code:

root@truenas[/mnt/data3]# fio --name=write_throughput --directory=/mnt/data-alone --numjobs=1 \
--size=10G --time_based --runtime=60s --ramp_time=2s \
--direct=0 --verify=0 --bs=4M --iodepth=64 --rw=write \
--group_reporting=1
write_throughput: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=64
fio-3.27
Starting 1 process
write_throughput: Laying out IO file (1 file / 10240MiB)
^Cbs: 1 (f=1): [W(1)][29.0%][w=1401MiB/s][w=350 IOPS][eta 00m:44s]
fio: terminating on signal 2

write_throughput: (groupid=0, jobs=1): err= 0: pid=4294: Fri Dec 10 17:03:39 2021
  write: IOPS=350, BW=1404MiB/s (1472MB/s)(22.6GiB/16484msec); 0 zone resets
    clat (usec): min=606, max=43728, avg=2693.70, stdev=2015.40
     lat (usec): min=626, max=43743, avg=2843.31, stdev=2163.25
    clat percentiles (usec):
     |  1.00th=[  668],  5.00th=[  742], 10.00th=[  807], 20.00th=[ 1287],
     | 30.00th=[ 1942], 40.00th=[ 2180], 50.00th=[ 2376], 60.00th=[ 2573],
     | 70.00th=[ 2900], 80.00th=[ 3425], 90.00th=[ 4424], 95.00th=[ 5800],
     | 99.00th=[10552], 99.50th=[12911], 99.90th=[19530], 99.95th=[29492],
     | 99.99th=[43779]
   bw (  MiB/s): min= 1086, max= 1571, per=100.00%, avg=1408.52, stdev=98.63, samples=32
   iops        : min=  271, max=  392, avg=351.66, stdev=24.70, samples=32
  lat (usec)   : 750=5.83%, 1000=11.34%
  lat (msec)   : 2=14.97%, 4=54.07%, 10=12.67%, 20=1.04%, 50=0.09%
  cpu          : usr=5.07%, sys=87.74%, ctx=15850, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,5785,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=1404MiB/s (1472MB/s), 1404MiB/s-1404MiB/s (1472MB/s-1472MB/s), io=22.6GiB (24.3GB), run=16484-16484msec
root@truenas[/mnt/data3]# fio --name=write_throughput --directory=/mnt/data3 --numjobs=1 --size=10G --time_based --runtime=60s --ramp_time=2s --direct=0 --verify=0 --bs=4M --iodepth=64 --rw=write --group_reporting=1
write_throughput: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=64
fio-3.27
Starting 1 process
write_throughput: Laying out IO file (1 file / 10240MiB)
fio: io_u error on file /mnt/data3/write_throughput.0.0: No space left on device: write offset=8568963072, buflen=4194304
fio: pid=4304, err=28/file:io_u.c:1841, func=io_u error, error=No space left on device

write_throughput: (groupid=0, jobs=1): err=28 (file:io_u.c:1841, func=io_u error, error=No space left on device): pid=4304: Fri Dec 10 17:04:05 2021
  write: IOPS=710, BW=2839MiB/s (2977MB/s)(3464MiB/1220msec); 0 zone resets
    clat (usec): min=829, max=24092, avg=1374.16, stdev=1531.18
     lat (usec): min=848, max=24130, avg=1398.46, stdev=1531.79
    clat percentiles (usec):
     |  1.00th=[  873],  5.00th=[  914], 10.00th=[  938], 20.00th=[  971],
     | 30.00th=[ 1012], 40.00th=[ 1045], 50.00th=[ 1090], 60.00th=[ 1139],
     | 70.00th=[ 1188], 80.00th=[ 1287], 90.00th=[ 1500], 95.00th=[ 2671],
     | 99.00th=[11600], 99.50th=[13304], 99.90th=[23987], 99.95th=[23987],
     | 99.99th=[23987]
   bw (  MiB/s): min= 2858, max= 2921, per=100.00%, avg=2889.63, stdev=44.37, samples=2
   iops        : min=  714, max=  730, avg=722.00, stdev=11.31, samples=2
  lat (usec)   : 1000=26.07%
  lat (msec)   : 2=66.78%, 4=5.42%, 10=0.35%, 20=1.15%, 50=0.12%
  cpu          : usr=1.07%, sys=83.76%, ctx=65, majf=0, minf=4
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.1%, 4=99.9%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,867,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=2839MiB/s (2977MB/s), 2839MiB/s-2839MiB/s (2977MB/s-2977MB/s), io=3464MiB (3632MB), run=1220-1220msec
root@truenas[/mnt/data3]#

Samuel Tai · Dec 10, 2021

How much RAM did you allocate to the VM?

dacabdi · Dec 10, 2021

Samuel Tai said:
How much RAM did you allocate to the VM?

Thanks for the reply. It has 8 GiB, I can give it more and get back to you, the host has 64 GiB. Also, I don't see a lot of RAM usage for ARC.

dacabdi · Dec 10, 2021

Doubled the RAM and still I observed the same issue, the performance of a pool with one vdev and one data disk on it, is half that of the same disk hosting a UFS partition.

Samuel Tai · Dec 10, 2021

Which is what to expect, because your one-disk pool has to perform 2 writes for each write: once to the ZIL on the disk, and then from the ZIL to the disk itself.

Since you're just experimenting, you can disable this behavior by setting zfs set sync=disabled <name of your root dataset>. In production, this isn't recommended, as the design of ZFS prioritizes data safety over performance.

dacabdi · Dec 10, 2021

Interesting point, however, disabling it does not cause any noticeable improvement. Also, it would be intuitive to expect reading performance to not be affected by log writes. However, on reading, I still see the same issue; actually, it is worse, about 2.5x less performance.

Samuel Tai · Dec 10, 2021

Since you're using virtual disks instead of passing through a physical HBA, I'm inclined to suspect Microsoft optimization for UFS.

Kailee71 · Dec 14, 2021

Yes same here. HyperV is just about the worst for IO intensive applications. For sure ESXi *with hardware passthrough* will far surpass the IO performance of what you're seeing, to the point of that comparison being made within HyperV being kind of pointless. It is vitally important to read up on the issues that will hit you when running TrueNAS as a VM.

Also, a single disk vdev is really not representative... Too much depends on the actual architecture that this again makes it a comparison of little value. Yes, fun and educational for a first play, but do try at least different vdev layouts and see what happens.

Kai.

dacabdi · Dec 31, 2021

Agreed. So, a little update folks. I now have a DELL R710 with a JBOD controller serving 5 8TB rotating media cached by 64 GB RAM (ARC) + 500 GB NVME (on x8 PCIe) acting as L2ARC. The NAS has an 8 NICs LACP LAG, and the 2 Proxmox VEs each have 4 NICs bonded in LACP l3+4 mode (although the switch only supports l2+3, so it's hashing at IP level only). I have about 10 client VMs mounting this NFSv4 share and all of them can R/W and are capped by the network performance (1GbE), even when running in parallel, so it looks like the cache is doing wonders. I also did ZFS over iSCSI from the Proxmox VE hosts and it works pretty fine (had to use a client plugin that somebody posted online, I will lookup the URL later and post it here for anybody trying to do the same).

This is a tad off topic but I also got a slim-resourced FreeIPA server in a VM on top of the TrueNAS. Which is already authenticating users on the NAS but now I am trying to do automount for the home dirs of the users on the network. If you happen to have done this on the past or can point me to a good guide, I would appreciate it.

Kailee71 · Jan 12, 2022

Erm, are you sure you need the L2ARC dev? With 5 disks you'll easily saturate a 1Gb link. If anything, I would personally stay away from iSCSI (your data is not written sync by default), use NFS4 with sync on, and if this hits your performance, add an SLOG dev. This of course is assuming you have ample RAM to have appropriately sized ARC in flight at any time (depends on workload, but can easily be checked using your graphs in TrueNAS).

Kai.

Important Announcement for the TrueNAS Community.

[Perfomance] ZFS underperforming against UFS

dacabdi

Cadet

Setup

Results

Samuel Tai

Never underestimate your own stupidity

dacabdi

Cadet

dacabdi

Cadet

Samuel Tai

Never underestimate your own stupidity

dacabdi

Cadet

Samuel Tai

Never underestimate your own stupidity

Kailee71

Contributor

dacabdi

Cadet

Kailee71

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

[Perfomance] ZFS underperforming against UFS

Cadet

Setup​

Results​

Never underestimate your own stupidity

Cadet

Cadet

Never underestimate your own stupidity

Cadet

Never underestimate your own stupidity

Contributor

Cadet

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "[Perfomance] ZFS underperforming against UFS"

Similar threads

Setup

Results