Remove or Migrate Special VDEV

ferret · Jul 26, 2022

Hi All,

Being a new to TrueNAS and ZFS I would like to apologise in advance if I am posting a topic that has already been covered, I was unable to find anything similar with search the forum.

I have created a pool with 3 x raidz2 vdev (5 x 3TB SAS 7.2k), special mirror vdev (2 x nvme 500gb) , log mirror vdev (2 x SSD SATA 500GB) & cache stripe vdev (2 x SSD SAS 200GB). With zvol via iscsi to Proxmox cluster. So only being used for vm's.

To be honest I am not seeing the benefit of the special vdev using nvme's plus I think I would be better of swapping the log & cahce vdev devices.

Is there a way to migrate special vdev metadata to another ssd vdev so to remove the nvme's?

Cheers

sretalla · Jul 27, 2022

ferret said:
a pool with 3 x raidz2 vdev

ferret said:
Is there a way to migrate special vdev metadata to another ssd vdev so to remove the nvme's?

Since your pool has at least one RAIDZ VDEV, no can do.

Recreate the pool is your only option to change the geometry at this point.

If the "replacement" drives were the same size or larger than the existing ones, you could just resilver new ones in and remove the old drives.

As a general comment, you're doing it all completely wrong.

You're doing block storage with RAIDZ2 and doing SLOG on SATA SSDs.

You should probably read these if you want good performance:

The path to success for block storage

It seems like I haven't written a sticky for awhile, but just in the last week I've had to cover this topic several times. ZFS does two different things very well. One is storage of large sequentially-written files, such as archives, logs, or data files, where the file does not have the middle...

www.truenas.com

Some insights into SLOG/ZIL with ZFS on FreeNAS

What is the ZIL? POSIX provides a facility for the system or an application to make sure that data requested to be written is actually committed to stable storage: a synchronous write request. Upon completion of a sync write request, the underlying filesystem is supposed to guarantee that a...

www.truenas.com

SLOG benchmarking and finding the best SLOG

I'd like to take a few minutes to talk about SLOG devices and what makes good ones versus bad ones. I have no doubt that this will be a controversial topic since this is not well understood by many people. In short, there's 3 things that you need for a "great" SLOG: 1. Fast throughput 2...

www.truenas.com

ferret · Jul 27, 2022

sretalla said:
Since your pool has at least one RAIDZ VDEV, no can do.

Recreate the pool is your only option to change the geometry at this point.

If the "replacement" drives were the same size or larger than the existing ones, you could just resilver new ones in and remove the old drives.

As a general comment, you're doing it all completely wrong.

You're doing block storage with RAIDZ2 and doing SLOG on SATA SSDs.

You should probably read these if you want good performance:

The path to success for block storage

It seems like I haven't written a sticky for awhile, but just in the last week I've had to cover this topic several times. ZFS does two different things very well. One is storage of large sequentially-written files, such as archives, logs, or data files, where the file does not have the middle...

www.truenas.com

Some insights into SLOG/ZIL with ZFS on FreeNAS

What is the ZIL? POSIX provides a facility for the system or an application to make sure that data requested to be written is actually committed to stable storage: a synchronous write request. Upon completion of a sync write request, the underlying filesystem is supposed to guarantee that a...

www.truenas.com

SLOG benchmarking and finding the best SLOG

I'd like to take a few minutes to talk about SLOG devices and what makes good ones versus bad ones. I have no doubt that this will be a controversial topic since this is not well understood by many people. In short, there's 3 things that you need for a "great" SLOG: 1. Fast throughput 2...

www.truenas.com

Hi,

Many than ks for you kind response.

As I mentioned, I am new to TrueNAS & ZFS.

I will read the articles.

So are you suggesting creating new pool with say 7 x mirror vdev (aka raid10) with hot spare instead for block storage?

Cheers

Patrick M. Hausen · Jul 27, 2022

ferret said:
So are you suggesting creating new pool with say 7 x mirror vdev (aka raid10) with hot spare instead for block storage?

If you read the first article he linked you will find that the answer is "yes"

Mirrors for block storage. Highly recommended ...

ferret · Jul 27, 2022

Patrick M. Hausen said:
If you read the first article he linked you will find that the answer is "yes"

Mirrors for block storage. Highly recommended ...

Have now read the first 2 articles and noted that recommendation for mirrored vdevs for block storage and NVMe for SLOG, which is why my question to change the pool mix.

Am I correct in thinking that special vdev is of no real value for zvol only pool?

Can I used SATA SSD's for cache?

Before building TrueNAS server I read lots of articles (clearly the wrong ones) and watcedh lots of youtube's and followed the advice of 1 tutorial from 45 Drives (https://www.youtube.com/watch?v=T0frbOzBlHI) which I can confirm doesn't work in real work environment.

Many thanks for your kind assistance.

Patrick M. Hausen · Jul 27, 2022

ferret said:
Am I correct in thinking that special vdev is of no real value for zvol only pool?

I would think so but I am not 100% sure. Maybe someone can confirm/deny?

ferret said:
Can I used SATA SSD's for cache?

Only when you already maxed out your memory and you are still getting too many ARC misses, does an L2ARC make any sense. But in that case, yes.

sretalla · Jul 27, 2022

Patrick M. Hausen said:
I would think so but I am not 100% sure. Maybe someone can confirm/deny?

I have seen @jgreco and maybe a few of the other more experienced forum lurkers saying it would help even for ZVOLs.

ferret · Jul 28, 2022

Hi All,

I made the geometry change and it's made almost zero different.

Reading 600's and writing 1100's

6 x Mirror VDEV's with 1 x Host Spare
2 x Mirror VDEV 200GB Enterprise SAS SSD for Metadata
2 x Mirror VDEV 500GB SATA SSD for Log
2 x Stripe VDEV 500GB NVME SSD via 2 x PCIe 4x Cards

Although the new pool seems to look better but something still doesn't feel correct. CPU work load never get's above 1% and 10GB Nic's are doing almost nothing. It doesn't make sense.

Any assistance would be very much appreciated.

Cheers

sretalla · Jul 29, 2022

ferret said:
I made the geometry change and it's made almost zero different.

Reading 600's and writing 1100's

How are you testing... some types of test don't really show the potential as they don't parallelize the workload.

ferret · Jul 29, 2022

sretalla said:
How are you testing... some types of test don't really show the potential as they don't parallelize the workload.

Using CrystalDiskMark and zpool iostat -v ....

sretalla · Jul 29, 2022

ferret said:
Using CrystalDiskMark and zpool iostat -v ....

OK, so that will be suffering from the points I mentioned... you should use fio on the host to get the raw pool performance and use something that generates more load in parallel (simulating multiple clients) from the network.

ferret · Jul 29, 2022

Many thanks, I've run the following;

fio --name=test --size=5g --rw=write --ioengine=posixaio --direct=1 --bs=1m

WRITE: bw=1238MiB/s (1298MB/s), 1238MiB/s-1238MiB/s (1298MB/s-1298MB/s), io=5120MiB (5369MB), run=4137-4137msec

fio --name=test --size=5g --rw=read --ioengine=posixaio --direct=1 --bs=1m

READ: bw=1914MiB/s (2007MB/s), 1914MiB/s-1914MiB/s (2007MB/s-2007MB/s), io=5120MiB (5369MB), run=2675-2675msec

I'm noticing slight lagging inside VM compared to running same VM on IBM V7000 Gen1 SAN (no ssd's nor tiering). Anyway numbers above look great. For several years I've been fighting the decision to implement CEPH or ZFS on paper I just can't see how CEPH could truely compete with ZFS on any level.

sretalla · Jul 29, 2022

You'll want to set the arguments a bit more for fio to get a good test:

fio --name TEST --eta-newline=5s --filename=fio-tempfile.dat --rw=randwrite --size=500g --io_size=1500g --blocksize=10m --iodepth=1 --direct=1 --numjobs=1 --runtime=3600 --group_reporting

you can play with the numjobs and iodepth to really test harder and reduce the size of the size and io_size to force more IOPS.

With the --rw set to randwrite, that's going to likely bottleneck on random IO creation, so doing as you did above with a separate read and write cycle will most likely give you the results that matter.

HoneyBadger · Jul 29, 2022

sretalla said:
I have seen @jgreco and maybe a few of the other more experienced forum lurkers saying it would help even for ZVOLs.

You won't be able to make use of the special_small_blocks property (I believe the allocator only considers file blocks for this option) but you can leverage it for metadata (and dedup tables, but you really want to read the resource from @Stilez about the type of hardware necessary and level of performance to expect!)

@ferret Apologies if I missed it earlier, but are you able to post the make/model of your SSDs? You're also likely benchmarking asynchronous performance which is why your writes are very high (line speed) but that can expose you to a risk of data loss if the ProxMox VMs aren't able to send/handle sync write requests.

ferret · Aug 15, 2022

Apologies for the delay in responding, I wanted to do more analysis of the issues.

The issue is related to IO Delays in Proxmox Cluster on all 5 Hosts from the single TrueNAS server with shared ZVOL block storage via iSCSI (LACP 10GB), in particular when restoring VM's from Proxmox Backup Server (average size is 100GB) and again when copying VM images from local ZFS RAIDZ (5 x 500GB Crucial MX500's) to TrueNAS server.

TrueNAS Server Setup;
IBM x3650 M3
- dual X5690 @ 3.47GHz
- 72GB RAM
- 2 x 300GB 10K HDD for OS
- 2 x Dell 200GB SAS SSD -> Special VDEV Mirror
- 2 x Crucial MX500 500GB SATA SSD -> Log VDEV Mirror
- 2 x PCIe Cards with Samsung 970 EVO 500GB NVMe's -> Cache VDEV Stripe
- IT HBA (SAS2008) with dual SAS Cables
- Intel 10GB Dual Port -> LACP to NEXUS 5500 with vPC

EMC Enclosure
- Dual Controllers
- 15 x 3TB 7.2K -> 7 x VDEV Mirrors (aka RAID10) & 1 hot spare

Proxmox Hosts are all IBM X3550 M4's with approx. 20 VMs.

The above environment on paper should be fine with ample capacity. Clearly I have doing something wrong?

HoneyBadger · Aug 15, 2022

I'll try to hit the main points, apologies as it's late here. Headings in bold.

Fast writes, slow reads: Can you show the results of zfs get sync? I imagine it's all "standard" which means odds are good that ProxMox isn't sending sync writes down the line, making your writes artificially fast (and making your log vdevs not get used)

Log devices: Although with your log devices not getting used, you haven't hit the impact of your log devices being non-optimal for SLOG duties. While they're fantastic consumer SSDs, they aren't really suited for a ZFS write log as shown in the performance metrics here:

SLOG benchmarking and finding the best SLOG

LOL. First, I'd have to either switch to SSD-only or add a lot more disks before I'd even get close to saturating 10GbE. But, Optane did help speed things along for the HDDs and I hope that my special VDEV will help improve responsiveness with small files and the metadata when I switch to...

www.truenas.com

Code:

Synchronous random writes:
         0.5 kbytes:    508.1 usec/IO =      1.0 Mbytes/s
           1 kbytes:    536.5 usec/IO =      1.8 Mbytes/s
           2 kbytes:    505.6 usec/IO =      3.9 Mbytes/s
           4 kbytes:    444.5 usec/IO =      8.8 Mbytes/s
           8 kbytes:    454.1 usec/IO =     17.2 Mbytes/s
          16 kbytes:    473.9 usec/IO =     33.0 Mbytes/s
          32 kbytes:    503.4 usec/IO =     62.1 Mbytes/s
          64 kbytes:    612.7 usec/IO =    102.0 Mbytes/s
         128 kbytes:    821.5 usec/IO =    152.2 Mbytes/s

If you're hoping to get anywhere near the capacity of your 10Gbps line, you're going to need to invest in something like Intel P3600/P3700 or Optane 900/905/P4800X, and even then you aren't likely to see that type of speed all the time - perhaps during large operations or migrations but day-to-day use on a per-VM basis, it will probably be in the 3-5Gbps range using one of the aforementioned cards.

Dell SSDs: Do you have a vendor model (HGST/Toshiba?) or a Dell part number? These might be better options for SLOG for now if they're tagged as "write intensive" eg: HGST HUSMH-series or Toshiba PX02SS but I'm still going to make a strong push for Intel datacenter-grade NVMe if you want to leverage that 10Gbps network.

Connection method: I see you've got an LSI HBA listed, is that for the internal devices? How are the external ones in the EMC DAE attached, another LSI HBA? If you're using an internal IBM/Lenovo ServeRAID you may need/want to replace or reflash it if it's in IR mode instead of IT.

LACP and iSCSI: These don't mix. I'd suggest migrating to two non-overlapping subnets and setting up iSCSI MPIO instead as that will give you potential to burst up to 20Gbps (doable on reads, not likely on writes)

ISCSI Multipath - Proxmox VE

pve.proxmox.com

RAM: DDR3 from that era is cheap, 72GB is good but 144GB is better. If you've got empty slots, fill 'em with the largest-capacity DIMMs that your server supports.

Hope this isn't too much too fast.

ferret · Aug 15, 2022

Many thanks for your valued response.

Fast writes, slow reads: See attached results and you are 100% correct, all standard. Will investigate Proxmox to ensure it sends sync writes. That explain everything, years ago I had an environment with Solaris 11.3 then 11.4 with ZFS Appliance and XEN Server farm using iSCSI never had any performance issues. Sadly XEN changed licensing so went to Ovirt then to Proxmox as Ovirt didn't have a simple backup solution at the time.

Log devices: Will need time to digest this information and fix the Proxmox to send sync writes.

Dell SSDs: Pliant LB206M

Connection method: LSI HBA is external only and connecting to the EMC DAE. I am using international M1015 for OS, Special VDEV & Log VDEV, all configure as RAID 0 VD na letting TrueNAS to do mirroring an d stripping.

LACP and iSCSI: Will do, would NFS be a better option?

RAM: Will double the RAM as suggested, although did think that with only a few TB of storage I didn't need anymore RAM.

Absolutely fabulous feedback and very much appreciated. I've always loved ZFS just can't get it working nicely with Proxmox and really do not want to go down the CEPH path because something doesn't feel write with the architecture.

Cheers

HoneyBadger · Aug 16, 2022

Fast writes, slow reads: Most hypervisors that I've tested have the behaviour of assuming the iSCSI array will provide data safety (non-volatile writes) on its own. NFS has that "sync" behaviour built into the protocol itself and tends to be "sync by default" - you can enforce this easily by doing zfs set sync=always poolname/zvolname but be aware that you will bottleneck heavily on your writes without an appropriate log device, leading us to ...

Log devices: I believe the correct method for ProxMox is to enforce sync=always on the ZFS ZVOLs and set to ProxMox as writethrough (use the host page cache for reads if desired, don't cache writes, and the storage will treat all writes as needing to be on non-volatile storage when received)

Dell SSDs: I see both the Pliant LB206M (aka Sandisk Lightning) and a Samsung SM1625 in there. The LB206M seems to have good sync-write performance but the SM1625 I recall having some users report oddly poor performance that doesn't line up with the spec sheet. They might be better SLOG devices in the short-term until you can get some NVMe devices, but they should be fine for the special/meta drives.

Connection method: Sounds like you haven't flashed the internal M1015 into IT mode yet. The RAID0 virtual-disk method isn't recommended. External LSI is fine of course.

LACP and iSCSI: NFS and LACP do "play nice" together, yes. You'll "only" get 10Gbps throughput to a given single endpoint but I would expect you to bottleneck elsewhere first under most workloads.

RAM: ZFS will use all of your available RAM for a first-level read cache in the form of the ARC (Adaptive Read Cache) which makes it a very good way to improve performance. The more of your hot/active data that can be served from RAM, the fewer reads have to hit the back-end vdevs, which frees them up for more write I/O.

Important Announcement for the TrueNAS Community.

Remove or Migrate Special VDEV

Cadet

Attachments

Powered by Neutrality

Cadet

Hall of Famer

Cadet

Hall of Famer

Powered by Neutrality

Cadet

Attachments

Powered by Neutrality

Cadet

Powered by Neutrality

Cadet

Powered by Neutrality

actually does care

Cadet

Attachments

actually does care

Cadet

Attachments

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Remove or Migrate Special VDEV"

Similar threads