Optimizing workload for write throughput on two x 6-drive vdev raidz2 pool

rich1

Dabbler
Joined
Aug 20, 2021
Messages
18
setup : TrueNAS-12.0-U5|Mobo:HP DL380e G8|CPU: 2 x E52450|RAM: 128GB|boot HDD 1 x 3TB IBMESXS|data HDD:12 x 10TB Seagate Ironwolf Pro|SSD Cache: 1 x 2TB Samsung 860 Pro|HBA : LSI SAS 9207 8i PCI3|NIC: onboard 1gb + 10Gb SolarFlare SFN7002F SFP+ Dual-Port 10GbE Flareon PCIe 3.0|1 Pool consisting of 2 x 6 drive vdevs , dedupe is off, recordsize=1MB| NFS v4

I have a truenas server built as above. It is dedicated for a specific workload, which mostly is read heavy and has benefited reasonably from the RAM and SSD caching so far. However, there is also a write heavy workload and since optimizing the client cpu efficiency of the write heavy workload the truenas' write throughput is now a serious bottleneck. When running 100+ writer processes they all block severely on writing to the nas, in this state each of the 12 hdd's are reporting a write rate of about 10MB/second so in total 120MB/sec, and averaging 1-5 pending disk operations.

The shape of the pool consisting of two vdevs each with 6 drives on raidz2 was chosen to get pretty good storage capacity - accepting a bit of a compromise on IO throughput. However 120MB/sec is worse that i was expecting and just simple tests either via nfs or locally on the server demonstrates the server is capable of better IO. Just copying 6 files concurrently over nfs tops out at about 500MB/sec on the writes (files were already cached so no reads). A dd test locally on the truenas server yields 2.5GB/sec write and 5GB/sec read on a 400GB file.

To find out the write requirements of my workload i temporarily pointed the workload writes to a ramdisk instead of truenas to see how fast it needed to write to disk when blocked on cpu rather than disk. In this test the workload sure enough was cpu bound and wrote to the ramdisk at rate of 0.3MB /second across 48 separate files. Hence to run say 100 processes doing the same thing I calculate it needs a write throughput of only 30MB/sec but of course across 4800 files - hhmm.

Is this just an obvious case of thrashing the heads to do random writes ? The individual writes themselves should be sequential but likely ruined by the concurrency.

The question then is how to refactor my workload to use truenas throughput more efficiently ?
The options I am thinking about are introducing middle tier to only write concurrently across say 4 threads, or re-designing the file layout so that each writer only writes to one file or some mix of both. The files are hdf5 tables - not sure if it would achieve anything if the 48 tables were contained in one large hdf. If required I could consider nas rebuild if there was a strong case for it.

thanks for any suggestions.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
The shape of the pool consisting of two vdevs each with 6 drives on raidz2 was chosen to get pretty good storage capacity - accepting a bit of a compromise on IO throughput.
I think you're massively understating the compromise made on IOPS here...

If your intention is to have a pool that handles the IOPS you're throwing at it, having 12 disks restrict themselves to the IOPS of only 2 disks isn't the way forward. RAIDZ VDEVs have the IOPS of one single member disk. Throughput can still be decent if IOPS are low, but there's no getting around low writing throughput if IOPS are high.

You will need to change your pool topology in order to improve the situation and use Mirrors.

The options I am thinking about are introducing middle tier to only write concurrently across say 4 threads
There may be something to this idea if you want to stay with your existing pool layout. Adding an NVME pool or something with small capacity but high IOPS as an initial writing location to absorb your IOPS there and later using a single "writer" to put that content into your tables might be an alternative way forward.
 

rich1

Dabbler
Joined
Aug 20, 2021
Messages
18
I think you're massively understating the compromise made on IOPS here...

If your intention is to have a pool that handles the IOPS you're throwing at it, having 12 disks restrict themselves to the IOPS of only 2 disks isn't the way forward. RAIDZ VDEVs have the IOPS of one single member disk. Throughput can still be decent if IOPS are low, but there's no getting around low writing throughput if IOPS are high.

You will need to change your pool topology in order to improve the situation and use Mirrors.


There may be something to this idea if you want to stay with your existing pool layout. Adding an NVME pool or something with small capacity but high IOPS as an initial writing location to absorb your IOPS there and later using a single "writer" to put that content into your tables might be an alternative way forward.
thanks @sretalla for your response. unfortunately the server board won't support nvme and all my disk slots are occupied now. I've got a similar solution going currently tho : I've setup ramdisk on the client to do the high frequency writes then when they are complete the file(s) are moved in bulk to the server - ie much lower frequency, testing this now. . . .
 

rich1

Dabbler
Joined
Aug 20, 2021
Messages
18
thanks @sretalla for your response. unfortunately the server board won't support nvme and all my disk slots are occupied now. I've got a similar solution going currently tho : I've setup ramdisk on the client to do the high frequency writes then when they are complete the file(s) are moved in bulk to the server - ie much lower frequency, testing this now. . . .
yes it works by staging the bulk copies and avoiding continuous concurrent writes to the nas it can easily cope with the throughput.
 
Top