TrueNAS server disk IO performance

Sean M

Cadet
Joined
Mar 1, 2021
Messages
6
This is a question concerning disk IO performance for a zfs pool.

I have a server, main hardware components are:
- 4U Supermicro Chassis, 36 drive bays (24 front, 12 rear), front and rear SAS expanders
- 36 Seagate Exos 12TB SAS drives
- AVAGO Invader SAS Controller, AVAGO MegaRAID SAS FreeBSD mrsas driver version: 07.709.04.00-fbsd, in jbod mode
- Motherboard is Supermicro X11 series, dual 12-core Xeon Scalable Silver processors, 196GB DDR4 ECC RAM

This server was running Windows Server 2016 with the drive array controlled through the MegaRaid card (raid 60). I have now re-purposed it as a NAS running TrueNAS Core 13.U1

I switched the SAS Raid card to jbod mode so the drives are passed directly to TrueNAS. I configured one storage pool comprised of three RaidZ2 Vdevs. Each Vdev is 12 drives.

Running fio I get around 2GiB/s write speed to the pool. The disk i/o graphs show this as around 70 MiB/s per drive. 70 MiB/s times 30 drives = 2.1 GiB/s. Seems to add up.

If I bench an individual drive I get 250 MiB/s outer, 200 MiB/s middle and 120 MiB/s inner. Multiplying 200 MiB/s times 30 drives is 6 GiB/s. Is this the performance I should be seeing?

If so, any recommendations as to what I can do to get closer to this?

Regards, Sean

PS: This is my first post here, if I am missing some important data to advise on this, please let me know.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I switched the SAS Raid card to jbod mode so the drives are passed directly to TrueNAS.

Well, that's not going to work well and may eventually cause you problems including possible pool corruption.


Predicting pool performance for random pool configurations is a bit of hit or miss. Optimal speeds come from using mirrors, while RAIDZ configurations tend to be dragged down a bit by the inherent overhead. For IOPS, a RAIDZ pool typically has something resembling the IOPS of the slowest component device in each vdev, meaning a 3-vdev-of-12-disks-in-RAIDZ with component conventional HDD's with maybe 250 IOPS might tend to peak out around 600-800 IOPS. By comparison, a mirror system of 18-vdev-of-2-disks-in-mirror will see something closer to 4500 IOPS write/9000 IOPS read. This ties in to actual MB/sec in rather byzantine ways which makes it hard to predict.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I'd also add that JBOD mode on RAID controllers often has surprisingly poor performance (presumably because nobody cared enough to implement it properly). Just one more reason to use a real HBA.
 

Sean M

Cadet
Joined
Mar 1, 2021
Messages
6
Yes, understood on the RAID card in jbod mode vs. HBA. I have ordered a SAS3 HBA which should be in soon. Once received, I will swap it out and run new tests. Meantime, I think I will configure it with striped mirrors and see how that compares on performance.

I will post the results. Thank you for your data and advice.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Can you post the actual fio command(s) you're using? If it's only a single stream of I/O you aren't likely going to be able to leverage the wider and multiple vdev setup you have. You'll have similar challenges extracting maximum performance from small records in RAIDZ.

It's unfortunately not as simple as "individual disk times number of disks" - the vdev geometry has a big impact, as well as the workload you're putting on it. Wider vdevs also need to "coordinate" a bit more in terms of all members working together to complete a write.

Since it seems as if you can reconfigure the array now without impacting data, this is a good time to change things and figure out the proper layout.

What kind of data do you intend to load on this system once it's built?
 

Sean M

Cadet
Joined
Mar 1, 2021
Messages
6
Can you post the actual fio command(s) you're using? If it's only a single stream of I/O you aren't likely going to be able to leverage the wider and multiple vdev setup you have. You'll have similar challenges extracting maximum performance from small records in RAIDZ.
fio command:
Code:
fio --name=write --ioengine=posixaio --rw=write --bs=128m --size=32g --overwrite=0 --numjobs=16 --iodepth=16 --runtime=120 --time_based --end_fsync=1


I am not a master of fio but just modified this from another posting. I have run some variations with smaller block sizes and smaller file sizes.

As discussed above, I went ahead and reconfigured the pool as 18 striped mirrors and got barely better overall write speed. 2.1 GiB/s over 2 GiB/s. With only an effective 18 drives being written to the individual drive write speed went up to 120 MiB/s over 70 MiB/s but the overall write performance did not change much.

This server will be used to host media files for a film editorial department. This will hold image sequences as small as 6-8 MB per frame to 100+ GB MOV and MXF files for real-time playback.
 

Sean M

Cadet
Joined
Mar 1, 2021
Messages
6
Additionally, as a test and not for production, I went ahead and just striped all 36 drives, no protection, and ran the fio command (above) and get about 4.2 GiB/s. That graphs at about 115-120 MiB/s per drive for 36 drives.

Switching the pool back to 3 x 12 drive RAIDz2 now.
 

mav@

iXsystems
iXsystems
Joined
Sep 29, 2011
Messages
1,428
12-wide RAIDZ2 require at least 40KB of data to fill a single row and few times more to get to target space efficiency. With 128KB blocks it will write only about 12KB per drive per I/O, that is OK only if perfectly aggregated, but even then create CPU overhead. From use of such pool topology I guess you wish to storage some large data. If so, you may try to increase both recordsize and test I/O size from 128KB to 1MB to see whether it help to improve the results.

PS: First thing I would look on is `top -SHIz` output during the test to see total CPU usage and possible single thread bottlenecks.

PPS: Considering PCIe Gen3 on motherboard and likely x8 HBA, I'd estimate maximum HBA performance about 6GB/s, so the 4.2GB/s of stripe is not that far from it. Mirror obviously provide half of that on application level due to double disk write traffic. 12-wide RAIDZ12 adds less -- 20% extra traffic, but much more small disk I/Os to aggregate and parity to calculate. I also hope your backplanes are SAS3 and connected separately, so won't cause additional bottlenecks on SAS level.

PPPS: If you end up having two HBAs, it would be interesting to try using both same time, one connected with two cables to the front and another with two cables to the back backplane to try minimize both HBA and SAS bottlenecks.
 
Last edited:

Sean M

Cadet
Joined
Mar 1, 2021
Messages
6
12-wide RAIDZ2 require at least 40KB of data to fill a single row and few times more to get to target space efficiency. With 128KB blocks it will write only about 24KB per drive per I/O, that is OK only if perfectly aggregated, but even then create CPU overhead. From use of such pool topology I guess you wish to storage some large data. If so, you may try to increase both recordsize and test I/O size from 128KB to 1MB to see whether it help to improve the results.

PS: First thing I would look on is `top -SHIz` output during the test to see total CPU usage and possible single thread bottlenecks.
Thank you, mav. I will check that out and test with the larger record size.
 

Sean M

Cadet
Joined
Mar 1, 2021
Messages
6
12-wide RAIDZ2 require at least 40KB of data to fill a single row and few times more to get to target space efficiency. With 128KB blocks it will write only about 12KB per drive per I/O, that is OK only if perfectly aggregated, but even then create CPU overhead. From use of such pool topology I guess you wish to storage some large data. If so, you may try to increase both recordsize and test I/O size from 128KB to 1MB to see whether it help to improve the results.

PS: First thing I would look on is `top -SHIz` output during the test to see total CPU usage and possible single thread bottlenecks.

PPS: Considering PCIe Gen3 on motherboard and likely x8 HBA, I'd estimate maximum HBA performance about 6GB/s, so the 4.2GB/s of stripe is not that far from it. Mirror obviously provide half of that on application level due to double disk write traffic. 12-wide RAIDZ12 adds less -- 20% extra traffic, but much more small disk I/Os to aggregate and parity to calculate. I also hope your backplanes are SAS3 and connected separately, so won't cause additional bottlenecks on SAS level.

PPPS: If you end up having two HBAs, it would be interesting to try using both same time, one connected with two cables to the front and another with two cables to the back backplane to try minimize both HBA and SAS bottlenecks.
Ok, so I made just the 1M record size change and for the fio test with bs=128M and numjobs=16 I see a remarkable performance increase. It reports nearly 4 GiB/s (3.8) on the write test. This is with the 3 x 12 RAIDz2 vdevs pool.

Yes, I am running a SAS3 HBA in an 8x slot with 2 SAS3 expander back planes. Front has 24 drives, rear has 12. Each backplane is fed from a separate port on the HBA.

Thank you very much for your insight.
 
Top