Should I be making multiple dRAID vdevs?

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
New 60-drive HDD array (same drives) with just 1 dRAID vdev:
1 x draid2:4d:60c:2s
Code:
# iodepth=1
   READ: bw=1616MiB/s (1694MB/s), 1616MiB/s-1616MiB/s (1694MB/s-1694MB/s), io=15.8GiB (16.9GB), run=10003-10003msec
  WRITE: bw=1703MiB/s (1786MB/s), 1703MiB/s-1703MiB/s (1786MB/s-1786MB/s), io=16.6GiB (17.9GB), run=10003-10003msec

# no iodepth
   READ: bw=1627MiB/s (1706MB/s), 1627MiB/s-1627MiB/s (1706MB/s-1706MB/s), io=15.9GiB (17.1GB), run=10004-10004msec
  WRITE: bw=1716MiB/s (1799MB/s), 1716MiB/s-1716MiB/s (1799MB/s-1799MB/s), io=16.8GiB (18.0GB), run=10004-10004msec

# time cp -a
Copied 20 GiB at 1489.90 MB/s.

Only difference here is 2 hotspares total rather than 1 for each of the 4 vdevs. Not sure if it makes a huge difference because it's just "gap" space; completely unused.

To me, this is definitive proof dRAID is not usable in large datasets if bandwidth or iops is even relatively important. If all you care is storing data and not ever looking at it, you should be fine.

I'm wondering if it'd perform better with SSDs, but I can't test that until I can transfer out my data. Seems like dRAID is only good if you break it up into separate vdevs just like RAID-Z :(. It's still got tons of advantages over RAID-Z, but 1 vdev isn't gonna cut it!

EDIT: Oh shoot! I just realized I used 4 data in this set and 5 data in the other one. But 4 data should be much faster right?

This is 9 redundancy group whereas before, with 4 vdevs, I had only 8 redundancy groups spread out 2-wide on each dRAID vdev.
 
Last edited:

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
HDD draid1:1d:60c:0s
This is the fastest dRAID can go. It's essentially all mirrors.
Code:
   READ: bw=1863MiB/s (1954MB/s), 1863MiB/s-1863MiB/s (1954MB/s-1954MB/s), io=18.2GiB (19.5GB), run=10003-10003msec
  WRITE: bw=1940MiB/s (2034MB/s), 1940MiB/s-1940MiB/s (2034MB/s-2034MB/s), io=19.0GiB (20.3GB), run=10003-10003msec

1-data is definitely faster than 4-data, but not fast enough.

But then I tried 30 mirrors:
Code:
   READ: bw=974MiB/s (1022MB/s), 974MiB/s-974MiB/s (1022MB/s-1022MB/s), io=9744MiB (10.2GB), run=10002-10002msec
  WRITE: bw=1043MiB/s (1094MB/s), 1043MiB/s-1043MiB/s (1094MB/s-1094MB/s), io=10.2GiB (10.9GB), run=10002-10002msec

# With cache enabled just as a sanity check
   READ: bw=3110MiB/s (3261MB/s), 3110MiB/s-3110MiB/s (3261MB/s-3261MB/s), io=30.4GiB (32.6GB), run=10001-10001msec
  WRITE: bw=3276MiB/s (3436MB/s), 3276MiB/s-3276MiB/s (3436MB/s-3436MB/s), io=32.0GiB (34.4GB), run=10001-10001msec

# Back to no cache
   READ: bw=1367MiB/s (1434MB/s), 1367MiB/s-1367MiB/s (1434MB/s-1434MB/s), io=13.5GiB (14.4GB), run=10076-10076msec
  WRITE: bw=1448MiB/s (1519MB/s), 1448MiB/s-1448MiB/s (1519MB/s-1519MB/s), io=14.2GiB (15.3GB), run=10076-10076msec

# time cp -a /mnt/Bunnies/performanceTest /mnt/Temp/
20 GiB transferred at 2092.41 MB/s.

Conclusion: Mirrors are also bad????? The `cp -a` was a lot faster than `fio` for once, but also note the cache gets re-enabled after doing `fio` testing.

Also, the `fio` test I'm doing does both read and write at the same time which is gonna be slower.
 
Last edited:

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
I thought about this some more. My `fio` command was writing a 5GB file but transferring more than 5GB of data. So I realized maybe using a larger file will use more of the drive.

Another thing I remembered, while RAID-Z and dRAID default to 1M recordsize, mirrors default to 128K.

Code:
# HDD: 30 x mirrors w/ 1M recordsize
zfs set primarycache=none Temp
fio --ioengine=libaio --filename=/mnt/Temp/performanceTest --direct=1 --sync=0 --rw=readwrite --bs=16M --runtime=10 --size=50G --time_based --name=fio
rm /mnt/Temp/performanceTest
zfs set primarycache=all Temp

# --bs=16M | recordsize 1M
   READ: bw=3121MiB/s (3273MB/s), 3121MiB/s-3121MiB/s (3273MB/s-3273MB/s), io=30.5GiB (32.7GB), run=10001-10001msec
  WRITE: bw=3280MiB/s (3439MB/s), 3280MiB/s-3280MiB/s (3439MB/s-3439MB/s), io=32.0GiB (34.4GB), run=10001-10001msec

# --bs=1M | recordsize 1M (seems like the bytesize doesn't matter if at or above the recordsize)
   READ: bw=3052MiB/s (3200MB/s), 3052MiB/s-3052MiB/s (3200MB/s-3200MB/s), io=29.8GiB (32.0GB), run=10002-10002msec
  WRITE: bw=3207MiB/s (3363MB/s), 3207MiB/s-3207MiB/s (3363MB/s-3363MB/s), io=31.3GiB (33.6GB), run=10002-10002msec

# --bs=1M | recordsize 128K
   READ: bw=1925MiB/s (2019MB/s), 1925MiB/s-1925MiB/s (2019MB/s-2019MB/s), io=18.8GiB (20.2GB), run=10017-10017msec
  WRITE: bw=1925MiB/s (2019MB/s), 1925MiB/s-1925MiB/s (2019MB/s-2019MB/s), io=18.8GiB (20.2GB), run=10017-10017msec

# --bs=1M | recordsize 16M ⛔ (I guess bytesize does matter if it's below the recordsize).
   READ: bw=199MiB/s (208MB/s), 199MiB/s-199MiB/s (208MB/s-208MB/s), io=1987MiB (2084MB), run=10006-10006msec
  WRITE: bw=210MiB/s (220MB/s), 210MiB/s-210MiB/s (220MB/s-220MB/s), io=2097MiB (2199MB), run=10006-10006msec
# Second run done after other tests just to make sure I'm not crazy
   READ: bw=201MiB/s (210MB/s), 201MiB/s-201MiB/s (210MB/s-210MB/s), io=2014MiB (2112MB), run=10037-10037msec
  WRITE: bw=212MiB/s (222MB/s), 212MiB/s-212MiB/s (222MB/s-222MB/s), io=2129MiB (2232MB), run=10037-10037msec

# --bs=16M | recordsize 16M
   READ: bw=2463MiB/s (2583MB/s), 2463MiB/s-2463MiB/s (2583MB/s-2583MB/s), io=24.2GiB (25.9GB), run=10041-10041msec
  WRITE: bw=2580MiB/s (2705MB/s), 2580MiB/s-2580MiB/s (2705MB/s-2705MB/s), io=25.3GiB (27.2GB), run=10041-10041msec

# --bs=32M | recordsize 16M (as-expected, bytesize doesn't matter once it's over the recordsize)
   READ: bw=2236MiB/s (2344MB/s), 2236MiB/s-2236MiB/s (2344MB/s-2344MB/s), io=21.8GiB (23.5GB), run=10005-10005msec
  WRITE: bw=2354MiB/s (2468MB/s), 2354MiB/s-2354MiB/s (2468MB/s-2468MB/s), io=23.0GiB (24.7GB), run=10005-10005msec

I think this needs more testing. I wish I could script the creation of zpools, but TrueNAS doesn't see them if I do it. It might be fine to do it in the CLI without TrueNAS, but I'm wanting to simulate what it'd be like if I was actually running the thing.

1M is better for mirrors on this machine; that's for sure. Meaning my SSDs doing 128K might be needing larger [max] recordsize values as well. Can't test SSDs until I figure this out though.
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
Code:
# HDD: 1 x draid1:3d:60c:1s w/ 1M recordsize
zfs set primarycache=none Temp
fio --ioengine=libaio --filename=/mnt/Temp/performanceTest --direct=1 --sync=0 --rw=readwrite --bs=16M --runtime=10 --size=50G --time_based --name=fio
rm /mnt/Temp/performanceTest
zfs set primarycache=all Temp

# --bs=1M | recordsize 1M | --rw=readwrite
   READ: bw=1588MiB/s (1665MB/s), 1588MiB/s-1588MiB/s (1665MB/s-1665MB/s), io=15.5GiB (16.7GB), run=10001-10001msec
  WRITE: bw=1583MiB/s (1660MB/s), 1583MiB/s-1583MiB/s (1660MB/s-1660MB/s), io=15.5GiB (16.6GB), run=10001-10001msec

# --bs=16M | recordsize 1M | --rw=readwrite
   READ: bw=1740MiB/s (1824MB/s), 1740MiB/s-1740MiB/s (1824MB/s-1824MB/s), io=17.0GiB (18.3GB), run=10016-10016msec
  WRITE: bw=1821MiB/s (1910MB/s), 1821MiB/s-1821MiB/s (1910MB/s-1910MB/s), io=17.8GiB (19.1GB), run=10016-10016msec

# --bs=1M | recordsize 128K | --rw=readwrite (yes, recordsize 128K is slower. 100% confirmed!)
   READ: bw=661MiB/s (693MB/s), 661MiB/s-661MiB/s (693MB/s-693MB/s), io=6614MiB (6935MB), run=10001-10001msec
  WRITE: bw=664MiB/s (696MB/s), 664MiB/s-664MiB/s (696MB/s-696MB/s), io=6642MiB (6965MB), run=10001-10001msec

# --bs=1M | recordsize 1M | --rw=read
   READ: bw=1443MiB/s (1513MB/s), 1443MiB/s-1443MiB/s (1513MB/s-1513MB/s), io=14.1GiB (15.1GB), run=10001-10001msec

# --bs=1M | recordsize 1M | --rw=write
  WRITE: bw=1801MiB/s (1888MB/s), 1801MiB/s-1801MiB/s (1888MB/s-1888MB/s), io=17.6GiB (18.9GB), run=10001-10001msec

Doing separate read and write runs shows me that `--rw=readwrite` is accurate or potentially accurate.

After these tests, it's clear all my previous performance tests are correct. The only data that's bad is what's going onto my SSDs since it's all 128K, the default for mirrors. Ugh...

Conclusion dRAID also needs more vdevs.
 
Last edited:

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
Code:
# HDD: 15 x draid1:2d:4c:1s w/ 1M recordsize
zfs set primarycache=none Temp
fio --ioengine=libaio --filename=/mnt/Temp/performanceTest --direct=1 --sync=0 --rw=readwrite --bs=1M --runtime=10 --size=50G --time_based --name=fio
rm /mnt/Temp/performanceTest
zfs set primarycache=all Temp

# --bs=1M | recordsize 1M | --rw=readwrite
   READ: bw=2305MiB/s (2417MB/s), 2305MiB/s-2305MiB/s (2417MB/s-2417MB/s), io=22.5GiB (24.2GB), run=10001-10001msec
  WRITE: bw=2288MiB/s (2399MB/s), 2288MiB/s-2288MiB/s (2399MB/s-2399MB/s), io=22.3GiB (24.0GB), run=10001-10001msec
# Second run done after other runs as a sanity check
   READ: bw=2298MiB/s (2409MB/s), 2298MiB/s-2298MiB/s (2409MB/s-2409MB/s), io=22.6GiB (24.3GB), run=10083-10083msec
  WRITE: bw=2282MiB/s (2393MB/s), 2282MiB/s-2282MiB/s (2393MB/s-2393MB/s), io=22.5GiB (24.1GB), run=10083-10083msec

# --bs=1M | recordsize 1M | --rw=read
   READ: bw=2921MiB/s (3063MB/s), 2921MiB/s-2921MiB/s (3063MB/s-3063MB/s), io=28.5GiB (30.6GB), run=10001-10001msec

# --bs=1M | recordsize 1M | --rw=write
  WRITE: bw=3699MiB/s (3879MB/s), 3699MiB/s-3699MiB/s (3879MB/s-3879MB/s), io=36.1GiB (38.8GB), run=10001-10001msec

Doing separate read and write runs in this one gave me vastly different numbers. I guess there's more work to be done.
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
Code:
# HDD: 4 x draid2:5d:15c:1s w/ 1M recordsize
zfs set primarycache=none Temp
fio --ioengine=libaio --filename=/mnt/Temp/performanceTest --direct=1 --sync=0 --rw=readwrite --bs=1M --runtime=10 --size=50G --time_based --name=fio
rm /mnt/Temp/performanceTest
zfs set primarycache=all Temp

# --bs=1M | recordsize 1M | --rw=readwrite
   READ: bw=2328MiB/s (2441MB/s), 2328MiB/s-2328MiB/s (2441MB/s-2441MB/s), io=22.8GiB (24.4GB), run=10009-10009msec
  WRITE: bw=2311MiB/s (2423MB/s), 2311MiB/s-2311MiB/s (2423MB/s-2423MB/s), io=22.6GiB (24.3GB), run=10009-10009msec

# --bs=1M | recordsize 16M | --rw=readwrite
   READ: bw=2920MiB/s (3062MB/s), 2920MiB/s-2920MiB/s (3062MB/s-3062MB/s), io=28.5GiB (30.6GB), run=10001-10001msec
  WRITE: bw=3051MiB/s (3199MB/s), 3051MiB/s-3051MiB/s (3199MB/s-3199MB/s), io=29.8GiB (32.0GB), run=10001-10001msec

# --bs=16M | recordsize 16M | --rw=readwrite (wanted to try this again. Still doesn't make a difference)
   READ: bw=3116MiB/s (3268MB/s), 3116MiB/s-3116MiB/s (3268MB/s-3268MB/s), io=30.5GiB (32.7GB), run=10017-10017msec
  WRITE: bw=3273MiB/s (3432MB/s), 3273MiB/s-3273MiB/s (3432MB/s-3432MB/s), io=32.0GiB (34.4GB), run=10017-10017msec

# --bs=1M | recordsize 1M | --rw=read
   READ: bw=3053MiB/s (3201MB/s), 3053MiB/s-3053MiB/s (3201MB/s-3201MB/s), io=29.8GiB (32.0GB), run=10001-10001msec

# --bs=1M | recordsize 1M | --rw=write
  WRITE: bw=4003MiB/s (4197MB/s), 4003MiB/s-4003MiB/s (4197MB/s-4197MB/s), io=39.1GiB (42.0GB), run=10001-10001msec

Back to the original lineup that I'm probably gonna use. I'm surprised this is so much better!
 

sfatula

Guru
Joined
Jul 5, 2022
Messages
608
I said no iodepth and no numdepth but meant no numjobs and no iodepth. Oh well. Looks like you skipped both later. The results make no sense to me. Sorry, can't explain it away.
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
I said no iodepth and no numdepth but meant no numjobs and no iodepth. Oh well. Looks like you skipped both later. The results make no sense to me. Sorry, can't explain it away.
Yeah, I realized what you were saying later, but I have absolutely no clue why the results are so bad in ways that they should be good.

I noticed that my mirrored SSDs are slower than I'd expect as well. I'm guessing part of that is the 128K recordsize, but again, I can't test my SSDs until I move data over to my HDDs. Then I can finally delete that SSD pool and try more dRAID tests.

---

Maybe my system is underpowered?

Epyc 7313p (16 core Ryzen 9 5950X equivalent).
256GB of 3200 DDR4.
All on a SuperMicro HT12SSL-NT motherboard.

Those HDDs are all HGST H7210A520SUN010T 10TB drives.
My SSDs are all Crucial MX500 either 2TB or 4TB models.

The NIC is a 2 x 25Gb Mellanox ConnectX-6.

All my cards are LSI 9305 5 x 24i and 1 x 16e. They're all PCIe 3.0 x8 cards.
All four ports on the 16e are connected to 4 separate Adaptec SAS Expander 82885T cards in another Storinator XL60 case housing all 60 HDDs.

---

I'm assuming this is just not enough of a system to get adequate performance with the configurations I've been trying. I don't know too much about server hardware, but I may have been buying these older parts not realizing they're unsuitable for getting good performance with ZFS.
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
I reformatted all these drives from 512e to 4Kn. Didn't change the read and write speeds at all, but after adding two SSD metadata drives, I was able to significantly increase it!

Code:
# 10 TiB HDD 4 vdev x draid2:5d:15c:1s
#  2 TiB SSD 1 vdev x 2-drive mirror

zfs set primarycache=none Temp
fio --ioengine=libaio --filename=/mnt/Temp/performanceTest --direct=1 --sync=0 --rw=readwrite --bs=16M --runtime=10 --size=50G --time_based --name=fio
rm /mnt/Temp/performanceTest
zfs set primarycache=all Temp

   READ: bw=3968MiB/s (4161MB/s), 3968MiB/s-3968MiB/s (4161MB/s-4161MB/s), io=38.8GiB (41.6GB), run=10003-10003msec
  WRITE: bw=4119MiB/s (4319MB/s), 4119MiB/s-4119MiB/s (4319MB/s-4319MB/s), io=40.2GiB (43.2GB), run=10003-10003msec

This configuration is what I'm gonna use going forward as I copy all my SSD zpool data. I'm eventually going to reconfigure my SSD zpool as dRAID and copy all that data back.

This is why I need the speed. I don't want to have my system down for many long amount of time, and I may have to use the HDD zpool temporarily.

After starting the transfer, it's gone from taking over 4 days to about half a day. That's a HUGE difference.
 
Last edited:

claw

Cadet
Joined
Jan 16, 2024
Messages
2
I reformatted all these drives from 512e to 4Kn. Didn't change the read and write speeds at all, but after adding two SSD metadata drives, I was able to significantly increase it!

Code:
# 10 TiB HDD 4 vdev x draid2:5d:15c:1s
#  2 TiB SSD 1 vdev x 2-drive mirror

zfs set primarycache=none Temp
fio --ioengine=libaio --filename=/mnt/Temp/performanceTest --direct=1 --sync=0 --rw=readwrite --bs=16M --runtime=10 --size=50G --time_based --name=fio
rm /mnt/Temp/performanceTest
zfs set primarycache=all Temp

   READ: bw=3968MiB/s (4161MB/s), 3968MiB/s-3968MiB/s (4161MB/s-4161MB/s), io=38.8GiB (41.6GB), run=10003-10003msec
  WRITE: bw=4119MiB/s (4319MB/s), 4119MiB/s-4119MiB/s (4319MB/s-4319MB/s), io=40.2GiB (43.2GB), run=10003-10003msec

This configuration is what I'm gonna use going forward as I copy all my SSD zpool data. I'm eventually going to reconfigure my SSD zpool as dRAID and copy all that data back.

This is why I need the speed. I don't want to have my system down for many long amount of time, and I may have to use the HDD zpool temporarily.

After starting the transfer, it's gone from taking over 4 days to about half a day. That's a HUGE difference.

Interesting thread and thanks for posting your findings and results. I am preparing to also setup a ~45drive draid and do some tests. When you wrote that you are testng ie 4 x draid, or 5 x draid, or 10 TiB HDD 4 vdev x draid2 what exactly are you doing to create such pools? Are you manually building them in cli shell first? In the truenas create pool gui I do not see any method of customizing multiple vdev of separate draids.

Also, what was your final decision on draid pool configuration that you most preferred? Was it this prior post '10 TiB HDD 4 vdev x draid2:5d:15c:1s'. So is that 60 disks total, spread across four different vdevs of draid2?

Thanks
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
I used the GUI, but I also did it from the cli too.

When using the cli, I needed to set ashift and compression, and I think one other thing manually.

I settled on 4 vdev x draid2:5d:15c:1s once I had all 60 x 10TB drives

My other offsite NAS with 36 bays uses 3 vdev x draid1:4d:12c:1s with the same drives.

I noticed, the low power Atom C2750 CPU wasn't fast enough to keep up. It barely writes and reads past 700MB/s yet did 2700MB/s (with the same test) on my main NAS when I plugged directly into it's SAS Expanders.

But it's only using a gigabit Ethernet connection anyway, so I don't care.

My SSD array of 112 drives ended up with 7 x draid2:5d:16c:1s of 2TB and 4TB SSDs (different sizes in different vdevs).

I found I could greatly improve read and write speeds by changing the test slightly where I did only reads and only writes, and I turned on one test per core. The fio tests I did were single-threaded previously.
 
Top