NVMe speeds not increasing in stripe

digity

Contributor
Joined
Apr 24, 2016
Messages
156
I have 3 x 1.6 TB NVMe U.2 SSDs in a U.2 NVMe to PCI-e 3.0 x4 adapter. Performed some quick benchmarks with dd (sync disabled, compression off) and each drive returns 1700 MB/s read and write (it's rated for 3100 MB/s read, 1900 MB/s write, but whatever). I then put them all in a stripe pool together, but I basically get the same 1700 MB/s read and write for that pool. Why am I not getting something like 5100 MB/s (1700 x 3) or ~3700 MB/s (a likely real world throughput for the ~3900 MB/s theoretical bandwidth of PCI-e 3.0 x4)??


P.S. - The mobo is Gigabyte X79-UP4 w/ XEON E5-4627 v2 & 16 GB DDR3 RAM. I verified all 3 PCI-e slots are operating at 3.0 x4.

P.P.S. - Also, another test I did was run a dd benchmark on the individual drives simultaneously and while they didn't return 1700 MB/s each, combined they did return 2900 MB/s read and 1700 MB/s write. Not sure what to make of this, but it is closer to the quoted performance of a single drive I guess.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
"...not increasing in stripe".

So, you mean, you have three vdevs that consist of one NVMe disk each?

ZFS does not interleave vdevs the way a RAID controller might. You can get high speeds on a RAID controller with a tight interleave in the manner you're suggesting.

ZFS writes (and therefore reads) a block to a single vdev, and a block can be very large. So when you are reading the stuff back, you aren't likely to be getting any meaningful readahead, and are being limited to the speed at which it is reading from the device, then the other device, then the third device, and then back 'round...
 

digity

Contributor
Joined
Apr 24, 2016
Messages
156
"...not increasing in stripe".

So, you mean, you have three vdevs that consist of one NVMe disk each?

ZFS does not interleave vdevs the way a RAID controller might. You can get high speeds on a RAID controller with a tight interleave in the manner you're suggesting.

ZFS writes (and therefore reads) a block to a single vdev, and a block can be very large. So when you are reading the stuff back, you aren't likely to be getting any meaningful readahead, and are being limited to the speed at which it is reading from the device, then the other device, then the third device, and then back 'round...

So no stripe/RAID0 performance benefits when the disks span multiple NVMe PCI-e cards? The same goes for multiple SAS/SATA HBA and RAID PCI-e cards?

If I install a 4 port U.2 HBA or RAID card, physically attach the 3 drives to that, then I'll get stripe/RAID0 performance benefits?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
So no stripe/RAID0 performance benefits when the disks span multiple NVMe PCI-e cards? The same goes for multiple SAS/SATA HBA and RAID PCI-e cards?
What @jgreco said was that it depends on the block size and the size of the files you're working with. Saying that there's no advantage would be incorrect.
 

digity

Contributor
Joined
Apr 24, 2016
Messages
156
What @jgreco said was that it depends on the block size and the size of the files you're working with. Saying that there's no advantage would be incorrect.

Oh, okay. So the block size was left at the default 128k when creating this stripe pool. I just want to see the max this setup can achieve, I definitely don't want to put these in RAID if I only get single disk performance.

Sounds like my benchmark methods aren't the best - what's the best way to benchmark this setup?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Well, let me ask you a different question. How fast is your network? 1Gbps can manage about 120MBytes/sec. 10Gbps can manage about 1200MBytes/sec. Both those numbers are smaller than your SSD read speed.

You want a large block size with ZFS because that is the basis on which things like compression work.

Your problem is related to a concept I had to explain on freebsd-hackers a quarter century ago, which is that small interleave sizes necessarily mean that two or more HDD's get involved in almost every filesystem transaction. If you really only have one thing going on at a time, that's fine, you might get faster reads and writes. However, if you have, for example, eight drives at a 64KByte interleave, and you read a 1MB file, each drive in the pool will need to seek to the same portion of the disk and read just two 64K segments, and if you have a bunch of clients trying to access data, that will happen in a serialized format, first client #1, then #2, then #3... By comparison, if you use a large interleave, one disk can be fulfilling the entire 1MB request of client #1, while simultaneously another disk is fulfilling the entire 1MB request of client #2, etc.

This is related to, but not the same as, your issue. There is a related issue where when you really do want to be able to read a single thread of activity very quickly. In order to do this, you need to be reading ahead, AND you need the optimal situation where multiple vdevs had the same amounts of free space, so that block writes were interleaved between vdevs. This is not that common in my experience, but if it happens, block #1 gets written to vdev 1, block #2 gets written to vdev 2, block #3 gets written to vdev 1, and back and forth between two vdevs (or cycling between three or more). That sets up a situation where when you read blocks, you are alternating between devices, and in that case, if you can get ZFS to read ahead sufficiently far, then you will see the sort of boost you were imagining.

The problem is that this is unlikely to happen under normal conditions. However, you can get multiple clients simultaneously accessing the pool, keeping the I/O rates of the aggregate pool at a high rate. That is what ZFS is geared towards.
 
Top