Help Troubleshooting performance with NVMe RAID

paulinventome

Explorer
Joined
May 18, 2015
Messages
62
So I have a PCI Board with 4x 2TB NVMe on (970 Eco). The board is an OWC and shows each drive separately. It all seems to work. I have set up a ZRaid, so one redundant drive and 3 together. It's a scratch drive designed to be run over 10gbe and most video working files.

I am seeing pretty poor performance from various machines and when I do a dd locally I get

WRITE: 107374182400 bytes transferred in 29.555392 secs (3,632,981,203 bytes/sec)
READ: 107374182400 bytes transferred in 56.144439 secs (1,912,463,349 bytes/sec)

By comparison this truenas box also has a ZRaid with 3 x 4TB SSD and a 9 spinning disc array and I get better performance from these.

So I know one on is likely to say, oh it's probably this or that but what I would love to know is *how* you would go about troubleshooting this?

So one thing I would try is to delete this pool and start with just a single NVMe drive and compare and try some other combinations.

But is using dd a good idea? Is there a better way to benchmark? The dd command is basically dd if=/dev/zero of='/mnt/Blaze/Blaze Drive/tmp.zero' bs=2048k count=50k

I don't believe the card should be an issue - it does the right thing with PCI lanes and does it on board.

So any pointers really welcome...

thanks
Paul
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
dd is a single write (or read) process - nvme drives are designed for high performance with multiple parallel access patterns or higher queue depths.
High performance you see in regular nvme drives for single threaded operations are usually cache driven, the only real exception is (was) Optane NVME drives which have very good single threaded performance.

If your use case will include multiple parallel write processes (multiple video editors for example), try to recreate the scenario with something like fio using realistic block sizes, parallel streams and potentially queue depths (multiple ideally independent activities triggered by a single client).

Bear in mind that accessing this over the network will be slower than the local test as well.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Another vote for fio here
 

paulinventome

Explorer
Joined
May 18, 2015
Messages
62
dd is a single write (or read) process - nvme drives are designed for high performance with multiple parallel access patterns or higher queue depths.
High performance you see in regular nvme drives for single threaded operations are usually cache driven, the only real exception is (was) Optane NVME drives which have very good single threaded performance.

If your use case will include multiple parallel write processes (multiple video editors for example), try to recreate the scenario with something like fio using realistic block sizes, parallel streams and potentially queue depths (multiple ideally independent activities triggered by a single client).

Bear in mind that accessing this over the network will be slower than the local test as well.
Thank you!

Well I noticed abysmal performance just reading and writing from a single machine. Read of 50 to 100MB/s. I used dd locally to rule out networking. If I use dd for all my pools then from a relative view the NVMe is the poorest performer which tells me that something isn't right.

I think it's a case of expectation. I am naively thinking that I want my NVMe pool to saturate a 10g connection. I don't see why it can't. Whilst the drives may be rated for 3GB/s I don't need that. In a ZRaid 1 config I would expect 3 of them to be able to do 1GB/s. Not 100MB/s

fio is a good call but I think I want to get local dd performance sorted before stepping back to a network, wouldn't that make sense?

I am going to destroy that pool and recreate in a number of ways to compare first.

The PCI card is expensive and presents as 4 drives, there's no RAID on board. There is a PCI switch but it's x8 card, and what they claim is if one drive is being accessed that can be at x4 speeds rather than x2 of all drives. But x2 on PCI 3 is 2GB/s so really that would be plenty for my use case. So I don't think the card is a problem but of course who knows...

Kindest
Paul
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
fio is a good call but I think I want to get local dd performance sorted before stepping back to a network, wouldn't that make sense?
fio tests local performance (and does it properly, unlike dd).

cd to somewhere in the pool you want to test and run this:

fio --name TEST --eta-newline=5s --filename=fio-tempfile.dat --rw=write --size=50g --io_size=1500g --blocksize=128k --iodepth=16 --direct=1 --numjobs=16 --runtime=120 --group_reporting

Remember to remove teempfile.dat at the end (or use it in a read-oriented test later... I assume read is less interesting to test for your scenario)
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
The problem is that things don't scale up as one would expect (which I learned the hard way some time back).

I am not sure why, one would think that caching in memory should speed things up, but in the end nowadays i think of it as follows
(I am happy to provide the countless discussions I started here and on StH as reference why i think that way)

1 write process writes to a single disk with up to that one disks iops
1 write process writes to a single mirror with up to one disks iops (as its mirror'ed)
1 write process writes to a pair of mirrors with up to one disks iops (as its still a single process writing. It might write to both pairs in a round robin fashion but it does not speed up much)
2 write processes writes to single mirrors with up to one disks iops (as its limted by the single drives iops. )
2 write processes writes to a pair of mirrors with up to two disks iops (as its basically single process writing to each pair. )
and so on.

Basically you need a write process per vdev/mirror pair to get that particular drives single threaded iops performance
Nov if you look at SSDs and even worse at NVME, they are made for parallel access.
So imagine you need 16 write processes to saturate a NVME drive.
If you have 2 mirror pairs you actually need 32 processes to saturate the mirror, if you have 8 pairs you need 128 processes...

O/c this is vastly simplified, I am sure memory streamlining writes helps some, writes are not only writes but almost always reads too, blocksize, network latency and more have their impact too, but its a start to have a more realistic expectation level.
 

paulinventome

Explorer
Joined
May 18, 2015
Messages
62
So getting closer.

First I will have a go with fio but I was using dd as it offered a comparison to the other pools I have.

I rebuilt the pool several times to test

Single 2TB NVmE
Write 2581MB/s Read 3092MB/s
Quad Stripe
Write 3859MB/s Read 2406MB/s
Zraid 128k Block
Write 3599MB/s Read 1911MB/s
Zraid 512k Block
Write 3982MB/s Read 2874MB/s

I presume the read on quad stripe is single threaded performance blocked as suggested above.

So ZRaid with 512k block performed well enough for me. Moving back to my workstation I tested the speed (Blackmagic Disc Speed Test) and got 950MB/s write and 850MB/s read. So I am happy with that.

But first time around it was ZRaid with 128k and I was getting a read of 100MB/s and so this tells me that something else was going on because the set up now isn't that different from what I had before. So I will need to keep an eye on this to see how it varies and whether there is actually a networking thing going on.

I wonder if there are issues as the drive fills up, so that's the next test too.

@Rand - so the iops is an interesting aspect and I need to deep dive and understand what you're saying. IOPS isn't so important for me, the drive is storing large proxy video files and frames. But even so, I like to understand so thank you for your reply. You say I need a write process per vdev, as I understand my ZRaid is a single vdev? From my experiment above on a single NVMe I am getting what I imagine is close to the real world performance of the NVMe, through the single threaded dd?

@sretalla So I'm going to have a go with fio, read performance is important to me, I will need to work out what the read version of that command is too.

Thank you all
Paul
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
Reads are different from writes as those can be filled from multiple drives at the same time (each drive reads a certain amounts of blocks), so more drives helps in reads even with single threaded reads.

Yes, a RaidZ is a single vDev for writes (not for reads as mentioned above). If you have many parallel users then moving to a pair of mirrors might speed up writes, but usually nvme drives are so good at multithreaded writes that it might not even show with your user/process base.

fio read simply uses --rw=read

If you have large files only then you can also move to a 1M blocksize without issues. You might loose a tiny bit of space but will reduce overhead.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
@sretalla So I'm going to have a go with fio, read performance is important to me, I will need to work out what the read version of that command is too.
fio --name TEST --eta-newline=5s --filename=fio-tempfile.dat --rw=read --size=50g --io_size=1500g --blocksize=128k --iodepth=16 --direct=1 --numjobs=16 --runtime=120 --group_reporting
 

paulinventome

Explorer
Joined
May 18, 2015
Messages
62
As this is a scratch drive, why not stripe?
I was going to originally but then on a project if I lost everything it would be a pain to restore/recreate. I tested various Raids and found that with ZRaid I can now saturate a 10gbe link read and write. So thought that extra level of safety might be worth it.

But that's balanced by the fact that actually I've never had an NVMe drive fail on me (once it's all working) and I understand that they have redundancy and raid internally inside them.

So I may just be wasting one of them for parity.

Be interested what you think?

cheers
Paul
 

Rand

Guru
Joined
Dec 30, 2013
Messages
906
But that's balanced by the fact that actually I've never had an NVMe drive fail on me (once it's all working) and I understand that they have redundancy and raid internally inside them.
I would be very careful with that when exposing nvme drives to heavy writes, especially if you look at consumer drives like the Eco line...
Sure SSDs tend to hold much longer than expected, and usually you dont tax them less than you think, but if you really hammer them day in&out they die like flies.
If you want to read up search reddit for people killing NVMe drives in the early chia days;)
 

paulinventome

Explorer
Joined
May 18, 2015
Messages
62
I would be very careful with that when exposing nvme drives to heavy writes, especially if you look at consumer drives like the Eco line...
Sure SSDs tend to hold much longer than expected, and usually you dont tax them less than you think, but if you really hammer them day in&out they die like flies.
If you want to read up search reddit for people killing NVMe drives in the early chia days;)
I suppose it's definition of heavy-writes.

They're not hammered day in and day out and as I understand it's more about overall use rather than the speed at which you use it - there's an overall life expectancy based on lifetime transfer?

I'd imagine these would last a few years and then I would move to whatever the next step up is then. I think!

Cheers
Paul
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
They're not hammered day in and day out and as I understand it's more about overall use rather than the speed at which you use it - there's an overall life expectancy based on lifetime transfer?
Correct - most drives carry a lifetime endurance rating which is usually expressed in TBW (TeraBytes Written) or DWPD (Drive Writes Per Day)

Some vendors will tie their warranty directly to this rating - for example, if you use a consumer disk for write-logging and proceed to massively exceed the TBW metric, the vendor may use it as ground to deny the RMA.
 
Top