Dell R7415 EPYC Gen-1, 24x NVMe, slower RAIDz perf than the SSD comprising it

TrumanHW

Contributor
Joined
Apr 17, 2018
Messages
197
Dell PowerEdge R7415
AMD Epyc 7351P (1st-Gen)
16c 2.4GHz | 2.9GHz 64M Cache

256GB DDR4-2400 ECC
24x SFF NVMe slots

4x 7.68TB Micron 7300 Pro NVMe x4
8x 7.68TB Micron 9300 Pro NVMe x4 (while they perform well, I've limited their use to testing only as the fans go high bc these weren't originally sold with the R7415)

Both synthetic benchmarks and Read / Write tests provide the same performance.
All drives tested via both Windows and Ubuntu ... getting

2GB/s - Write - Micron 7300 Pro NVMe x4
3GB/s - Read - Micron 7300 Pro NVMe x4

3GB/s - Write - Micron 9300 Pro NVMe x4
3GB/s - Read - Micron 9300 Pro NVMe x4


RAIDz1 of 3 drives & 4 SSD in both TNC // TNS get performance worse than the single-drive performance.

I get approx. the same performance of 1x 7300 Pro under TNC or TNS as I get with either 3 or 4 of them in TNC or TNS.
I get approx. the same performance of 8x 9300 Pro as I get with 4x 7300 Pro in both TNC or TNS.

I had originally thought this was a TrueNAS // ZFS issue ... but then, I tested a Software RAID under Ubuntu of 3x 7300 Pro...
and got approximately the same performance as I got in TNC / TNS with an extra ~100MBs R/W (likely the extra ZFS overhead for error checking).

When I tested a RAIDz1 of 4 NVMe, I got ~125MB/s according to ZFS I/O in performance reporting (per NVMe) ... awesome ey?
When I tested a RAIDz2 of 8 NVMe, I got ~ 87 MB/s according to ZFS I/O in performance reporting (per NVMe) ... awesome ey?

Of course, these drives are connect directly to the R7415 motherboard for which the manual says has 128 PCIe 3.0 lanes.

In fact ... while still rather unimpressive, I added an HBA330 and 4x Evo 870 (which get ~500MB/s R/W) and got only
500MB/s - Write
600MB/s - Read
As in, almost! as good as a bunch of drives that are at least 4x as fast.


Hoping Ericloewe might see this and make some suggestions ...

YEAH, of course, I can test this about 100 more ways to further see that there's really a problem.
What no one's yet to offer are candidate solutions (ideally, those that are free first).

In another thread on STH (and despite having said that CPU utilization (as if anyone really thought an Epyc CPU was the problem??) was at all of 5% for only about 0.5s of the tests that I did ... of course, the default, thoughtless suggestion ...? "Oh, it's the CPU." Even when having done tests with FIO and benchmarking in Ubuntu of the array ... meaning it doesn't even convolve SMB ... they STILL suggested I buy (if not a CPU, another computer, bc hey, maybe I just need a 3rd Gen Epyc to get more than 500MB/s like my SPINNING array gets).

I know this isn't just a TrueNAS issue if it's one at all. It seems like it's maybe a backplane issue..? But what..? Dell sold an NVMe ONLY (until I later added the HBA330 to even access SAS or SATA) ... but addressing more than 1 at a time makes them all limited to less than 1GB/s in aggregate??? Where are the thousands of complaints then? Bc this would be rather intolerable to any customer who purchased this configuration and installed more than 1 NVMe drive. Presumably you couldn't even copy from one NVMe and write to another without reducing their performance to at best 1/4th (or ~700MB/s).

Granted, the crippled NVMe speeds are still somewhat faster than my spinning rust array ... bc it's spectacularly consistent. But my spinning array always outperforms the performance of the drives of which it's comprised. In fact, I have literally seen it get 1200MB/s ... and HGST drives aren't particularly fast compared to most drives ... let's say ~200MB/s max. That means the RAIDz2 array gets over the N-P x drive-speed occasionally.

In contrast ... if this got even 50% it'd get 4GB/s (obviously not of IOPs limited data; my tests are with large media files of 1GB+) for the 4 drive config ... or 12GB/s with the 8x 9300 Pro (3GB/s each). I know that'll be limited by something (although my older // consumer i7-8700K with 4x PM983 managed to get over 10GB/s ... so, is it really that crazy !??).

I just hope I can get this machine to have the "array-performance" of equal or greater than 1 drive from which it's comprised.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Performance may not mean only what you think it means.

Testing methods are important. What exactly are you using (hence what are you testing in reality)?

It seems your definition of performance is throughput (MB/s)... performance is also tied up in IOPS and workload (sync or async writing).

If you're testing remotely and writing to/from a share, you're also testing your network and the performance of SMB (maybe single-threaded).

What recordsize have you set for the dataset you're testing with? and does your testing workload suit that?

have you used fio from the TrueNAS console/SSH? I would suggest starting there if you really want to test storage performance.
 

TrumanHW

Contributor
Joined
Apr 17, 2018
Messages
197
The same testing I did with my spinning array. While I sincerely appreciate your time & attention, I'm giving up.

I'm going to explore some different approaches. ZFS on Linux or even MacOS (BSD)? I'll keep it sync'd to my spinning array (risk limiting) with incremental+differential backups (it's RAIDz2 – for some fault tolerance), and I'll pickup the largest LTO I can afford/justify. As to the high-speed (NVMe) array? I'm just not going to accept performance slower than the speed of the drives from which it's comprised otherwise I'm going to resort to concatenated or something.


It does frustrate me that, until anyone even acknowledges a problem exists, no one's going to look in to fixing it.

Despite no shortage of threads titled "Slow NVMe performance" under the category of TN...
In which we've not only never seen, but we don't expect to ever see ... a thread's OP say:
"Thanks, I now get performance that scales to the qty of NVMe drives from which it's comprised."

Despite this pattern , rather than even entertain TN limitations, people's expectations are BERATED.
Which to me, ultimately concedes the point I've grown to accept:
TN just doesn't [yet?] yield performance from NVMe arrays analogous to what it can from spinning arrays.

Honestly, I expect someone will eventually post a hostile or condescending response (I'm not saying you).

With spinning arrays, you get performance that's roughly commensurate to the HDs from which it's comprised.
[(Spinning HD-perf) x QTY (N-P)]

Yet, with SSD ... I got very similar performance from
- 8x ~500MB/s SSD (SATA EVO 870)
-as I got from-
- 8x ~3GB/s SSD (Micron 9300 Pro)

Usually, we expect performance to be extracted up to the threshold of some hardware limitation (except! for TN+NVMe).
And there's a cavalcade of self-appointed guardians of the status quo committed to pretending no problem exists.

Yet, I get either unfalsifiable, unfounded, or poorly supported claims:
- Its your CPU
- You must be benchmarking it wrong
- Your unreasonable expectations are the problem. (my personal fav)

I wouldn't be so surprised if the people here weren't so exceptionally knowledgeable & intelligent.
I guess this is just some weird dogma (shibboleth) ... one that I hope we eventually blaspheme against.
 

NickF

Guru
Joined
Jun 12, 2014
Messages
763
Before you get too upset, let's talk about this a bit more objectively. In FreeBSD and Linux, alot of work has to been going on to optimize performance for NVME arrays. The problem is when your CPU and RAM can't keep up with the compression overhead or the memory requirements, performance backs off to keep the system stable. I really did try to help you in your previous post and I don't appreciate the characterization that we didn't try to help.

Now let's talk about some general advice on ZFS with NVME (You can do all of these things in TrueNAS)

Try disabling compression:
Code:
zfs set compression=off pool/dataset


Try only caching metadata:
Code:
zfs set primarycache=metadata pool/dataset


Try increasing the record size of your dataset:
Code:
zfs set recordsize=1M pool/dataset


You can try disabling prefetch on your system:
Code:
vfs.zfs.prefetch_disable="1"


The point here is you should see a trend. All of the above are designed to reduce the system bottleneck by not trying to cache data in RAM on a pool that's nearly as fast as ram...or to compress data hitting single threaded performance hard for single client streams.

If you want to color outside the lines a bit, switch to SCALE and Try Polling:
 
Top