optane p5800x

jpmomo · Apr 13, 2021

Hello,
I am looking to get some help with optimization of a new truenas single server setup. The drives that are currently available include the following:
1. intel p5800x 800GB optane drive
2. samsung pm1735 3.2TB pcie 4.0 nvme
3. 2 x seagate ironwolf pro 10TB
4. server with amd epyc rome 7742 x 2 64c cpus, 512GB ddr4 3200mhz ram
5. mellanox connectx-5 cdat dual port 100GE pcie 4.0 NIC

The goal is to optimize the speed of the storage. We are not sure how best to utilize the optane drive. We are also not sure how to best utilize the samsung pm1735. We also need to make sure that the 2 seagate hdds don't slow everything down. Please note that I only have 1 of the optane drives and it is the 800GB version. I also currently only have 1 of the samsung pm1735 3.2TB ssd. I also have all of the relevant computers connected into the same 100G arista switch. I should be able to leverage jumbo frames. The server will also have a couple of the second gen pcie 4.0 m.2 drives. These are not enterprise m.2 ssds but they are pretty fast at 7GBps r/w. I may have the option of getting a second samsung pm1735 3.2TB drive to at least be able to mirror that tier if it is helpful. If I go that route I can just eliminate the seagate HDDs altogether. This project is mainly for benchmarking purposes. That is why the resiliency is not the top priority. I would like to be able to configure the storage to keep pace with the network. Thanks for any suggestions or pointers as to where to best research the details.
jp

HoneyBadger · Apr 13, 2021

jpmomo said:
The goal is to optimize the speed of the storage.

What is the target workload? With only two spinning disks, your sustained read/write rates are going to be very low in comparison to the EPYC + 100GbE hardware.

jpmomo said:
intel p5800x 800GB optane drive

Excellent SLOG candidate, has a good chance of being the fastest one outside of NVDIMMs.

jpmomo said:
samsung pm1735 3.2TB pcie 4.0 nvme

Mirrored vdev for a dedicated high performance pool.

jpmomo · Apr 13, 2021

HoneyBadger said:
What is the target workload? With only two spinning disks, your sustained read/write rates are going to be very low in comparison to the EPYC + 100GbE hardware.

Excellent SLOG candidate, has a good chance of being the fastest one outside of NVDIMMs.

Mirrored vdev for a dedicated high performance pool.

Thanks for the quick response. Some replies to try and fill in the blanks.

Work load is synthetic via proprietary benchmarking sw. If the spinning disks are a bottleneck, I can remove them from the equation. If the concern is that there are just 2, I can increase as needed (up to 12 if that helps).

With regards to using the 5800x as a slog, it seems that would only be utilized in the event of a power failure. That would probably not happen during our benchmarks. Is that the only time that the 5800x would be used if configured only as a slog? Also, since I only have 1, is that also an issue? I was thinking that I might be able to better take advantage of this drive for multiple functions. Ex. L2arc and slog and ? I understand that this would not be mirrored but still might be a better way to take advantage of the performance of this drive of this drive.

Your last suggestion would require that I purchase a second pm1735 drive, correct?

Thanks again for your help and insight,
Jp

HoneyBadger · Apr 15, 2021

jpmomo said:
Work load is synthetic via proprietary benchmarking sw. If the spinning disks are a bottleneck, I can remove them from the equation. If the concern is that there are just 2, I can increase as needed (up to 12 if that helps).

As a general rule, the sustained throughput to the pool (non-cached reads, longer bursts of writes) scales with the number of vdevs in the pool. If you're seeking maximum performance, that is generally found with large numbers of disks configured as mirrored pairs - so for the 12 drives, they would be six pairs of 2.

For the workload itself - can you provide greater detail in regards to the size of files, granularity of access, and the pattern of CRUD or "Create, Read, Update, Delete" that will happen? Example statements:

"Our software works with files for diagnostic imaging. They are usually saved as large sets of hundreds of files, approximately 2-3MB each. These files are read on a set-by-set basis, so a given client will open a few hundred files at a time, and load them into their local application. Once files are written, they are rarely if ever changed, and deleted after seven years."

"We need to host a database on a shared iSCSI disk. This database is largely reads, but will periodically have a burst of writes, approximately every 60 seconds. The database software works with 8K records. The data itself is highly compressible, and frequently overwritten. It is an online DB and requests cannot be resubmitted."

jpmomo said:
With regards to using the 5800x as a slog, it seems that would only be utilized in the event of a power failure. That would probably not happen during our benchmarks. Is that the only time that the 5800x would be used if configured only as a slog? Also, since I only have 1, is that also an issue?

An SLOG will be utilized anytime that synchronous ("safe") writes are requested - either by the client, in a situation where the protocol understands sync writes (NFS) or when explicitly forced by ZFS (zfs set sync=always poolname/datasetname) - this ensures that the data is on stable storage (not just RAM) before the server will return acknowledgement of the write. This is critical for things like the second example statement (online DB, live virtual machines) but less so for situations like a generic file server where the file exists on a client machine and can be "saved again" in another location in case of an outage. It will only be read from in case of a power failure or other unexpected shutdown.

jpmomo said:
I was thinking that I might be able to better take advantage of this drive for multiple functions. Ex. L2arc and slog and ? I understand that this would not be mirrored but still might be a better way to take advantage of the performance of this drive of this drive.

Optane is an exception to the "don't use a single drive for multiple functions" because it doesn't suffer performance loss from a mixed read/write workload. You can easily split this device into an L2ARC/SLOG; however, with 512GB of RAM, your active set of "hot data" would need to exceed 512GB before the L2ARC would be used. The DC P5800X is probably the fastest drive out there right now.

jpmomo said:
Your last suggestion would require that I purchase a second pm1735 drive, correct?

Correct, but whether a second high-performance pool is necessary or not depends on the workload. By itself it could be a fine L2ARC device though, but again you'll have to have more than 512GB of actively hot data before it starts getting valuable use.

I know I've said "depends on workload" a lot - but it really does depend on it.

jpmomo · Apr 15, 2021

Thanks again for all of your help. I realize that I haven't been very clear as to workload/use case. Without knowing the details of those 2 variables, it is hard to recommend a specific config. I will try and clarify/simplify at least an initial use case/workload.
1. large (>3GB) .iso files stored on a truenas share using the smb protocol. I will be retrieving these large files from other Windows clients.
2. both the truenas server and windows client will be on the same 100Gbps network. Both machines will have dual port 100GE NICs utilizing pcie 4.0
3. I would also like to try and setup a LAGG on both.
4. I would also try and use jumbo frames (9000)
My game plan at this point (subject to change several times a day!) is to test just the optane drive as a separate standalone vdev. I won't really need a slog at this phase as the smb traffic will be async. It also seems like I won't need an l2arc either. I am using 3200Mhz ram that will then go directly to the optane (or to the optane if it is not already in the arc). I can also benchmark the same setup with the samsung pm1735. Obviously there is no redundancy in this config but as mentioned, this is initially just for testing purposes. I was told that there is a way to utilize the hdd for cold type storage by either using rsync or snapshots with replication. I can dig into that later but for now, I am trying to see how to push the limits of the storage to keep up with the available bandwidth of the dual 100Gbps network. I was also told to pay attention to the disk's block size and that could impact the performance in a similar way as jumbo frames do to network performance. The next step would be to add drives in striped mode to theoretically double the raw performance.
Thanks again for your help on this new pet project!

HoneyBadger · Apr 15, 2021

Trying to sustain 100Gbps writes (10GB/s) will require massive levels of performance - I wouldn't look at hard drives being anywhere near able to sustain this in practice. Realistically you may hit another bottleneck first, possibly at a CPU or driver level. I would start with testing with iperf first to see what your raw network throughput numbers. Reads from ARC (RAM) should get as close as possible to the network speed.

Once you start missing cache and going to disk, the workflow of "large files on an SMB share" is normally something that would be fine with a RAIDZ2 configuration of vdevs, but the desire for "really fast speeds" here might suggest mirrors. Certainly a comparison test is in order here. The effect of a 100% cache miss rate (all reads come from disk) can be simulated by setting zfs set primarycache=metadata on the dataset - you could set it to none but that's probably artificially bad, as that means you also didn't get any of the metadata in RAM.

LAGG is certainly suggested but more from a perspective of redundancy - single-client performance will still be limited to the link speed of a single path.

For the block/record sizes, likely what was suggested is to set the recordsize to 1M instead of the default 128K.

jpmomo said:
I was told that there is a way to utilize the hdd for cold type storage by either using rsync or snapshots with replication.

Both of those methods will work - this would be a good way to have the data being read from Optane or NVMe SSD, while still being on a second (protected) vdev of HDDs for redundancy.

ChrisRJ · Apr 15, 2021

jpmomo said:
[..] I will try and clarify/simplify at least an initial use case/workload.
1. large (>3GB) .iso files stored on a truenas share using the smb protocol. I will be retrieving these large files from other Windows clients.

Frankly, I am still not clear on the use-case. Are you really saying that you want to transfer large files for the sake of it? Also, you are implying, at least to me, that there are other workloads (potentially with very different requirements) that are either undefined right now or that you don't want to mention.

jpmomo · Apr 16, 2021

ChrisRJ said:
Frankly, I am still not clear on the use-case. Are you really saying that you want to transfer large files for the sake of it? Also, you are implying, at least to me, that there are other workloads (potentially with very different requirements) that are either undefined right now or that you don't want to mention.

I seem to be confusing folks or at least not being very clear. The reason that I mentioned initial use case is that I have some proprietary testing sw that benchmarks storage from a variety of workloads. I understand that in order to optimize any configuration, it could be very dependent on the specific use case. To make things clearer, I specified smb with large file transfers. Let me try and ask some specific questions that may help.

1. Have you used any pcie 4.0 ssd with truenas? If so, please detail.
2. which motherboard and cpu did you use?
3. which nic card/cards did you use?
4. how did you setup the storage and shares?

If you have any experience with any of the above, it would be very helpful to learn from your experience. Even if you haven't, your feedback has been very helpful already and have helped me to begin to understand this sw much better than a few weeks ago

jpmomo · Apr 16, 2021

HoneyBadger said:
Trying to sustain 100Gbps writes (10GB/s) will require massive levels of performance - I wouldn't look at hard drives being anywhere near able to sustain this in practice. Realistically you may hit another bottleneck first, possibly at a CPU or driver level. I would start with testing with iperf first to see what your raw network throughput numbers. Reads from ARC (RAM) should get as close as possible to the network speed.

Once you start missing cache and going to disk, the workflow of "large files on an SMB share" is normally something that would be fine with a RAIDZ2 configuration of vdevs, but the desire for "really fast speeds" here might suggest mirrors. Certainly a comparison test is in order here. The effect of a 100% cache miss rate (all reads come from disk) can be simulated by setting zfs set primarycache=metadata on the dataset - you could set it to none but that's probably artificially bad, as that means you also didn't get any of the metadata in RAM.

LAGG is certainly suggested but more from a perspective of redundancy - single-client performance will still be limited to the link speed of a single path.

For the block/record sizes, likely what was suggested is to set the recordsize to 1M instead of the default 128K.

Both of those methods will work - this would be a good way to have the data being read from Optane or NVMe SSD, while still being on a second (protected) vdev of HDDs for redundancy.

It seems like for pure benchmarking purposes, a striped set of multiple (at least 4 x fastest (7GB R/W)) drives/vdevs in the same pool and share should be able to not be the bottleneck for 100Gbps from the networking perspective. I have verified the network will not be the bottleneck with some testing sw that I have access to. the sw will also allow ramp up of the storage i/o in either single or multi user/parallel process perspective. A real world example of what I am trying to sort out is the transfer (smb) of large files back and forth between the truenas server and 1 or more windows clients. I am not sure what the configuration options are at this point regardless of any more details of this use case. Would you propose something other than multiple striped fast nvme ssds? I already mentioned that resiliency is not a concern for this use case and I understand that this is one of the main tenants of truenas and zfs in general. So it may be something that most folks are hesitant to recommend. I went into this project thinking that it might be good to use a tiered strategy with the optane as the tier just after the ram. then send to either traditional fast nvme ssd (pm1735) then to hdd. I was then told that truenas doesn't do tiering in the proper sense, just caching. I then was told that L2arc and slog would not be beneficial due to (my late to the party) definition of my use case (smb with large files to/from share). The reason was the amount of ram I had and the rather small optane drive and more importantly the async nature of smb. I was then told that just having a single optane drive, the storage I/O would be the bottleneck on the 100Gbps network. It is here where I find myself at with the suggestion that I aggregate multiple fast (7/7GB R/W) nvme ssds in a single pool and see if that does the trick! Please let me know if you think/know there may be a better way to achieve this type of storage I/O throughput. Please keep in mind that this is all being tested on pcie 4.0 end to end. My initial thoughts are that truenas may still have some quirks with this pcie 4.0 configuration but then again, I am speaking from 1 week of truenas experience

(although with the very nice help of a lot of folks with a bit more!)

jpmomo · Apr 17, 2021

ChrisRJ said:
Frankly, I am still not clear on the use-case. Are you really saying that you want to transfer large files for the sake of it? Also, you are implying, at least to me, that there are other workloads (potentially with very different requirements) that are either undefined right now or that you don't want to mention.

Hello Chris, do you know if there are any issues with Truenas working with pcie 4.0 systems? I have been searching for a while and haven't been able to find any reference with regards to Truenas and pcie 4.0. thanks for your help.

Important Announcement for the TrueNAS Community.

optane p5800x

jpmomo

Dabbler

HoneyBadger

actually does care

jpmomo

Dabbler

HoneyBadger

actually does care

jpmomo

Dabbler

HoneyBadger

actually does care

ChrisRJ

Wizard

jpmomo

Dabbler

jpmomo

Dabbler

jpmomo

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

optane p5800x

Dabbler

actually does care

Dabbler

actually does care

Dabbler

actually does care

Wizard

Dabbler

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "optane p5800x"

Similar threads