10 Gigabit - Not as fast as expected?

LIGISTX · Feb 18, 2023

I just got a pair connectx-2 cards and fiber installed in my PC and homelab. It is working, and once something is in ARC I am seeing full (~9.3 gbps) read speeds, but if something is not in ARC I am only seeing about 2.6-2.8ish gbps.

Is this expected with a 10 drive wide Z2 array with 5400rpm drives? I figured each drive can likely do a pretty consistent 100 MBps (800 mbps), so even with overhead, I figured I would hopefully be more in the 5gbps+ area.

Any ideas on anything I could potentially tune, or is this just expected behavior? My TrueNAS Core VM only has 24 GB of RAM, so it is light on RAM I admit. But just copying a Windows 10 ISO as a test (one large file...) I figured I would see better performance.

Happy to share any info about the setup/array/settings, I am just not sure where to start...

jgreco · Feb 18, 2023

LIGISTX said:
Is this expected with a 10 drive wide Z2 array with 5400rpm drives?

Absolutely. The ConnectX-2 cards are not particularly good, and a RAIDZ array with only a single vdev is not super fast. You can try the 10 Gig tuning article under Resources to see if you can get it better.

LIGISTX said:
My TrueNAS Core VM only has 24 GB of RAM, so it is light on RAM I admit.

That's a good amount of RAM for a 1GbE filer, but a bit light on 10Gbps. When stuff is not already in ARC, the filer has to pull it in from the pool, which involves discovering that it isn't in ARC, finding the metadata for the relevant data, then fetching that data from the pool, and only afterwards can it start transmitting it down the wire to you. 10G performance is reliant on the transmit queue being constantly filled with data, so what really wins here are large amounts of RAM, large TCP/IP buffers (see tuning article), low latency (your HDD's add lots of latency), and an efficient protocol (sometimes CIFS isn't).

LIGISTX · Feb 18, 2023

jgreco said:
Absolutely. The ConnectX-2 cards are not particularly good, and a RAIDZ array with only a single vdev is not super fast. You can try the 10 Gig tuning article under Resources to see if you can get it better.

That's a good amount of RAM for a 1GbE filer, but a bit light on 10Gbps. When stuff is not already in ARC, the filer has to pull it in from the pool, which involves discovering that it isn't in ARC, finding the metadata for the relevant data, then fetching that data from the pool, and only afterwards can it start transmitting it down the wire to you. 10G performance is reliant on the transmit queue being constantly filled with data, so what really wins here are large amounts of RAM, large TCP/IP buffers (see tuning article), low latency (your HDD's add lots of latency), and an efficient protocol (sometimes CIFS isn't).

Thanks for the info! I am just a single home user who really doesn't "need" 10 GB, I just had the connectX's laying around so I figured spending 30 bucks on transceivers and fiber just to play with was a pretty cheap way to have an entertaining weekend.

I can pretty affordably add another 64 GB of RAM to this box which I could fully dedicate to truenas (bring it from 24 GB up to 88, maybe I would do just to free up some resources for other VM's).

I have considered adding a few optane drive for special metadata after seeing Wendell talk about it a good bit, again, not because I need this at all, simply because I enjoy tinkering with it... homelab and all. That would significantly reduce latency as metadata would be on optane, but I doubt it would make much of an impactful difference. I am not hitting the array with DB reads or writes, I literally just use it for plex 90% of the time, and the other 10% I am trying to dump RAW photo's or video to it which is all write not read anyways.

What I have found interesting though, I seem to get slightly faster write speed then read, not by much, but maybe 200-300 mbps faster ,is it dumping writes into RAM? I have only tried to write ~4-5 GB files, so I have not exceeded what it has available to it in RAM.

jgreco · Feb 18, 2023

LIGISTX said:
I have considered adding a few optane drive for special metadata after seeing Wendell talk about it a good bit, again, not because I need this at all, simply because I enjoy tinkering with it... homelab and all.

This is generally not appropriate for home users; special metadata devices are NOT removable from the pool once added. Not sure who the heck "Wendell" is. But beware that this could be bad if you change your mind.

LIGISTX said:
What I have found interesting though, I seem to get slightly faster write speed then read, not by much, but maybe 200-300 mbps faster ,is it dumping writes into RAM? I have only tried to write ~4-5 GB files, so I have not exceeded what it has available to it in RAM.

ZFS transaction groups are gathered up in system RAM, and you might think of them as "the write cache." This can be gigabytes large and it is very common for ZFS to appear to write at ferocious speeds for several gigabytes and then jump off a cliff, which is the point at which the cache is full and ZFS is FIFO'ing the data out to the pool at the speed that the pool is actually capable of.

LIGISTX · Feb 18, 2023

jgreco said:
This is generally not appropriate for home users; special metadata devices are NOT removable from the pool once added. Not sure who the heck "Wendell" is. But beware that this could be bad if you change your mind.

Wendell at Level 1 tech... He obviously isn't saying this is important for home use cases as his use cases are not home use cases, but he has had some interesting articles about it recently.

jgreco said:
ZFS transaction groups are gathered up in system RAM, and you might think of them as "the write cache." This can be gigabytes large and it is very common for ZFS to appear to write at ferocious speeds for several gigabytes and then jump off a cliff, which is the point at which the cache is full and ZFS is FIFO'ing the data out to the pool at the speed that the pool is actually capable of.

Gotcha. Thanks for the info. I just read through the tunables article. I will give some of those a try and see what happens.

LIGISTX · Feb 18, 2023

@jgreco I have not looked at tunables since I stood this array up back in 2015... I vaguely remember maybe setting some, but boy, its been a long time...

I see these are not "generated by autotune". Any of these hurting anything or should be adjusted further?

Autotune has these:

I will edit tho ones in your guide on 10gb+, but just wanted to confirm these all look fine.

Thanks for the help!

LIGISTX · Feb 18, 2023

jgreco said:
You can try the 10 Gig tuning article under Resources to see if you can get it better.

Hmm, I stopped at this step net.inet.ip.intr_queue_maxlen: 2048, so I did everything before that.... and I think my read speeds went down ~50-100mbps. Write speeds seemingly went up? Maybe this is a result of having just rebooted and ARC not being filled at all yet? I rebooted the machine earlier this morning for unrelated reasons and have not touched any of these files since, so I am not sure how it would have populated anything into ARC on the previous test either.

Any thoughts?

jgreco · Feb 18, 2023

I would jettison most of that and make sure autotune is disabled. At one time, someone thought (sort of correctly) that FreeBSD didn't suitably tune for the ZFS and heavy fileserver environment. I'm actually trying to collect some of the more obscure tunables that have been recommended in the past so that I can do a walkthru to determine value in the modern era. You might notice I have a cryptic comment about per-device tunables; I think a lot of tunables get blindly parroted from forum posts, such as hint.isp stuff, which would be for Qlogic fiber channel controllers.

LIGISTX · Feb 18, 2023

jgreco said:
I would jettison most of that and make sure autotune is disabled. At one time, someone thought (sort of correctly) that FreeBSD didn't suitably tune for the ZFS and heavy fileserver environment. I'm actually trying to collect some of the more obscure tunables that have been recommended in the past so that I can do a walkthru to determine value in the modern era. You might notice I have a cryptic comment about per-device tunables; I think a lot of tunables get blindly parroted from forum posts, such as hint.isp stuff, which would be for Qlogic fiber channel controllers.

I will give that a try. Should I chuck literally everything? I am entirely unsure what is needed vs not-needed here.

I did just up my truenas VM from 24 to 30Gb and went from 8 to 14 vCPU's as I saw it was getting very high on CPU utilization during writes. So... I admit this is no longer going to be straight apples to apples testing. But, I figure I should throw a little more resources at it since I have them.

winnielinnie · Feb 18, 2023

jgreco said:
I think a lot of tunables get blindly parroted from forum posts, such as hint.isp stuff, which would be for Qlogic fiber channel controllers.

Those are default out-of-the-box for TrueNAS Core (fresh installations).

"ISP" tunables? I never added these...

Just noticed under System -> Tunables there are four entries added, of which I never created myself. hint.isp.0.role hint.isp.1.role hint.isp.2.role hint.isp.3.role TrueNAS Core 12.0-U7 screenshot of System -> Tunables What are these, exactly? The most I could find about them is they have...

www.truenas.com

To this day I still don't know why they are populated by default on new TrueNAS Core systems.

LIGISTX · Feb 18, 2023

Hmm, well, I disabled all the auto-tunables, set the options in 10 Gigabit tuning, and it didn't seem to make much difference at all, if anything its slightly slower :/. I suppose the good thing is I mostly just write to the NAS (I don't bother trying to photo edit directly off it, I was hoping maybe this would give me a shot at it, but honestly adobe lightroom is not even "fast" when reading from local SSD............). But at least I can write to the box at ~3 gigabit-ish. Too bad I am only seemingly able to read from it at 1-2.5 depending on the files.

jgreco · Feb 18, 2023

winnielinnie said:
To this day I still don't know why they are populated by default on new TrueNAS Core systems.

Probably because FC is "sorta" somewhat supported, I can't remember when that happened, but you need the correct role in a FC topology.

jgreco · Feb 18, 2023

LIGISTX said:
Hmm, well, I disabled all the auto-tunables, set the options in 10 Gigabit tuning, and it didn't seem to make much difference at all, if anything its slightly slower :/. I suppose the good thing is I mostly just write to the NAS (I don't bother trying to photo edit directly off it, I was hoping maybe this would give me a shot at it, but honestly adobe lightroom is not even "fast" when reading from local SSD............). But at least I can write to the box at ~3 gigabit-ish. Too bad I am only seemingly able to read from it at 1-2.5 depending on the files.

In order to get the fastest speeds, the NAS has to be able to prefetch into ARC the stuff that you're about to need. However, this requires a type of prefetch prescience that just doesn't exist. Even plain ol' sequential readahead doesn't always perform the way you'd want. It does better with lots of RAM and more vdevs.

Dash_Ripone · Feb 18, 2023

LIGISTX said:
I just got a pair connectx-2 cards and fiber installed in my PC and homelab. It is working, and once something is in ARC I am seeing full (~9.3 gbps) read speeds, but if something is not in ARC I am only seeing about 2.6-2.8ish gbps.

Is this expected with a 10 drive wide Z2 array with 5400rpm drives? I figured each drive can likely do a pretty consistent 100 MBps (800 mbps), so even with overhead, I figured I would hopefully be more in the 5gbps+ area.

Any ideas on anything I could potentially tune, or is this just expected behavior? My TrueNAS Core VM only has 24 GB of RAM, so it is light on RAM I admit. But just copying a Windows 10 ISO as a test (one large file...) I figured I would see better performance.

Happy to share any info about the setup/array/settings, I am just not sure where to start...

I got a pretty big performance boost by enabling autotune

HoneyBadger · Feb 18, 2023

jgreco said:
This is generally not appropriate for home users; special metadata devices are NOT removable from the pool once added.

Minor correction here; special VDEVs are removable under the same constraints as regular device removal, with the most common sticking point being "there cannot be any RAIDZ VDEVs in the pool" - so this caution this does apply to your situation @LIGISTX

LIGISTX · Feb 18, 2023

jgreco said:
In order to get the fastest speeds, the NAS has to be able to prefetch into ARC the stuff that you're about to need. However, this requires a type of prefetch prescience that just doesn't exist. Even plain ol' sequential readahead doesn't always perform the way you'd want. It does better with lots of RAM and more vdevs.

Do you think upping it to ~70-80 GB of RAM would make any appreciable difference for my mostly basic use case?

jgreco · Feb 19, 2023

LIGISTX said:
Do you think upping it to ~70-80 GB of RAM would make any appreciable difference for my mostly basic use case?

The general problem is that for it to make a difference, the workload has to be such that the ARC would be primed with the data in a timely fashion, such as by having recently written the data to the pool, or readahead. Using large record sizes such as 1M helps if the system has to retrieve the data off the pool, because it has to read the entirety of the record up front, and then has the remainder of the record in ARC for rapid access. My basic gut says that adding RAM will be helpful but also disappointing.

For some additional "light reading" of the deeper dive variety, you can find out more about prefetch:

Chris's Wiki :: blog/solaris/ZFSHowPrefetching

(which may be too old to be relevant)

Chris's Wiki :: blog/solaris/ZFSHowPrefetchingII

Chris's Wiki :: blog/solaris/ZFSARCStatsAndPrefetch

LIGISTX · Feb 19, 2023

Tr

jgreco said:
The general problem is that for it to make a difference, the workload has to be such that the ARC would be primed with the data in a timely fashion, such as by having recently written the data to the pool, or readahead. Using large record sizes such as 1M helps if the system has to retrieve the data off the pool, because it has to read the entirety of the record up front, and then has the remainder of the record in ARC for rapid access. My basic gut says that adding RAM will be helpful but also disappointing.

For some additional "light reading" of the deeper dive variety, you can find out more about prefetch:

Chris's Wiki :: blog/solaris/ZFSHowPrefetching

(which may be too old to be relevant)

Chris's Wiki :: blog/solaris/ZFSHowPrefetchingII

Chris's Wiki :: blog/solaris/ZFSARCStatsAndPrefetch

Trying to fiddle around and get smarter with all of this, I came across this on another forum. I was thinking of something along the same lines and it looks like it exists…

I have multiple datasets, largest being my Plex collection which has 0 benefit from being cached. Things are written to that dataset, rarely are ever read, and when they are…. They are read at whatever bitrate the file is (typically 4k, but still, that’s easy for a harddrive to spit out). So I figured I should try and optimize my ARC for the datasets I DO actually interact with on my PC.

From serverfault:

That said, you can specify how ARC/L2ARC should be used on a dataset-by-dataset base by using the primarycache and secondarycacheoptions:

zfs set primarycache=none <dataset1> ; zfs set secondarycache=none <dataset1>will disable any ARC/L2ARC caching for the dataset. You can also issue zfs set logbias=throughput <dataset1> to privilege througput rather than latency during write operations;
zfs set primarycache=metadata <dataset2> will enable metadata-only caching for the second dataset. Please note that L2ARC is feed by the ARC; this means that if ARC is caching metadata only, the same will be true for L2ARC;
leave ARC/L2ARC default option for the third dataset.

Finally, you can set your ZFS instance to use more than (the default of) 50% of your RAM for ARC (look for zfs_arc_max in the module man page)

First question here… any idea what sort of impact this may have? Theoretically, I’d reduce the size of what I care about in ARC down from ~14TB to 4.5 TB if I removed the Plex library, and is the default ARC only 50% on BSD truenas CORE? If so, I’d happy allocate more to ARC since I run nothing inside of truenas - it is purely just a file handler. I run all services externally in other VM’s all under proxmox. I used to run this system on 16GB of RAM, now that it has 30…. I can’t see why dedicating ~22GB specifically to ARC (or more?) would be an issue. Any idea what the “minimum” would be to remain as not ARC specific? I don’t run iSCSI, I don’t have dedupe, and I have the basic LZ4 compression. I have 1 NFS mount and some SMB shares…. Pretty basic setup.

*Edit, doing some more reading I saw this note on the "zfs set primarycache": When these properties are set on existing file systems, only new I/O is cache based on the values of these properties

Since I am trying to remove a dataset from ARC,
1: is that persistent across reboots?
2: Do I need to somehow "force" the old ARC data from that dataset out? Or would a simple reboot of the system force it to repopulate ARC based on usage, and it won't attempt to fill the ARC up with any data from the dataset I excluded?

jgreco · Feb 19, 2023

LIGISTX said:
That said, you can specify how ARC/L2ARC should be used on a dataset-by-dataset base by using the primarycache and secondarycacheoptions:

Yes, but it generally does a pretty good job if you just leave it alone too. There's a lot of people who want to try to force the system to behave in certain ways thinking that they know better, but this usually isn't as true as they'd like to think. You can end up with some weird behaviours if you forcibly disable cache and you aren't actually correct.

LIGISTX said:
Finally, you can set your ZFS instance to use more than (the default of) 50% of your RAM for ARC (look for zfs_arc_max in the module man page)

This is bullchips; it is only 50% on the crappy Linux implementation. Solaris targets a size slightly less than physmem and FreeBSD will use free memory for ARC and release it if userland demand exists, which is really the best implementation in my opinion. Having to reserve memory for ARC really sucks; a system should be able to dynamically adjust to allocation demands on the fly. ARC is a great candidate for that.

LIGISTX said:
Since I am trying to remove a dataset from ARC,
1: is that persistent across reboots?

It isn't clear what you're asking. ARC is naturally cleared by a reboot; anything that was in ARC goes away.

LIGISTX said:
2: Do I need to somehow "force" the old ARC data from that dataset out? Or would a simple reboot of the system force it to repopulate ARC based on usage, and it won't attempt to fill the ARC up with any data from the dataset I excluded?

You can't actually force ARC not to cache anything for a dataset. You can really only suggest the likelihood of value in the cached data, effectively hinting that what is cached is low value. Some stuff like metadata will be cached regardless.

LIGISTX · Feb 19, 2023

jgreco said:
Yes, but it generally does a pretty good job if you just leave it alone too. There's a lot of people who want to try to force the system to behave in certain ways thinking that they know better, but this usually isn't as true as they'd like to think. You can end up with some weird behaviours if you forcibly disable cache and you aren't actually correct.

Understood. I went ahead and gave it a shot anyways, leaving it just to cache metadata for my /media directory. This stuff is likely “accessed” the most as I do watch movies and TV shows more then anything else, but I have 0 need for this to be in ARC. We shall see if it makes any appreciable difference.

jgreco said:
This is bullchips; it is only 50% on the crappy Linux implementation. Solaris targets a size slightly less than physmem and FreeBSD will use free memory for ARC and release it if userland demand exists, which is really the best implementation in my opinion. Having to reserve memory for ARC really sucks; a system should be able to dynamically adjust to allocation demands on the fly. ARC is a great candidate for that.

Yes, I came across this shortly after as well. I ran … was it zfs_summary or arc_summary (wow I already forget) and noticed what it had as targets, minimums and high water mark max. I do like this implementation as well, and I wonder why the Linux version is different?

Also, thanks again for all the help!

Important Announcement for the TrueNAS Community.

10 Gigabit - Not as fast as expected?

Guru

Resident Grinch

Guru

Resident Grinch

Guru

Guru

Guru

Resident Grinch

Guru

MVP

Guru

Resident Grinch

Resident Grinch

Cadet

actually does care

Guru

Resident Grinch

Guru

Resident Grinch

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "10 Gigabit - Not as fast as expected?"

Similar threads