NAS upgrade plan

mirko80

Dabbler
Joined
Mar 23, 2018
Messages
12
Hello guys, in our small company we have a NAS for huge multimedia files storage, 44TB total usable space, file types are huge raw professional photos (from 100MB to 1GB for a single pics) and 8K video recordings (tens to hundreds of GB), plus some 3D editing files. Recent working data set is about 200 to 500GB. This NAS is connected with 4x1Gbit/s ports to a managed switch (with lacp), and 6-10 clients are using it, primarly for read data, working on their workstations and save on NAS, so concurrent read/write of very big files.

Actual hardware configuration:
Xeon E3 Skylake 4c/8t, 32GB ECC DDR4, Motherboard Asus P10S-WS, 8x Ironwolf 8TB 7200rpm in raidz2, 128GB single nvme SSD for L2arc, 4x 1Gbps ports.
128kb recordsize (thinking to increase a lot)
noprefetch=0 and increased l2arc_write_max
TrueNAS Core 12

Now I'm planning a hardware upgrade because of not enough space (that's 84% used space) and not enough speed when there are 4-5 clients reading and writing concurrently, and for increase client-server connection speed to 2.5gbps from actual 1gbps.

Planned first step:
RAM 64GB, another set of 8x 8TB Ironwolf 7200rpm to add another raidz2 vdev to the same pool, 3x 256GB nvme SSDs for L2arc, 2x 10gbe sfp+ card, 2x LSI 8-port sata controller, QNap managed switch with 16x 2.5gbe and 2x 10gbe sfp+
increasing recordsize to 1M for 3D dataset, and 2M for videos and photos.

I'd like to know if it's a good start for improving things, a 12:1 ratio l2arc:ram is too much or can I increase it more, considering the huge recordsize??
Obviously it's a final desperate upgrade for this motherboard, next step will be a motherboard exchange with a Xeon E5/Scalable and a lot of RAM.

Thank you in advance for your help!
 

Jessep

Patron
Joined
Aug 19, 2018
Messages
379
You are putting a lot of eggs in one basket i.e. no redundancy for mission critical systems
I see nothing regarding backups, that is a lot of data to have at risk
2.5Gb is a silly upgrade, go 10Gb
ARC/L2ARC should aim for 5x ratio, Get more RAM first (lots more)

Suggestion:
Get new hardware, preferably with a 4-24 hour response time and/or use redundant hardware systems

Alternate Suggestion:

Buy two of something like this (assuming you are in the US) server

Buy two of something like this, switch

Server
That gives you two identical platforms, that if one goes down you can limp along with one until you are back up, (honestly I would suggest 3).
Increase the RAM to 128GB each production system, set ARC metadata min to appropriate size, split your pools/datasets between the systems. Think about using a third system as live replication/backup for critical data.

Switch
That gives you two identical platforms, make them fully independent paths for storage, backup gigabit networking as well.
The 10Gb ports are all active with newer firmware (see "servethehome" forum thread)
 

mirko80

Dabbler
Joined
Mar 23, 2018
Messages
12
You are putting a lot of eggs in one basket i.e. no redundancy for mission critical systems
I see nothing regarding backups, that is a lot of data to have at risk
2.5Gb is a silly upgrade, go 10Gb
ARC/L2ARC should aim for 5x ratio, Get more RAM first (lots more)

Suggestion:
Get new hardware, preferably with a 4-24 hour response time and/or use redundant hardware systems

Alternate Suggestion:

Buy two of something like this (assuming you are in the US) server

Buy two of something like this, switch

Server
That gives you two identical platforms, that if one goes down you can limp along with one until you are back up, (honestly I would suggest 3).
Increase the RAM to 128GB each production system, set ARC metadata min to appropriate size, split your pools/datasets between the systems. Think about using a third system as live replication/backup for critical data.

Switch
That gives you two identical platforms, make them fully independent paths for storage, backup gigabit networking as well.
The 10Gb ports are all active with newer firmware (see "servethehome" forum thread)
Thank you for reply. I'm ok with actual redundancy, we have another NAS for important source files, that are also (at least the most important ones) backed up on archive spare drives. We also have spare parts for everything (not necessarly with the same performance, but ok for accessing data in a hour or two, in case of disaster).

I don't like to go directly to 10gbe client-server because I think it's a higher price for then discovering that there are many other bottlenecks (like having 2x 10gbe on NAS and combined 6-8x 10gbe on clients, sometimes even local raid storage of the workstations can't match 10gbe speed and a lot of read/writes are made on single external spare drives with limited speed of 100-150MB/s, I prefer a middle but good reliable speed, that can be performed in all conditions. Upgrade to 10gbe can be made however in the future.

For the new NAS performance I definitely agree that Xeon E3 is not enough :( Even PCIe lanes can't support all that hardware. I think I will go with a C612 motherboard from Supermicro, Xeon E5 1660 V4 (8c/16t at a good frequency) and 128/192GB of RDIMMs. For L2arc:ram ratio there are many opinions online that 5:1 ratio is ok for 128k recordsize, but with 1-2M records you can get even more, I don't know how much more, and I don't know if it's true.
 
Last edited:

Nick2253

Wizard
Joined
Apr 21, 2014
Messages
1,633
128kb recordsize (thinking to increase a lot)
In my opinion, changing the recordsize won't make a ton of difference in your application. With large files, sequential reads/writes dominate the workload, and blocksize doesn't matter much for those cases. However, if you have data that has lots of smallish files, then this might be worth tweaking. I'd still recommend testing, but the cost of the couple hours of labor that it would take to properly test this setting would likely buy you a better L2ARC or a few more GB of RAM, which would highly likely improve performance even more.

In any event, it's really write loads that see performance improvements by way of increasing recordsizes. It sounds like your usecase is mostly reading, which is why I'm not it's worth your time to go down that rabbit hole.

The biggest place where you'll find improvements with increasing recordsize is when you are constantly overwriting slightly changing data. A large blocksize means that your data spans fewer blocks, and therefore there are fewer checksums to compare for changes.
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
Hello,
I understand you're facing a performance degradation, that encourages an upgrade.
Xeon E3 Skylake 4c/8t, 32GB ECC
and not enough speed when there are 4-5 clients reading and writing concurrently

I'd like to challenge the assumption that the lacking performance is due to a need for a higher horsepower CPU/Motherboard. Do you really see CPU usage spike super high?
I wonder what metric you're looking at to determine this is not an adequate CPU/RAM combination for the task?

I believe the lack of performance comes form the pool layout/space utilization. And thus a significant boost in performance will be seen once a 2nd vdev is added.

RAM 64GB, another set of 8x 8TB Ironwolf 7200rpm to add another raidz2 vdev to the same pool, 3x 256GB nvme SSDs for L2arc, 2x 10gbe sfp+ card, 2x LSI 8-port sata controller
See, this is quite close to what I run in my box, except I ran no L2ARC at the moment.
I've 32GB RAM, A dual 10GbE MLX-3, and 2x LSI 9211-8i.

The worst stress I've been able to put on the CPU was during 'synthetic copies' of large files of random content (/dev/urandom) and sent them internally between 2 pools to stress the system basically twice.
I've ran synthetic's and pool2pool copies/writes which averages around 500-650 mb/s, and that didnt do more than 50% cpuload of my E3-1230v1.

For sure, in the longer run when NVMe's will be eating up PCI lanes for breakfast. I'm in the process of upgrading due to the PCIe lane/slot starvation in my system/and my ambitions for the system.

Anyhow, Hope you can get something out of this.
I'd try adding drives first, and a 10GbE card to see where things are going, and have a good look at statistics for a few weeks.
Then if it still doesnt work out, then you could proceed to get the additional hardware and do the full upgrade.
 

mirko80

Dabbler
Joined
Mar 23, 2018
Messages
12
Hello,
I understand you're facing a performance degradation, that encourages an upgrade.



I'd like to challenge the assumption that the lacking performance is due to a need for a higher horsepower CPU/Motherboard. Do you really see CPU usage spike super high?
I wonder what metric you're looking at to determine this is not an adequate CPU/RAM combination for the task?

I believe the lack of performance comes form the pool layout/space utilization. And thus a significant boost in performance will be seen once a 2nd vdev is added.


See, this is quite close to what I run in my box, except I ran no L2ARC at the moment.
I've 32GB RAM, A dual 10GbE MLX-3, and 2x LSI 9211-8i.

The worst stress I've been able to put on the CPU was during 'synthetic copies' of large files of random content (/dev/urandom) and sent them internally between 2 pools to stress the system basically twice.
I've ran synthetic's and pool2pool copies/writes which averages around 500-650 mb/s, and that didnt do more than 50% cpuload of my E3-1230v1.

For sure, in the longer run when NVMe's will be eating up PCI lanes for breakfast. I'm in the process of upgrading due to the PCIe lane/slot starvation in my system/and my ambitions for the system.

Anyhow, Hope you can get something out of this.
I'd try adding drives first, and a 10GbE card to see where things are going, and have a good look at statistics for a few weeks.
Then if it still doesnt work out, then you could proceed to get the additional hardware and do the full upgrade.

Never told I want to upgrade because of lack of CPU power. My E3 Xeon is perfectly capable and sits around 0-5% during file transfers.
I want to upgrade primarly for lack of disk space, and I want to improve concurrent read/writes from 4-6 clients on 2.5gbe (instead of actual 1gbe). I think the limiting factors on this use case is the single raidz2 vdev, low Arc size compared to size of data set they typically use.
I don't know how it could perform with 64GB, a bigger L2arc, and 2 vdev pool (probably good), but for future proofing I think it's better to go with a Xeon E5, primarly for RAM and PCIe lanes (if I want to put a third vdev I'm out of pci lanes on E3)
 
Last edited:

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
Never told I want to upgrade because of lack of CPU power. My E3 Xeon is perfectly capable and sits around 0-5% during file transfers.
Good to have that confirmed :)

I think the limiting factors on this use case is the single raidz2 vdev
yup

don't know how it could perform with 64GB, a bigger L2arc,
Nowhere near the improvement you'll see with a 2nd vdev and more free space on the pool.
What you can do is to explore the arc_summary (there are plenty of threads on the topic) where you could observer validate the need for L2ARC or even more RAM.

The 'rule of thumb' varies probably, but seeing anyting close to or above, let's say 90% arc hit rate would indicate you're already really well off.
If you're 'saucing around in the trenches' of say 60% generally, then the case is strong for more RAM and L2ARC.
This could also be found in the GUI -> Reports -> ZFS.
Note that these 'thresholds' are really vague.
I think it's better to go with a Xeon E5, primarly for RAM and PCIe lanes (if I want to put a third vdev I'm out of pci lanes on E3)
I agree.
I do see the argument for doing the upgrade, yet at this moment I fail to see how you are restrained on PCI lanes at this point?

Here's my attempt at getting where you are at:
Do you have an additional dual port nic for a total of 4x1GbE ports?
Are you running the NVMe drive in a PCIe "adapter" instead of on the motherboard slot?
I see the HBA occupying one slot.
I get that you could have 2 slots free at this point?
Some avenues to explore:
There are PCIe cards that give more than the 8x slots of the 9211-8i, for example I've used the full height 9201-16i (just the first one to come to mind) or newer versions probably have better throughput. Maybe that is something to look into, in case economizing PCIe lanes is a thing.
BTW - what enclosure do you use?

In case you have a NIC occupying a PCIe slot, I'd have look at changing that out for a 10GbE, and get (if not already accessible) a switch with at least 2x 10GbE ports.
That would future proof the networking department for quite the while.

Don't feel 'assaulted' by these questions, I'm curious and interested in the discussion part of options for your use case :)
 

mirko80

Dabbler
Joined
Mar 23, 2018
Messages
12
Good to have that confirmed :)


yup


Nowhere near the improvement you'll see with a 2nd vdev and more free space on the pool.
What you can do is to explore the arc_summary (there are plenty of threads on the topic) where you could observer validate the need for L2ARC or even more RAM.

The 'rule of thumb' varies probably, but seeing anyting close to or above, let's say 90% arc hit rate would indicate you're already really well off.
If you're 'saucing around in the trenches' of say 60% generally, then the case is strong for more RAM and L2ARC.
This could also be found in the GUI -> Reports -> ZFS.
Note that these 'thresholds' are really vague.

I agree.
I do see the argument for doing the upgrade, yet at this moment I fail to see how you are restrained on PCI lanes at this point?

Here's my attempt at getting where you are at:
Do you have an additional dual port nic for a total of 4x1GbE ports?
Are you running the NVMe drive in a PCIe "adapter" instead of on the motherboard slot?
I see the HBA occupying one slot.
I get that you could have 2 slots free at this point?
Some avenues to explore:
There are PCIe cards that give more than the 8x slots of the 9211-8i, for example I've used the full height 9201-16i (just the first one to come to mind) or newer versions probably have better throughput. Maybe that is something to look into, in case economizing PCIe lanes is a thing.
BTW - what enclosure do you use?

In case you have a NIC occupying a PCIe slot, I'd have look at changing that out for a 10GbE, and get (if not already accessible) a switch with at least 2x 10GbE ports.
That would future proof the networking department for quite the while.

Don't feel 'assaulted' by these questions, I'm curious and interested in the discussion part of options for your use case :)
Now I'm not restrained on PCI lanes, I have only 2 NICs at 1Gb/s plus 2x onboard, and a PCI adapter for M.2 SSD. I build this machine 4 years ago, I didn't remember why I choosed to put the nvme ssd on an adapter instead of m.2 slot onboard, now I realized that the onboard m.2 slot is connected to pch, sharing bandwidth with the onboard sata controller, I was smart, lol.

In case of putting 16x additional sata ports, a dual sfp+ card, 2x total M.2 SSDs I should be at the very last limit of pci lanes of a consumer platform.
It should be possibile however, connecting the dual sfp+ card to one of the 4x PCH slots, sharing 4000MB/s (4 lanes) with the onboard sata controller (1600MB/s + 2400MB/s), and putting a 2x M.2 to x8 PCI card and a 16x s-ata controller on the two x8 slots connected to the CPU.

But we are at a very last limit, as I told, for future proofing it's better a 2011-3 platform that can use RDIMMs up to 1TB, and a high number of PCIe devices. I can reuse in future the E3 for upgrading another less important NAS that's now on i3.

For the chassis I'm on a generic 4U 8 slot, planning with this upgrade to go with a 24 slot 4U (priced about the same of a 3U 16 slot).

For the switch I planned a QNap managed with 16x 2.5Gb and 4x 10Gb (2 rj45 + 2 sfp+), it's about 800€.

I agree that the main problems now are low free space and only 1 vdev. It's two months that I tell once a week to workers, to free up some space, archive somethings, but they don't listen to me :(
 
Last edited:

mirko80

Dabbler
Joined
Mar 23, 2018
Messages
12
This is my arc_summary
P.S. In the first post, I didn't remember correctly the configuration of this machine, it's 48GB RAM 256GB L2Arc and record size is 256k
 

Attachments

  • arc_summary.txt
    29 KB · Views: 119

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
Here's some breakdown, the way I read it.
I encourage any other forum member to critically review my conclusions.

Code:
L2ARC breakdown:                                                  350.8M
        Hit ratio:                                     11.3 %      39.7M
        Miss ratio:                                    88.7 %     311.1M
        Feeds:                                                      9.8M


This indicates your current L2ARC only adds value to about 11%. The other 88% of hits are missed. This indicates to me that there is no point in using a larger L2ARC to hope for improved performance.

Code:

ARC total accesses (hits + misses):                                59.2G
        Cache hit ratio:                               99.4 %      58.8G
        Cache miss ratio:                               0.6 %     350.8M
        Actual hit ratio (MFU + MRU hits):             99.4 %      58.8G
        Data demand efficiency:                        82.0 %     794.1M
        Data prefetch efficiency:                       6.0 %     205.7M

Cache hit ratio of 99.4% indicates that the system is super healthy in it's current configuration, and would likely not benefit greatly from additional RAM, nor L2ARC.

This may be boring conclusions to the likes of me, who enjoy "hoping to find reasons to upgrade", unfortunately - the ARC doesn't tell that story.
 

mirko80

Dabbler
Joined
Mar 23, 2018
Messages
12
Here's some breakdown, the way I read it.
I encourage any other forum member to critically review my conclusions.

Code:
L2ARC breakdown:                                                  350.8M
        Hit ratio:                                     11.3 %      39.7M
        Miss ratio:                                    88.7 %     311.1M
        Feeds:                                                      9.8M


This indicates your current L2ARC only adds value to about 11%. The other 88% of hits are missed. This indicates to me that there is no point in using a larger L2ARC to hope for improved performance.

Code:

ARC total accesses (hits + misses):                                59.2G
        Cache hit ratio:                               99.4 %      58.8G
        Cache miss ratio:                               0.6 %     350.8M
        Actual hit ratio (MFU + MRU hits):             99.4 %      58.8G
        Data demand efficiency:                        82.0 %     794.1M
        Data prefetch efficiency:                       6.0 %     205.7M

Cache hit ratio of 99.4% indicates that the system is super healthy in it's current configuration, and would likely not benefit greatly from additional RAM, nor L2ARC.

This may be boring conclusions to the likes of me, who enjoy "hoping to find reasons to upgrade", unfortunately - the ARC doesn't tell that story.

But with a larger L2Arc there will not be a bigger chance that data requested is in L2arc instead of the array? My main target is to lower array usage, so it could be more ready to non sequential reads from many different clients...
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
My main target is to lower array usage, so it could be more ready to non sequential reads from many different clients...
Understood.

But with a larger L2Arc there will not be a bigger chance that data requested is in L2arc instead of the array?
I agree this seems intuitively the case.
To my understanding of how things actually work, it is not really what happens.
Anything that ends up in the L2ARC is basically evicted from ARC, in RAM.
With the present use case scenario, interpreting your arc stats, is that merely 11% of what ends up in the L2ARC ever gets to be used.

Currently you have excellent hit rate on ARC.
It would make a lot of sense to look towards adding L2ARC in the case where your ARC would have a very low hit rate. Your case doesn't look that way at all at this point.
The large sequential reads you're talking about, does not seem to leverage the L2ARC in your case.
Most of the time - that's what to be expected.
 

mirko80

Dabbler
Joined
Mar 23, 2018
Messages
12
I found that l2arc_noprefetch is actually 1, I think I placed this setting in the wrong place, not so smart :) This can be the reason of l2arc low usage...
 

Nick2253

Wizard
Joined
Apr 21, 2014
Messages
1,633
How long has your system been up? That will dramatically impact the hit and miss ratios. This article does a pretty good job explaining that: https://blog.chaospixel.com/linux/2016/09/zfs-why-is-l2arc-hit-ratio-so-low.html

A larger L2ARC size won't necessarily lower array usage. Depending on how frequently any one piece of data is used, you may need an L2ARC that is a significant percentage of your array to actually get enough data on there to be useful. For applications like yours (huge files), I recommend more vdevs. Sequential reads are generally very efficient, and more vdevs gives you more random I/O.

Your noprefetch=0 flag can actually degrade performance. Noprefetch=0 means that the system will attempt to cache sequentially requested data into ARC and L2ARC. However, ARC and L2ARC is generally much better used for metadata or other small files, which causes high I/O loads. By crowding out this smaller data with the large sequential data, you gain very low performance benefits (sequential data can easily saturate an ethernet link).

EDIT: this Github discussion does it more justice than I can: https://github.com/openzfs/zfs/issues/10464

Anything that ends up in the L2ARC is basically evicted from ARC, in RAM.
This is not strictly true. As data in the ARC is determined to be "old", the system will write a copy of that data to L2ARC. However, it won't actually flush that data from ARC until it needs additional space in ARC. I'd roughly guess that this overlap is about 25% of the ARC, but that number is dynamically adjusted by the system (and tunables can obviously impact it as well). It can easily be the case that 100% of the data in the L2ARC is also in the ARC.
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
This is not strictly true. As data in the ARC is determined to be "old", the system will write a copy of that data to L2ARC. However, it won't actually flush that data from ARC until it needs additional space in ARC. I'd roughly guess that this overlap is about 25% of the ARC, but that number is dynamically adjusted by the system (and tunables can obviously impact it as well). It can easily be the case that 100% of the data in the L2ARC is also in the ARC.
Excellent contribution!

I slid a little bit there on the copy part.
It makes a lot of sense for the ARC to not "move" entries to L2ARC, but rather "copy" and evict upon need. This also explains a bit the various tunables related to metadata and eviction in this thread is ARC smart?. I recall jumping on tunables where the speed of replenishing the L2ARC could be tuned too.

This even reinforces the overall conclusion to avoid focusing on L2ARC in this scenario.
 

mirko80

Dabbler
Joined
Mar 23, 2018
Messages
12
So it's clear that I can expand storage and network speed without needing a new platform, I will propose both solutions, a new Nas with E5/128GB and an upgraded one with E3/64GB, both with 24 bays chassis, 2 vdevs and 20Gb/s network.
Anyway, the difference in cost is about 600~700€ not so much.
 
Top