400TB+ ZFS File server with large L2ARC

Jyo

Dabbler
Joined
May 12, 2021
Messages
14
Greetings to members of the forum!!

We are a NGO looking to create a File Server to store media files as well as create a low latency, high throughput workspace for the video editors and content creators.
Currently we have frozen on the following specs :

Dual socket server with 16 core, 32 thread CPUs
256 GB RAM
60 Bay WD JBOD with 36 x 18 TB SAS 7200rpm HDDs
NIC : Need suggestions for 40/100 GbE

Most clients are Windows workstations and PCs and a couple of 2013 Mac Pros

The server will keep 1 copy of our entire media archive (approx 230 TB and growing) and about 80-100 TB of working data. There are 24 drive slots free in the JBOD for future growth.

What we need to figure out is how do we spec out L2ARC to help with boosting performance for editing and ingest teams. Editors work on current video projects and ingest team works no digitizing tape and organizing data before ingesting to our asset management solution which will then keep a copy on this file server as well as no LTO tape in a separate library.

We plan to have a server attached to this JBOD with 24 NVME drive bays (PCIe Gen 4). With the new Intel PCIe Gen 4 SSDs launching what would be a good idea - 4/8 x 3.84/7.68 TB P5510 NVME SSDs? Or some combination of Optane P5800X SSDs?

What sort of NIC can we use? Higher bandwidth is preferred, like 40/100 GbE, we have a 40 GbE port switch and clients are connected to 10GbE ports, 100 GbE is keeping future network upgrade in mind.

Also are AMD Epyc CPUs ok for ZFS storage implementation? Any loss of performance/functionality/gotchas when it comes to EPYC? Thing is that even with the new Ice Lake CPUs from Intel, AMD is just so far ahead in price to performance and features that its hard to overlook..

Grateful to the community for your time and advice!
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Can't answer most of your questions, but I'm building my next TrueNAS with AMD Epyc. Much more modest configuration, with just 8 cores / 16 threads. But, the board does have 4 x 16 lane PCIe slots, 2 x 10Gbps Ethernet ports & 6 memory channels, (1 DIMM per memory channel). All the parts are here, just need time to put it together. So I do have confidence that TrueNAS, (Core or Scale), will work on AMD Epyc, otherwise I would not have paid that much money for a "real" server. Will see if that assumption is correct soon.
 
  • Like
Reactions: Jyo

blanchet

Guru
Joined
Apr 17, 2018
Messages
516
There are issue when you have many NVMe SSDs:
For the moment, it is safer to stay with SATA or SAS SSD even if it is slower than NVMe SSD.

For the NIC, pick only Intel NICs, they just works well on any operating system. It is useful if you repurpose the server later for another workload (especially for VMware ESXi).
 

Jyo

Dabbler
Joined
May 12, 2021
Messages
14
Can't answer most of your questions, but I'm building my next TrueNAS with AMD Epyc. Much more modest configuration, with just 8 cores / 16 threads. But, the board does have 4 x 16 lane PCIe slots, 2 x 10Gbps Ethernet ports & 6 memory channels, (1 DIMM per memory channel). All the parts are here, just need time to put it together. So I do have confidence that TrueNAS, (Core or Scale), will work on AMD Epyc, otherwise I would not have paid that much money for a "real" server. Will see if that assumption is correct soon.
Thanks Arwen, that does help have confidence in Epyc, the Gen 4 PCIe lanes and so many of them do make a difference with new storage devices

Do you have any experience using Mellanox NICs?
 

Jyo

Dabbler
Joined
May 12, 2021
Messages
14
There are issue when you have many NVMe SSDs:
For the moment, it is safer to stay with SATA or SAS SSD even if it is slower than NVMe SSD.

For the NIC, pick only Intel NICs, they just works well on any operating system. It is useful if you repurpose the server later for another workload (especially for VMware ESXi).
Thanks Blanchet!

The LTT NVME Incident was alarming, some research shows that they have managed to improve upon things as seen in this video
https://www.youtube.com/watch?v=9-Xvthp9l-8
They say it was specific to those SSDs on that platform, but it is something to watch out for..

Memory Bandwidth is one of the reasons for choosing a dual socket platform vs the single socket one and populating 1 DIMM per memory channel should be a good start (16 DIMMS of 16GB each)

The supermicro server - https://www.supermicro.com/en/Aplus/system/2U/2124/AS-2124US-TNRP.cfm
allows 24 drive bays and one PCIe x16 gen 4 slot, it is possible to disable 4 drive bays and free up another PCIe x16 Gen 4 slot which is what are doing. That gives us 20 usable drive bays from which 2 will go to boot drives and then leave 18 for Data Drives/L2ARC/ZIL.

What I'm understanding is that using too many NVME drives in a single pool tends to overwhelm the CPU, but with high capacity (3.84/7.68 TB or more) PCIe Gen 4 NVME SSDs performance should not be an issue while keeping the number of drives low.. any thoughts?

Am trying to decide what drive configuration to use for the L2ARC, we have a video archival and post production workflow, low end - mostly HD, was wondering if having a large L2ARC (14 or 28 TB) and a high feed rate will help to basically cache everything that is read from the server in L2ARC (including very large video files) and thereby every second access to a block will be from L2ARC.. Would this work?

What's the highest feed rate to a L2ARC? The Intel P5510 3.84/7.68 TB SSDs have a write speed at 128K of about 4 GB/s.. Having 4 of these in the L2ARC should give a very high theoretical write speed, would it be practical to set the feed rate to something crazy like 8GB/s ?? Is this the correct line of thought?
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Thanks Arwen, that does help have confidence in Epyc, the Gen 4 PCIe lanes and so many of them do make a difference with new storage devices

Do you have any experience using Mellanox NICs?
Yes, AMD Epycs, (and to a much smaller extent, Ryzen), are making Intel look like they artificially restricted PCIe lanes in their CPUs to force people to either buy higher end CPUs or buy dual socket CPU boards, WITH 2 CPUs.

No, I do not have any experience with Mellanox NICs. My Epyc board has Intel NICs on-board, (which, unlike Intel CPUs, are price competitive and good performers).
...
The supermicro server - https://www.supermicro.com/en/Aplus/system/2U/2124/AS-2124US-TNRP.cfm
allows 24 drive bays and one PCIe x16 gen 4 slot, it is possible to disable 4 drive bays and free up another PCIe x16 Gen 4 slot which is what are doing. That gives us 20 usable drive bays from which 2 will go to boot drives and then leave 18 for Data Drives/L2ARC/ZIL.
...
The board does seem to have 2 dedicated SATA ports, which possibly could be used for SATA SSD boot drives. This would make more sense, (for the boot drives), than NVMe drives.
 
  • Like
Reactions: Jyo

Jyo

Dabbler
Joined
May 12, 2021
Messages
14
Yes, AMD Epycs, (and to a much smaller extent, Ryzen), are making Intel look like they artificially restricted PCIe lanes in their CPUs to force people to either buy higher end CPUs or buy dual socket CPU boards, WITH 2 CPUs.

No, I do not have any experience with Mellanox NICs. My Epyc board has Intel NICs on-board, (which, unlike Intel CPUs, are price competitive and good performers).

The board does seem to have 2 dedicated SATA ports, which possibly could be used for SATA SSD boot drives. This would make more sense, (for the boot drives), than NVMe drives.

AMD being so disruptive has really helped push along performance at a higher rate..

Will see if there's something from Intel we can use, am sure Supermicro will have Intel based NICs on offer, thanks for the tip!

Is it the SATADOM ports you refer to?
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
...
Is it the SATADOM ports you refer to?
Sort of. They are basically normal SATA ports which you could use for SATA DOMs, or just plain SATA disks / SSDs.
 
  • Like
Reactions: Jyo

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
They are basically normal SATA ports which you could use for SATA DOMs, or just plain SATA disks / SSDs.
The main difference is that they (the orange coloured ones) also provide power to the SATA DOM. Most Supermicro boards have either two of them right away or an additional power connector near the SATA ports so you can plug one DOM into the "SuperDOM" port and another one into a regular SATA port with separate power connection. Besides the power issue "SuperDOM" ports are plain SATA ports as @Arwen already said, and DOMs are just plain SATA interface flash drives.
 
Last edited:
  • Like
Reactions: Jyo

Jyo

Dabbler
Joined
May 12, 2021
Messages
14
Thank you Arwen, Patrick. I'll check with SMC if they can connect 2 front drive bays to the orange SATA ports. I believe there is a rear drive bay add-on, but it may cost PCIe expansion slots in this server.

I had a few basic questions, are volumes expandable/shrinkable on a ZPool? If we have one volume for archive data and then another for a Working File Share, will it be possible to expand the volumes as the data grows? And also add more drives to the ZPool? Any best practices?

Is it as simple as add another VDEV wroth of drives to the ZPool and then expand volumes?

Also since there will be a very large number of files in the archive data volume, is compression or deduplication beneficial? I see that there is some way to improve filesystem performance by keeping ZFS metadata on a SSD VDEV/Volume? What are the best practices on that? Please do guide.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Datasets (volumes) are by default unlimited in size and will store whatever you put there until the pool fills up. You would need to actively set a quota to limit them in size and of course that quota can be changed as desired any time.

If you have a really large number of files and you are going to serve them via SMB, then a separate metadata vdev on SSDs definitely does help. It should have the same level of redundancy as your other vdevs. It is not a cache - if the metadata vdev is lost, your entire data is as well.
So if you go with RAIDZ2 vdevs for your spinning disks, use a three-way mirror for your metadata. Four-way mirror if your HDDs are in RAIDZ3. Etc.

You should probably read the ZFS primer:
 
Last edited:
  • Like
Reactions: Jyo

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
There are issue when you have many NVMe SSDs
Thank you very much for those links. We have been hitting our heads against the next available wall about a year ago when we deployed our first all-NVMe systems. Going to disable compressed ARC, probably.
 

Jyo

Dabbler
Joined
May 12, 2021
Messages
14
Datasets (volumes) are by default unlimited in size and will store whatever you put there until the pool fills up. You would need to actively set a quota to limit them in size and of course that quota can be changed as desired any time.

If you have a really large number of files and you are going to serve them via SMB, then a separate metadata vdev on SSDs definitely does help. It should have the same level of redundancy as your other vdevs. It is not a cache - if the metadata vdev is lost, your entire data is as well.
So if you go with RAIDZ2 vdevs for your spinning disks, use a three-way mirror for your metadata. Four-way mirror if your HDDs are in RAIDZ3. Etc.

You should probably read the ZFS primer:

Thanks Patrick,

I did go through the ZFS Primer and it helps, thanks for the brief on datasets and metadata pool.

Will set aside a pool of SSDs for metadata, was trying to size them but the data on that in forum posts is very varied. Is there a way to make a ballpark estimate? We have mostly videos, audios and photos, almost all of which are GB+ in size and a total of 230+ TB at the moment, want to be able to grow to 700/800TB with the current setup. Please guide

Would it make sense to also backup the metadata pool? Would that relax the triple or quadruple mirror resiliency requirement? OR is there a way to accelerate or prioritize the metadata caching in such a way that all metadata is cached in the L2ARC upon boot? Any tuning possible along those lines?
 

Jyo

Dabbler
Joined
May 12, 2021
Messages
14
Thank you very much for those links. We have been hitting our heads against the next available wall about a year ago when we deployed our first all-NVMe systems. Going to disable compressed ARC, probably.
We faced some issues with our first all NVME server as well, it was on windows with VROC for RAID, but large sequential writes would every once-in-a-while tank for a few seconds before resuming at full speed again. Happens once every few minutes in multi-hundred-GB transfers, but overall the higher level of performance is quite amazing. Moving terabytes has never been so much fun! :wink:
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Will set aside a pool of SSDs for metadata, was trying to size them but the data on that in forum posts is very varied. Is there a way to make a ballpark estimate? We have mostly videos, audios and photos, almost all of which are GB+ in size and a total of 230+ TB at the moment, want to be able to grow to 700/800TB with the current setup.
I can't, sorry. Don't have experience with setups similar to yours. According to this thread a ballpark estimate is 1% of your data. I suggested with current prices of storage - why not go for 5% and be grossly oversized and not have to worry? Of course I was thinking common pool sizes in my environment, not 400 TB or more. The good thing: you can always add more. Start with a three-way mirror of 4 TB drives, add another three if necessary (and your case permits, of course).

Would it make sense to also backup the metadata pool? Would that relax the triple or quadruple mirror resiliency requirement? OR is there a way to accelerate or prioritize the metadata caching in such a way that all metadata is cached in the L2ARC upon boot? Any tuning possible along those lines?
First the special/medata vdev is a vdev in your pool, not a separate pool. A backup will always include data and metadata all the same. If the backup machine has got a different topology (i.e. without special vdev), the backup will still be the same semantically, only the performance of the backup system will be different.

Anecdotal evidence of another forum member said, the special vdev by far outperforms an L2ARC in file sharing use cases. As always your mileage may vary ...
 
  • Like
Reactions: Jyo

Jyo

Dabbler
Joined
May 12, 2021
Messages
14
I can't, sorry. Don't have experience with setups similar to yours. According to this thread a ballpark estimate is 1% of your data. I suggested with current prices of storage - why not go for 5% and be grossly oversized and not have to worry? Of course I was thinking common pool sizes in my environment, not 400 TB or more. The good thing: you can always add more. Start with a three-way mirror of 4 TB drives, add another three if necessary (and your case permits, of course).


First the special/medata vdev is a vdev in your pool, not a separate pool. A backup will always include data and metadata all the same. If the backup machine has got a different topology (i.e. without special vdev), the backup will still be the same semantically, only the performance of the backup system will be different.

Anecdotal evidence of another forum member said, the special vdev by far outperforms an L2ARC in file sharing use cases. As always your mileage may vary ...
Thank you Patrick!

Special vdevs is really interesting, so from what you're suggesting it may be better to go with 4-20TB (1-5% of 400TB) of usable space on special vdev for metadata and bypass a L2ARC altogether? I do understand that we will have to test things out before committing to them, but I want to make an informed purchase so as to avoid wastage or shortfall of space.. One thing that wasn't clear was if RaidZ2 would work for the special vdev, then I could go in for 3.84TB SSDs that could be in a 8 wide RaidZ2 special vdev or repurposed for a 4 wide L2ARC and a 4 wide RaidZ2 if that much space isn't needed for metadata..

Am still searching documentation to see if there's a way to prime the L2ARC with certain metadata/zfs specific bits at boot that will speed up all of these operations with just a large L2ARC. Is that the correct line of thought?
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
it may be better to go with 4-20TB (1-5% of 400TB) of usable space on special vdev for metadata and bypass a L2ARC altogether?
That's what I am suggesting.

With 256 G of RAM why do you think you need an L2ARC at all? This is huge and will be mostly cache (ARC). RAM is always better than L2ARC regardless of the media and topology.

About the RAIDZ2 for metadata - sorry, no idea. I would go with a mirrored vdev, but that is just a gut feeling combined with anecdotal evidence - i simply have not seem anything but mirror vdevs in the wild.
I think it can be justified to go for a three-way-mirror even if you have RAIDZ3 for your data drives. This will nominally be less redundancy but the smaller SSDs will resilver so much more quickly in case one fails. Again there is probably no hard data (failure statistics) available just yet.
 

Jyo

Dabbler
Joined
May 12, 2021
Messages
14
With 256 G of RAM why do you think you need an L2ARC at all?

Well the idea is to be able to cache whatever video projects and their media that the team is working on in L2ARC, that way I do not need a special vdev dedicated to it, L2ARC can automatically and aggressively(if possible) cache the files the team needs and when they move to another project the unused cached blocks get evicted due to prioritization of the newer ones.

Since they will edit the files over the network it helps to have low latency and high bandwidth. Is it a feasible use case?
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Of course it is. The question is if the working set of your team will be way larger than the 256G you already have as primary cache. Plus, the data needs to be copied to and from the L2ARC. For everything that is in RAM the L2ARC is of absolutely no use. Once RAM is full, the L2ARC will start to fill up by read operations. So the only way to warm up the L2ARC is by reading data.
There is a persistence option now - it's off by default, I don't know if it is supposed to work just now. It is quite new, too - just like special vdevs.

All caches (ARC and L2ARC) work on a block level. So it caches metadata and data all the same. If one and the same video file (and its blocks) are repeatedly read - probably good. If this is all about streaming and different users work on different files all the time, or if the application is rather write heavy - probably not improving things much.

I cannot argue pro or contra L2ARC, I was just referring to a forum post about a large setup where the author measured huge improvements with special vdevs and none or not quite as large ones with L2ARC. Specifically if you have hundreds to thousands of files in directories and "listing directories" is a common client operation because users use graphical desktop environments and file browsers (Explorer, Finder, ...) then I would definitely implement a special vdev.

L2ARC really depends on the question if the same blocks will be frequently read multiple times.
 
  • Like
Reactions: Jyo

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Is it your intentions to serve up the video files to the clients and have them edit the video files on their local machine or to have them edit the content on the file server directly? Also, the size of the files to be worked on are of course a major consideration. Just curious.
 
  • Like
Reactions: Jyo
Top