TrueNAS storage server configuration question, 16Core 3.0 or 24Core 2.3

jena

Cadet
Joined
Jul 20, 2021
Messages
8
Hi all,

Our research lab is considering a TrueNAS storage server to store and feed bulk data (too expansive to run all SSD) to our computation node via dual 10G SFP+ or single 40G QSFP.
It will be 8 x 3.5inch 18TB HDD in RAID-Z2, with lz4 compression; future expansion is another 8 HDD RAID-Z2 vdev

Work load
  • SMB
    • Proxmox VM reads data from storage server, compute, and writes data back.
    • Most file reads are ~10MB image files or other large binary data files and a couple small 1-100kB configuration ASCII text files.
  • Syncthing and a few other jail plugins
  • Maybe a few Docker
(PS: really want to use TrueNAS Scale if its main features become stable in October 2021, so that I don't need to migrate from Core to Scale later )

It mostly likely will be a Lenovo SR655 2U server with 12 x 3.5in SAS front bay + 4 x 3.5in SAS mid bay + 4 x 2.5in rear SAS bay (mirrored boot drive)
Question:
1. CPU
The vender can configure a CPU at 155W TDP in the chassis so the choices are:
EPYC Milan 7313 16 Core 3.0GHz
EPYC Rome 7352 24 Core 2.3GHz
The pricing doesn't have a huge difference.

Which one is more suitable?

2. RAM
128GB 3200MHz, the server board has 16 DIMM slots (if more than 8 DIMM is used, only 2933MHz)
32GB x 4 or
16GB x 8

3. OCP NIC
Is Mellanox ConnectX-4 Lx 10/25GbE SFP28 supported by TrueNAS core?

4. HBA
Is LSI SAS3408 and LSI SAS3416 supported by TrueNAS core?
That is the only two HBA options in that chasiss.

5. Slog necessary?
It looks like SMB is async by default.
I guess it is no performance benefit to have a Slog.

$1300 P5800x 400GB is tempting but hard to justify.

6. L2ARC beneficial?
I have read that L2ARC also caches the metadata, which might speed up the read.
Since L2ARC is a copy of cached data that also exist on data pool, can I just use a high endurance nvme like Seagate Firecuda 520 2TB (3600TBW)?
Enterprise SSD are just too expansive ($1000+).

7. Misc.
Mirrored Intel S4610 240GB as boot drive
Dual 1100W PSU
Mirrored M.2 nvme 1TB (might go with Samsung consumer 980, not that mission critical) for plugins and docker installed on

8. Any other suggestion in hardware?

9. Support

Could we get paid subscription support to do initial configuration and future troubleshot by iXsystem?

Thank you very much for helping.

PS: Sorry that I didn't have enough time to figure all these questions by reading a lot of the post.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,906
With such a use-case I am inclined to suggest that you go for a commercial vendor (iXsystems or other). They will deliver support and also translate your requirements into a matching hardware specification.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,600
At least investigate the option @ChrisRJ suggested.

One note about L2ARC, the suggestion is to max out your RAM first. In the default L2ARC configuration, it caches data & metadata, and requires RAM for the index / catalog of what is in L2ARC. Thus, adding a large L2ARC on smaller memory system could actually hurt performance.

That said, I do plan on testing a L2ARC on my new system. But, it will be for metadata only. And if it works better, I'll leave enabled. (One of my use cases is backing up 3 clients, so I might have millions of files on the NAS...)
 

jena

Cadet
Joined
Jul 20, 2021
Messages
8
With such a use-case I am inclined to suggest that you go for a commercial vendor (iXsystems or other). They will deliver support and also translate your requirements into a matching hardware specification.

Currently we are interested in Lenovo SR655.
We could get very good institutional discount if purchase from DELL or Lenovo.
I doubt iXsystem will provide a discount close to the big vendors.
That's why I was seeking to see if iXsystem provide just software support subscription.
 

HarambeLives

Contributor
Joined
Jul 19, 2021
Messages
153
That said, I do plan on testing a L2ARC on my new system. But, it will be for metadata only. And if it works better, I'll leave enabled. (One of my use cases is backing up 3 clients, so I might have millions of files on the NAS...)

How are you getting the L2ARC to cache metadata? I thought you would configure a fusion pool and have a dedicated vdev for metadata
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
I doubt iXsystem will provide a discount close to the big vendors.
You never know if you don't ask, so it's worth the question. ;)

That's why I was seeking to see if iXsystem provide just software support subscription.
Providing commercial software support for a system they didn't design would be a challenging ask. Paying an experienced contractor (iX themselves or otherwise) for a few hours of time to handle the design/build process would make them significantly more likely to take you on as a client.

How are you getting the L2ARC to cache metadata? I thought you would configure a fusion pool and have a dedicated vdev for metadata
zfs set secondarycache=metadata pool/dataset will do that. The advantage is that an L2ARC can be removed or lost without compromising pool integrity, losing a special vdev kills the pool.
 

HarambeLives

Contributor
Joined
Jul 19, 2021
Messages
153
zfs set secondarycache=metadata pool/dataset will do that. The advantage is that an L2ARC can be removed or lost without compromising pool integrity, losing a special vdev kills the pool.

Thanks. Does this mean the L2ARC is set to ONLY store Metadata, or will it also serve as an L2ARC and then when full just spill over to the regular pool for metadata like the special vdev?

If I have multiple datasets, do I need to do it for each? And as I add datasets, I need to add it each time?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
So let's go through the original post @jena

0. Workload
Is there going to be more than one ProxMox VM/guest accessing the SMB share concurrently? The more you trend towards random access, the worse RAIDZ2 will perform compared to mirrors. That said, the files you're reading and presumably writing are fairly large (10MB+) so you'll have good space efficiency and likely will have enough of a delay between "read-compute-write" to make Z2 viable.

1. CPU
16c @ 3GHz. Your primary workload is SMB which is poorly threaded, so prefer speed to core count.

2. RAM
32GB x 4. Always buy the highest-density DIMMs you can afford, because this gives you room for more RAM in the future. ZFS loves RAM.

3. OCP NIC
The ConnectX-4 should be supported by the mlx5 driver - ensure that it is set to Ethernet mode and not IB.

4. HBA
Both are supported under the mpr driver but you will likely have to flash them from the OEM firmware to HBA or IT firmware. Ensure that this is possible and available from your OEM, or there are instructions on how to flash your specific OEM card into its LSI equivalent (LSI 9400-8i or 9400-16i I believe)

5. Slog necessary?
No, and even if your client system wanted to use strict SMB sync I'd suggest disabling it for this "batch processing" use case.

6. L2ARC beneficial?
Potentially. Once a given "batch" of data has been processed, will it ever be read again? How big is a given "batch" that is accessed before "going cold?" Ideally, you should equip the system with enough RAM to put the entire "batch" into primary ARC, at which point the L2ARC SSD would be useless, unless it's been set to metadata-only as @Arwen suggested. A mixed-workload SSD like the S4610s you're using for boot would be fine for this.

7. Misc.
All options looking good. Your board should support PCIe bifurcation but verify before buying a 2-in-1 M.2 to PCIe card.

8. Any other suggestion in hardware?
Consider reserving a slot for an external HBA should you need to expand beyond the 16x 3.5" bays in your current system. Also ensure that your cooling is up to snuff as you'll be putting 4x HDDs in the exhaust path of the 12x front ones. Expect a loud system.

9. Support
See my previous reply above. :)
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
Thanks. Does this mean the L2ARC is set to ONLY store Metadata, or will it also serve as an L2ARC and then when full just spill over to the regular pool for metadata like the special vdev?

If set with that flag the L2ARC will only hold metadata, and not any actual ARC data. This reduces the amount of writes/churn for workloads that don't benefit from the cached data (or are random enough to make L2ARC hitrate terrible) and instead causes metadata reads to come from SSD instead of spinning disk. For workloads with lots of small file access (backups, user home drives) that metadata read traffic can hurt the performance of the back-end disks, so having those small reads hit an SSD speeds it up significantly. As far as it becoming full, L2ARC is a ring buffer (first in, first out) so when an L2ARC gets full, it simply starts dropping the oldest data it has. All data in the L2ARC is already safely on disk, so the next read that misses ARC + misses L2ARC just goes to the data vdevs. Making it a metadata-only device significantly reduces the amount of data on the SSD (meta is usually 0.1% of most common file workloads, but can be as big as 1% for block storage) so this makes it less likely to "overflow" in that manner.

@jena as mentioned in #6 above, if your files are "batch processed" and then they "go cold" and aren't really read from again, a metadata-only L2ARC makes much more sense. Having that cold data flow through your L2ARC would probably just result in a very poor hit-rate and a lot of wear.

If I have multiple datasets, do I need to do it for each? And as I add datasets, I need to add it each time?
The value is per-dataset, but you can set it at the parent level (a parent dataset or the pool itself) and the secondarycache value of child datasets should inherit it. I'd double-check though just in case it's explicitly set to cache-all by the TN middleware.
 

HarambeLives

Contributor
Joined
Jul 19, 2021
Messages
153
If set with that flag the L2ARC will only hold metadata, and not any actual ARC data. This reduces the amount of writes/churn for workloads that don't benefit from the cached data (or are random enough to make L2ARC hitrate terrible) and instead causes metadata reads to come from SSD instead of spinning disk. For workloads with lots of small file access (backups, user home drives) that metadata read traffic can hurt the performance of the back-end disks, so having those small reads hit an SSD speeds it up significantly. As far as it becoming full, L2ARC is a ring buffer (first in, first out) so when an L2ARC gets full, it simply starts dropping the oldest data it has. All data in the L2ARC is already safely on disk, so the next read that misses ARC + misses L2ARC just goes to the data vdevs. Making it a metadata-only device significantly reduces the amount of data on the SSD (meta is usually 0.1% of most common file workloads, but can be as big as 1% for block storage) so this makes it less likely to "overflow" in that manner.

@jena as mentioned in #6 above, if your files are "batch processed" and then they "go cold" and aren't really read from again, a metadata-only L2ARC makes much more sense. Having that cold data flow through your L2ARC would probably just result in a very poor hit-rate and a lot of wear.


The value is per-dataset, but you can set it at the parent level (a parent dataset or the pool itself) and the secondarycache value of child datasets should inherit it. I'd double-check though just in case it's explicitly set to cache-all by the TN middleware.

Thanks, hopefully I'm not derailing OP's thread here

I'd like full L2ARC functionality + Metadata on SSD's. Am I still better off just having an SSD mirror for metadata in this case?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,110
Thanks, hopefully I'm not derailing OP's thread here

I'd like full L2ARC functionality + Metadata on SSD's. Am I still better off just having an SSD mirror for metadata in this case?

I consider the L2ARC discussion still relevant to OP's interests as well as others reading, hopefully the forum admins agree if they're lurking. ;)

In your case I would stick with the default secondarycache=all - special vdevs for metadata are more about accelerating metadata writes than reads, such as users with lots of small, frequently-updating files or using block workloads over NFS/iSCSI. But as I'm now going to be asking more workload-specific questions of you, perhaps I'll find your thread and comment in there (tomorrow, it's late.)
 

jena

Cadet
Joined
Jul 20, 2021
Messages
8
You never know if you don't ask, so it's worth the question. ;)

Providing commercial software support for a system they didn't design would be a challenging ask. Paying an experienced contractor (iX themselves or otherwise) for a few hours of time to handle the design/build process would make them significantly more likely to take you on as a client.

zfs set secondarycache=metadata pool/dataset will do that. The advantage is that an L2ARC can be removed or lost without compromising pool integrity, losing a special vdev kills the pool.
Sure. I will get a quote from them.

PS: One thing that iXsystem could improve is pricing transparency.
Currently there is no configurator for R, X, M series, and the quote only ask for a budget.

"Paying an experienced contractor" We would consider this and in some case willing to paid for support. However, we don't have as much budget allocated to this like a business would. There is varies limitation and regulation on spending in an University.

In my case, I looked and watched level1tech's post and video about specials metadata device and decided to avoid it.
Only a three-way mirror for such an critical role can make me sleep better.
But 5 year down the road, what if we expended the data pool and its metadata approaches or exceeds the current specials metadata vdev.
Sounds pretty complicated and a lot of room for things to go south.
 

jena

Cadet
Joined
Jul 20, 2021
Messages
8
So let's go through the original post @jena

0. Workload
Is there going to be more than one ProxMox VM/guest accessing the SMB share concurrently? The more you trend towards random access, the worse RAIDZ2 will perform compared to mirrors. That said, the files you're reading and presumably writing are fairly large (10MB+) so you'll have good space efficiency and likely will have enough of a delay between "read-compute-write" to make Z2 viable.

1. CPU
16c @ 3GHz. Your primary workload is SMB which is poorly threaded, so prefer speed to core count.

2. RAM
32GB x 4. Always buy the highest-density DIMMs you can afford, because this gives you room for more RAM in the future. ZFS loves RAM.

3. OCP NIC
The ConnectX-4 should be supported by the mlx5 driver - ensure that it is set to Ethernet mode and not IB.

4. HBA
Both are supported under the mpr driver but you will likely have to flash them from the OEM firmware to HBA or IT firmware. Ensure that this is possible and available from your OEM, or there are instructions on how to flash your specific OEM card into its LSI equivalent (LSI 9400-8i or 9400-16i I believe)

5. Slog necessary?
No, and even if your client system wanted to use strict SMB sync I'd suggest disabling it for this "batch processing" use case.

6. L2ARC beneficial?
Potentially. Once a given "batch" of data has been processed, will it ever be read again? How big is a given "batch" that is accessed before "going cold?" Ideally, you should equip the system with enough RAM to put the entire "batch" into primary ARC, at which point the L2ARC SSD would be useless, unless it's been set to metadata-only as @Arwen suggested. A mixed-workload SSD like the S4610s you're using for boot would be fine for this.

7. Misc.
All options looking good. Your board should support PCIe bifurcation but verify before buying a 2-in-1 M.2 to PCIe card.

8. Any other suggestion in hardware?
Consider reserving a slot for an external HBA should you need to expand beyond the 16x 3.5" bays in your current system. Also ensure that your cooling is up to snuff as you'll be putting 4x HDDs in the exhaust path of the 12x front ones. Expect a loud system.

9. Support
See my previous reply above. :)

Huge Thank you for the detailed reply.
Really appreciated.

4. HBA
If TrueNAS already sees individual drives, could I assume that I don't need firmware flash?
I know the DELL and HPE has their proprietary firmware for HBA.

Lenovo says: 430-8i HBA, Non-RAID (JBOD mode) support for SAS and SATA HDDs and SSDs (RAID not supported)

6. L2ARC beneficial?
Once a given "batch" of data has been processed, will it ever be read again?
We plan to have 128GB RAM. I am not sure if the ARC is big enough?

Some use case scenarios:
Feed data to Proxmox node
1. one medical imaging data (<1GB), which might be run once, check the results, adjust parameters, rerun again
2. a batch of medical imaging data (maybe 10-ish), parameters have been tuned in step#1, just get all results.
3. A folder of 200-ish of tiled jpg images (5-10MB) (split to tiles, in order to process one at a time within 64GB RAM during computation). After run, tune the parameter, check the results, likely need to tune parameters and rerun again.
4. For the same case of step #3 (maybe 10-ish such folder), run with previously optimized parameters.

For other bulk data storage
just like a NAS, the most data doesn't get used often or ever. We have about 12TB currently.
Occasionally, other members might download a few hundred GB to his/her local PC to do some analysis and computation.

Backup important data on all members' PC using syncthing
Those data are usually generated or computed on members' local PC. A back up of those will reduce the risk of data loss of single disk failure in those PC.

7. Misc.
Lenovo engineer said that it doesn't support bifurcation in BIOS, but they sell PCIE switch card that runs at gen3 speed.
The M.2 2in1 card is provided by Lenovo.

8. Any other suggestion in hardware?
Consider reserving a slot for an external HBA
Sure. The riser are not expensive, we plan to populate all available riser cage.
Lenovo also upgrade performance fan package in their configurator and limit the CPU TDP to 155W in this chassis.

I like the Lenovo SR655 because of its 16 x 3.5 drive bay (there is also a similar Supermicro model with 16 bay).
First vdev of 8 RAID-Z2 18TB offers 90-ish TB storage, which should be sufficient in the next 3-5 years.
In the next 5-7 years, upgrading the second 8 drive vdev doesn't need a JBOD disk shelf.
Then, if it is close to 8 - 10 years, it might be a time to buy new storage server for warranty, or maybe all flash server for more "hot" data, and who knows what is the future trend of storage 10 years later.
This is just my thought.
Feel free to provide your suggestion.
 

jena

Cadet
Joined
Jul 20, 2021
Messages
8
Thanks, hopefully I'm not derailing OP's thread here

I'd like full L2ARC functionality + Metadata on SSD's. Am I still better off just having an SSD mirror for metadata in this case?
No problem.
It is kinda related to some of my use case.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,906
4. HBA
If TrueNAS already sees individual drives, could I assume that I don't need firmware flash?
Unfortunately not. The problem with unsuitable storage controllers is not so much that they "simply don't work". It is instead edge-case situations when things may go south. For more details please check "Don't use RAID" from the recommended readings in my signature.
 
Top