Fusion pool metadata vdev size for a 45 drive Supermicro box

KevinM

Contributor
Joined
Apr 23, 2013
Messages
106
I've seen estimates for anywhere from 0.3% to 1.6% for metadata, so I'm just trying to get the sizing nailed down before ordering. The other concern is the metadata is only mirrored with this provisional setup. Thoughts about risk?

We have been using 36-bay Supermicro boxes for a while and are happy with them, but for the latest pair of FreeNAS boxes we are looking at the 6049P-E1CR45L+ due to its larger capacity and optional NVMe storage. My big question is with the hybrid storage pools available with TrueNAS CORE, how much SSD space will I need for a metadata VDEV if I have 5 9-wide RAIDZ2 VDEVs? These systems will primarily be used for Veeam backups but will also be hosting misc NFS and iSCSI storage. Thanks in advance!

Provisional storage layout:
45 x 16tb Nearline SAS
768gb memory
2 x 240gb s4510 SATA drives (OS)
2 x 2tb p4510 NVMe (L2ARC)
2 x 375gb Optane NVMe (SLOG)
2 x 7.6tb NVMe (metadata)
 
Last edited:

jasonsansone

Explorer
Joined
Jul 18, 2019
Messages
79
Run "zdb -Lbbbs POOLNAME" on the existing pool to determine the metadata size. This will show you what your pool actually uses. You can plan around that.

Regarding the reliability of a mirror vs a three-way mirror, that depends on your own system requirements. If this is just a backup system, I think a mirror is fine. If the data is lost, it can be repopulated on the next Veeam backup. The statistical odds of both mirrors failing AND your primary system suffering a data loss at the same time is pretty remote, but only you can make that call. For my system, I have three identical P3605 drives, but I am using a mirror for metadata and one for L2ARC. My logic is that I can remove the L2ARC at anytime and use it to rebuild the metadata mirror if there was a failure. I get more value from the drive as L2ARC as opposed to sitting in a three-way mirror. There is the rebuild time to consider, but how long does it take an NVMe mirror to resilver? I also have a complete backup of the system.
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,829
I'm a bit confused about the 2TB S4510 NVMe L2ARC - Isn't that by definition / SKU a SATA drive? Also, what is the specific data you hope to host there if you're already using a metadata special VDEV? Just curious, as I'm planning to continue to use my L2ARC, retiring it from metadata-only use and graduating it to general data use but mostly because it's already there.

Re: metadata @ 1.6% - 5*7*16*0.8*.016 = 7.2TB so that should be enough by a generous margin. However, @jasonsansone has it right - better to measure, if possible. Whether 2 mirrored NVMEs is enough is a different question. I'm planning on using three partitioned S3610's for my 8-drive Z3 pool (500MB for Metadata, 1.1TB for small files per disk) but my use case is different and I have a qualified cold spare sitting on standby, if needed.
 
Last edited:

KevinM

Contributor
Joined
Apr 23, 2013
Messages
106
Run "zdb -Lbbbs POOLNAME" on the existing pool to determine the metadata size. This will show you what your pool actually uses. You can plan around that.

Regarding the reliability of a mirror vs a three-way mirror, that depends on your own system requirements. If this is just a backup system, I think a mirror is fine. If the data is lost, it can be repopulated on the next Veeam backup. The statistical odds of both mirrors failing AND your primary system suffering a data loss at the same time is pretty remote, but only you can make that call. For my system, I have three identical P3605 drives, but I am using a mirror for metadata and one for L2ARC. My logic is that I can remove the L2ARC at anytime and use it to rebuild the metadata mirror if there was a failure. I get more value from the drive as L2ARC as opposed to sitting in a three-way mirror. There is the rebuild time to consider, but how long does it take an NVMe mirror to resilver? I also have a complete backup of the system.
That's the thing--I'm trying to do the sizing before ordering.
About the expected use, it's budgeted for Veeam storage but there does tend to be mission creep whenever tier 1 storage gets tight. For that reason I'd like to take advantage of these new fusion pools just in case.
I'm also thinking of migrating the L2ARC from 2 2tb drives to 1 4tb drive so I can have a 3-way mirror for metadata.
Practically speaking, with 720tb raw I would consider the box to be getting full by 300tb with 5 9-way raidz2 vdevs. I'm just trying to spitball how much metadata space that would mean.
 

jasonsansone

Explorer
Joined
Jul 18, 2019
Messages
79
That's the thing--I'm trying to do the sizing before ordering.
About the expected use, it's budgeted for Veeam storage but there does tend to be mission creep whenever tier 1 storage gets tight. For that reason I'd like to take advantage of these new fusion pools just in case.
I'm also thinking of migrating the L2ARC from 2 2tb drives to 1 4tb drive so I can have a 3-way mirror for metadata.
Practically speaking, with 720tb raw I would consider the box to be getting full by 300tb with 5 9-way raidz2 vdevs. I'm just trying to spitball how much metadata space that would mean.

Don’t you already have a pool, somewhere, with this data? If not, I would calculate on the high side, but there is no way to know for certain without measuring your specific data.
 

KevinM

Contributor
Joined
Apr 23, 2013
Messages
106
Don’t you already have a pool, somewhere, with this data? If not, I would calculate on the high side, but there is no way to know for certain without measuring your specific data.
I tried but "zdb -Lbbbs POOLNAME" but it only worked on freenas-boot. But after a bit of googling and tinkering I got "zdb -Lbbb -U /data/zfs/zpool.cache poolname to run (attached). Using the reddit link you posted above I should be looking at the ashift column of L1, L2 and L3 ZFS plain file (we're not using zvols) for a total of less than 125gb. Even with a much larger pool 8tb for metadata should be more than enough, which is what I want.

Thank you for this.
 

Attachments

  • freenas_zdb_output-091520.txt
    10.8 KB · Views: 319

KevinM

Contributor
Joined
Apr 23, 2013
Messages
106
I'm a bit confused about the 2TB S4510 NVMe L2ARC - Isn't that by definition / SKU a SATA drive? Also, what is the specific data you hope to host there if you're already using a metadata special VDEV? Just curious, as I'm planning to continue to use my L2ARC, retiring it from metadata-only use and graduating it to general data use but mostly because it's already there.

Re: metadata @ 1.6% - 5*7*16*0.8*.016 = 7.2TB so that should be enough by a generous margin. However, @jasonsansone has it right - better to measure, if possible. Whether 2 mirrored NVMEs is enough is a different question. I'm planning on using three partitioned S3610's for my 8-drive Z3 pool (500MB for Metadata, 1.4TB for small files per disk) but my use case is different and I have a qualified cold spare sitting on standby, if needed.
My bad. The quote for the 4tb drives lists Intel SSDPE2KX020T8 as the model, which I believe is the p4510. The two sata drives for the back of the box are s4510 drives.

I've been thinking about the safety issue for the past couple of days and I think I would rather have a 3-way mirror for the metadata, just so it would match the redundancy of the rest of the pool (raidz2). I wouldn't technically need this for the backup box, but the bucket of money only comes by once. Also we set the backup boxes up to be able take over for the production boxes if need be.

As for the intended use, my assumption was that that the metadata special vdev is different from the l2arc in that the l2arc is essentially a file cache and the special vdev contains system metadata like file and directory metadata, back-end database stuff, etc. I've used FreeNAS since 8.3 but I am by no means an expert in ZFS internals. Am I wrong about this?
 

jenksdrummer

Patron
Joined
Jun 7, 2011
Messages
250
I've seen estimates for anywhere from 0.3% to 1.6% for metadata, so I'm just trying to get the sizing nailed down before ordering. The other concern is the metadata is only mirrored with this provisional setup. Thoughts about risk?

We have been using 36-bay Supermicro boxes for a while and are happy with them, but for the latest pair of FreeNAS boxes we are looking at the 6049P-E1CR45L+ due to its larger capacity and optional NVMe storage. My big question is with the hybrid storage pools available with TrueNAS CORE, how much SSD space will I need for a metadata VDEV if I have 5 9-wide RAIDZ2 VDEVs? These systems will primarily be used for Veeam backups but will also be hosting misc NFS and iSCSI storage. Thanks in advance!

Provisional storage layout:
45 x 16tb Nearline SAS
768gb memory
2 x 240gb s4510 SATA drives (OS)
2 x 2tb s4510 NVMe (L2ARC)
2 x 375gb Optane NVMe (SLOG)
2 x 7.6tb NVMe (metadata)

Consider that Veeam data is almost entirely sequential. Also that compression and dedupe, with Veeam, is per job; and from what I've seen of it, it does a better job than FreeNAS. Also consider, are you doing this as a LUN provisioned to a VM storage box? I can tell you first hand, you're far, far better off getting a RAID controller and doing it as a windows box with the array formatted ReFS 64K; run that all natively.

All the rest of what you have there, save for the boot drives, straight up, isn't needed. Forget storage spaces too. Get a RAID controller.

I have a 36 bay x10 gen SM box as our backup storage; loaded with 4TB drives. It's acting as the proxy and storage in our environment. Did it as a 3-span RAID 60 (12R6 x3). I get absolutely ridiculous speeds out of Veeam; where this box running FreeNAS 11.3; I could barely crack 50Mbit/sec on the iscsi interfaces for throughput. I get, frequently, 1GB/sec out of a backup job and Veeam is complaining the source is the problem (Nimble HF array). That's bytes...not bits, nor just streaming rates with TCP overhead; that's the windows OS showing it, as well as the backup jos reporting it. Granted our Nimble array is under a fairly moderate load constantly, but I was using the Nimble as a backup repository because performance was so pisspoor with FreeNAS as an iSCSI target (which is far from any sort of best practice; but had to do something; one server took 84 hours to complete using the freenas, too 12 with nimble!). It was fine at first. Don't get me wrong...but once we got FreeNAS beyond around 50% disk consumed, and seriously by 75%...forget it. It will suck. We spent the money on a SM 3108 controller and the capacitor module and we are not looking back; in fact, we're doing the same at another of our datacenters for the same reason.

As for the other thoughts, misc NFS and iSCSI storage; just want to say that if your backup system performance is important, dedicate the hardware. In fact, for what you're spending on the NVMe and Optane, get another box to act as your NFS/iSCSI misc needs. You can also cut that RAM down to 128GB and call it a day.
 

jenksdrummer

Patron
Joined
Jun 7, 2011
Messages
250
l2arc is a read-cache overflow from RAM's read-cache. With Veeam, you'll just be burning up SSDs as you'll never hit the L2ARC for reading data, but FreeNAS will think you might, so it'll just churn data into it over and over for that possibility. Writes go to RAM then get flushed to disk. Unless you have sync=on, then it's writes to RAM AND ZIL before reporting to the source that it has the data, then flushed to disk.

At no point with FreeNAS/TrueNAS is there a case of where adding NVMe/SSD anywhere, save for going entirely flash, will make a system write faster. What's Veeam's primary thing? Writing lots and lots of data.
 

kspare

Guru
Joined
Feb 19, 2015
Messages
508
I’ve played round with veeam and caching. You are better off with a 16 bay qnap, throw in 4 s3700 ssd for a raid 10 write/read. It soaks up the iops for merges really well. Plus with ext 4 you don’t have the issues of zfs and COW. You really don’t want to go over 50% disk space...it’s not ideal. We are actually moving all our file servers off freenas and onto a big qnap and keeping freenas for terminal servers only. That’s where it shines.
 

KevinM

Contributor
Joined
Apr 23, 2013
Messages
106
Regarding the concerns about performance: I fully understand that FreeNAS is not the fastest storage solution on the planet. On the other hand we have never lost data in the years that we've been using FreeNAS, which is something we cannot say about traditional RAID storage solutions. The worst one was mostly self-inflicted, as these things always are...

We had a RAMSAN, an early all-flash fibre channel array. It was very fast, especially for the time, but one Sunday evening the storage disconnected and a number of our critical virtual machines went offline. We got on the phone with IBM support and started the array rebuild. We got all the way to 99%+ when the rebuild failed. After some investigation the engineer apologized and said that due to a known bug in the firmware we were running the array would not be recoverable and we would need to restore from backup. This is when we found out that our Tivoli guy had put the Tivoli index on the RAMSAN, and all the Tivoli backups were now offline as well. By this time it was around midnight on a Sunday, and we had to rebuild a big chunk of our environment by 8 am the next morning.

The worst issue we've seen with FreeNAS was back in the 9.3 days when snapshots failed to delete as scheduled. But we have never lost data.
 

kspare

Guru
Joined
Feb 19, 2015
Messages
508
It's not about being the fastest...it's the purpose.

If you plan to run 16tb drives and plan to keep under the 80% or recommended 50% you will be fine.

You aren't going to see huge benefits with the caching. I'm honestly getting better results with qnap and it's ssd caching for my file servers, but freenas blows them away with the new fusion drives...just make sure you understand what you are moving to and why.
 

KevinM

Contributor
Joined
Apr 23, 2013
Messages
106
It's not about being the fastest...it's the purpose.

If you plan to run 16tb drives and plan to keep under the 80% or recommended 50% you will be fine.

You aren't going to see huge benefits with the caching. I'm honestly getting better results with qnap and it's ssd caching for my file servers, but freenas blows them away with the new fusion drives...just make sure you understand what you are moving to and why.
The boxes we're using for Veeam storage are at 55% or so, so I am considering them effectively full. We have been providing 16tb iscsi luns for Veeam which we're formatting to ReFS with 64k clusters. I've been thinking of going to 32tb drives for the new box.

About the overall layout, the cost for the NVMe drives is not very high compared to 45 x 16tb sas drives, and I'd still like these boxes to be at least halfway performant for other workloads in case we have some sort of emergency and need to host production data on them.

Thanks everyone for your insights.
 

jenksdrummer

Patron
Joined
Jun 7, 2011
Messages
250
My thoughts still lean away from the use of any SSD; it will just going to get burned up in the process. Backups as a topology lend themselves well to large sequential writes (which spindles still outperform save for NVMe) and the only read function is to verify the write was completed.

Having an L2Arc would just be silly; that data will get churned in/through RAM then offshed to L2ARC and never get read enough to justify the write to the L2ARC; the read-back-verify Veeam would be looking for would come from RAM; whereas later reads would need to come from the data VDEV anyway. Metadata writes would be pretty minimal as well and by making it a fusion pool, you're actually introducing an additional point of failure; as loss of that VDEV means the loss of the pool. Bonus...if you run out of metadata vdev space, writes start happening on the data vdev. Only way to correct that is to flush the pool and write the data back; something I'd not look forward to with backup data that has retention figures associated.

I would also caution against using TrueNAS 12 in any type of non-GA Release, perhaps holding back until update1. Let others be your testers and deal with issues that may or may not break your ability to backup/restore data. You haven't yet faced a data loss, but you can face a capability loss due to instability.

Hardware RAID; specifically the Supermicro 3108 controller, use Windows, is what I would recommend for this system with it's integration/functionality with Veeam. I was literally in your shoes, save for an older box / already acquired...originally it was running Ubuntu with OpenZFS and it sucked. We went with FreeNAS and performance was great until it wasn't and sucked. We blew the box up to start new again for Veeam, and performance never got to acceptable levels. Blew it up again as WS2019 and storagespaces w/ double parity, and it sucked more. Blew that out and swapped the 3008-IT controller for a 3108, RAID60 12d*3spans (same topology used with FreeNAS Z2 x 3vdev) - and it screams...

To that, you could run Hyper-V on the box and run a Linux VM to get your NFS / iSCSI kicks and still come out way ahead.
 

hescominsoon

Patron
Joined
Jul 27, 2016
Messages
456
The boxes we're using for Veeam storage are at 55% or so, so I am considering them effectively full. We have been providing 16tb iscsi luns for Veeam which we're formatting to ReFS with 64k clusters. I've been thinking of going to 32tb drives for the new box.

About the overall layout, the cost for the NVMe drives is not very high compared to 45 x 16tb sas drives, and I'd still like these boxes to be at least halfway performant for other workloads in case we have some sort of emergency and need to host production data on them.

Thanks everyone for your insights.
I like your thinking. ZFS will protect your data unless you REALLY do something to cause it not too. The price for this is speed of course. I have seen other valid suggestions(ext4 and others) but they trade data reliability for speed..that's the trade off. right now I would not run v12 on anything production. According to the schedule IX put out themselves they do not recommend 12 for anything critical(like backups IMO) until U1. I will probably wait until U2(depending on the regressions that come up)..:)
 

kspare

Guru
Joined
Feb 19, 2015
Messages
508
I like your thinking. ZFS will protect your data unless you REALLY do something to cause it not too. The price for this is speed of course. I have seen other valid suggestions(ext4 and others) but they trade data reliability for speed..that's the trade off. right now I would not run v12 on anything production. According to the schedule IX put out themselves they do not recommend 12 for anything critical(like backups IMO) until U1. I will probably wait until U2(depending on the regressions that come up)..:)

I'm running beta.21 in production, *but* all I use is NFS. I do not use *any* other function. If you are simply using it for nfs storage. I think you are pretty safe...it's been very successful for me.
 

hescominsoon

Patron
Joined
Jul 27, 2016
Messages
456
I'm not a nfs user but SMB and ISCSI..:)
 

jenksdrummer

Patron
Joined
Jun 7, 2011
Messages
250
Use of beta anything in production can and sometimes will lead to an unplanned resume updating event.
 

KevinM

Contributor
Joined
Apr 23, 2013
Messages
106
Use of beta anything in production can and sometimes will lead to an unplanned resume updating event.
In our case, version 12 if not 12.1 will be out by the time the boxes actually get here and are ready to be used. But for accounting reasons we have to submit the P.O. by the end of the month.

In regards to your other comments, you make an excellent case in support of hardware raid and I have no doubt the performance increase would be dramatic. But to us the more important issue is the safety of the data, which includes ease of replication to the off-site replication partner.

IMO ZFS performance could be improved if BPR (block pointer rewrite) were ever implemented, as fragmentation is really the Achilles heel of ZFS. I really hope the OpenZFS guys get around to developing this some day.
 

jenksdrummer

Patron
Joined
Jun 7, 2011
Messages
250
"But to us the more important issue is the safety of the data, which includes ease of replication to the off-site replication partner. "

There's a solid use case :)

I will still mention I think the need for L2ARC isn't going to help you; that and 768GB of RAM is a lot of cache to burn through before it hits L2ARC; you'll still get data written to L2ARC but it will be aged out pretty far by that point.

What I wish could be done, is at the dataset level, be able to flag what can or can't be L2ARC'd once it falls out of RAM and not only save that cache for something useful.

I'm also not that sure that SSD for metadata vdevs is going to help much either. That's also assuming that small block I/O isn't part of the mix; but, cross that up with probably seeing a benefit with small block vs then you run out of that vdev and it spills over to the data vdev and benefit lost, while also increasing failure potential.

My statement was more akin to a general approach, not really towards anyone in particular. Just if one chooses to use beta anything in a production environment it's best to be prepared for that possible outcome :)
 
Top