New to FreeNAS - Need Help Assessing what I have inherited

onigiri · Jul 31, 2018

Unfortunately it is not within our current budget to purchase new equipment even though I priced out a JBOD to expand Gemini. So increasing the IOPS by adding another pool or expanding the existing one is unfortunately out of the question right now. We do have a 2nd server from 45Drives that is a mirror image (hardware wise) of what is in Gemini. After we are done with some data cleanup and basic housekeeping, we are going to evaluate the space available to us. At this current point, my boss, who is the CEO and makes all financial decisions, would like to see all the data on Gemini replicate to Apollo in the event that Gemini decides to stop working for any reason that production can continue to work. I am not sure that this is the best use of that piece of equipment. Because I am in a situation where I need to use what we have and I have to constantly work around production as they are always processing something 24/7. It makes making changes more difficult.

I also change the MTU size to 9000 and when configured at the workstation level (endpoint, switch, server all set to 9000) the performance actually degraded. The following two screenshots were taken just a minute ago during production. Which is under heavy load right now.

Jumbo Frames Enabled (9014):

Jumbo Frames Disabled:

HoneyBadger said:
(Also, nice naming convention. Just clicked that it's the spaceflight programs.)

Hah yea. We are located in Huntsville, AL. The 'Rocket City'

HoneyBadger · Jul 31, 2018

onigiri said:
Unfortunately it is not within our current budget to purchase new equipment even though I priced out a JBOD to expand Gemini. So increasing the IOPS by adding another pool or expanding the existing one is unfortunately out of the question right now.

Hi, Mr. Onigiri's CEO;

Please consider the ongoing impact of having your production runs take an unreasonably long amount of time vs. the one-time capital cost of increasing your back-end storage performance. "Throw hardware at it" is often the least-expensive option when comparing to the cost of employee salaries, meeting contractual obligations with clients, or even being able to tighten up your SLA/deliverable dates to be more competitive in your field. A smaller (50TB?) array purpose-built to hold only a single active project might deliver significant ROI if it is able to cut the processing time by the huge factor that these benchmarks imply.

onigiri said:
We do have a 2nd server from 45Drives that is a mirror image (hardware wise) of what is in Gemini.

That's Apollo, correct? Where's Mercury?

onigiri said:
At this current point, my boss, who is the CEO and makes all financial decisions, would like to see all the data on Gemini replicate to Apollo in the event that Gemini decides to stop working for any reason that production can continue to work. I am not sure that this is the best use of that piece of equipment. Because I am in a situation where I need to use what we have and I have to constantly work around production as they are always processing something 24/7. It makes making changes more difficult.

Question here is "what is the cost of the lost time" - that should determine how much is replicated and at what stage(s) it's done. A catastrophic failure of the array mid-production would likely result in having to "restart" one step of this processing - what's the most expensive step, and how "expensive" from a storage/capacity perspective is it to mirror the data before or after these steps?

onigiri said:
I also change the MTU size to 9000 and when configured at the workstation level (endpoint, switch, server all set to 9000) the performance actually degraded. The following two screenshots were taken just a minute ago during production. Which is under heavy load right now.

If you're getting congestion, bad connections, or otherwise dropping packets, jumbo frames might hurt more than help. Can you monitor throughput against a production load in real-time, such as at the endpoint? This would tell you what the actual workload is seeing as a result, so that you optimize for that case.

onigiri · Jul 31, 2018

HoneyBadger said:
That's Apollo, correct? Where's Mercury?

Hah, that is the next iteration. Either AWS or something like a Nimble all flash array.

HoneyBadger said:
Question here is "what is the cost of the lost time" - that should determine how much is replicated and at what stage(s) it's done. A catastrophic failure of the array mid-production would likely result in having to "restart" one step of this processing - what's the most expensive step, and how "expensive" from a storage/capacity perspective is it to mirror the data before or after these steps?

The most import things to archive is raw data from flight, because the cost to re-fly is astronomical has to be done during fall-winter time or what is known as 'leaf-off', as well as the final deliverable data, as far as he is concerned, everything in the middle can be recreated, it just costs man hours. It was not in the original plan to replicate all of the active production data until we sat down and started the discussion on how to balance between the three servers we have. (the third being a 2yr old EMC Isilon that houses our VMs and corporate data) And once we cleanup the data, if Gemini has enough space, we may wind up replicating Gemini to Apollo for a basic DR scenario in the event Gemini dies.

Here is a proposed data workflow. Hydra is the EMC isilon that only has 170TB available. Apollo and Gemini are the two FreeNAS boxes that we have been discussing here. The theory is to split the load between Hydra and Gemini. Heavy processing items (SimActive and GeoCue) on Hydra, as it has been handling those processes for the past 2 years (until the purchase of Gemini and Apollo in Dec '17) like a champ with no hiccups (only downside is available space). And the rest of production on Gemini. We dont quite know if this is possible yet as we just had a meeting yesterday and flagged a ton of data for archival and deletion that is months-years old and there has been no cleanup.

HoneyBadger · Jul 31, 2018

onigiri said:
Hah, that is the next iteration. Either AWS or something like a Nimble all flash array.

Talk to Tegile and iXSystems as well if you're considering commercially supported ZFS or ZFS-like arrays.

The most import things to archive is raw data from flight, because the cost to re-fly is astronomical ... as well as the final deliverable data, as far as he is concerned, everything in the middle can be recreated, it just costs man hours.

That's definitely the most critical step if it involves planes in the air, and I imagine the deliverable data is small enough in size to be a non-issue compared to the raw grabs. Mirror the hell out of all of that data on both Gemini & Apollo.

Here is a proposed data workflow.

I like it. You've got the crucial separation between the "output of preprocessing" and "output of GeoCue/SimActive" so that you won't have the same disks trying to service both workloads. I'd strongly suggest that whichever of Gemini/Apollo is going to be the "primary" for getting hammered by Ortho/LiDAR/GIS be set up with mirror vdevs - so if that means "clean off Apollo, redo the pool as 30x2 mirrors" then Apollo becomes primary. That's the best possible random performance you'll get from spinning disks.

onigiri · Jul 31, 2018

HoneyBadger said:
Mirror the hell out of all of that data on both Gemini & Apollo.

We actually archive this data at the appropriate step in the workflow.

HoneyBadger said:
set up with mirror vdevs - so if that means "clean off Apollo, redo the pool as 30x2 mirrors"

I thought about this, but as it stands in a 10x6 RaidZ2 we have 351.6TB available. Dropping down another 100TB will be a tough pill to swallow as we already lost almost 100TB from re configuring from 4x15 to 10x6.

Gemini Current Status:

HoneyBadger · Jul 31, 2018

onigiri said:
We actually archive this data at the appropriate step in the workflow.

Considering the cost of recreating that data I'd still almost be paranoid enough to mirror it immediately myself, following the 3-2-1 backup rule:

3 copies of the data
2 different types of media
1 copy is offsite

I thought about [mirror vdevs] but as it stands in a 10x6 RaidZ2 we have 351.6TB available. Dropping down another 100TB will be a tough pill to swallow as we already lost almost 100TB from re configuring from 4x15 to 10x6.

I'd still say "try it" if you're at a stage where you're about to clear out the array and go from 4x15 to 10x6. Do it as 24x2, leaving yourself with twelve unused drives - if the increase in performance isn't enough to justify the loss of space, build a 2x6 from the twelve remaining drives (giving you about 80TB usable) migrate the project over there, then destroy the mirror-pool and extend with the 48 drives again in an 8x6 to get your original intent of 10x6.

I'm harping on about mirrors for good reason - "more vdevs" is how you drive up pool performance, and this would be a good case study to say "look, prod throughput went up X-fold as a result, but we need to spend ($reasonable-amount) to maintain space parity."

Important Announcement for the TrueNAS Community.

New to FreeNAS - Need Help Assessing what I have inherited

onigiri

Dabbler

HoneyBadger

actually does care

onigiri

Dabbler

HoneyBadger

actually does care

onigiri

Dabbler

HoneyBadger

actually does care

Similar threads

Important Announcement for the TrueNAS Community.

New to FreeNAS - Need Help Assessing what I have inherited

onigiri

Dabbler

HoneyBadger

actually does care

onigiri

Dabbler

HoneyBadger

actually does care

onigiri

Dabbler

HoneyBadger

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "New to FreeNAS - Need Help Assessing what I have inherited"

Similar threads