New to FreeNAS - Need Help Assessing what I have inherited

Chris Moore · Jul 30, 2018

HoneyBadger said:
Other than that, I don't really have much here other than your workload being a poster child for solid state.

Is this something that can be improved by using L2ARC and / or SLOG? Any advice for how to improve the situation above and beyond the pool of mirrors?

HoneyBadger · Jul 30, 2018

onigiri said:
Thanks for your response HoneyBadger.

Here is a screengrab of gstat -p
It is constantly changing but there are always 1-2 drives in the red during refresh.

Do your drives ever get a chance to "settle down" at any point, or are they basically always serving requests?

To respond to your question about how much data is in the pools. Right now we are moving data from one server to another to reconfigure the drives from 4x15 to 10x6 vdev config. All of the data does not fit on one server that is roughly 380TB. We have to put some data on another server, which is an EMC Isilon, that data that is living on the Isilon that shouldn't be right now is about 60T.

If you configured both servers with the 30x2 mirror setup, you'd get a total of 263*2=526TB usable between them, each server would be about 72% full. That's a bit close to the "20% free space" guideline for my liking, and ZFS performs worse as the pool gets full.

onigiri · Jul 30, 2018

HoneyBadger said:
If you configured both servers with the 30x2 mirror setup, you'd get a total of 263*2=526TB usable between them, each server would be about 72% full. That's a bit close to the "20% free space" guideline for my liking, and ZFS performs worse as the

There is no way to go back and reconfigure the one server at this point. Also I am just referring to current data, we have anywhere from 15-30 projects going at a time that is constantly revolving and raw data from the field is 10-50TB a pop, depending on what we are flying and what sensors are being used.

HoneyBadger said:
Do your drives ever get a chance to "settle down" at any point, or are they basically always serving requests?

The answer to this is no. We run two applications 'GeoCue' that processes LiDAR imaging and another called 'SimActive' does a process called orthophotogrammetry, processing raw images, sometimes upwards of 3000 (300MB) .tif files, and lays them ontop of a surface. These two processes, depending on the processing power associated with them can take anywhere from days to weeks, and when one is done, another is almost immediately started.

HoneyBadger · Jul 30, 2018

Chris Moore said:
Is this something that can be improved by using L2ARC and / or SLOG? Any advice for how to improve the situation above and beyond the pool of mirrors?

L2ARC could reduce the impact but it ultimately depends on how big the working set is. Since we're talking about 380TB of GIS data, if even 1/10th of it is being accessed during each "production run" we're talking about 38TB. You might be able to add something like 4TB of total L2ARC, and get 1/10th of that data - in essence, saving 10% of the workload from your disks.

SLOG won't help - he's already writing async, so he's able to ingest into a txg at line-speed. The problem comes when his system is trying to flush the outstanding txg out to the back-end vdevs. If there was an SLOG in play, it would choke the incoming reads sooner, but ultimately your pool needs to have the ability to completely flush out the txg faster than it can fill, or the write-throttle kicks in.

Chris Moore · Jul 30, 2018

HoneyBadger said:
If there was an SLOG in play, it would choke the incoming reads sooner, but ultimately your pool needs to have the ability to completely flush out the txg faster than it can fill, or the write-throttle kicks in.

So an L2ARC would help some, but SLOG not at all. Other than that, more vdevs to allow more transactions in parallel.

HoneyBadger · Jul 30, 2018

onigiri said:
There is no way to go back and reconfigure the one server at this point. Also I am just referring to current data, we have anywhere from 15-30 projects going at a time that is constantly revolving and raw data from the field is 10-50TB a pop, depending on what we are flying and what sensors are being used.

The answer to this is no. We run two applications 'GeoCue' that processes LiDAR imaging and another called 'SimActive' does a process called orthophotogrammetry, processing raw images, sometimes upwards of 3000 (300MB) .tif files, and lays them ontop of a surface. These two processes, depending on the processing power associated with them can take anywhere from days to weeks, and when one is done, another is almost immediately started.

So 380TB right now, and 15-30 projects ongoing that generate between 10-50TB each.

Do these projects have relatively predictable timelines, and have you trended out your average TB/month growth rate? How much of this data is actively used on a day-to-day, and how much can be warehoused or moved to archival-tier storage? Any chance of being able to split the workload into multiple pools, or multiple systems?

Chris Moore · Jul 30, 2018

onigiri said:
the system was purchased with 64GB RAM each and then they decided to upgrade to 256GB because of the terrible performance. Again my predecessor just pointed to 45Drives and said 'it should have come configured right the first time'. It was then, they upgraded the memory. I will look into what it will take to upgrade that RAM because if it is what I think it is, (8x32GB) Sticks, then we will be sitting on a lot of VERY expensive RAM that we cannot re-appropriate. and would need to replace with 8x64 or 8x128.

I can't remember if we talked about it, but what is the CPU utilization on the system?

MatthewSteinhoff · Jul 30, 2018

Thanks, HoneyBadger.

Sounds like this is a problem that can only be solved with SSDs, spindles or both. The quickest way to get performance up to where it needs to be is to install an HBA with external ports, a drive enclosure and a bunch more disks configured as a stripe of mirrors. That'll greatly increase IOPS and bandwidth without having to replace the entire server.

Cheers,
Matt

HoneyBadger · Jul 30, 2018

Chris Moore said:
So an L2ARC would help some, but SLOG not at all. Other than that, more vdevs to allow more transactions in parallel.

L2ARC would only help as much of the workload as it can fit, and only then if it can stay loaded with hot data that isn't invalidated by new writes.

HoneyBadger · Jul 30, 2018

MatthewSteinhoff said:
Thanks, HoneyBadger.

Sounds like this is a problem that can only be solved with SSDs, spindles or both. The quickest way to get performance up to where it needs to be is to install an HBA with external ports, a drive enclosure and a bunch more disks configured as a stripe of mirrors. That'll greatly increase IOPS and bandwidth without having to replace the entire server.

Cheers,
Matt

If you do this (HBA+enclosure) you'll want to create a new pool configured with mirror vdevs, pull the data off the old drives, then destroy the old pool and use the drives to make more mirror vdevs for "NewPool"

Expanding the current pool will still cause writes to be limited to the speed of the slowest vdev - in this case, an overloaded RAIDZ2. ZFS will try to balance out the writes to favor the vdevs with the most free space, but it doesn't totally ignore the other ones.

MatthewSteinhoff · Jul 30, 2018

Agreed... Create new striped mirror pool. Move data. Redo old pool as striped mirror. Balance IO across the two pools.

Or wipe the old pool drives then mirror them and add them to the new pool for additional space. At no point would you add the RAIDZ2 drives back into the new pool as RAIDZ2 or you'd tank the wonderful performance you just established.

As nifty as it would be if a magic performance tuning trick could solve this problem, this is looking more and more like spend-money-throw-hardware-at-it situation.

Cheers,
Matt

onigiri · Jul 30, 2018

Perhaps, configuring my 2nd server (apollo) with this new config, rsyncing between the two then moving production to the 2nd server and then reconfiguration the 1st server (gemini).

Also, in regards to the data management. We archive every chance we get onto tape, I wish it was something better but alas BackupExec for us :(

Raw Data > Archivced after pre-processing
Output of pre-processed data > archived after it is run through 2 specific applications
production data > straight up deleted after project is delivered and paid for and the client's period to request changes expires
deliverable data > archived after project is paid for and the client's period to request changes expired.

onigiri · Jul 30, 2018

In reading another thread, For testing purposes I reduced the block size to 8k and my performance increased 1000% on random block write but halved my sequential. I made this change after reading into what block size my EMC isilon uses(8k) and comparing it to the advice in the other thread to reduce it from 128k to 16k.

Now it remains to be seen about long term production performance as it was just bench marking.

Here is during production:

Here is after hours, as you can see, not much difference:

In comparison, here is the Dell EMC Isilon (all HDD) not Hybrid:

Now I normally see 8k block size on Databases. What long term negative impact will I see by leaving the block size changed?

Chris Moore · Jul 30, 2018

onigiri said:
Now I normally see 8k block size on Databases. What long term negative impact will I see by leaving the block size changed?

This actually makes a bit of sense. The "File Geo Database" in ArcGIS is a database that is made of individual files and so many of them are small files. The 8k blocks are probably more applicable for the way the type of data.

HoneyBadger · Jul 30, 2018

Smashing results. I admit I'm not familiar with ArcGIS but if it's generated a single large file when you imported the data, the default recordsize of 128K would mean that a small write (8K) inside that file would necessitate a partial record rewrite, which sucks performance-wise. If it always writes in a fixed size to that "database file" then definitely match your recordsize to equal it.

More importantly, is the benchmark performance being borne out by real-world results as well? You might not see much of an improvement if it still has to chew through the 128K records that already exist. Doing a send/recv or even manually cp/mv'ing the data would rewrite it with the smaller blocks though.

As far as what the long term implications would be - more blocks means more metadata for ZFS to keep track of, and that means more space used by things that aren't your actual data. It also hurts the larger sequential transfers, as you saw from your benchmarking, but likely the additional small-block performance is much more valuable here.

A bit of searching on the ArcGIS software didn't say much other than that the storage should be "optimized for random small I/O" - a recordsize would be really neat to have, if it uses a fixed one.

Chris Moore · Jul 30, 2018

HoneyBadger said:
A bit of searching on the ArcGIS software didn't say much other than that the storage should be "optimized for random small I/O" - a recordsize would be really neat to have, if it uses a fixed one.

We deal with these ArcGIS File GeoDatabase things where I work, but apparently not as much as @onigiri , and I know that some significant number of the files are as small as 1k where other files can be megabytes. It is the worst collection of data I have had to deal with.

onigiri · Jul 31, 2018

HoneyBadger said:
A bit of searching on the ArcGIS software didn't say much other than that the storage should be "optimized for random small I/O" - a recordsize would be really neat to have, if it uses a fixed one.

ESRI, the developers of the software want everyone to move towards the Enterprise version where the metadata is entered into an actual PostgreSQL Database. The cost of that Enterprise software is astronomical for what we use ArcDesktop and ArcMap for. Like I mentioned earlier, our heaviest IO applications are SimActive, which drapes flight images over a surface layer as well as GeoCue that classifies GB's of point cloud LiDAR data.

HoneyBadger · Jul 31, 2018

It's not a huge leap to assume that ArcDesktop/ArcMap will treat that "File Database" in a similar manner to a PostgreSQL DB - but I don't know this directly. (There is a 21-day trial of ArcDesktop Pro, maybe I should grab that and monkey with it.)

If you create a new dataset (or have set recordsize=8K on an existing one) and then create/copy a new "filedb" then it should be written aligned.

recordsize is an upper bound, your lower bound is related to the ashift value* so smaller files will only get padded up to the size of 2^ashift and not the recordsize. Overhead from smaller files shouldn't change much, you'll generate a little more metadata as something that was previously a single 128K record on disk is now sixteen 8K records.

The other thing to ask is how the reduced sequential writes will impact your initial raw data load speeds if you copy them into a recordsize=8K filesystem. If your sequential write speeds are halved then your time to do the initial raw data load could double.

It's a matter of understanding how the data is written and read at each step. Sequential access in big chunks works better with the larger recordsize, random access in small chunks works better with a smaller one. You'll just have to pick where you want to experience the pain when you hit a workload where "ideal write" is "big sequential" but "ideal read" is "small random" and throw your best hardware at that bottleneck.

* we should probably check, do

Code:

zdb -U /data/zfs/zpool.cache | grep ashift

and it will give a list of rows - they should all show 12 in your case I believe

onigiri · Jul 31, 2018

HoneyBadger said:
zdb -U /data/zfs/zpool.cache | grep ashift

Thanks for the continued replied HoneyBadger.

As requested:

Also, the raw data brought in is a mixture of large and small files, but they are not compressed (zipped) and imported in as one file. The raw data is from the plane sensors that go through a process to be converted into usable imagery, in which case they are converted to .tif images.

Here is a typical example of a folder that contains the raw images to be processed with our application. For every 1GB image there is a 1KB 'world' file. This one folder is ~2.4TB and there are 10 of these folders for this project.

HoneyBadger · Jul 31, 2018

That's a lot of data. With them being large TIFFs the read pattern should be "large sequential" - the TFW file is pretty much inconsequential based on relative size, so this stage of the data is probably best served with large recordsize - possibly even bigger than the default 128K.

This is a situation where I'd want the output to go to an entirely different zpool (or array, even) so that you aren't having reads and writes fighting for time on the same vdevs and thrashing the disks. Is it possible to have the data shuffle back and forth between them at each stage, or maybe add a third array with a performance focus?

Going off your earlier flowchart for the data, you had this:

Raw Data > Archived after pre-processing
Output of pre-processed data > archived after it is run through 2 specific applications
production data > straight up deleted after project is delivered and paid for and the client's period to request changes expires
deliverable data > archived after project is paid for and the client's period to request changes expired.

Raw data comes in from the field and gets dumped into a pool on your first array with the default recordsize (128K) since it's a mix of small and large files. (Archive raw data to tape)

Pre-processing occurs and the TIFFs end up in a pool on a different with large recordsize (1M) because they're 1GB files. (Archive TIFFs to tape)

GeoCue and SimActive read from this pool (in big chunks) and write their "database-style" output to a pool/dataset (on a new third array?) with recordsize (8K) - this is your "production dataset" and where you'd want to throw your best hardware, so that your ArcGIS crew can work on the production data with max performance and generate the deliverable results. (Delete when done)

Where the deliverables end up probably isn't as critical since I assume they're not edited after generation, just read and reviewed, so a general file server is probably fine. (Archive to tape)

That would result in a minimum of "read and write to same array" - but you'd need to be mindful of how much data is in flight at any given time. If the "production dataset" is small then you might actually be able to look at using SSD for the third new array; if not, then definitely go with mirrors if you're using spinning disks.

(Also, nice naming convention. Just clicked that it's the spaceflight programs.)

Important Announcement for the TrueNAS Community.

New to FreeNAS - Need Help Assessing what I have inherited

Hall of Famer

actually does care

Dabbler

actually does care

Hall of Famer

actually does care

Hall of Famer

Guru

actually does care

actually does care

Guru

Dabbler

Dabbler

Hall of Famer

actually does care

Hall of Famer

Dabbler

actually does care

Dabbler

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "New to FreeNAS - Need Help Assessing what I have inherited"

Similar threads