ZFS configuration options for storing millions of images

c4st · Nov 27, 2018

I am designing a system capable of working with 15 million (and growing) image files ranging from 100kb to 10mb. After some preliminary research, our first choice is ZFS on FreeNAS. I was hoping I could get the the communities plan on our proposed hardware and ZFS configuration. We recognize we're dealing with a few non-optional/odd configurations/requirements.

Disclaimer: I'm a Software Engineer; I typically do not work on the filesystem level so I apologize in advanced if anything mentioned below is woefully inaccurate in relation to ZFS.

Environment:

Concurrent Connections: This is an internal application that will only be utilized by a few folks. It's safe to say there will never be more than 5 concurrent connections accessing this data. In most (95%) cases, only one or two users may be accessing this data concurrently.
Network: This data will only be accessed internally on a 1GbE network, though a 10GbE network is in the works for systems that will read this data.
Hardware: The hardware we have allocated for this project is a mixture of older enterprise grade hardware and consumer grade storage (great combination, I know).

Dell R720xd w/ 24x 2.5” bays
RAM: 128GB RAM (more can be allocated if needed)
CPU: 2x E5-2620 @ 2.20GHz
Storage:
8x2TB SSDs local storage (Crucial MX500)
1x500GB SSD for OS and Database (Crucial MX500)
RAID: H310 (IT Mode)

The Data

In almost all cases, the data will be permanently stored on the drive after the initial write.
The data will not be modified (edited, compressed, resized, etc) after the initial write.
The directory structure of the data is non-optimal [1], due to the design of the application pulling this data, it is more or less immutable.
The data should be read optimized which includes, but may not be limited to: random/sequential reads, directory listings, etc.
There will be new images written to the file structure on a fairly regular basis, but the write performance is not much of a concern.
Data will be processed (image hashing, facial recognition vectors, etc) and stored in a MySQL database via a few Docker Containers on the same box to maximize processing performance.
Data will be read/consumed by the user via SMB (windows) or NFS (linux) mounts.
Data will be written via SMB (windows) or NFS (linux) mounts.
There is a significant number of identical files (conservative estimate is 20%, but it could be quite more), due to the design of the application pulling this data we cant delete the filenames.
We have about 10TB worth of data currently.
We do not intend for this data to grow rapidly, maybe 1TB a year.
We typically do a full directory backup of the data every month.

Concerns:

Since we're planning on using consumer grades SSDs (unless an argument can be made to use 2.5" spinners) we're concerned about the longevity of the drives (particularly with metadata thrashing).
Is the hardware sufficient for this use case?
We'd like to dedupe the data if its feasible (more on this in the configuration section below) if it will result in significant space savings.

High Level Configuration

Based on my introductory research I think this may be a good starting point. As I mentioned above, I don't typically work at the filesystem level so I apologize if I am woefully inaccurate on anything below:

1 pool
1 raidz1 vdev (8x2TB SSD) - this may eventually expand to 3 raidz1 vdevs of the same size (24 bay enclosure)
compression=on - with lz4 compression, if this isn't enabled by default
ashift=9 (4k block size)
dedupe=on may be a pipe dream though. With 128GB RAM I think I can make this work (assuming 5GB metadata per 1TB storage). I may be better off writing a script that will simply create (potentially millions) of hard links for identical images instead.
sync=disabled from what I understand, disabling sync effectively disables the SLOG which means I may lose up to 5 seconds (configurable?) worth of new data being copied in the event of a hardware failure. If this is the case, I am OK with a few seconds of new data loss as long as existing data on the system is safe.
no SLOG - see sync=disabled explanation above
no ZIL
no L2ARC - I don't see any reason why this would be needed with SSDs?

I would love to hear the communities feedback/suggestion on our above proposal. I am happy to provide any additional information that may be needed.

Thank you!

[1]

Example directory structure - none of the directory/filenames are normalized in any way.

Code:

+ root directory 1
   - sub directory 1
	   - image 1
	   - image 2
	   - image 3
	   - ...
	   - image n (where n is between 1 and 1,000+)
   - sub directory 2
	   - image 1
	   - image 2
	   - image 3
	   - ...
	   - image n
   ....
   - sub directory n (where n is between 1,000 and 30,000)
	   - image 1
	   - image 2
	   - image 3
	   - ...
	   - image n
+ root directory 2
+ ...
+ root directory 15

MatthewSteinhoff · Nov 27, 2018

In a past life, I was the image archive guy for a major market newspaper.

To answer your questions directly...

* If your data is mostly write once, any SSD should be fine. Writes wear out SSDs while reads are effectively free. (We use whatever is the least expensive SSD available and haven't yet burned one out in two years of use and a higher write load than you're estimating.)

* Your hardware sounds plenty up to the task as does your configuration for file services. (Not knowing about your applications, I can't suggest how it will perform with the applications.)

* Dedupe is scary. I've yet to see a situation where it is a good idea and effective. If you're only saving 20%, the cost to compensate for efficiency with additional storage is under $500. I'd gladly pay $500 not to have to complexity and ticking time bomb of dedupe.

Other thoughts...

* A mirrored stripe would be SIGNIFICANTLY faster than RAIDZn at the cost of storage efficiency. Knowing nothing about your application, RAIDZ2 may be fast enough on its own using SSDs. Another advantage of a striped mirror is that you can add two drives at a time to expand instead of having to add, as you propose, eight drives at a time. (We can hide two drives in the supply budget and no one cares. If we went to buy eight at a time, we'd need a meeting, project plan and budget line-item.)

* Databases very much prefer striped mirrors to RAIDZn for the additional IOPS but that may not be necessary since you're running on SSDs.

* Use RAIDZ2 not RAIDZ1: it it the sweet spot between capacity and reliability (when not using mirrors).

* If you're only adding 1TB a year, I'd leave write sync on. It'll still be plenty fast and improves reliability and you're not concerned about write speeds.

* If you're running applications on the FreeNAS server itself, you may wish to have multiple storage pools so your largely static data and your actively written application/database data can have distinct storage and replication policies. (You may also choose to use a higher compression level for your static data (gzip9, if your images are compressible) and lz4 for your active application data.)

* Adding a second (much lower powered, much lower cost) FreeNAS server would allow you to replicate your data hourly (or even more often) with little or any performance impact. Depending on the value of your data, it might be worth it compared to monthly backups. Snapshotting and replication is a base FreeNAS feature and highly recommended.

* I'm a big fan of separating storage from applications and don't typically use FreeNAS for anything other than storage. This is personal preference and there is nothing really wrong with doing that as you suggest. Just not my cup of tea.

* You can always easily add SLOG/L2ARC later if needed. I'd go without first. (ZIL always exists, by the way. The only question is if it lives on the data disks or on its own storage.)

* Looks like you've put a lot of thought and design into your project. Seems solid.

Cheers,
Matt

Jessep · Nov 27, 2018

Good information provided, should help considerably in getting you squared away.

First:

Disclaimer: Business use case you should go with a supported device i.e. iXsystems TrueNAS or other. You don't want to be primary support for such a device if it isn't your primary job function.

Second:

I would suggest Samsung over Crucial as there have been firmware issues recently with MX500 devices. Prices are similar. Go with Samsung Pro's if you want more write endurance.

Pool structure:

Standard practice would be Z2 as Z1 is deprecated due to rebuild times and chances of a second drive failure. With 24 bays I would suggest 8 wide Z2 and adding a further 8 wide Z2 as needed for pool expansion.
Z2 with 8x2TB gives ~9TB when taking into account formatting and the 80% maximum pool usage.
You could go with 12X2TB if you want more upfront storage space, or (2) 6x2TB for more performance.

Dedupe I'll leave to others to comment on, I have no direct experience with it on FreeNAS.

You always have a ZIL, you can add a SLOG, likely not needed if you aren't worried about sync=always and write performance.

Backups should be daily, with block level change backups and your use case they would be very small quick backups. Or use snapshot replication. RAID whether ZFS or hardware is not a backup and once a month is a recipe for misery later.

You need to price and spec a UPS and setup automatic shutdown in case of power failure, any business critical system requires this as standard.

Compression:

An idea I've seen on the forum is to start with maximum compression before you copy data to the pool. When all old data is copied change compression to lz4 for new incoming data. This should offer significant space savings for the initial data seed.

danb35 · Nov 27, 2018

Jessep said:
Standard practice would be Z2 as Z1 is deprecated due to rebuild times and chances of a second drive failure.

I was about to say this, but does it really hold for SSDs?

Jessep said:
you can add a SLOG, likely not needed if you aren't worried about sync=always and write performance.

...and if you're using an all-SSD pool anyway, it seems unlikely it would do much for you even with sync enabled.

MatthewSteinhoff · Nov 27, 2018

danb35 said:
I was about to say this, but does it really hold for SSDs?

To be safe, I'd say that is still the right advice to give.

But, you're right, I've searched for answers and haven't found anything especially satisfying. The way SSDs fail is different than the way conventional drives fail. From everything I've read, they don't typically die outright but tend to become read-only. I'm not sure how that will play out with ZFS. I'm going to guess that, in another two or three years, we'll have better answers. Today, however, I don't think enough SSDs have failed to have a clear idea as to what happens.

Or, maybe not. We're still in the phase where SSDs keep getting better in terms of both speed and size. My gut tells me that people are still upgrading before SSDs are wearing out. So, maybe it won't be two or three years but five years?

Cheers,
Matt

Jessep · Nov 28, 2018

danb35 said:
I was about to say this, but does it really hold for SSDs?

...and if you're using an all-SSD pool anyway, it seems unlikely it would do much for you even with sync enabled.

Agreed, while most SSD are more reliable than HDD, have better un-correctable bit error rate, and faster resilver time they have a much greater chance of things like firmware issues.

Overall I would stand by greater redundancy vs. less.

RickH · Nov 28, 2018

I run 6 FreeNAS servers that are used to store imaging data (my company actually scans/converts 30 million plus pages a year and provides hosting for document management software) so I have quite a bit of experience with similar types of data. ZFS / FreeNAS is definitely up to the task (I have one server with more that 250 million individual files), but there are some considerations...

In addition to all of the advice everyone else has giving I would offer the following thoughts:

As others have stated - don't even consider using Z1, the capacity hit you'll take to move to Z2 is well worth the added redundancy and peace of mind.
Don't touch de-dupe! 20% is no where near a large enough gain to justify the extra headaches.
I don't have exact data on the H310 with the backplane that's in the R720 servers, but I have used a crossflashed H310 with the backplane in another Dell server and wasn't able to get more than 200 MiB/sec throughput with any single drive loaded and that number dropped when all drives were being accessed in parallel - As you're planning on using all SSD storage, your major throughput bottleneck is going to be the H310/backplane combo - You're going to be paying a lot of money for SSDs that you're not going to see the full performance benefit from. (there are still other gains, but it's just something to consider)
You mention that the folder structure isn't optimized and the example you show indicates that there may be some folders with 30K+ items contained within - I've run into similar scenarios and this is going to severely affect your performance, especially over SMB. There are some SMB and ZFS parameters that will help, but they come with some trade-offs:

For SMB you can disable extended attributes and legacy dos attributes if your client applications aren't using them. In my experience this can significantly increase performance on directories with 5K+ items. To do this enter the following lines in your SMB config
ea support=no
store dos attributes=no

For ZFS tuning I would definitely disable atime and possibly configure your cache to store only metadata (this made a pretty big difference on my arrays with spinning drives - I'm not sure how it would translate to an all SSD array). Atime is controllable through the GUI, to set the caching options run the following from a shell:
zfs set primarycache=metadata poolname/datasetname
You indicate that your planning on sharing over NFS and SMB. If you're considering sharing the same files/dataset over both protocols I would warn against this. NFS and SMB use different methods of locking files and you're asking for trouble by enabling both for the same files...

Important Announcement for the TrueNAS Community.

ZFS configuration options for storing millions of images

c4st

Cadet

MatthewSteinhoff

Guru

Jessep

Patron

danb35

Hall of Famer

MatthewSteinhoff

Guru

Jessep

Patron

RickH

Explorer

Similar threads

Important Announcement for the TrueNAS Community.

ZFS configuration options for storing millions of images

c4st

Cadet

MatthewSteinhoff

Guru

Jessep

Patron

danb35

Hall of Famer

MatthewSteinhoff

Guru

Jessep

Patron

RickH

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "ZFS configuration options for storing millions of images"

Similar threads