Is fresh written data immediately in the READ cache?

Axel Mertes · May 11, 2015

Hi All,

I tried to search the forum, but maybe I simply did not find the right words for the search, so I ask via this new thread. As I am new to ZFS and I want to understand prior to final installation the following:

Is fresh written data immediately in the read cache?

As we are working in content creation, we produce a big load of fresh data per day that is usually be required to read immediately after writing (think of image sequences being rendered, then played for review). For this scenario its very helpful if any fresh written data is automatically available from the read cache.

Is that the case with ZFS?

If not, is there an easy way to accomplish this?

Best regards
Axel

cyberjock · May 11, 2015

The question is to ambiguous to provide an answer. It will be cached, as long as other requirements don't result in it being discarded such as the file being so big it doesn't fit in the RAM, etc etc etc.

Axel Mertes · May 11, 2015

Well, then I'll try to get more precise:

Say we have a 32 TByte pool made out of 16 TByte vDevs.
Add 4 TByte for ZIL write SSD cache.
Add 4 TByte of L2ARC read SSD cache.
Add 96+ GByte of RAM.

I write an image sequence of say 50 GByte and want to play it afterwards. The SSD cache size is designed to be at least 2-3 times bigger than all aggregated read+writes on a single average day together. The idea is to feel like we never run from disks but SSD.

The point is:
I understand that ZIL and L2ARC caches are completely independent systems, so that data landing in a write cache isn't necessarily in the read cache as well.
On the other hand I understand that ZFS employs a technique called COW (Copy On Write) to maintain fragmentation free disks as much as possible. That would mean a fresh written file is copied by ZFS itself to another target position after being closed. I have not found much details on this process and if I really understood it right. Doing such a COW I think it should read the file first, before writing it again. But will that cause it to become available in read cache?

Or is there another trick to make a file available in read cache after writing, without any user interference?

I used e.g. PrimoCache on Windows Server before, which actually does what I want: Any block written to a disk set is being cached. Emplying a last use algorithm then results in fresh written data to be in the cache for reading. And it uses just one cache, not two independent ones, which makes this easier.

And how would you balance the size of a SSD read and write cache in above scenario?
Would 2 TByte ZIL write cache and 6 TByte L2ARC read cache be more efficient?
Of course do we have more reads than writes, many times more reads.

Best regards
Axel

HoneyBadger · May 11, 2015

Axel Mertes said:
Say we have a 32 TByte pool made out of 16 TByte vDevs.
Add 4 TByte for ZIL write SSD cache.
Add 4 TByte of L2ARC read SSD cache.
Add 96+ GByte of RAM.

Most SLOG devices are 8-10GB in size, because they're only as big as what a few seconds of sustained writes will occupy. The rest of that 4TB will be completely unused. 4TB of L2ARC will also require some pretty phenomenal amounts of RAM to index.

If you're at the point of considering 4TB of L2ARC, you should probably just be looking at a pure-SSD pool, and a process to cycle older/stale data to a separate, slower pool.

Axel Mertes · May 11, 2015

HoneyBadger said:
Most SLOG devices are 8-10GB in size, because they're only as big as what a few seconds of sustained writes will occupy. The rest of that 4TB will be completely unused. 4TB of L2ARC will also require some pretty phenomenal amounts of RAM to index.

If you're at the point of considering 4TB of L2ARC, you should probably just be looking at a pure-SSD pool, and a process to cycle older/stale data to a separate, slower pool.

OK, so write cache can be fairly small I understand. Current hardware RAID controllers often have like 32-64 GBytes RAM cache. Maybe we can leave it on that alone?

I don't think that a pure SSD pool would be much better, as we want this kind of automatic "Tiering" by SSD caching a HDD based disk pool. If we have to manually employ moving data around, it just complicated things and causes administrative trouble.
Given the size of 64+ TBytes for our main pool (plus a 1:1 mirror) that is going to be very expensive pure SSD pool IMHO.

I read here: http://mags.acm.org/communications/200807/?pg=49#pg49 that we will need about 1/50th of the size of the SSD cache as indexing RAM.
In turn 4000 GByte/50 = 80 GByte RAM. That is not "phenomal" nowadays, is it?
Or do you have other, preferably "real world" values for me?

Best regards,
Axel

diehard · May 11, 2015

That article is from 2008 and looks like it has nothing to do with ZFS.

Usage greatly varies the amount of L2ARC to ARC usage ratio.. its about 200 bytes per record if i recall.

A "safe" bet for 4TB of l2ARC would be in the 256GB-386GB range in RAM.

HoneyBadger · May 11, 2015

Axel Mertes said:
OK, so write cache can be fairly small I understand. Current hardware RAID controllers often have like 32-64 GBytes RAM cache. Maybe we can leave it on that alone?

A RAID card with 32GB of RAM that can be used for stable write logging? I very much doubt that. Most RAID cards are holding maybe 2GB of battery-backed or flash-backed RAM.

I don't think that a pure SSD pool would be much better, as we want this kind of automatic "Tiering" by SSD caching a HDD based disk pool. If we have to manually employ moving data around, it just complicated things and causes administrative trouble.
Given the size of 64+ TBytes for our main pool (plus a 1:1 mirror) that is going to be very expensive pure SSD pool IMHO.

If you're going to be writing an estimated ~1TB a day, you could write that to a smaller 4TB SSD-only pool, then have a batch job to copy the files over to the massive array of spinning disks during off-peak hours. Once the files are verified to be successfully copied, you free up the space on the SSD pool.

I read here: http://mags.acm.org/communications/200807/?pg=49#pg49 that we will need about 1/50th of the size of the SSD cache as indexing RAM.
In turn 4000 GByte/50 = 80 GByte RAM. That is not "phenomal" nowadays, is it?
Or do you have other, preferably "real world" values for me?

L2ARC utilization isn't really as simple as a ratio, but that ratio is usually "1:4" or "1:5" if you expect ZFS to be able to use any of that RAM for caching. By default ZFS only lets you use 1/4 of your total ARC size for metadata (including the L2ARC index) and while you could override vfs.zfs.arc_meta_limit to let it use the entire ARC for metadata storage, that means essentially zero reads from RAM.

Each entry in the L2ARC index uses roughly 180 bytes. ZFS uses a dynamic block size, with a max of 128KB - if you're writing exclusively sequentially in big swaths of data and have the clients mount over NFS with rsize/wsize=131072, you might actually see that, and you'll have ~33.5 million 128KB records consuming 180 bytes for a total of about 5.6GB of RAM.

Now, that may actually work, provided that ZFS never writes a block smaller than 128KB.

But as soon as it does, the RAM requirement starts to reach for the sky. An 8KB record takes the same 180 bytes to index as a 128KB one, but there would be 16 more of them for the same data volume, which means 16x the RAM to index it.

ZFS caches things primarily based on MFU and MRU - "most frequently used" and "most recently used." There's a complex data flow as well on ingest, where the data will get into a write buffer, but may or may not make it to the primary RAM cache (ARC) based on MRU/MFU values compared with what's already there. And the stuff that gets evicted may or may not have landed in ARC already because of the L2ARC write limiter (obviously a device intended to be read cache shouldn't have all of its bandwidth consumed by writes) and other data already in L2ARC.

You definitely will have a high-churn system here and in my opinion the "tiered storage" needs to get some stronger separation between the tiers to avoid a digital traffic jam.

diehard · May 11, 2015

You can add DIMM's to Areca cards but the most they support is 8GB i believe.

You are better off with an NVMe SSD for the ZIL, IMO.

depasseg · May 11, 2015

I think you want to read up a little more on ZFS. First, there really isn't a write cache (well, there is sorta). The ZIL handles writes. The ZIL exists in RAM and data is written to the pool as needed. If the system receives a request for a Sync write, then the data in the ZIL must also be written somewhere else. By default, this will be written to the pool and then the write will be acknowledged. If you want to improve performance, you can use a SLOG, which is a very fast (SSD or NVMe) power protected device to handle the Sync writes (SLOG), the writes are acknowledged quicker and performance improves (vs writing to the pool and waiting for acknowledgement). The writes will still come from RAM, the SLOG only gets used if there was a power failure and the ZIL wasn't fully written to the pool. And if if you aren't doing Sync writes, then this is meaningless.

And second, there isn't a tiered storage model like some other large SAN providers offer. Hence the suggestion for an all-SSD pool.

cyberjock · May 11, 2015

I'm with depasseg. Just your hardware choices indicate you are throwing everthing at it to see what sticks. This is going to cost you a crapload of money to buy the hardware, then a crapload more to figure out what you screwed up so you can fix it. And by crapload, I'm talking 5-digit mistakes.

You need to go back and do a LOT more reading. You're still in over your head.

Example:
I didn't see that equation that gives you 80GB, but you need something like 300 bytes of RAM for each *block* you write. If your blocks are 512-bytes, obviously you'd need almost as much RAM as you have for the L2ARC. Luckily the blocks will probably be bigger, but you need to know what the block size is going to be so you can do your own math. We generally recommend your L2ARC not exceed 5x your system RAM. Which means that for a 4TB L2ARC you'd want about 768GB of RAM.

There is much more that I could quote you on, but you really need to understand these fundamentals and not be able to regurgitate them because someone said "do it this way".

Axel Mertes · May 11, 2015

The article was referred to from here:

https://blogs.oracle.com/brendan/entry/test

and is referred to by Brendan Gregg who is apparently the developer of L2ARC for ZFS?
So I think it has quite a lot to do with ZFS...

However, I have no evidence if that 50:1 ratio really fits.

I'd really like to understand why the L2ARC is so memory hungry as you write and what is stored in 180 bytes per cached block?
Is there any detailed explaination somewhere?

I am used to other block caching systems (o.k, fixed block size storage) that was pretty efficient, as it had no checksums involved it simply needed the block address from HDD and the address of the RAM memory block where this cache copy of the HDD can be found in RAM cache. If we add a complexe block address and a checksum, I don't see why it should take 180 bytes... Is that file based?

Regarding RAID controllers cache size:
I had devices like Infortrend etc. RAID external enclosure in mind were I can easily see values up to 16 GByte buffered cache inside each controller, so a redundant controller will have 32 GBytes cache. They now use a large capacitor to write DRAM cache to flash cache in an event of failure.

Maybe its enough to forget about a dedicated ZIL SSD cache and simply rely on large controller DRAM cache.

HoneyBadger · May 11, 2015

The L2ARC index doesn't just store a pointer to the data, it also stores information used for caching decisions such as MFU/MRU values.

As far as I can find on those Infortrend systems it only refers to the memory as "cache" which is most likely "read cache."

Axel Mertes · May 11, 2015

HoneyBadger said:
The L2ARC index doesn't just store a pointer to the data, it also stores information used for caching decisions such as MFU/MRU values.

As far as I can find on those Infortrend systems it only refers to the memory as "cache" which is most likely "read cache."

Well, yes, the other block caches I am used to do the same, they store the data of last use and maybe they count up how often. But thats just a very few bytes. Anyhow, if its 180 bytes, then its 180 bytes. If its 300 bytes then its 300 bytes. I am not going into redesigning ZFS ;-)

Regarding write cache in such enclosures:
It depends on your settings. You can of course do write through policy caching or read only etc.. However, most read caches don't work efficiently on external RAID enclosures, except you make sure to have defragmentend sequential data. Thats why we regularly defragmented the systems. And I can see how the cache fills when writing and that I can always write faster to RAIDs than I can read from, because it fills the cache first, while reading requires reading first from HDD. This is common among many RAID subsystems I have been using in the past, like Infortrend, Accusys, EasyRAID, FibreNetix, DataDirect etc.

You decide in the RAID settings if and how to use the controller cache. Since the advent of protected cache RAM write cache is standart among most of these vendors. And I know it works, because I use it all day.

Best regards,
Axel

depasseg · May 11, 2015

What protocol are you using to share this network storage with the end users?

Your system RAM will be used as write cache in ZFS. And if you are using CIFS all this talk about write cache (controller or SLOG) is irrelevant.

Axel Mertes · May 11, 2015

cyberjock said:
I'm with depasseg. Just your hardware choices indicate you are throwing everthing at it to see what sticks. This is going to cost you a crapload of money to buy the hardware, then a crapload more to figure out what you screwed up so you can fix it. And by crapload, I'm talking 5-digit mistakes.

You need to go back and do a LOT more reading. You're still in over your head.

Example:
I didn't see that equation that gives you 80GB, but you need something like 300 bytes of RAM for each *block* you write. If your blocks are 512-bytes, obviously you'd need almost as much RAM as you have for the L2ARC. Luckily the blocks will probably be bigger, but you need to know what the block size is going to be so you can do your own math. We generally recommend your L2ARC not exceed 5x your system RAM. Which means that for a 4TB L2ARC you'd want about 768GB of RAM.

There is much more that I could quote you on, but you really need to understand these fundamentals and not be able to regurgitate them because someone said "do it this way".

Currently we already throw a crapload of money towards energy consumption of the older system, based on 8 external RAID enclosures.
Given todays HDD disk size and the options of SSD caching an/or storage tiering I can do the same or better with just a single enclosure.
Coming from Windows server side, I really like many things I read about ZFS, as I am currently researching alternatives. And some things I don't like so much.
And obviously I can only bring in my experience with storage & SAN systems I had before, not knowing the specifics of ZFS that much yet. I am happy for any feedback/link/resource in this regard.
If I do not need to buy a lot of SSD cache for ZIL / write cache, then hey, just fine. And if it needs to be SLC memory etc. then thats valueable information.
I just hope to enable at least twice the amount of read cache of what we really touch per day, so that actual traffic from/to HDD storage will be minimized and made as efficient as it can be.
I'll try to understand the ZFS system before actually implementing one, because that would throw a lot of money elsewhere. Its pretty hard to find any benchmark or performance figures for ZFS in a usage like for post production / content creation. We deal with huge amounts of streaming data and single frame file sequences. When I throw e.g. a 2k/4K cinema encoding job on the renderfarm (10 computeres, 16 cores each), it will read 160 files and write another 160 files simultaneously with high throuhput. Bandwidth is always the limiting factor here. Right now I can do like 300+ fps in 2K, 90+ fps in 4K. I can kill the performany of about any HDD based storage system with this. SSD based storage helps a lot here.

So much about the "why I am doing this"?

Best regards
Axel

Axel Mertes · May 11, 2015

depasseg said:
What protocol are you using to share this network storage with the end users?

Your system RAM will be used as write cache in ZFS. And if you are using CIFS all this talk about write cache (controller or SLOG) is irrelevant.

We have mostly Windows 7 and Windows 8.1 clients, and a few MAC OSX clients. I intend to use SMB share in most cases.
Right now we run a mix between some machines as 1 GBit Ethernet SMB clients, more important ones running FC SAN using MetaSAN on 4 GBit FC. We plan to move away from shared SAN to an ethernet only apprach with 10 GBit clients and possibly 20 to 40 GBit server connection.

IMHO a perfect system would be one that can utilize local clients SSDs as cache for anything that they pull form the network or write to it.
I tried to convince the MetaSAN developers to implement it, but they dropped the whole product altogether.
QLogic actually implemented such thing in hardware now, having SSD based local block storage caching for any data that runs through FC SAN. And their numbers match ours by doing this manually on our systems using the OCZ Revo3x2 we have in our workstations. You can easily make your network 10..100 times faster using this apprach of decentralized caching, because it will free up all data pathes between the systems. Empty pathes means much higher bandwidth when needed. Unfortunately it works only with block storage and cost a real fortune.
Microsoft is on a good road with its approach of local caching when using with WAN shares (Branched Cache). However, the build-in latencies doesn't make it work in a local networks SMB shares yet. That might change at some point. Poeple completely underestimate the performance gain from decentralized SSD caching altogether.

Best regards,
Axel

depasseg · May 11, 2015

I recommend contacting ixsystems and discussing your needs.

I'd lean towards a RAM heavy 512G->1TB system with multiple L2ARC SSD's for your read cache. And since you are doing CIFS (SMB), make sure you get a CPU with the fastest frequency possible. Write cache (meaning SLOG) isn't really relevant since you aren't doing sync writes. But if you see write performance issues, you could easily add the smallest Intel DC3700 (200GB?).

As for decentralized caching, I'm with you, in fact Dell (and others) are adding Tier 0 (local caching) extensions of their SAN to the local workstations.

Axel Mertes · May 11, 2015

I've already contacted them and waiting for a reply.

I also watched their introduction videos and was participating in a webinar.

Regarding the L2ARC cache:
I've originally considered to use the SAMSUNG 845DC Pro for this. Would they work for ZIL too?
I always read that L2ARC cache SSD don't need to really be very "safe", so there is no real need to have battery backup etc. Would then a set of SAMSUNG 850 EVOs be enough for L2ARC?
Using 4 to 8 of them should bring between 2 to nearly 4 GByte/s transfer speed in theory. No idea how that behaves in L2ARC.
Anyone?

Best regards,
Axel

Important Announcement for the TrueNAS Community.

Is fresh written data immediately in the READ cache?

Axel Mertes

Cadet

cyberjock

Inactive Account

Axel Mertes

Cadet

HoneyBadger

actually does care

Axel Mertes

Cadet

diehard

Contributor

HoneyBadger

actually does care

diehard

Contributor

depasseg

FreeNAS Replicant

cyberjock

Inactive Account

Axel Mertes

Cadet

HoneyBadger

actually does care

Axel Mertes

Cadet

depasseg

FreeNAS Replicant

Axel Mertes

Cadet

Axel Mertes

Cadet

depasseg

FreeNAS Replicant

Axel Mertes

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

Is fresh written data immediately in the READ cache?

Cadet

Inactive Account

Cadet

actually does care

Cadet

Contributor

actually does care

Contributor

FreeNAS Replicant

Inactive Account

Cadet

actually does care

Cadet

FreeNAS Replicant

Cadet

Cadet

FreeNAS Replicant

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Is fresh written data immediately in the READ cache?"

Similar threads