OK, so write cache can be fairly small I understand. Current hardware RAID controllers often have like 32-64 GBytes RAM cache. Maybe we can leave it on that alone?
A RAID card with 32GB of RAM that can be used for stable write logging? I very much doubt that. Most RAID cards are holding maybe 2GB of battery-backed or flash-backed RAM.
I don't think that a pure SSD pool would be much better, as we want this kind of automatic "Tiering" by SSD caching a HDD based disk pool. If we have to manually employ moving data around, it just complicated things and causes administrative trouble.
Given the size of 64+ TBytes for our main pool (plus a 1:1 mirror) that is going to be very expensive pure SSD pool IMHO.
If you're going to be writing an estimated ~1TB a day, you could write that to a smaller 4TB SSD-only pool, then have a batch job to copy the files over to the massive array of spinning disks during off-peak hours. Once the files are verified to be successfully copied, you free up the space on the SSD pool.
I read here:
http://mags.acm.org/communications/200807/?pg=49#pg49 that we will need about 1/50th of the size of the SSD cache as indexing RAM.
In turn 4000 GByte/50 = 80 GByte RAM. That is not "phenomal" nowadays, is it?
Or do you have other, preferably "real world" values for me?
L2ARC utilization isn't really as simple as a ratio, but that ratio is usually "1:4" or "1:5" if you expect ZFS to be able to use any of that RAM for caching. By default ZFS only lets you use 1/4 of your total ARC size for metadata (including the L2ARC index) and while you could override vfs.zfs.arc_meta_limit to let it use the entire ARC for metadata storage, that means essentially zero reads from RAM.
Each entry in the L2ARC index uses roughly 180 bytes. ZFS uses a dynamic block size, with a max of 128KB - if you're writing exclusively sequentially in big swaths of data and have the clients mount over NFS with rsize/wsize=131072, you might actually see that, and you'll have ~33.5 million 128KB records consuming 180 bytes for a total of about 5.6GB of RAM.
Now, that may actually work, provided that ZFS never writes a block smaller than 128KB.
But as soon as it does, the RAM requirement starts to reach for the sky. An 8KB record takes the same 180 bytes to index as a 128KB one, but there would be 16 more of them for the same data volume, which means 16x the RAM to index it.
ZFS caches things primarily based on MFU and MRU - "most frequently used" and "most recently used." There's a complex data flow as well on ingest, where the data will get into a write buffer, but may or may not make it to the primary RAM cache (ARC) based on MRU/MFU values compared with what's already there. And the stuff that gets evicted may or may not have landed in ARC already because of the L2ARC write limiter (obviously a device intended to be read cache shouldn't have all of its bandwidth consumed by writes) and other data already in L2ARC.
You definitely will have a high-churn system here and in my opinion the "tiered storage" needs to get some stronger separation between the tiers to avoid a digital traffic jam.