ZFS memory requirements

Status
Not open for further replies.

taylorjonl

Dabbler
Joined
Dec 14, 2013
Messages
11
I posted in off-topic because I will not be using FreeNAS but this forum seems to have the most knowledge of ZFS for hobbyists. I currently have a box that has these parts:
  • Tyan S7012 Motherboard
  • Intel Xeon E5620 Processor 2.4
  • Kingston ValueRAM 12GB DDR3
  • Areca ARC-1680IX-24 PCIe x8 SAS RAID Card
  • Seagate Barracuda LP 2 TB 5900RPM SATA 3 GB/s x20
  • OCZ Vertex 2 120 GB SSD
This is all in a 24 drive case with SAS drive planes. I have had this for a while and not done anything with it because of other more important things and because I am nervous using HW raid because I fear the card will fail then I will have to spend another $1k on a new card and it also makes upgrades harder.

So I went on eBay and bought IBM M1015 x3 and I will reflash them with the LSI firmware to make them pure HBAs.

So here is how I am thinking of configuring this box:

X X X X
X X X X

X X X X
X X X X

A + O O
* * * *

X = storage
A = l2arc
+ = zil
O = spare
* = fast

The storage drives will be my Seagate 2tb drives setup in two RAIDZ2 arrays that will be striped, then the fast drives will be some 1tb WD Red drives setup in RAIDZ. The l2arc will probably be my OCZ Vertex 2 at least at first. The zil I was looking to find a cheap and small SLC SSD, around 20GB. I am open for suggestion but would like to use as much hardware that I currently have as possible.

The motherboard is dual processor(only one is populated right now) with triple channel memory and has 18 DIMM slots. It supports up to 144GB, if I keep going with 12GB triple channel kits I max out at 72GB. These 12GB kits are around $100 each, if I go with the 24GB kits they are around $300. Also each processor has 9 DIMMs, so to use all the DIMMs I have to get a new $400 CPU.

So here are the two options:
144GB = $300x6 + $400 = $2200
72GB = $100x5 + $400 = $900

So ZFS loves RAM, I have read the rule is 1GB per 1TB of raw storage and 1GB per 50GB of l2arc. Firstly does this sound accurate?

If I do the math, I have 32tb in Seagate drives, then I have 4tb in WD Red drives and I have 120GB of l2arc. So if I start to calculate my RAM requirements from the above I need around 38GB of RAM just for ZFS? If that math is correct I would probably just round it up to 40GB. Now I want this box to be my NAS along with hosting some VMs(using KVM).

I plan on using SmartOS which is based on Illumos which is based on OpenSolaris. This has a concept similar to FreeBSD jails, so I *think* I can build OpenIndiana in a "zone" since it is also based on Illumos. I would also host some other zones for maybe a web server, ftp, sshd, etc. As much as possible I would use zones over KVM, but I do need(or at least want) to host a Window VM for my HTPC(I get my cable through a Ceton tuner).

My main question is, I know that ZFS will consume as much RAM as it is allowed and I thought I had read that it will give it back if the system needs the RAM, so if I were to have the 72GB of RAM with the above configuration, then I start reserving RAM for VMs and zones, what happens as I take more RAM from ZFS?

Does ZFS just start losing performance down to a floor or will the universe implode? For example, if I decide to reserve 40GB for VMs and zones, this leaves 32GB for ZFS, or even reserve 48GB for VMs and zones, leaving 24GB for ZFS, will it still work it just won't work as well as if it had more RAM?

I think 32GB for VMs and zones will probably be enough for me but I really want to understand how the system will behave if I need to push the system harder.
 

KMR

Contributor
Joined
Dec 3, 2012
Messages
199
Yes, ZFS loves RAM. There are dangers with virtualization and as I understand it one of them is the temptation to under resource ZFS. There are a couple of very knowledgeable people on this forum that have much more experience with this than me so maybe they will pop in and say a few words.
If you have 2 RAIDZ2 vdevs with 8 drives each you will have four parity drives for the entire pool. I don't think this will give you 32TB of usable storage - have I misunderstood you here? Next, I don't think a RAIDZ1 array will perform the way you hope for VM storage and as I understand it using a pool in a VM to serve as storage for other VMs on the same host is not a recommended practice. VMs need low latency for acceptable performance. If you have decided to go this course look at striped mirrors. I tried 4x 1TB drives in a striped mirror configuration as VM storage for a separate ESXi host using iSCSI and MPIO but couldn't get it to work acceptably. It may just have been my configuration but I believe ZFS needs to be tuned for low latency to do what you need - again, other members here have a lot more experience doing this sort of thing so they should be able to advise you on the best course.

From what I have read I don't think you will need the L2ARC provided you give your VM the proper amount of RAM to begin with and an L2ARC is no substitute for ARC. A ZIL may help you with your VM storage because of the well known sync write issue (again, others can offer more wisdom here than me). At any rate, figure out the amount of usable storage you will have and what the recommended amount of RAM is for that configuration and stick with it. Failing to do so is universe ending stuff, like dividing by zero. Well, maybe not really but it is still a bad idea. Use your 120GB SSD for VM storage and it will outperform any pool you are likely to configure by a large margin. Also, you didn't say but I hope you will be using pass-through for those nice shiny M1015's?
 

taylorjonl

Dabbler
Joined
Dec 14, 2013
Messages
11
KMR, the 32tb is the raw storage, the usable storage is 10.9tbx2 according to this calculator for two RAIDZ2 with 8 drives:

http://www.servethehome.com/raid-calculator/

Do you calculate the RAM requirements based on the raw storage or based on usable storage?

I will reflash the firmware as described here for IT mode operation(basically HBA, no RAID).

http://www.servethehome.com/ibm-serveraid-m1015-part-4/

Anyone else think I don't need L2ARC? If you think I need low latency storage for VMs, how about a RAID10 with 4xSSDs? Or maybe just RAID1 for 2xSSDs? For now I may just use my existing SSD or a 250GB one I got lying around for VMs, then migrate to something better at a later date after I know more about how things are working and what needs improvement.

If someone has suggestions on a better way to utilize these drives I am open. I have been reading about optimal drive counts in various RAIDZ arrays, seems like an 8 drive RAIDZ2 isn't such a good plan:

http://forums.freenas.org/threads/what-number-of-drives-are-allowed-in-a-raidz-config.158/#post-457

Maybe a better choice is either a single 10 disk RAIDZ2 for around 14.6tb or 3x6 disk RAIDZ2 for around 7.3tbx3. Thoughts?
 

KMR

Contributor
Joined
Dec 3, 2012
Messages
199
Your original post says you have 20 2TB drives.. is that correct? RAIDZ1 performs best with an odd number of drives and requires a minimum of 3. RAIDZ2 performs best with an even number of drives and requires a minimum of 4 drives.

Check out Cyberjock's guide. It should be stickied somewhere on the forums here.

Also, after re-reading your original post I see that you are using an OS based on OpenSolaris. I'm not that familiar with it, but is ZFS on your host OS an option? Just a thought.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Guide is in my sig.

The 1GB per 50GB is only valid for certain situations. What you should look at is your actual VMs data usage. If your VMs will be constantly accessing 200GB of data, then you should shoot for a 200GB+ L2ARC. There is the 1:5 ration for RAM to l2ARC, so you'd need about 40GB of ARC just for a 200GB L2ARC. That probably means 48+GB of RAM for FreeNAS. If you plan to run multiple VMs on ZFS you will almost certainly need an L2ARC. I'd definitely budget the cost of one in with the initial build.

As for a ZIL I like the 32GB SLC Intel X25-Es from ebay. I bought 5 of them for $130 each and they all had 98 or 99% lifespan remaining despite having years of uptime.

Now, running VMs from RAIDZ1 and RAIDZ2 is difficult. Most people we recommend a bunch of mirrored vdevs in the pool. This gives you much higher IOPS and throughput. If you choose to use RAIDZ1 or RAIDZ2 you are going to be fighting an uphill battle. This may force you to get more RAM so you can have an even bigger L2ARC.

Using an SSD is absolutely possible for VMs. That helps alot in the arena of IOPS and potential throughput. You may still need lots of RAM, but the L2ARC and ZIL may become unnecessary. I say "may" because mileage will vary depending on your SSDs. Also keep in mind that if you plan to use SSDs you can expect them to wear out faster than you are used to in a desktop environment.

RAIDZ1s aren't that reliable in this day and age. Please read the link in my sig on the topic.
 

taylorjonl

Dabbler
Joined
Dec 14, 2013
Messages
11
KMR, OpenSolaris has comparable support for ZFS as BSD as far as I know.

I have done some further reading
For better performance, a mirror is strongly favored over any RAIDZ, particularly for large, uncacheable, random read loads.
When determining how many disks to use in a RAIDZ, the following configurations provide optimal performance. Array sizes beyond 12 disks are not recommended.
  • Start a RAIDZ1 at at 3, 5, or 9, disks.
  • Start a RAIDZ2 at 4, 6, or 10 disks.
  • Start a RAIDZ3 at 5, 7, or 11 disks.
The recommended number of disks per group is between 3 and 9. If you have more disks, use multiple groups.

Ref: http://doc.freenas.org/index.php/Hardware_Recommendations

Another factor I am trying to consider/understand is how ZFS handles alignment:
sub.mesa wrote:
As i understand, the performance issues with 4K disks isn’t just partition alignment, but also an issue with RAID-Z’s variable stripe size.
RAID-Z basically works to spread the 128KiB recordsizie upon on its data disks. That would lead to a formula like:
128KiB / (nr_of_drives – parity_drives) = maximum (default) variable stripe size
Let’s do some examples:
3-disk RAID-Z = 128KiB / 2 = 64KiB = good
4-disk RAID-Z = 128KiB / 3 = ~43KiB = BAD!
5-disk RAID-Z = 128KiB / 4 = 32KiB = good
9-disk RAID-Z = 128KiB / 8 = 16KiB = good
4-disk RAID-Z2 = 128KiB / 2 = 64KiB = good
5-disk RAID-Z2 = 128KiB / 3 = ~43KiB = BAD!
6-disk RAID-Z2 = 128KiB / 4 = 32KiB = good
10-disk RAID-Z2 = 128KiB / 8 = 16KiB = good

Ref: http://forums.freenas.org/threads/what-number-of-drives-are-allowed-in-a-raidz-config.158/#post-457

Then I thought about the hardware some more, I will have triple IBM M1015s that are all PCIE x8, to get max IOPS I need to split the load, so powers of three. So referencing the above, stick to 3, 6, 9 or 12 drive configurations.

My newest thoughts are a RAIDZ2 of 2tb x6 x3 which is 36tb of raw storage, or 24tb of usable storage, this should be properly aligned and optimized to work with triple channels.

This leaves six slots for additional drives.

Cyberjock, my latest thinking was a mirror of 250gb x3 Samsung 840 EVOs for three of the six remaining slots. From what I read ZFS will split reads across all disks and to be honest the VMs will not do many sequential reads/writes, mostly random reads. This box will primarily be for guests that are IO hungry, but the spinning disks will be where the bulk of the sequential IO will be.

With the remaining slots I think I will make them hot spares. I am very interested in making sure these cheap disks don't cause a data loss. So for each RAIDZ2 vdev, I would have the ability to lose one disk with no problems, then the hot spare will replace it, as long as I don't have a failure while "resilvering" this spare I will have a good array?

Finally I think I will *optionally* improvise with the onboard SATA channels to get a ZIL and L2ARC. What I mean by improvise is zip ties and bailing wire(not really) to get 1-2 drives from the onboard controller. I will only do this after running the system and identifying that the ZIL or L2ARC will benefit my IO.

Does all this sound logical?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
First, if you read my presentation all of that stuff about the number of disks in each RAID type and the optimal number of disks is in my noob guide. I've tried to make it easy for people without having to spend the months I did on research. ;)

Your PCIe logic for powers of three is unnecessary. M1015s are PCIe 8x 2.0. That means each card is capable of 4GB/sec(and just 2GB/sec on a board with PCIe 1.0). Do you really think your pools will do 4GB/sec? As a hint, I have one of the fastest pools around, and I do a whopping 800MB/sec.

We aren't baking cookies for momma here. We're engineering a powerhouse for your data! <insert Tim Allen grunt>. Your bottleneck is almost certainly going to be that piddly 133MB/sec LAN port you have. Even at 10Gb LAN you are talking about other bottlenecks because of IOPS.

So think reasonable when you start doing all this number crunching. At the end of the day who gives a crap if you can do 10,000TB/sec on your pool locally if you're limited to a 1Gb LAN port. ;)

Also, those Samsung EVO drives are rated for just 1000 to 3000 writes. That's not much at all(that's the worst of ANY drive on the market). They are cheap, but they are also VERY disposable. I'd never buy one and I'd definitely never recommend one in a server capacity...ever. Remember any write will end up on all 3 disks. So you will be doing a lot more writes on every disk than you might normally do. Some people have exceeded the drive life within just a few weeks. So be warned and monitor those drives like a hawk if you choose to use them. All 3 disks will have roughly the same wear on them, so they will all die about the same time.

And hot spares do not come online automatically for FreeNAS. If you won't have physical access to the server later put the hot spares in. Otherwise I recommend you just keep them on a shelf in safe storage in case you do need them. I'm not sure what your OS does, but FreeBSD does not online hot spares automatically at all. Check with your OS to find out if it supports it.

If you want to wait on the ZIL and L2ARC that's an option. When disk latency starts going up that will be your cue to get an l2arc and/or zil.
 

KMR

Contributor
Joined
Dec 3, 2012
Messages
199
Damn. That was some good info Cyberjock. I was under the impression that maybe ZFS wasn't well suited to VM storage duty over the long term, or am I thinking iSCSI + ZFS? If that is the case would NFS + ZFS make a difference provided you had the proper equipment (L2ARC + ZIL) for VM duty?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
ZFS is well suited to storing your data safely. It doesn't perform so well when you do things like throw tons of sync writes at it like ESXi + NFS does without significant hardware. iSCSI + ZFS with VMs can have its own problems with fragmentation(which NFS can have also) and will often need an L2ARC to handle the high iops you will want. iSCSI is spared the penalties of the sync writes, but isn't spared of high system requirements like lots of RAM and L2ARC. ZFS has no defrag tools, so you need the pool to be able to manage its fragmentation for its free space. Our VM expert recommends pools that are at most 60% full for maximum pool performance regardless of fragmentation.
 

KMR

Contributor
Joined
Dec 3, 2012
Messages
199
Not trying to steal this thread away, but does the addition of an L2ARC + ZIL help minimize fragmentation or simply reduce the performance hit the fragmentation has on the VMs?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
It only reduces the performance hit to the pool over the long term. It doesn't affect where the data will be stored on the pool at all.
 

taylorjonl

Dabbler
Joined
Dec 14, 2013
Messages
11
Your PCIe logic for powers of three is unnecessary. M1015s are PCIe 8x 2.0. That means each card is capable of 4GB/sec(and just 2GB/sec on a board with PCIe 1.0). Do you really think your pools will do 4GB/sec? As a hint, I have one of the fastest pools around, and I do a whopping 800MB/sec.

I get your point of the bandwidth but I thought it would be best to spread the load, possibly to help with latency? I thought I had read that ZFS will do operations in parallel and the entire operation will only be as fast as the slowest drive.

We aren't baking cookies for momma here. We're engineering a powerhouse for your data! <insert Tim Allen grunt>. Your bottleneck is almost certainly going to be that piddly 133MB/sec LAN port you have. Even at 10Gb LAN you are talking about other bottlenecks because of IOPS.

So think reasonable when you start doing all this number crunching. At the end of the day who gives a crap if you can do 10,000TB/sec on your pool locally if you're limited to a 1Gb LAN port. ;)

I also understand what you are saying here, my desire is to put as many services that will need access to this storage as is responsible on this box as either a "zone"(basically like chroot jails) or under KVM. By doing this I hope to reduce what uses the LAN, e.g. if I have this solely be my NAS and make my HTPC an external box, then I effectively double the LAN traffic to the NAS(when HTPC reads/writes) and I triple the LAN on the HTPC(when it reads/writes/streams).

Also, those Samsung EVO drives are rated for just 1000 to 3000 writes. That's not much at all(that's the worst of ANY drive on the market). They are cheap, but they are also VERY disposable. I'd never buy one and I'd definitely never recommend one in a server capacity...ever. Remember any write will end up on all 3 disks. So you will be doing a lot more writes on every disk than you might normally do. Some people have exceeded the drive life within just a few weeks. So be warned and monitor those drives like a hawk if you choose to use them. All 3 disks will have roughly the same wear on them, so they will all die about the same time.

For now I may just use my single OCZ Vertex2 120GB while I figure all of this out, then later on maybe buy something else. The more I think about it this drive to me is the least important as long as I take regular backups. If it fails, I can just restore.

This URL explains a little about ZFS replication:

View: https://www.youtube.com/watch?v=TwsqKBrZ1t8


I am still doing my research on this but it looks promising.

And hot spares do not come online automatically for FreeNAS. If you won't have physical access to the server later put the hot spares in. Otherwise I recommend you just keep them on a shelf in safe storage in case you do need them. I'm not sure what your OS does, but FreeBSD does not online hot spares automatically at all. Check with your OS to find out if it supports it.

From what I have read the hot spares should auto replace in Illumos based OSes, but I will definitely test this out. Maybe I will use ZFSes ability to use files to do a test or maybe I will try it out before I start loading up my data on the drives.

If you want to wait on the ZIL and L2ARC that's an option. When disk latency starts going up that will be your cue to get an l2arc and/or zil.

I think I will wait to start optimizing the storage until after I have some experience with it, otherwise I will buy stuff I don't need and will just collect dust or make it perform worse.

After this I feel I need to do some more research before I start buying or committing to anything else(except RAM, I know I need this so I am going to pull the trigger on the 12gb kits). I have two of the three M1015s coming next week, the final one will probably come after the new year. So what I may do is mess around some with a pool that will be 2tbx7x2 for a RAIDZ3+0(if this is how you describe it). This will give me 7.3tbx2 with pretty decent redundancy. Then when the final card comes in I may mess around with my 2tbx6x3 in a RAIDZ2+0. This gives me a bit more space and maybe performance but has less redundancy.

My next major obstacle is figuring out how to perform backups... With 14.6tb what do you backup to? If I go with the RAIDZ3+0 option, I will have 10 slots open, maybe a 4tbx6 RAIDZ2 array? Or should the backup not be on the same box?

Cyberjock, you said you have a fast array, do you have a page that describes it? What would you do with my hardware?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
In a business environment I'd always go to a second box. You could choose to go with a RAIDZ2 for backups.

Your hesitation with buying hardware is normal. People get paid good money to figure all this crap out for people. You'll easily spend several weeks trying to get all of this stuff right as there's alot to digest.

As for what I'd do with you hardware, I've already discussed what I'd do in my recommendations. ;)
 
Status
Not open for further replies.
Top