L2ARC & Cache for shared ESXi SAN

HalfJawElite · Jul 10, 2017

Hello Everyone,

I am currently in the process of implementing a new ESXi shared SAN in my home for provisioning and running VM's from. This new SAN will be created on an HP DL360e Gen8 1U server, with 4x 3.5 inch hot swap bays (two SATA 6.0 and two SATA 3.0 Gbps ports). The drives to be used for the mass amount of storage will be some 1 TB WD RE4's. My question for this setup is whether I should use ZIL, L2ARC, or both; and if these should be done on a SATA or PCI-e SSD.

I should note that I was previously looking at doing an all SSD array for this SAN (here), but due to the cost of 1 TB SSD's and the fact that this HP server only has two SATA 6.0 ports (HP B120i storage controller); I cannot ensure that I have enough space on the SAN for all VM's in an all SSD setup, should I choose to add an additional pool of SSD's for more storage in the future.

Currently the SAN will be utilizing all 4 of it's 1 GbE ports (in LACP of course) and I hope to upgrade to 10 GbE via one of the PCI-e slots in future. I have also played with the idea of a SATA/SAS 6.0 Gbps DAS for this server via an external SAS/mini SAS HBA card to connect to a 1U 4 bay chassis with SSD's for expansion.

The specs of the SAN are as follows:
CPU: two Xeon E5-2304 quad core
RAM: 32 GB DDR3 UDIMM ECC (8x 4GB), 12 total slots in system
PCI-e: one 3.0 16x
PCI-e: one 2.0 8x (4x electrical)
Network: four 1 GbE NICs
PSU: dual redundant 460W HP hot swap

Please let me know what your thoughts are for implementing this setup as I'm looking to deploy the hardware soon and want to achieve the best possible performance with the most amount of storage possible.

tvsjr · Jul 12, 2017

ZIL is in-pool. A SLOG, which moves the ZIL to a separate device, is a near requirement for VM loads. You'll want a high-endurance SSD with write protection for this.
L2ARC only makes sense after you have substantial amounts of RAM. Don't spend money here, spend money on adding another 32+ GB RAM.
You need to run striped mirrors and not exceed 50% utilization. If you're running 1TB drives, that basically means 1TB usable. And, keep in mind that your array's IOPS (the critical measure for VM workloads) is the sum of the IOPS of the slowest disk in each vdev. So, if you're running 7200 RPM SATA spinning rust drives, you're looking at 150-200 IOPS.

I do something very similar (see my sig) and I've got 12 drives in the VM array.

If you'll search here for works like iSCSI, ESX, ESXi, etc., you'll find lots of threads on this topic.

HalfJawElite · Jul 13, 2017

tvsjr said:
ZIL is in-pool. A SLOG, which moves the ZIL to a separate device, is a near requirement for VM loads. You'll want a high-endurance SSD with write protection for this.
L2ARC only makes sense after you have substantial amounts of RAM. Don't spend money here, spend money on adding another 32+ GB RAM.
You need to run striped mirrors and not exceed 50% utilization. If you're running 1TB drives, that basically means 1TB usable. And, keep in mind that your array's IOPS (the critical measure for VM workloads) is the sum of the IOPS of the slowest disk in each vdev. So, if you're running 7200 RPM SATA spinning rust drives, you're looking at 150-200 IOPS.

I do something very similar (see my sig) and I've got 12 drives in the VM array.

If you'll search here for works like iSCSI, ESX, ESXi, etc., you'll find lots of threads on this topic.

Thanks for helping to shed light on this topic. Given what you state about the IOPS being measured by the slowest disk in the vdev..... I thought that was the purpose of having the cache drive, to assist in bringing higher IOPS to running VM's in the vdev? If I'm understanding what you're saying correctly, then it makes absolutely no sense to use any hard drives and instead look at repurposing the 960GB Crucial M500's from the ESXi hosts for an all SSD array? Currently the drives are being used on the local hosts themselves but I can swap them out if needed to bring the higher IOPS to the SAN instead. I was looking at getting a Kingston KC400 512GB or DC400 480GB to use as the cache drive since they're within my budget for those sizes.

tvsjr · Jul 13, 2017

HalfJawElite said:
Thanks for helping to shed light on this topic. Given what you state about the IOPS being measured by the slowest disk in the vdev..... I thought that was the purpose of having the cache drive, to assist in bringing higher IOPS to running VM's in the vdev? If I'm understanding what you're saying correctly, then it makes absolutely no sense to use any hard drives and instead look at repurposing the 960GB Crucial M500's from the ESXi hosts for an all SSD array? Currently the drives are being used on the local hosts themselves but I can swap them out if needed to bring the higher IOPS to the SAN instead. I was looking at getting a Kingston KC400 512GB or DC400 480GB to use as the cache drive since they're within my budget for those sizes.

Ostensibly, yes... but remember, this is a block store. ZFS has no clue what's really stored on those blocks, so the caching algorithms are quite handicapped. To give you some idea (there's a big thread on this in one of the other forums, but this will work), my NAS has been up almost 5 days (updated to 11.0-U1). Currently, I have a 21.6G ARC (in RAM) and a 39.6G L2ARC (Intel S3700 SATA SSD). My ARC hit ratio is 64.5%... my L2ARC hit ratio is 0.0%. I think I've seen it hit 1-2%. I would be far better off maxing the RAM out in the box and getting rid of the L2ARC altogether.

Running hard drives is fine, but you need to be realistic with the number of drives you need. A 4-drive array isn't going to get you much.

When selecting SSDs, be very careful. Obviously, serving a VM store is drive-intensive (reads and writes). L2ARC, even more so. My S3700 is rated for 10 Drive Writes per Day (DWPD) for 5 years... that's 256GB*10*365*5=4.672PB. The Crucial M500 is rated for 40GB/day for 5 years, or 72TB... meaning my S3700 can take 64x more writes over its lifespan. The KC400 is rated for 800TB, the DC400 for 257TB. Just loafing along, my array is averaging about 1MB/sec writes per drive... so that's 6MB/sec (two mirrors, 12 drives, 6 vdevs). That's 518GB a day. And that's not really hitting anything too hard. An array of S3700s could handle that and not wear out for a while... you'd kill your M500s in 291 days.

Moral to the story - there's a lot more to the SSD story than simple capacity. And Intel data center grade drives reign surpreme for a reason... and are more expensive for a reason as well. If you pay attention on eBay, you can pick up last-generation drives (like the S3700) with minimal power on time/writes for much cheaper than buying stuff brand new.

HalfJawElite · Jul 13, 2017

tvsjr said:
Ostensibly, yes... but remember, this is a block store. ZFS has no clue what's really stored on those blocks, so the caching algorithms are quite handicapped. To give you some idea (there's a big thread on this in one of the other forums, but this will work), my NAS has been up almost 5 days (updated to 11.0-U1). Currently, I have a 21.6G ARC (in RAM) and a 39.6G L2ARC (Intel S3700 SATA SSD). My ARC hit ratio is 64.5%... my L2ARC hit ratio is 0.0%. I think I've seen it hit 1-2%. I would be far better off maxing the RAM out in the box and getting rid of the L2ARC altogether.

Running hard drives is fine, but you need to be realistic with the number of drives you need. A 4-drive array isn't going to get you much.

When selecting SSDs, be very careful. Obviously, serving a VM store is drive-intensive (reads and writes). L2ARC, even more so. My S3700 is rated for 10 Drive Writes per Day (DWPD) for 5 years... that's 256GB*10*365*5=4.672PB. The Crucial M500 is rated for 40GB/day for 5 years, or 72TB... meaning my S3700 can take 64x more writes over its lifespan. The KC400 is rated for 800TB, the DC400 for 257TB. Just loafing along, my array is averaging about 1MB/sec writes per drive... so that's 6MB/sec (two mirrors, 12 drives, 6 vdevs). That's 518GB a day. And that's not really hitting anything too hard. An array of S3700s could handle that and not wear out for a while... you'd kill your M500s in 291 days.

Moral to the story - there's a lot more to the SSD story than simple capacity. And Intel data center grade drives reign surpreme for a reason... and are more expensive for a reason as well. If you pay attention on eBay, you can pick up last-generation drives (like the S3700) with minimal power on time/writes for much cheaper than buying stuff brand new.

I've been running both of my M500's in my ESXi hosts for over 2 years now and haven't had any issues with them. Would the estimate on the M500 life cycle be based on using the drive specifically for a VMFS volume, or operating the drive in FreeNAS itself? If the former is true than I should have killed my drive by now within the host itself correct?

tvsjr · Jul 13, 2017

HalfJawElite said:
I've been running both of my M500's in my ESXi hosts for over 2 years now and haven't had any issues with them. Would the estimate on the M500 life cycle be based on using the drive specifically for a VMFS volume, or operating the drive in FreeNAS itself? If the former is true than I should have killed my drive by now within the host itself correct?

It all depends on your workload. And, endurance tests have proven that some drives can go well beyond their rating. My question would be, is ESXi providing any sort of SMART monitoring for those drives? If there were pre-failure indicators, would you know?

HalfJawElite · Jul 13, 2017

tvsjr said:
It all depends on your workload. And, endurance tests have proven that some drives can go well beyond their rating. My question would be, is ESXi providing any sort of SMART monitoring for those drives? If there were pre-failure indicators, would you know?

I am aware of some SMART data being accessible from within the ESXi CLI but nothing that can tell me the total TBW. I may have to boot off one of my Linux USB sticks and have a look in there at the data this week.

I also found this old post regarding SSD caching and L2ARC that helps clear up some of the mystifying details on this topic. Thus far it looks like I'll be going back to my previous consideration of an all SSD array ( dual 960GB or 1TB+ SSD's) for the VM datastore and my dual 1TB WD RE4's for VM backups. I'll also need to look into maxing out the RAM configuration for my HP DL360e Gen8 server (dual Xeon E5-2403's installed).

Thoughts on this method?

tvsjr · Jul 13, 2017

If you're only doing 4 drives... 2 SSDs and 2 HDDs... I would personally just directly connect them to the system and call it a day. You're adding substantial overhead and don't seem to be getting much for it.

HalfJawElite · Jul 13, 2017

tvsjr said:
If you're only doing 4 drives... 2 SSDs and 2 HDDs... I would personally just directly connect them to the system and call it a day. You're adding substantial overhead and don't seem to be getting much for it.

I already have directly attached SSD's to both of my hosts. The idea behind this build is that I'll be adding additional hosts in future and can easily deploy the new systems without local storage. This is also being done to satisfy the requirements for VMware vCenter High Availability and vMotion for my ESXi hypervisor cluster.

tvsjr · Jul 13, 2017

I guess it seems relatively pointless to build a NAS that will give you the effective capacity of half of one drive (two SSDs mirrored, only 50% used to avoid fragmentation), unless your only goal is simply to play with HA/vMotion.

Important Announcement for the TrueNAS Community.

L2ARC & Cache for shared ESXi SAN

HalfJawElite

Cadet

tvsjr

Guru

HalfJawElite

Cadet

tvsjr

Guru

HalfJawElite

Cadet

tvsjr

Guru

HalfJawElite

Cadet

tvsjr

Guru

HalfJawElite

Cadet

tvsjr

Guru

Similar threads

Important Announcement for the TrueNAS Community.

L2ARC & Cache for shared ESXi SAN

Cadet

Guru

Cadet

Guru

Cadet

Guru

Cadet

Guru

Cadet

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "L2ARC & Cache for shared ESXi SAN"

Similar threads