BUILD ESXi SAN - on a budget

clipsco · Nov 7, 2018

Hi, everyone. I'm new to the forums but have been a lurker for quite some time.

I am putting together a SAN for iSCSI VM storage, and am having some confidence issues regarding R/W speeds before ordering $2k+ of parts.
Currently, I have a server running with 18 1TB SSDs in hardware RAID 50, yielding 12TB. If I were to just convert that into a FreeNAS server I would yield 4.8TB at the accepted 60% capacity limit, which is horrible. There are quite a few of high IO VMs, so I don't believe Z2/Z3 are an option.

My small FreeNAS server is running VMs right now on 6x SSDs in Z2 and I am getting ~900MB/s Seq Read and ~1300MB/s write...however that is standard sync...

Ive gone over the SLOG benchmarks, but thats just benchmarking the SLOG, not actual performance of the filesystem.

So, my first question is about SLOG and ARC. Lets say I build the following:

Intel X5670 (or something of that nature)
128GB RAM (overkill?)
8 4TB 2.5" spinny-disks (mirror)
1 spare
Optane 900p SLOG
9210-8i
10Gb Ethernet

Can I actually expect VM write performance reflect that of the SLOG, despite using spinny-disks?
With all that RAM, the ARC should provide great read speeds, or should I get an L2ARC device?

Secondly, its about the 60% rule and iSCSI block storage.

Code:

NAME		   SIZE  ALLOC   FREE  EXPANDSZ   FRAG	CAP 
san1vol1	  7.25T  2.69T  4.56T		 -			14%	37%

However, I have a zvol for an iSCSI datastore at 5.8TB. ESXi shows it's full, which is how I understand block storage works, but since my zpool capacity is way down there at 37%, I should have been able to create the zvol at the full size of the zpool, right?

You have all been an especially great resource to my ventures down Storage Lane.

Thanks!

jgreco · Nov 7, 2018

RAIDZ is almost never a good option for block based storage.

https://forums.freenas.org/index.ph...d-why-we-use-mirrors-for-block-storage.44068/

VM write performance depends on the pool, SLOG, and pool occupancy rates. ZFS is a copy-on-write filesystem, and as space fills, new writes are allocated to the small bits of space freed up by previous writes. This means lots of seeking.

A pool with 10% occupancy can seem as though it is a MUCH faster device:

https://extranet.www.sol.net/files/freenas/fragmentation/delphix-small.png

This is a single disk, once it has reached steady state fragmentation, which isn't what you'll see at the beginning of your adventure, but is likely to be where you end up in a year.

As you can see, at 10% pool occupancy, a single drive is managing a bunch of I/O, and it doesn't really matter if it is sequential or random. It is very easy to allocate space for new writes, and that is reflected. As you get out past 50%, that has fallen precipitously.

So the thing you want, desperately, to do is to have LOTS of free space. This is what makes writes fast.

But CoW and fragmentation also affect reads. If this becomes unacceptably slow, the answer is to throw L2ARC at it. The ARC is the only mechanism ZFS has to combat read fragmentation. If you have a bunch of VM's with a reasonable working set, you want to size the ARC+L2ARC to hold that working set. When you do that, your reads of all the data you're normally reading will be approximately SSD-speed, your pool will be mostly free to handle new write traffic, so having a decent ARC+L2ARC does actually affect write speeds indirectly.

You definitely do NOT need to run out and cram your box with L2ARC right away, if you're cost-sensitive.

Nick2253 · Nov 7, 2018

The throughput issue is uniquely for spinning drives, though, right? So with SSD-based storage, there'd be no need to hold to ~60% occupancy.

jgreco · Nov 7, 2018

Nick2253 said:
The throughput issue is uniquely for spinning drives, though, right? So with SSD-based storage, there'd be no need to hold to ~60% occupancy.

Ah, no. It just works differently for SSD. SSD write speeds tank when the free page pool is depleted. This creates a somewhat different set of challenges, but still challenges. Some of them can be addressed or mitigated, at least to some extent, on some SSD's, by ashift 13, etc.

Jessep · Nov 8, 2018

clipsco said:
Intel X5670 (or something of that nature)

128GB RAM (overkill?)

8 4TB 2.5" spinny-disks (mirror)

1 spare

Optane 900p SLOG

9210-8i

10Gb Ethernet

What performance 4TB 2.5" spinner were you planning on using?

The only 4TB 2.5" I know of are SMR drives.

Johnnie Black · Nov 8, 2018

Jessep said:
The only 4TB 2.5" I know of are SMR drives.

Correct, currently only Seagate makes 2.5" 4TB SATA drives, and they are SMR, doesn't mean they can't be used, but performance will suffer, though maybe not so much in a mirror.

I'm currently testing some to replace my WD Blue drives to make a lower power server, since my HDD pool is mostly WORM data it might be OK, still I hope write performance is better then the current resilvering speed, looks like it's going to taker over 2 days to finish, and it took about 10 to 12 hours with the WDs.

clipsco · Nov 8, 2018

jgreco said:
RAIDZ is almost never a good option for block based storage.

https://forums.freenas.org/index.ph...d-why-we-use-mirrors-for-block-storage.44068/

VM write performance depends on the pool, SLOG, and pool occupancy rates. ZFS is a copy-on-write filesystem, and as space fills, new writes are allocated to the small bits of space freed up by previous writes. This means lots of seeking.

So the thing you want, desperately, to do is to have LOTS of free space. This is what makes writes fast.

Yes, this is all information I am familiar with, my configuration will be mirrored. I know the occupancy rate effects performance, but I am more focused on the write performance with 8 spinny disks and a SLOG. See below for a link to performance tests with these drives.

I know I should keep occupancy as low as possible, but cost-per-tb is an issue with ZFS and I want to find the balance. So, with compression on, and 80% of the pool allocated to a zvol, the block 100% full, the actual capacity free I am concerned about is the 37% free in that pool, NOT the space left that ESXi reports in the iSCSI share, correct? I should keep an eye on fragmentation to gauge performance loss, right?

Does anyone have a similar spinny-disk configuration they can share performance results for

jgreco said:
You definitely do NOT need to run out and cram your box with L2ARC right away, if you're cost-sensitive.

My understanding is, then, that the ARC basically becomes a bottleneck over time then.

Jessep said:
What performance 4TB 2.5" spinner were you planning on using?

The only 4TB 2.5" I know of are SMR drives.

Seagate ST4000LM024, performance tests here.

Using this configuration gets my price per TB down to $142 (including the SLOG). If someone could confirm that I can expect solid SSD-like IOPS with the mentioned specs then we may finally have found a cost-effective VM storage option, and an argument against those hardware RAID savages. Currently, all-flash hardware RAID10 runs about $360/TB.

I'll have one hell of a spreadsheet for you guys when this is all done.

Code:

RAID Type		 Total Capacity Practical Capacity Size of disks Number of Disks in group Number of groups Total Disks Cost per Disk Total Cost of Disks ZIL Cost Total Cost Price per TB
Two-Way Mirror   14.0 TB				  8.24 TB		  4 TB					2								4					 8			 $112.00			  $896.00	   $275.00 $1,171.00 $142.11

Also one more question, which is more related to VMware, so feel free to ignore.
I know VMware does not like link aggregations for iSCSI, and network port binding is the way to go...but does anyone know if performance is increased with LA on the SAN and port-binding on ESXi? FreeNAS not allowing multiple interfaces in the same subnet makes Round-Robin/Port Binding very complicated, and during some testing I found better performance by removing vmk's in the second subnet.

Thanks again for all your replies!

HoneyBadger · Nov 8, 2018

Hi @clipsco - see inline quotes below for my thoughts on this.

clipsco said:
I know I should keep occupancy as low as possible, but cost-per-tb is an issue with ZFS and I want to find the balance. So, with compression on, and 80% of the pool allocated to a zvol, the block 100% full, the actual capacity free I am concerned about is the 37% free in that pool, NOT the space left that ESXi reports in the iSCSI share, correct? I should keep an eye on fragmentation to gauge performance loss, right?

The actual pool capacity utilized (CAP%) is what you want to use as one of the metrics, as well as FRAG% giving you a general idea of the number of fragmented metaslabs.

If your data from san1vol1 contains similar data, that means the 5.8TB of logical data in that VMFS volume compressed down to about 2.7TB after LZ4 had its way with it. That's about 2:1, which is pretty good for LZ4. That means that if you create a 12.8TB sparse ZVOL (80% of 16TB) it should result in 6.4TB of actual used space (vs. logicalused) in your pool. 6.4TB/16TB = 40% pool occupancy.

And honestly, that's about where I'd stop if you wanted to retain that level of performance long-term. It gives you some headroom in case your workload doesn't compress well enough, and offers 50-60% of your pool as a free space sacrifice to appease the Angry God of Pool Fragmentation. ;)

Does anyone have a similar spinny-disk configuration they can share performance results for?

Closest to this I would say is a 12x1TB setup, but it's only about 10% full. It's bottlenecked by network at 2Gbps or 200MB/s; the pool itself is capable of about 600MB/s writes.

Seagate ST4000LM024, performance tests here.

Using this configuration gets my price per TB down to $142 (including the SLOG).

Those are SMR. Seagate likes to hide this by saying their drives "feature Multi-Tier Caching (MTC)" - but that caching is just a bandage over the sucking chest wound that is the awful performance of drive-managed SMR.

Regardless of their low cost, I cannot stress this enough: do not use these drives.

If someone could confirm that I can expect solid SSD-like IOPS with the mentioned specs then we may finally have found a cost-effective VM storage option, and an argument against those hardware RAID savages. Currently, all-flash hardware RAID10 runs about $360/TB.

ZFS's ARC and SLOG can greatly help mechanical drives perform better, but it's never going to perform like flash. Flash may come with its own performance gremlins (as @jgreco alluded to with page management) but by and large it beats the pants off of spinning rust.

You'll have to re-calculate your numbers with the removal of SMR drives as mentioned. Is the 2.5" a hard requirement based on the chassis, or can you go with cheaper-per-TB 3.5" (which you could leverage for either cost savings, or adding more space to sacrifice to the Frag-Gods)

Also one more question, which is more related to VMware, so feel free to ignore.
I know VMware does not like link aggregations for iSCSI, and network port binding is the way to go...but does anyone know if performance is increased with LA on the SAN and port-binding on ESXi? FreeNAS not allowing multiple interfaces in the same subnet makes Round-Robin/Port Binding very complicated, and during some testing I found better performance by removing vmk's in the second subnet.

Thanks again for all your replies!

iSCSI doesn't like link aggregation - that's what MPIO is for. Put each ESXi vmk in a separate, non-overlapping subnet matched up with a single FreeNAS NIC, and set up VMware to use the round-robin pathing (VMW_PSP_RR) as well as reducing the IOPS before path switch to a significantly lower number than the out-of-box value of 1000 - the exact number will depend on your number of hosts/VMs and the amount of storage traffic they generate.

clipsco · Nov 9, 2018

@HoneyBadger, thanks for the incredibly helpful reply.
You saved me by getting me to ditch the idea of reducing costs with 2.5" spinny-disks. They're fine for backups right now, but that's going to be the most I ever use them for.

So, my costs essentially triple. Does anyone have a recommendation for 2TB SSDs? Ive had 4 Crucial 1TB failures this year, so I'm not too interested in them...unless someone wants to convince me otherwise.

Code:

Samsung 860 EVO Storage 8 $2,920.00
Samsung 860 EVO Spare 2 $730.00
Intel 900P SLOG 1 $279.00
Transcend 32GB SSD OS 2 $47.00
128GB RAM FS Support 1 $300.00
Initial Cost $4,276.00
Total Capacity 7.02 TB
Practical Capacity 4.84 TB
Potential Capacity 12.10 TB
Minimum Expansion Cost $730
Total Filled Cost $8,760

iSCSI doesn't like link aggregation - that's what MPIO is for. Put each ESXi vmk in a separate, non-overlapping subnet matched up with a single FreeNAS NIC, and set up VMware to use the round-robin pathing (VMW_PSP_RR) as well as reducing the IOPS before path switch to a significantly lower number than the out-of-box value of 1000 - the exact number will depend on your number of hosts/VMs and the amount of storage traffic they generate.

Yea, this is the configuration I'm using now. However, according to this KB it is not recommended by VMware. Esxtop shows a 8Gb-11Gb during load testing, so I am unsure if there is a bottleneck or if that's just what to expect, since it's averaging 10Gb on a 10Gb network. EDIT: I will try dropping the IOPS per path switch to 100 and recheck.

jgreco · Nov 9, 2018

Some of VMware's recommendations are aimed at busy sites with tons of VM's, and the tuning decisions you make there are different.

Many FreeNAS users will go and try to benchmark and get "the most" out of their setup, often not realizing that this can be detrimental to overall performance.

I don't have time right now to comment on some other things that should be said.

clipsco · Nov 12, 2018

jgreco said:
Some of VMware's recommendations are aimed at busy sites with tons of VM's, and the tuning decisions you make there are different.

I have 5 HANA (in-memory databases) instances running, consuming about 1TB of RAM. They sync to persistent storage at regular intervals, so there are huge IOPS bursts. They are capable of consuming every IOP that is available to them. With Storage IO Control enabled they dont tank the other VMs, but "the most I can get" is basically what I need.

It turns out cost isn't as a big of a deal as I thought. This is my first time building and managing storage, so the $5k+ for all SSDs in this build was crazy to me...Apparently not so crazy to my boss.

Does anyone have any recommendations on 2TB SSDs?

Jgreco, thanks for taking the time you have already. Your posts over the years, as well as a handful of other users, have really kept me from drowning when dropped into the pool (pun intended) that is known as storage.

jgreco · Nov 12, 2018

Glad to know someone reads stuff.

If you can, do some testing several different ways before arriving at any final configuration tuning options. I have repeatedly found, over the years, that benchmarks end up being deceptive as to how real world systems perform. Most benchmarks were designed to measure sequential and random speeds to standard hard disks, and are unable to adapt to the complex performance behaviours that ZFS will exhibit as circumstances get more complicated. Those who set out to create a testing environment that includes a stressed-out pool are likely to get more down-to-earth results than those who start with an empty pool. If you need to deal with bursty write performance on SSD, do consider the potential benefits of underprovisioning (overprovisioning, depending on who you talk to) your SSD's to create a larger free page pool. Check if one ashift works better than another (and don't forget about 13).

kdragon75 · Nov 12, 2018

clipsco said:
Does anyone have any recommendations on 2TB SSDs?

I know @Chris Moore leans toward samsung and intel ~~recommandes anything cheap but better than king-dian or other types of ultra cheap~~ SSD. No need for enterprise class SSDs. Basically, by the time they wearout prices and performance will have changed so much that it would be silly to hang onto such old drives anyway. Keep one or two on hand as cold spares and be happy!

On the cost end of things, get a quote from iXsystems, or better from NetApp. Suddenly $10k will sound CHEAP. If you really want to laugh look at IBM...

Also considering the database is running in RAM, you may look into a datastore just for the databases and set sync=disabled and plan snapshots in the interval. Is the database sync operation blocking or will it continue to service requests while writing to disk?

Chris Moore · Nov 12, 2018

kdragon75 said:
recommandes anything cheap

Not true, if we are talking boot drives, I usually point out that I successfully use 40 GB laptop drives, in mirrored pairs, that I bought as "new - old stock" and that those have worked very reliably for me for over two years and that the reason I chose to do that is because we had some old servers at work that we decommissioned after about ten years of use that had laptop drives as boot drives and they had never had a problem either. Laptop spinning disks in a stationary environment can be very reliable because they are built to handle the movement that a laptop undergoes and the fact that they are slower than regular mechanical disks is not a significant issue for a FreeNAS boot disk.
These drives do happen to be inexpensive, because the demand is somewhere between low to nonexistent. I picked up the six I bought for $7 each.
The next option would be SSD and with regard to those, I have picked up many small capacity Intel drives from eBay in the vicinity of $20 to $40. If I were choosing, I would choose based on anticipated reliability and that would have me in the Intel or Samsung arena. I do not trust SSDs from many vendors and I have had bad experience with several of the 'cheap' brands. I

Johnnie Black said:
Correct, currently only Seagate makes 2.5" 4TB SATA drives, and they are SMR, doesn't mean they can't be used, but performance will suffer, though maybe not so much in a mirror.

I have some of those at work and testing shows the average data rate to be around 100MB/s, so they are quite slow in comparison to regular 3.5" drives. Even the 4TB 3.5" SMR drives had poor performance when I put them in a RAIDz2 pool. The drives were averaging around 125 MB/s where non SMR drives were averaging around 160 MB/s. I would say SMR disks are to be avoided, but I have not tried them in mirror sets.

clipsco said:
ntel X5670 (or something of that nature)

If you are talking about an old socket 1366 system with a Xeon X5670 processor, that is too old. The processor will limit maximum overall system performance.
I wouldn't do that unless you are just trying to get by with what you have handy because you don't have a budget to buy something.

clipsco said:
Currently, I have a server running with 18 1TB SSDs

If you can afford all those SSDs, you can certainly afford to buy a proper solution for what you need.

clipsco said:
8 4TB 2.5" spinny-disks (mirror)

Generally speaking, more disks equates to faster access. with only 8 disks, that is 4 mirrors and about 100 IOPS per mirror with those disks. Estimate 400 IOPS for the pool. That sounds pretty slow to me. I wouldn't do it.

Chris Moore · Nov 12, 2018

clipsco said:
Seagate ST4000LM024, performance tests here.

Using this configuration gets my price per TB down to $142

That doesn't make sense because the 2.5" drives are more expensive than the 3.5" drives. If you want low cost and high performance, why would you go with the smaller drives?

Johnnie Black · Nov 13, 2018

Chris Moore said:
The drives were averaging around 125 MB/s where non SMR drives were averaging around 160 MB/s. I would say SMR disks are to be avoided,

Agree, they should be avoided for most use cases, for this server I'm going to keep them because of the low power and since this pool is mostly write once in small chunks that fit on RAM cache there's no perceived impact at all, when I need to do larger writes sustained write speed went from 300MB/s to 200MB/s, which I feel is still adequate, and for speed I have the SSD pool.

One thing that surprised me was the resilvering performance, I don't know the inner working of ZFS but would expect resilvering to be mostly sequential writes, and SMR drives usually handle those well, but resilvering took about 7 times longer, which makes me believe resilvering acts like random writes and hits the SMR wall.

jgreco · Nov 13, 2018

Johnnie Black said:
Agree, they should be avoided for most use cases, for this server I'm going to keep them because of the low power and since this pool is mostly write once in small chunks that fit on RAM cache there's no perceived impact at all, when I need to do larger writes sustained write speed went from 300MB/s to 200MB/s, which I feel is still adequate, and for speed I have the SSD pool.

One thing that surprised me was the resilvering performance, I don't know the inner working of ZFS but would expect resilvering to be mostly sequential writes, and SMR drives usually handle those well, but resilvering took about 7 times longer, which makes me believe resilvering acts like random writes and hits the SMR wall.

RAIDZ resilvering is not like RAID5. It does not linearly scan the disk. ZFS recovery and validation operations are metadata traversals, which means lots of seeks.

Johnnie Black · Nov 13, 2018

jgreco said:
ZFS recovery and validation operations are metadata traversals, which means lots of seeks.

Thanks, that makes sense, and it explains the terrible resilvering speed.

Chris Moore · Nov 13, 2018

Johnnie Black said:
One thing that surprised me was the resilvering performance, I don't know the inner working of ZFS but would expect resilvering to be mostly sequential writes, and SMR drives usually handle those well, but resilvering took about 7 times longer, which makes me believe resilvering acts like random writes and hits the SMR wall.

@jgreco beat me to the explanation, but I will add my experience. It was largely because of the slowdown in resilver speed that I chose to replace the SMR drives that I had already put in service with non SMR drives. I had been able to replace a drive in my pool in two to three hours before the SMR drives and after they were in the pool, it took almost ten hours to do a resilver. I also noticed that the pool was much slower when receiving a ZFS send / receive after I had the SMR drives in the pool.
@clipsco In my situation, I didn't realize the disks I was getting were SMR until I had them in the pool and saw the performance numbers because Seagate did not state it in the documentation. I just don't want you to be surprised by the behavior if you choose to use these disks. They will be slow by comparison to 'regular' disks that do not use SMR to gain additional data density.

HoneyBadger · Nov 13, 2018

I don't know why people are still trying to force SMR to be useful for anything other than single-disk media backups (or object storage) - RAIDing them always ends poorly.

Once libzbc gets traction behind it and host-managed/aware is more common, it might be useful; but you're still going to need the equivalent of ashift=28 to align with the 256MB SMR zones on those drives. ;)

Back to the OP's question - @clipsco those 860 EVOs should actually be a fairly decent choice.

The firmware on them appears to support the two necessary features for TRIM to be passed through all LSI HBAs ("Data Set Management TRIM supported (limit 8 blocks)" and "Deterministic read ZEROs after TRIM") so your performance hopefully won't be absolutely trashed over time. (Although I hear the SAS2008 is still picky about proper TRIM support, so yet another reason to go SAS2308 if you didn't already have enough ...)

You'll likely still want to use over/under-provisioning at some level - I don't know if it supports the Host Protected Area or if you'll just have to manually make smaller partitions - to leave lots of spare unused NAND for the controller to wear-level over.

And finally, I believe that model uses an 8KB program page, so you'll want to use ashift=13 for the best performance.

Important Announcement for the TrueNAS Community.

BUILD ESXi SAN - on a budget

Dabbler

Resident Grinch

Wizard

Resident Grinch

Patron

Guru

Dabbler

actually does care

Dabbler

Resident Grinch

Dabbler

Resident Grinch

Wizard

Hall of Famer

Hall of Famer

Guru

Resident Grinch

Guru

Hall of Famer

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "ESXi SAN - on a budget"

Similar threads