BUILD iSCSI Datastore for 25VMs

sekim · Jun 16, 2016

Hi,

I would like some advice on my FreeNAS build.

My business has a small cluster of 3 ESXi servers running about 25VMs in total. I moved from local storage to iSCSI on a QNAP NAS a few months ago but the latency is too high for my liking (frequently over 50ms with spikes over 100ms). Although the QNAP has an SSD cache it doesn't play nicely with iSCSI - enabling it gives latency spikes of over 1,000ms and an error log full of messages saying "dm-kcopyd track job out of array". I've lost confidence in the QNAP and I want to move to something more stable and better understood.

Our usage is at the low end of the spectrum - IOPS is usually below 200 but does spike up to 1,000 on occasion (usually when I reboot a VM or similar). 1 of the VMs is a MySQL DB and another is an Exchange server with about 50 mailboxes but they both have a generous amount of RAM so don't tend to hammer the disk.

We have 10GBe which is overkill for our needs but that's another story - we have it now.

So here is the spec I am thinking of:-

1 x SSG-6028R-E1CR12L Supermicro SuperStorage Server 2U Rackmount (with Super X10DRH-iT)
1 x MCP-220-82609-0N Supermicro Rear hot-swap drive bay for 2x 2.5" drives
2 x Supermicro 32GB SuperDOM SATA-III 32GB
1 x Intel Xeon E5-2609 v4 Quad-core 2.50 GHz Processor
4 x 32GB Crucial RAM Module - 128 GB - DDR4 SDRAM
4 x HGST Ultrastar 7K6000 6TB configured as 2 x mirrored vdevs
1 x 400GB Intel P3700 PCIe SSD - Dedicated SLOG

If I understand correctly then 4 disks as 2 mirrored vdevs will give me around around 500 read IOPS. Write will be about half that but the SLOG will give a large buffer to keep performance up.

I wasn't going to bother with an L2ARC as I think 128GB RAM should cover things.

Hopefully this is overkill for what I need today but I want something that will cover our future needs and I prefer to run systems under-stressed rather than over! :)

Cheers

depasseg · Jun 16, 2016

sekim said:
Write will be about half that but the SLOG will give a large buffer to keep performance up.

SLOG isn't really a write buffer, and you won't you use more than ~10GBytes (10Gbits/sec*5 seconds for a TXG). And the SLOG is only used in the case where sync writes are needed (which iSCSI doesn't do by default).

You've got the room, just add more mirrored drives. In fact, you could get rid of the SLOG and the drives and use SSD's instead.

Dice · Jun 16, 2016

Welcome to the forums.

Overall - you're on a good path.
A couple of things I notice:

Your selected CPU does seem to be a low clocked high core count CPU. Rather the opposite of what has been recommended on the forum to similar builds. Typically, the 4 core models with higher clock are seen as better choices. If you intend to use only one CPU, you should consider the e5-1620v4, otherwise models like the E5-2637v4 would probably be a better pick.

sekim said:
4 x HGST Ultrastar 7K6000 6TB configured as 2 x mirrored vdevs

You'll probably would want to get additional vdevs, using more drives to accomodate the desired level of service/overkill. Cut drive size in half, and double the drive count.

Or even better as suggested by @depasseg - get a bunch of SSD's to solve multiple problems at once.

sekim · Jun 16, 2016

depasseg said:
SLOG isn't really a write buffer, and you won't you use more than ~10GBytes (10Gbits/sec*5 seconds for a TXG). And the SLOG is only used in the case where sync writes are needed (which iSCSI doesn't do by default).

Could you explain a bit more as reading on http://www.freenas.org/blog/zfs-zil-and-slog-demystified/ it says that with an SSD SLOG device "Your storage pool will have the write performance of an all-flash array with the capacity of a traditional spinning disk array"?

Also I thought sync=always was highly recommended with vmware & iSCSI ?

jgreco · Jun 16, 2016

First, the X10DRH might not be the best option. It's a big hot dual board. For a compute node, great, but for a NAS device with only 12 drives, you're unlikely to need it. You'd *probably* be better off with the X10SRL and an E5-1650 v4 (6 core, very fast) or quite likely the E5-1620 v4 would be fine. Make sure the memory you're getting is ECC Registered RDIMM (LRDIMM won't work with the E5-16xx) if you go that route. The two mirror vdevs will give you a varying number of IOPS, which could be as low as ~250 but as high as ~10K, depending on choices you make.

Getting IOPS out of a pool is a matter of properly sizing things, and there isn't a straightforward formula. If you give ZFS gobs of space to work with, for example, you might easily get 10x the write IOPS out of a hard drive pool that you'd expect to be able to get, but this would be because you're only using 10% of the space, and that doesn't cleanly translate to read IOPS, for which you need ARC/L2ARC.

https://forums.freenas.org/index.ph...d-why-we-use-mirrors-for-block-storage.44068/

etc

sekim · Jun 16, 2016

I will check out the X10SRL, the reason I had the X10DRH is because it comes as a prebuilt bare-bones system with an LSI 3008 controller in IT mode - basically it was the easy/lazy option! There's no reason I couldn't buy the chassis and motherboard separately though.

The X10DRH doesn't appear to support the E5-1600 series which is why I chose a 26xx series CPU. It's been years since I built a server from scratch so I'm not up to speed on which CPUs fit which sockets and the 2011 is especially difficult to fathom!

I opted for a lower clock speed as since I am mirroring I assumed (always dangerous) that I wouldn't need as much grunt compared to using RAIDz. Is that not the case?

I didn't realise that free space translated to better performance so that's really good to bear in mind.

Thanks for the feedback

depasseg · Jun 16, 2016

sekim said:
Could you explain a bit more as reading on http://www.freenas.org/blog/zfs-zil-and-slog-demystified/ it says that with an SSD SLOG device "Your storage pool will have the write performance of an all-flash array with the capacity of a traditional spinning disk array"?

Because in that context (TrueNAS), the underlying spinning disk storage is configured appropriately (i.e. more than 2 vdevs) to handle the output from the SLOG.

jgreco · Jun 16, 2016

sekim said:
Could you explain a bit more as reading on http://www.freenas.org/blog/zfs-zil-and-slog-demystified/ it says that with an SSD SLOG device "Your storage pool will have the write performance of an all-flash array with the capacity of a traditional spinning disk array"?

Because that doesn't really mean exactly what someone unfamiliar with the technology might think. Great marketing, less great technically.

https://forums.freenas.org/index.php?threads/some-insights-into-slog-zil-with-zfs-on-freenas.13633/

For very carefully qualified values of certain things I can make "your storage pool will have the write performance of an all-flash array" bla bla bla be a true statement, but this has less to do with the SLOG device than the overall system design. The SLOG is ALWAYS a performance killer; if you set "sync=disabled", your system will ALWAYS be faster. Period. Continued below.

Also I thought sync=always was highly recommended with vmware & iSCSI ?

Which is why he said "by default."

So yes, you do want sync=always for anything where you might have a VM running, and the filer crashes/reboots, and this could render the VM inconsistent if writes were lost.

So if you want fast, *start* with the pool, not the SLOG. Mirrors = faster. Gobs of free space = faster writes. So for example the VM filer here has 52TB of raw HDD space. Eight three-way mirrors of 2TB drives. Gives a 16TB pool, of which we allocate 7TB for iSCSI. Now the neat thing is that the underlying pool has maybe as much as 3600 IOPS read capacity (all 24 drives independent @ 150 IOPS), and probably at least 1200 IOPS write (8 vdevs @ 150 IOPS). But keeping free space increases IOPS, though not necessarily by a predictable amount, and having 1TB of L2ARC and 128GB of ARC means that most of the frequent reads on the filer are being fulfilled by the ARC subsystem, not pool reads. This means that most of the pool activity is writes, which is a great optimization.

Now, when you turn on SLOG, you will *lose* some of that performance, but you gain the POSIX write guarantees. This is a good thing.

So the thing to realize about an "all flash array" is that an array without a SLOG has to go through the ZFS in-pool ZIL mechanism to write a sync block, which is generally a lot slower than you might expect. So especially if I get to pick a HDD system with a lot of free space and only gigabit ethernet, I'd wager that I can make a HDD based system feel as fast as an all-flash array. Yay for marketing. It absolutely doesn't mean that in all cases a SLOG will make a HDD pool feel like an SSD pool; anyone who wants to test this firsthand is welcome to check it out on a highly fragmented pool filled to 80% doing block storage tasks.

jgreco · Jun 16, 2016

depasseg said:
Because in that context (TrueNAS), the underlying spinning disk storage is configured appropriately (i.e. more than 2 vdevs) to handle the output from the SLOG.

((Where's my stick?))

There is no "output from the SLOG"!!!!!! It is not a write cache.

depasseg · Jun 16, 2016

DOH! Rookie mistake. I'll hand in my zfs card.

Can't handle the output from the ZIL (RAM). <<- better?

jgreco · Jun 16, 2016

sekim said:
I will check out the X10SRL, the reason I had the X10DRH is because it comes as a prebuilt bare-bones system with an LSI 3008 controller in IT mode - basically it was the easy/lazy option! There's no reason I couldn't buy the chassis and motherboard separately though.

Right, but it isn't necessarily a great choice, and then you get into issues with which slots work/don't work when only one CPU is populated, and it's big, hot, and slow compared to a more nimble UP system.

The X10DRH doesn't appear to support the E5-1600 series which is why I chose a 26xx series CPU. It's been years since I built a server from scratch so I'm not up to speed on which CPUs fit which sockets and the 2011 is especially difficult to fathom!

Yeah, and Supermicro's offerings aren't really ZFS oriented. The E5-16xx is actually intended as a "workstation" class CPU but functionally is mostly just a faster, single socket version of the 26xx's, with the one caveat that LRDIMM isn't supported.

I opted for a lower clock speed as since I am mirroring I assumed (always dangerous) that I wouldn't need as much grunt compared to using RAIDz. Is that not the case?

Sure, that's valid, but it also means reduced speed if and when it's needed. Quite possibly a nonissue.

I didn't realise that free space translated to better performance so that's really good to bear in mind.

Thanks for the feedback :)

So one thing to bear in mind is that only having two 6TB vdevs might not be that awesome, but I *strongly* encourage you not to go down to four 4TB vdevs or something like that. If performance turns out to be an issue, add additional 6TB vdevs until performance is acceptable, even if your pool utilization is fairly low. From your described usage, I'm kinda suspecting you'll be "okay-but-not-thrilled" with the two 6TB vdevs.

jgreco · Jun 16, 2016

depasseg said:
DOH! Rookie mistake. I'll hand in my zfs card.

Can't handle the output from the ZIL (RAM). <<- better?

((needs bigger stick))

The ZIL is always a storage structure. It can be part of the pool (where it's best referred to as the ZIL) which means that a write has to be expedited out to the pool bypassing the normal transaction group write process, or by performing a write to a separate dedicated log device, the SLOG.

What you're thinking of is the transaction group. The transaction group is built in memory from the data being written and then flushed periodically to the pool.

The transaction group process works the same way for both sync and async data. Because data is being cached in RAM for eventual write, this is where the opportunity for data loss exists, which is what the ZIL/SLOG mechanism addresses.

The in-pool ZIL sucks especially for things like RAIDZ because you're doing a complex allocation, RAIDZ parity computation, write data to disk, etc., for each sync write request.

maglin · Jun 16, 2016

Jgreco is the one you need to listen to for your exact use case. A lot of VMs using a FreeNAS data store. I believe he went with a 3 drive per mirror stripped array but unsure how many mirrors. More RAM might not hurt but I could be wrong.

Sent from my iPhone using Tapatalk

depasseg · Jun 16, 2016

jgreco said:
((needs bigger stick))

The ZIL is always a storage structure. It can be part of the pool (where it's best referred to as the ZIL) which means that a write has to be expedited out to the pool bypassing the normal transaction group write process, or by performing a write to a separate dedicated log device, the SLOG.

What you're thinking of is the transaction group. The transaction group is built in memory from the data being written and then flushed periodically to the pool.

The transaction group process works the same way for both sync and async data. Because data is being cached in RAM for eventual write, this is where the opportunity for data loss exists, which is what the ZIL/SLOG mechanism addresses.

The in-pool ZIL sucks especially for things like RAIDZ because you're doing a complex allocation, RAIDZ parity computation, write data to disk, etc., for each sync write request.

Yes, I thought the TXG in memory was also considered part of the ZIL, but I see that it's not. But that's what I was referring to (TXG's coming from RAM going to the Pool). And thinking that after a while of RAM trying to write to a slow pool, and continuously asking the ZIL "Here hold this, and this, and this", things will eventually have to slow down due to the pool.

sekim · Jun 16, 2016

Done some more reading and if I understand correctly the ZIL (and therefore SLOG) does not store any data only the and actually it is only read from in the event of a crash/power failure. A fast SLOG gives a speed boost purely because it provides persistent storage for the ZIL so ZFS can postpone the log writes to a more convenient time. The data itself has to already have been written to the disc so if the disc is slow then the writes are slow.

There is not much price difference between 12 x 2TB WD Red Pros (SATA) and the 4 x 6 TB SAS drives I had specified.

jgreco · Jun 16, 2016

depasseg said:
Yes, I thought the TXG in memory was also considered part of the ZIL, but I see that it's not. But that's what I was referring to (TXG's coming from RAM going to the Pool). And thinking that after a while of RAM trying to write to a slow pool, and continuously asking the ZIL "Here hold this, and this, and this", things will eventually have to slow down due to the pool.

Closer, but still not right. See, what you really need to do is to visualize two parallel processes running alongside each other.

On one hand, you have the transaction group process. For the average case, a write comes in, and it gets placed into the transaction group, in RAM, and then "we're done." This assumes space is available in the transaction group. More on that in a sec.

On the other hand, you have the ZIL commit process. Async writes are ignored. For any sync write that comes in, a write has to be committed to the ZIL. If this is a SLOG device, it is a quick back-and-forth commit of the data to the SLOG device via whatever controller. If it's to the in-pool ZIL, then you may need to do some RAIDZ computation and writes to multiple disks or other complexicated and slow stuff.

Each write request gets sent to both those processes. They're essentially parallel. Both processes must complete before the write request is considered complete.

On the backside, we have a transaction group flusher that sends the data in the in-memory transaction group to disk. There's one transaction group that is being built, and a transaction group that's being flushed to disk (either currently or recently finished). So every five seconds (or when the transaction group reaches the maximum allowed size), ZFS closes out the transaction group that's being built, and starts flushing it to disk, opening a new transaction group for further user write requests.

But...! To make sure that we're not building up transaction groups faster than they can be flushed to disk, only one transaction group can be in each state. So we really need a transaction group to be able to flush out to disk in that five second window, or else everything STOPS while the flush completes.

That stalling process is completely unrelated to the SLOG/ZIL. It has to do with the write rate that the pool can sustain.

jgreco · Jun 16, 2016

sekim said:
Done some more reading and if I understand correctly the ZIL (and therefore SLOG) does not store any data only

Incorrect. It stores the data. In the event of a crash or whatever, it's the only place that the data has been stored persistently.

the and actually it is only read from in the event of a crash/power failure. A fast SLOG gives a speed boost purely because it provides persistent storage for the ZIL

Yes

so ZFS can postpone the log writes to a more convenient time.

Still no. There are no other "log writes." There is only the pool transaction group write.

sekim · Jun 16, 2016

If I have a 1.2 TB NVME SSD can I partition that up and use 200GB as SLOG and the other 1 TB as L2ARC? Or should they be separate devices?

Spearfoot · Jun 16, 2016

sekim said:
If I have a 1.2 TB NVME SSD can I partition that up and use 200GB as SLOG and the other 1 TB as L2ARC? Or should they be separate devices?

No. They need to be separate devices.

depasseg · Jun 16, 2016

sekim said:
If I have a 1.2 TB NVME SSD can I partition that up and use 200GB as SLOG and the other 1 TB as L2ARC? Or should they be separate devices?

While it is technically possible to do that, as @Spearfoot mentioned, it's not advised, and they should be separate devices.

Important Announcement for the TrueNAS Community.

BUILD iSCSI Datastore for 25VMs

Cadet

FreeNAS Replicant

Wizard

Cadet

Resident Grinch

Cadet

FreeNAS Replicant

Resident Grinch

Resident Grinch

FreeNAS Replicant

Resident Grinch

Resident Grinch

Patron

FreeNAS Replicant

Cadet

Resident Grinch

Resident Grinch

Cadet

He of the long foot

FreeNAS Replicant

Similar threads