12 disk setup SATA setup for VM storage

Jon K · Jun 19, 2016

Hi all -

Forgive me as I am new to FreeNAS (but not storage in general). I admin EMC, Netapp, Equallogic, etc. but am new to ZFS overall (though Netapp has a strinking resemblance...). That said, I have been using a Dell MD1000 w/ 15 1TB 7200RPM enterprise SATA disks in a RAID50 using 3 5-disk RAID5's striped. This was direct attached to a Dell R710 in my lab which is great but I wanted to have shared storage since I have two R710s. Both hosts have dual X5670 CPUs and 144GB of RAM each. I am running ESXi 6.0.

I've purchased a Dell R510 12-bay server and flashed an H200 card to 9211-8i firmware in IT mode. All is well, FreeNAS sees all 12 1TB disks including the two SSDs (one 250GB one 256GB) I've added internally. The R510 storage server has 64GB DDR ECC memory and dual E5620 CPUs (2.4GHZ 4c/8t).

I am looking for guidance here - so, I am used to H700/LSI 9260-8i, etc. having ~1GB of write-back cache. Usually 12 - 15 7200 RPM spindles w/ write-back cache equates to decent performance. This is for a lab, so it doesn't need insane IOPs, but the more the merrier. Can anyone help me through designing the layout? I'd like to have a balance of redundancy, IOPs, and usable space. I am OK losing 2 disks worth of storage for redundancy if needed - I use vSphere Replication to replicate to another cluster as well as a Synology serving iSCSI targets so I have redundancy at the VM and hypervisor level. I run maybe 20 - 30 VMs on this none of which are too chatty, but during Windows Update fests, well.. yeah. I have FreeNAS 9.10 installed to USB flash key (32GB). How do I best configure this for what I am trying to do? I am used to iSCSI but I have no qualms running NFS, either.

Thanks all! Great forum!

gpsguy · Jun 19, 2016

Striped mirrors are the way to go.

Do a forum search for user:jgreco and search terms like iSCSI, block storage, etc

He has a number of threads on the topic.

Sent from my iPhone using Tapatalk

jgreco · Jun 19, 2016

The 9260 usually only has 512MB of RAM unless you have one of the OEM versions like Supermicro's, and not all of that gets uses for write-back cache.

That's pretty much crap compared to ZFS, which will create transaction groups that are up to 1/8th the size of your system memory (by default). For production use, you'd want to include a SLOG device to provide protection for the data being written, but for a lab environment you might not be concerned about that.

You don't really have a lot of options here for pool design though. Your practical options are two-way mirrors or three-way mirrors. If you try to pick RAIDZ{1,2,3} you will find that your pool is way slow. A RAIDZ vdev is generally in the same speed category as the slowest member device, so a 7200RPM spindle might have maybe 200 IOPS to its name, and two six disk RAIDZ2's might end up feeling around 400 IOPS.

https://forums.freenas.org/index.ph...d-why-we-use-mirrors-for-block-storage.44068/

By way of comparison, as a pool of six vdevs of two-way mirrors, your 1TB disks would give you a 6TB pool, of which you can use about 2 or 3TB safely while maintaining fairly good performance levels.

jgreco · Jun 19, 2016

gpsguy said:
Striped mirrors are the way to go.

Do a forum search for user:jgreco and search terms like iSCSI, block storage, etc

He has a number of threads on the topic.

Wow, I just gotta stop answering threads... that's like the fourth time today someone's slid in moments before me with a good answer.

Mirfster · Jun 20, 2016

jgreco said:
Wow, I just gotta stop answering threads... that's like the fourth time today someone's slid in moments before me with a good answer. :)

Nah, personally I actually prefer it when others chime in and confirm/reassure or even correct any statement I made. To me it adds weight to the answer/suggestion. Also, for some posters (not saying the OP here) it seems that for some the answers need to be reiterated multiple times or even in a different fashion before they sink in. :D

snaptec · Jun 20, 2016

Agree!
I use only stripped mirrors with kvm vms.
Running really good.
Only massive (Backup-)Data is on 6 wide raidz2, but minimal 3 vdevs of them.

Gesendet von iPhone mit Tapatalk

Jon K · Jun 20, 2016

jgreco said:
The 9260 usually only has 512MB of RAM unless you have one of the OEM versions like Supermicro's, and not all of that gets uses for write-back cache.

That's pretty much crap compared to ZFS, which will create transaction groups that are up to 1/8th the size of your system memory (by default). For production use, you'd want to include a SLOG device to provide protection for the data being written, but for a lab environment you might not be concerned about that.

You don't really have a lot of options here for pool design though. Your practical options are two-way mirrors or three-way mirrors. If you try to pick RAIDZ{1,2,3} you will find that your pool is way slow. A RAIDZ vdev is generally in the same speed category as the slowest member device, so a 7200RPM spindle might have maybe 200 IOPS to its name, and two six disk RAIDZ2's might end up feeling around 400 IOPS.

https://forums.freenas.org/index.ph...d-why-we-use-mirrors-for-block-storage.44068/

By way of comparison, as a pool of six vdevs of two-way mirrors, your 1TB disks would give you a 6TB pool, of which you can use about 2 or 3TB safely while maintaining fairly good performance levels.

Sorry I was lumping the H700/9260 together.

So forgive me as I am still learning the ZFS way, but I thought that I'd be able to use system RAM as a write-back cache ahead of an SSD tier as yet another write-back cache. Is that not possible? What I mean is, even in RAID50 on my old setup, the IOPs weren't huge, but often the write-back cache was enough that it made for good overall/initial performance. If I flooded the cache faster than it could empty to spindles then yes, IOPs and throughput suffered, but since I was dealing mostly with small to medium IO VMs, it worked perfectly. Can that same principle not apply to FreeNAS/ZFS? Also you mention that six vdevs (RAID10, essentially, right?) would allow for 6TB before formatting but I can use 2 - 3TB safely. Why is that? Why could I not use 5TB, etc.?

I have the current storage pool set up in a single RAIDZ2 w/ 12 disks and the two SSDs as ZIL and SLOG. I did the benchmark that I found posted:

[root@freenas] /mnt/StoragePool# dd if=/dev/zero of=/mnt/StoragePool/tmp.dat bs=2048k count=50k
51200+0 records in
51200+0 records out
107374182400 bytes transferred in 39.740890 secs (2701856509 bytes/sec)

[root@freenas] /mnt/StoragePool# dd if=/mnt/StoragePool/tmp.dat of=/dev/null bs=2048k count=50k
51200+0 records in
51200+0 records out
107374182400 bytes transferred in 18.002898 secs (5964272115 bytes/sec)

That's a 100GB file written in 39 secs. So, that's not bad. Now I just need to find how to measure operations/second. Obviously some sort of cache, either SSD or RAM, is happening here.. but I am OK with that! LOL... just realized compression is on. Turned that off, testing now. How can I see IOPs in a similar manner?

jgreco · Jun 20, 2016

Jon K said:
Sorry I was lumping the H700/9260 together.

((guessing)) from the model number that the H700 is Dell's 9260. Yeah Supermicro has a nice expanded cache one too.

So forgive me as I am still learning the ZFS way, but I thought that I'd be able to use system RAM as a write-back cache ahead of an SSD tier as yet another write-back cache. Is that not possible?

No, that's not really the way ZFS works. ZFS builds a transaction group in system RAM, and you can reasonably consider that to be a write cache, and it could be HUGE, up to 1/8th of your system's memory. But of course system RAM isn't battery backed, so if the system were to lose power, bad. So, in *parallel*, there is a process called the ZFS Intent Log (ZIL). This is NOT a cache, but rather a log. That can be placed on a high quality power protected SSD (don't bother putting it on cheap non-power-protected SSD).

But that's all the personalized explanation I'm willing to type. I already spend too much time here, and in my previous response, please note that there was a link to one of the major forum resources that explains the ZIL and what a SLOG device is.

What I mean is, even in RAID50 on my old setup, the IOPs weren't huge, but often the write-back cache was enough that it made for good overall/initial performance. If I flooded the cache faster than it could empty to spindles then yes, IOPs and throughput suffered, but since I was dealing mostly with small to medium IO VMs, it worked perfectly.

You can get even more interesting results with ZFS, because we usually talk about the pessimistic performance, but there's a huge amount of potential upside if you do things the ZFS way.

Can that same principle not apply to FreeNAS/ZFS? Also you mention that six vdevs (RAID10, essentially, right?) would allow for 6TB before formatting but I can use 2 - 3TB safely. Why is that? Why could I not use 5TB, etc.?

What if I told you that ZFS could manage 2000 "random" write IOPS on a standard hard disk that is only physically capable of around 200, but that this statement was only true if the disk was 10% full?

ZFS is a copy-on-write filesystem, which has huge performance implications. Please also see the link I provided in my previous message about mirrors, which coincidentally contains a brief summary of this concept.

jgreco · Jun 20, 2016

Mirfster said:
Nah, personally I actually prefer it when others chime in and confirm/reassure or even correct any statement I made.

I suppose that's a good reason.

Jon K · Jun 20, 2016

Thanks jgreco - I don't mean to be a nuisance. I am just very, very used to EQL/Netapp/etc. with cache, etc.

I will read the links you've posted. I did read the user guide in respect to ZIL and SLOG but was still left with a few questions.

I am running dd if=/dev/zero of=/mnt/StoragePool/tmp.dat bs=2048k count=90k against my storage pool right now and seeing 6.5k write operations. Is that reflective of what it can do as is? I've turned compression off. Just trying to find out what the real performance is. 6.5k write IO is much better than what I had in RAID50 on my LSI controller.

Jon K · Jun 20, 2016

jgreco said:
ZFS is a copy-on-write filesystem, which has huge performance implications. Please also see the link I provided in my previous message about mirrors, which coincidentally contains a brief summary of this concept.

Ah yes - now I remember. It has to do with being able to limit write holes, etc. I can see why disk used % would affect that.

jgreco · Jun 20, 2016

Jon K said:
Thanks jgreco - I don't mean to be a nuisance. I am just very, very used to EQL/Netapp/etc. with cache, etc.

Not a nuisance, but you might find yourself shoved at resources. We're very good at getting users sorted out, but it's unlikely that the questions you ask today will be new.

I will read the links you've posted. I did read the user guide in respect to ZIL and SLOG but was still left with a few questions.

No doubt. If you can't find the answer, then by all means, ask.

I am running dd if=/dev/zero of=/mnt/StoragePool/tmp.dat bs=2048k count=90k against my storage pool right now and seeing 6.5k write operations. Is that reflective of what it can do as is? I've turned compression off. Just trying to find out what the real performance is. 6.5k write IO is much better than what I had in RAID50 on my LSI controller.

For the take of testing, that's OK, but after, turn compression back on, please. Use the default setting. A modern CPU can actually compress/decompress at rates far greater than the underlying HDD can read/write, so compression almost always increases speed.

So what you're seeing right now is quite possibly the maximum your pool can possibly sustain. When the pool is empty, there are huge stretches of contiguous space, and ZFS will lay down its transaction groups in those. Since they're sequential space on the HDD, the HDD can punch data out there at very good speeds. Even if you think you are writing "random" data inside a VM, this tends to get converted to sequential writes to the pool.

The downside? Fragmentation. For reads, a rewritten block is never in a contiguous run with the two neighboring un-re-written blocks that come before and after it in the same file. Also, as time goes on, let's say you had a maximally fragmented fairly full pool, and a VM's trying to do a write to sequential blocks. Those are very unlikely to actually be written sequentially, so there's a lot less differentiation between "sequential" write performance and "random" write performance.

If you can get your head around that and it makes a very depressing sort of sense to you, you understand about 2/3rds of the underpinnings of VM storage strategies in ZFS and we can work forward from there. 'Cuz that's the worst of it.

Jon K · Jun 20, 2016

Thanks jgreco - that absolutely makes sense. And, because this is a lab, I can live with that. Is there a way to align the written/unwritten blocks after the volume gets to X% full, or something? Thanks for explaining, it is a bit different than I've ever had to consider.

Doing my file write test, I was seeing 6.3 - 6.5k write IOPs with compression off. I re-enabled compression and do the same test and only see about 1.3k write IOPs.

jgreco · Jun 20, 2016

Then I think I'll toss you into the deep end a bit. Please consider heading over and reading

http://blog.delphix.com/uday/2013/02/19/zfs-write-performance/

Also if you're curious about the other half of ZFS, RAIDZ, check out

http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/

which also has some nuggets that are applicable to mirrors too. RAIDZ is almost entirely unsuitable for VM storage if you want performance, unless you have a really massive array.

AFTER you read that, see if it makes sense when I say that the traditional storage admin strategy of adding spindles to increase IOPS doesn't really work the same way with ZFS. With ZFS, you gain write IOPS by increasing both the size of the disks AND the number of spindles, and of the two, moving from a ~50% full disk to a ~10% full disk generates a hell of a turbo boost.

Random reads always suck with ZFS, thanks to fragmentation, so what we like to do there is to have lots of memory (ARC) and cache (L2ARC, i.e. SSD read cache). You ideally want to have all the blocks that are routinely accessed stored in ARC or L2ARC. This can largely mitigate the random read issue.

Jon K · Jun 20, 2016

Awesome thanks. Going to read those now.

For what it's worth, I don't have lots of memory, but I do have 64GB which is decent (and won't be running jails or anything this is a designated storage appliance for me). I also have 250GB/256GB SSDs setup as log/cache:

I will destroy this pool and do another in "RAID10" mirrors and see if the performance is better/worse. Honestly though, if the testing I've done is even a hint of what to expect, then RAIDZ2 w/ the SSDs for ZIL and SLOG is totally acceptable for my use so far. I don't be putting 8TB of data on this, either, but I might be putting 3TB.

jgreco · Jun 20, 2016

One of the fun/unfun aspects of ZFS is that there are so few absolutely concrete solid answers. If RAIDZ2 performance is acceptable, then that's a pretty solid answer. Please be sure you do some testing prior to committing, simply because ZFS has a mind-boggling ability to accelerate things where you might not expect them to be getting accelerated.

Jon K · Jun 20, 2016

jgreco said:
One of the fun/unfun aspects of ZFS is that there are so few absolutely concrete solid answers. If RAIDZ2 performance is acceptable, then that's a pretty solid answer. Please be sure you do some testing prior to committing, simply because ZFS has a mind-boggling ability to accelerate things where you might not expect them to be getting accelerated.

Absolutely - hardware RAID was much more defined. I've read the ZFS RAIDZ stripe article and am actually leaning toward 4 3x1TB RAIDZ configuration. Space is similar but stripes should be better. Reading that article made me shy away from the original 1 12x1TB RAIDZ2 config I had originally (even if it performed well). Thanks for all of the info.

jgreco · Jun 20, 2016

I guess for lab use, RAIDZ1 is probably fine, especially with smallish disks.

Do take some time to ponder your volblocksize (if doing iSCSI) because RAIDZ does not use fixed parity, and when using fixed size blocks you can get yourself in a bad situation where you're being required to allocate padding blocks. See the RAIDZ article, chart labeled "RAIDZ block layout", the light aqua with two data blocks - 8K block size there results in 100% space overhead for parity and padding, while the purple right above is 50% space overhead. You can end up totally screwing yourself if you do not understand how these interactions will end up translating into on-disk allocations.

fta · Jun 20, 2016

Jon K said:
Honestly though, if the testing I've done is even a hint of what to expect, then RAIDZ2 w/ the SSDs for ZIL and SLOG is totally acceptable for my use so far.

Try turning on sync=always and testing again. I'm assuming you'll be setting that for your iscsi zvols, and with your current testing, you're not really going through your SLOG for writes.

Jon K · Jun 20, 2016

jgreco said:
I guess for lab use, RAIDZ1 is probably fine, especially with smallish disks.

Do take some time to ponder your volblocksize (if doing iSCSI) because RAIDZ does not use fixed parity, and when using fixed size blocks you can get yourself in a bad situation where you're being required to allocate padding blocks. See the RAIDZ article, chart labeled "RAIDZ block layout", the light aqua with two data blocks - 8K block size there results in 100% space overhead for parity and padding, while the purple right above is 50% space overhead. You can end up totally screwing yourself if you do not understand how these interactions will end up translating into on-disk allocations.

Ugh makes sense. More to think about lol. I did see the padding in that chart. So help me understand - does volblocksize happen at the pool or zvol? It would seem the parity is calculated out at the pool level especially considering the applicable configuration seems to follow the vdev layout/pool. What do most people use for iSCSI? Also, on the topic but not, does FreeNAS support multipathing for iSCSI?

Important Announcement for the TrueNAS Community.

12 disk setup SATA setup for VM storage

Explorer

Active Member

Resident Grinch

Resident Grinch

Doesn't know what he's talking about

Guru

Explorer

Resident Grinch

Resident Grinch

Explorer

Explorer

Resident Grinch

Explorer

Resident Grinch

Explorer

Resident Grinch

Explorer

Resident Grinch

Contributor

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "12 disk setup SATA setup for VM storage"

Similar threads