The path to success for block storage

kspare · Jul 13, 2020

Agreed. Not many will have the extra capacity we currently have. We're trying to keep 2 extra servers for this purpose. After we do upgrades, we will slowly move data over to ensure stability. Ie. LSI firmware on the 12gb controllers, we're still running v 14 instead of the current 16 because its so stable. and the latest 11.3U

The new boxes are running truenas and meta drives with half our load....but its looking REALLY good.

Ericloewe · Jul 13, 2020

jgreco said:
We're all waiting for the point at which SSD makes more sense. Depending on what numbers you use and what endurance you need, we might almost be there. Two 10TB HDD's give you ~1-5TB of VM space, 1 if you want good speed, at a cost of around $400. Four 1TB SSD's give you ~1-1.5TB of VM space, at a cost of around $500, if you don't mind consumer endurance levels. So it's close.

Once you factor in power and A/C, SAS expander ports, and the larger physical size, I'd argue that SSDs are the way to go for anything bursty that can handle lowish endurance. Or if the upgrade cycle is short enough for it not to matter.

jgreco · Jul 13, 2020

Ericloewe said:
Once you factor in power and A/C, SAS expander ports, and the larger physical size, I'd argue that SSDs are the way to go for anything bursty that can handle lowish endurance. Or if the upgrade cycle is short enough for it not to matter.

It kinda depends.

It's totally possible to engineer "sleepy VM's" that avoid unnecessary I/O, especially writes, simply by doing things like disabling atime, doing more disciplined writes in the form of batch updates.

I typically run hypervisor storage in RAID1, generally with something like a 9271-8i controller with CacheVault, and this works pretty well. Back around 2014-2015, I could put three or five "consumer grade" SSD's (two mirrored or four mirrored, plus a spare) in a system, often for less than a single nonredundant "enterprise grade" SSD. So this works out to a "know your workload" game.

Anyways I bought a bunch of Intel 535's for our datacenter use back on Black Friday 2015 for $140-$180/each, which were rated for 40GB/day(?) on a 5 year warranty. I knew our workload was higher than that, but maybe only 80-120GB/day(?), and with SSD prices in rapid decline, I figured I'd burn them out in a year or three and replace them as they fell, getting larger ones. Well SSD pricing trends didn't quite work out that way, but the idea is still sound.

ehsab · Aug 7, 2020

Interesting read.
I understand that most users here use a combination of spinning disk and flash, and that most users have higher priority on capacity then performance.
I would really like to read more about systems were performance (ultra low latency, high iops) is a priority and how such a system should be tuned.
Lets say, with PCIe 4.0 x4 NVMe disks, what would be a suitable SLOG? Would a SLOG help with such drives?
How much does a SLOG do for sync writes to a pool?
Lets say your SLOG is on NVDIMM and is crazy fast and you have
a) a pool of mirrored SAS spinning drives
b) a pool of mirrored SATA SSDs
c) a pool of mirrored NVMe u2 disks

Would it matter to have that fast SLOG device for a and b above? Or does it just have to be slightly faster then the pool it SLOGs for?

jgreco · Aug 7, 2020

Define: "help with such drives."

Sync writes are always slower than if you disable sync writes. You would be able to blast away at an NVMe pool at insane write speeds without SLOG.

This makes the "How much does a SLOG do for sync writes to a pool" into a curious question. You're not going to be going faster than the SLOG device, that's for sure.

ehsab · Aug 7, 2020

jgreco said:
Define: "help with such drives."

Sync writes are always slower than if you disable sync writes. You would be able to blast away at an NVMe pool at insane write speeds without SLOG.

This makes the "How much does a SLOG do for sync writes to a pool" into a curious question. You're not going to be going faster than the SLOG device, that's for sure.

10x Samsung PM1733 in a 2-Way mirror (5 vdevs)
Or
10x Samsung PM983 in a 2-Way mirror (5 vdevs)
Or
10x Samsung 870 Evo on a 2-Way mirror (5 vdevs)
Or
10x spinning disks in a 2-Way mirror (5 vdevs)

Would a SLOG on NVDIMM be optimal for every pool? Or are there situations where a SLOG is not helping sync writes?
Would it be backwards to choose data disks over what SLOG device youre gona go with?

Dimaslan · Sep 26, 2020

Sorry if this has been covered before and I missed it, I have enabled S3 storage on my FreeNAS server in my homelab to test Veeam integration but the most important feature I want to test is bucket immutability.
Is there a way to create an immutable buck on the FreeNAS Min.io? I see you need to have or install awscli but documentation is a bit confusing on where to install that, should I install on my PC or via Shell on the FreeNAS itself?

ee21 · Nov 13, 2020

@jgreco I've read your guide to block storage, and had a couple questions I was hoping you wouldn't mind helping me out with.

I heard you loud and clear, that when it comes to block storage the way to go is a pool of stripped mirrors. Without trying to beat a dead horse too much, it did sound like your guide was mostly covering scenarios in which said pool consisted of magnetic hard drives however. Does the same guidelines apply to a pool of SSDs?

Specifically, I was interested in creating a pool with 2 stripped vdevs, each consisting of 3 500GB SSDs in Raid z1, similar to what would be considered RAID50, along with an NVMe SSD as an SLOG and 64GB RAM.. I am using this pool for iSCSI connections to Hyper-V hosts, running about 12 VMs' OS disks (any large data VHDs to be placed on another pool of stripped+mirrored HDDs). That being said, it would be nice to be able to clone a VM or whatnot without suffering any huge I/O penalties, and am running a 10G link to each host, which would be nice to see saturated first before hitting the max IOPS of my pool...

50% usable storage on an all-flash pool is a little bit hard to swallow, and was really hoping I might be able to leverage stripped RAIDz1 vdevs to bump that up to 66% usable, or should I just abandon all thoughts of that when using block storage, regardless of it being SSD or HDD storage?

HoneyBadger · Nov 13, 2020

I've been meaning to write some follow-up on that question specifically (block storage on RAIDZ using SSDs)

I'd love to give you a short Yes/No answer but as with so many things ZFS, there really isn't one.

Obviously, SSDs don't have the same random I/O penalty of an HDD when it comes to things like fragmentation or having to "work in sync." You can still be subject to a single drive hitting its garbage collection or TRIM process, and stalling out the vdev, but it's not as cripplingly bad as having to make a quartet of HDD spindles all move their arms together.

What you do still have to deal with is poor space utilization and write amplification. With small recordsizes, you're still having to write the parity information. Assuming your Z1 setup, if you write a record that compresses to 4K or smaller (or starts at 4K to begin with) you'll still be writing (one data + one parity) and using two drive's worth of space. Most VM I/O is 4K and 8K in size; however, with your Z1 being quite narrow you'll probably be able to eke out better utilization.

With SSDs you can also push the "usable storage" up a bit higher, as you don't have the random-seek penalty on reads. You will still have to deal with garbage collection/TRIM and keeping a certain percentage of your pool free. How far you can push it will likely depend more on your SSD's ability to clean its NAND pages, how much overprovisioning you have, and what exactly your tolerance for "good performance" is.

Mirrored SSD will still perform better, but RAIDZ SSD is starting to be the 320kbps MP3 of storage - it's often "good enough."

Edit, some further thoughts:
1. "NVMe SSD" doesn't necessarily mean "good for SLOG" - check the signature link for SLOG benchmarking. Needs to be high-endurance with low latency. Optane is excellent for this, but trying to truly saturate 10Gbps with regular I/O may require exotic solutions eg: NVRAM/NVDIMMs.
2. Hyper-V doesn't seem to understand how to throttle TRIM/UNMAP commands to iSCSI devices. This could cause high latencies under live migrations or deletes. See thread below for a minor discussion with a user who was in a similar scenario.

Curious L2ARC/ARC Re-Hydration Issue

Hi All, Long time lurker here running the following configuration: Dell R720 with 256GB RAM FreeNAS-11.3-U4.1 Dual controller 8 Gbps link to enclosure containing pool disks I am finding an issue where I migrate virtual machine storage off of the platform and onto another (presented via iSCSI...

www.truenas.com

ee21 · Nov 13, 2020

HoneyBadger said:
I've been meaning to write some follow-up on that question specifically (block storage on RAIDZ using SSDs)

I'd love to give you a short Yes/No answer but as with so many things ZFS, there really isn't one.

Obviously, SSDs don't have the same random I/O penalty of an HDD when it comes to things like fragmentation or having to "work in sync." You can still be subject to a single drive hitting its garbage collection or TRIM process, and stalling out the vdev, but it's not as cripplingly bad as having to make a quartet of HDD spindles all move their arms together.

What you do still have to deal with is poor space utilization and write amplification. With small recordsizes, you're still having to write the parity information. Assuming your Z1 setup, if you write a record that compresses to 4K or smaller (or starts at 4K to begin with) you'll still be writing (one data + one parity) and using two drive's worth of space. Most VM I/O is 4K and 8K in size; however, with your Z1 being quite narrow you'll probably be able to eke out better utilization.

With SSDs you can also push the "usable storage" up a bit higher, as you don't have the random-seek penalty on reads. You will still have to deal with garbage collection/TRIM and keeping a certain percentage of your pool free. How far you can push it will likely depend more on your SSD's ability to clean its NAND pages, how much overprovisioning you have, and what exactly your tolerance for "good performance" is.

Mirrored SSD will still perform better, but RAIDZ SSD is starting to be the 320kbps MP3 of storage - it's often "good enough."

Edit, some further thoughts:
1. "NVMe SSD" doesn't necessarily mean "good for SLOG" - check the signature link for SLOG benchmarking. Needs to be high-endurance with low latency. Optane is excellent for this, but trying to truly saturate 10Gbps with regular I/O may require exotic solutions eg: NVRAM/NVDIMMs.
2. Hyper-V doesn't seem to understand how to throttle TRIM/UNMAP commands to iSCSI devices. This could cause high latencies under live migrations or deletes. See thread below for a minor discussion with a user who was in a similar scenario.

Curious L2ARC/ARC Re-Hydration Issue

Hi All, Long time lurker here running the following configuration: Dell R720 with 256GB RAM FreeNAS-11.3-U4.1 Dual controller 8 Gbps link to enclosure containing pool disks I am finding an issue where I migrate virtual machine storage off of the platform and onto another (presented via iSCSI...

www.truenas.com

Thank you, that gives me a direction to start - sounds like I need to do some benchmarking to really find out what is going to work best for me..

Couple take away questions, 1: I'm a little green when it comes to RAID z levels, and what combinations of what drives with what RAID levels mean what as far as write penalties go.. Would it be better to simply make one RAID z1 vdev of 6x500GB SSD, rather than 2 stripped vDevs of 3xRAIDz1 SSDs? Would there be less write overhead, as there would only be one set of parity calculations going on, as opposed to technically 2?

2: My other takeaway from you post, if my pool is purely VM OS VHDs, stored on iSCSI zvol shares, it sounds like I should pick a smaller record size than the default 128k? Is this true weather I am doing stripped mirrors, or stripped RAIDz1 vdevs?

And lastly, I picked a 500GB Sabret Rocket (Pcie 3 unfortunately) which I think is among the best my budget could afford. It has a 800TBW endurance rating if I am not mistaken, at that capacity, which is pretty outstanding for a TLC NVMe. I'd love an Optane if I had an extra $1,000.

HoneyBadger · Nov 16, 2020

ee21 said:
Thank you, that gives me a direction to start - sounds like I need to do some benchmarking to really find out what is going to work best for me..

Couple take away questions, 1: I'm a little green when it comes to RAID z levels, and what combinations of what drives with what RAID levels mean what as far as write penalties go.. Would it be better to simply make one RAID z1 vdev of 6x500GB SSD, rather than 2 stripped vDevs of 3xRAIDz1 SSDs? Would there be less write overhead, as there would only be one set of parity calculations going on, as opposed to technically 2?

2: My other takeaway from you post, if my pool is purely VM OS VHDs, stored on iSCSI zvol shares, it sounds like I should pick a smaller record size than the default 128k? Is this true weather I am doing stripped mirrors, or stripped RAIDz1 vdevs?

And lastly, I picked a 500GB Sabret Rocket (Pcie 3 unfortunately) which I think is among the best my budget could afford. It has a 800TBW endurance rating if I am not mistaken, at that capacity, which is pretty outstanding for a TLC NVMe. I'd love an Optane if I had an extra $1,000.

1. You'll have better results with smaller width vdevs (eg: the 2x 3-drive) in terms of both performance and possibly also the space efficiency. You'll also have better redundancy, as with the single 6-drive Z1 you can only tolerate one failure, whereas with the 2x 3-drive you can have one drive in each of the 3. You'll virtually never be bottlenecked by the parity calculation speed on any modern processor.

2. The default ZVOL block size in Free/TrueNAS is 16K. Mirrors can handle any size, but the RAIDZ ones if you reduce the blocksize/recordsize too small, you end up with poor space utilization as described before. 16K is usually fine. I've been getting good results with 32K as it allows for bigger records to compress where possible.

3. Without a non-volatile write cache I can't see that Sabrent working well (and looking at the SLOG thread, it choked hard at low record sizes) - you can use non-enterprise Optane like the M10 or 900p/905p but you need to mind the endurance. It depends on how hard you write to your array, but in your scenario I'd be tempted (home lab, you can afford downtime?) to just have a handful of 16G Optane cards and swap them when they hit the 300TBW mark.

kspare · Nov 16, 2020

HoneyBadger said:
1. You'll have better results with smaller width vdevs (eg: the 2x 3-drive) in terms of both performance and the space efficiency. You'll also have better redundancy, as with the single 6-drive Z1 you can only tolerate one failure, whereas with the 2x 3-drive you can have one drive in each of the 3. You'll virtually never be bottlenecked by the parity calculation speed on any modern processor.

2. The default ZVOL block size in Free/TrueNAS is 16K. Mirrors can handle any size, but the RAIDZ ones if you reduce the blocksize/recordsize too small, you end up with poor space utilization as described before. 16K is usually fine. I've been getting good results with 32K as it allows for bigger records to compress where possible.

3. Without a non-volatile write cache I can't see that Sabrent working well (and looking at the SLOG thread, it choked hard at low record sizes) - you can use non-enterprise Optane like the M10 or 900p/905p but you need to mind the endurance. It depends on how hard you write to your array, but in your scenario I'd be tempted (home lab, you can afford downtime?) to just have a handful of 16G Optane cards and swap them when they hit the 300TBW mark.

I thought the default size in 12 was 128k?

HoneyBadger · Nov 16, 2020

kspare said:
I thought the default size in 12 was 128k?

... Now you're making me second-guess myself.

Default recordsize is 128K but the default volblocksize should still be 16K. Going to check this out.

Edit: TN12.0-REL still defaults ZVOLs to 16K and datasets to 128K.

kspare · Nov 16, 2020

HoneyBadger said:
... Now you're making me second-guess myself.

Default recordsize is 128K but the default volblocksize should still be 16K. Going to check this out.

sorry. I'll put my tail between my legs...im sure about the zvol! I know you create a new *pool* it's 128 and any data sets under that start at 128....

HoneyBadger · Nov 16, 2020

kspare said:
sorry. I'll put my tail between my legs...im sure about the zvol! I know you create a new *pool* it's 128 and any data sets under that start at 128....

"Trust, but verify" - always happy to double-check myself.

The performance impact of RAIDZ can be clubbed away with SSD, but the space-efficiency in smaller records not so much. Hence, we're back to mirrors again.

ee21 · Nov 19, 2020

HoneyBadger said:
"Trust, but verify" - always happy to double-check myself.

The performance impact of RAIDZ can be clubbed away with SSD, but the space-efficiency in smaller records not so much. Hence, we're back to mirrors again.

I ended up sticking with stripped mirrored pairs. I don't need all that much space for VM OS's anyhow. 1200GB effectively usable (without going over 80% of the pool) out of 3000GB total flash storage

As far as an SLOG goes, I'm limited to an NVMe or SATA SSD, no PCIe slots left, and a budget of around 200$ or less. I was sort of banking on the speed of the NVMe being able to commit any ZIL writes to actual NAND, along with 2 layers of UPS power protection and off-grid solar to prevent any data corruption, although granted none of that does anything to stop me from kicking the power cord accidentally while I'm working on my rack... My use of an SLOG was more geared tword performance desires, although maybe that is misguided as well

I'd love to fire a few questions your way about SLOGs, but don't want to hijack this thread.. I'll post on your benchmark thread maybe if that is okay, or if there is some sort of PM functionality here..

jgreco · Nov 19, 2020

ee21 said:
My use of an SLOG was more geared tword performance desires, although maybe that is misguided as well I'd love to fire a few questions your way about SLOGs, but don't want to hijack this thread.. I'll post on your benchmark thread maybe if that is okay, or if there is some sort of PM functionality here..

SLOG and "performance desires" are not good bedfellows. If you want performance, turn off sync writes. Turning on sync writes always results in slower performance. No SLOG can make your pool faster than async writes.

ee21 · Nov 19, 2020

jgreco said:
SLOG and "performance desires" are not good bedfellows. If you want performance, turn off sync writes. Turning on sync writes always results in slower performance. No SLOG can make your pool faster than async writes.

I'm exclusively using iSCSI block shares, which is what brought me to this thread to begin with haha. As I understood, sync writes are required or at least very highly recommended on such a pool, especially if the iSCSI share is holding a running VM? In which case, I was counting on the SLOG to reduce the IOPS that normally would hit that pool if it were also housing the ZIL... I hope my utilization of an SLOG here isn't too illogical, but please correct me if I am totally off here.

HoneyBadger · Nov 19, 2020

ee21 said:
I ended up sticking with stripped mirrored pairs. I don't need all that much space for VM OS's anyhow. 1200GB effectively usable (without going over 80% of the pool) out of 3000GB total flash storage

Keep compression enabled and you should be able to squeeze a bit more on there. It feels bad certainly, but you wouldn't have bought flash unless you wanted the performance, right?

ee21 said:
As far as an SLOG goes, I'm limited to an NVMe or SATA SSD, no PCIe slots left, and a budget of around 200$ or less. I was sort of banking on the speed of the NVMe being able to commit any ZIL writes to actual NAND, along with 2 layers of UPS power protection and off-grid solar to prevent any data corruption, although granted none of that does anything to stop me from kicking the power cord accidentally while I'm working on my rack... My use of an SLOG was more geared tword performance desires, although maybe that is misguided as well I'd love to fire a few questions your way about SLOGs, but don't want to hijack this thread.. I'll post on your benchmark thread maybe if that is okay, or if there is some sort of PM functionality here..

We're on a bit of a tangent but you're correct about sync=always being necessary for iSCSI zvols used for VMFS datastores, which drives you towards "you need a fast SLOG to recover some of the lost performance." Unfortunately the Sabrent doesn't hold up to that workload hence my recommendation of an Optane card. Even if you just have an M.2 2280 NVMe slot I'd say if you can afford the downtime, use M10 16G/32G cards and just swap them once they hit 300TBW.

If you want to take this further, let's go to either the SLOG thread or PMs. Although the most common use of "block storage" tends to be "virtualization" or other workloads that need to be sync=always, which implies SLOG being necessary as well to ensure good performance. And some people think that SLOG can make up for a poor vdev design (it can't, hence the "use mirrors" argument) so it's sort-of-relevant here still.

Sirius · May 29, 2021

Possibly dumb question... if my use case is a single iSCSI client, that is also a single user system (ie. not a VM host), would I be better off using a RAID-Z2 or Z3 pool vs striped mirrors?

I'm after as many IOPS and sequential performance as possible out of 12 or 14 7200rpm mechanical disks. I'm happy to stick with mirrors if that's the best option, but I'm curious as to if RAID-Z may work better for my somewhat unusual use case. I'm also considering using up to 8 x Optane 900p drives as L2ARC, probably under provisioned as my working set isn't massive but of course the latency and performance Optane is excellent.

The use case is effectively a scratch disk where raw performance is key but data integrity isn't necessarily the be all-end all, it's nice to have but if the pool dies I can easily re-aquire the data on it from a backup or just re-aquire it another way.

I'm already considering 100gigE in order to...
1. drop latency over the network
2. increase potential max bandwidth

My client is Windows 10 Pro for Workstations, which limits my iSCSI initiators to what something like Chelsio provides or the default windows one - which may not be capable of handling what I want.

(Apologies if this is a bump of a dead post - this just seemed the right place to ask these questions)

Important Announcement for the TrueNAS Community.

The path to success for block storage

Guru

Server Wrangler

Resident Grinch

Dabbler

Resident Grinch

Dabbler

Cadet

Dabbler

actually does care

Dabbler

actually does care

Guru

actually does care

Guru

actually does care

Dabbler

Resident Grinch

Dabbler

actually does care

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "The path to success for block storage"

Similar threads