SOLVED How to over-provision NVMe disks?

dxun · Nov 13, 2022

What is a good way to resize NVMe drives that don't support namespaces (such as consumer SSDs or Optane)?

Up until TrueNAS 12-U5, only SAS/SATA drives were supported by disk_resize and now, NVMe drives are supported but only if the support multiple namespaces (the disk_resize is basically wrapping nvmecontrol).

Is there a trick to over-provision NVMe drives with single namespace support? The only other approach I can think of is assuming creating smaller partitions with gpart is enough to signal the controller space is unused even if it is technically allocated.

nabsltd · Nov 16, 2022

You need to look more closely at your use case. In other words, why do you think you need more spare area on the NVMe drive? If you think the drive you have isn't up to your endurance needs, then it's unlikely that manual over-provisioning will help as much as just buying a more robust drive (or larger drive with the same endurance per space spec).

Drive manufacturers that spec different endurance based on merely changing the over-provisioning can do this because they know the controller chip is aware of the spare percentage, and know the real endurance of the flash chips. So, a drive that currently has 1 DWPD with some undeclared spare flash quantity might need you to use only 50% of the drive to get to 2 DWPD, while the same drive with only 20% extra spare flash that is known to the controller could get to 3 DWPD.

Then, too, most people vastly overestimate the amount they actually write, so it turns out that a drive with just 0.3 DWPD is just fine, if the original drive is big enough. I have spinning disk arrays (5x 2TB in hardware RAID-5) where the total combined writes on all the drives over a nearly 8-year lifetime is around 1200TB, which could all have been absorbed by a single 1TB NVMe drive with merely a 1 DWPD spec. But, if that NVMe had be a front for actual ingest (which doesn't include things like moving data around on the array, which can be much more leisurely), it could have been as low as 0.2 DWPD.

dxun · Nov 17, 2022

I am probably like many other hobbyists - trying to play it safe to spare oneself the cost and the discomfort of being faced with a crisis, yet without a good approximation of real-world demands on hosting a pool of VMs (say 10-15 machines at most). If a simple proactive configuration can add _years_ of operation, then it makes very little sense not to do it as soon as possible. Looking at the calculator, it is rather hard to imagine average host writes at 10 MB/s for 24/7/365/5 in order to reach the warranty limit within time period but....who knows? A few bursty days/weeks may end up reducing that 10 MB/s to 6 MB/s and so on.

In my case, for the pair Intel P4510 1TB drives I have on primary VM pool I've read somewhere (I'll try to dig up the link) that the endurance difference between this drive (1 DPWD) and its successor (4610 - 3 DPWD) is extra NAND for over-provisioning. The rest is the same - same controller, same NAND type, same software.

What I am curious abot is my secondary NVMe pool consisting of a pair of 1 TB Samsung 970 PRO - these are backed by pure MLC cells, so should reach far above the ~ 0.7 DPWD. Still I do not know exactly, so I am being cautious. Realistically, if its predecessor (the 840 Pro) is anything to go by, the drives should also easily reach into multi-petabyte TBW (which means probably > 1 DPDW).

HoneyBadger · Nov 17, 2022

The old Intel isdct tool has been discontinued and replaced by the Intel Memory and Storage Tool - unfortunately there's no FreeBSD version.

Intel® Memory and Storage Tool CLI (Command-Line Interface)

The Intel® Memory and Storage Tool (Intel® MAS) is a drive management CLI tool for Intel® Optane™ SSDs and Intel® Optane™ Memory devices, supported on Windows*, Linux*, and ESXi*. Note: Intel's NAND SSD business has been acquired by SK Hynix and is now Solidigm™. A new tool has been...

www.intel.com

Your Samsung SSDs you might be able to use the Samsung Magician tool - again, no BSD option here. You'll need to do a quick boot into a Linux or Windows OS.

But if this doesn't work, a secure erase (using the vendor's tool to ensure proper TRIM) followed by manually creating a smaller partition should deliver the same end result.

dxun · Nov 17, 2022

Thank you both - that is what I ended up doing: using Samsung Magician to OP the drives. It looks like it's the best that can be done.

For completeness, the disk_resize works fully with P4510 - this is because P4510 supports multiple namespaces.

Volts · Nov 18, 2022

ZFS is Copy on Write, so writes are already spread across free space, so is this useful?

On the other hand, will ZFS perform worse if the device size has been artificially constrained? ZFS likes contiguous free blocks.

dxun · Nov 18, 2022

It is very useful, if you're trying to extend the life of your SSD drives - that is what over-provisioning is about: giving the SSD controller more available space to balance the NAND wear. What I believe you're referring to here is pool fragmentation and in this case it's a trade-off between endurance and speed - and this being relatively fast NVMe SSDs, fragmentation impacts are less of a concern.

AlexGG · Nov 19, 2022

dxun said:
It is very useful, if you're trying to extend the life of your SSD drives - that is what over-provisioning is about: giving the SSD controller more available space to balance the NAND wear. What I believe you're referring to here is pool fragmentation and in this case it's a trade-off between endurance and speed - and this being relatively fast NVMe SSDs, fragmentation impacts are less of a concern.

Does it work in practice with modern controllers, though? I'd expect the controller to use used space for balancing if needed. There is some cost associated with it, but I would not expect it to be significant.

Volts · Nov 19, 2022

What I mean is, do the vendor tools accomplish something more than reducing the reported capacity?

One option is to simply not fill the device.

Or create a partition that doesn’t fill the device.

Or my favorite, set a quota on the pool.

That should accomplish everything overprovisioning does, while also reducing ZFS free space fragmentation, while also not preventing access to the space if needed.

Volts · Nov 19, 2022

AlexGG said:
I'd expect the controller to use used space for balancing if needed.

Is that a thing controllers do?

dxun · Nov 19, 2022

AlexGG said:
Does it work in practice with modern controllers, though? I'd expect the controller to use used space for balancing if needed. There is some cost associated with it, but I would not expect it to be significant.

I believe it does - even consumer SSD manufacturers (e.g. Samsung) are giving OP capabilities through their software, why would they do so if it's useless?

The question is how you define free space - from what I understand, partitioned space is used space, and that is how controllers track the occupied and raw (i.e. unused) cell (blocks). Controllers also know exactly how much each cell (block) has had writes thus far, how much more it can take and are able to dynamically and transparently (to the OS) swap in and out cell block allocations to maximise total device longevity (namely, a NAND cell has a fixed number of writes - around a 1000 cycles is what MLC cell is capable of, not sure what the numbers are for other cell types).

Virtually all SSDs have more than reported NAND available, and the rest is cordoned off for garbage collection and wear-leveling. Based on the ratio of reported vs reserved (plus seven herbs and spices), manufacturers calculate DPDW/TBW. This is standard stuff - what we're doing here is going beyond this: we're reserving extra space on top of what manufacturer had already reserved to get extra endurance.

How much additional endurance this confers is sometimes easy and sometimes very difficult to answer - only manufacturer would be able to give a precise answer.

Volts said:
What I mean is, do the vendor tools accomplish something more than reducing the reported capacity?

One option is to simply not fill the device.
Or create a partition that doesn’t fill the device.
Or my favorite, set a quota on the pool.

That should accomplish everything overprovisioning does, while also reducing ZFS free space fragmentation, while also not preventing access to the space if needed.

I don't think setting a quota on a pool comprised of partitions spanning whole available SSD space is the same as creating a smaller partition - see reasoning above. If you claim space for a partition, from a controller's point of view it is "used" - the fact that you have empty space is apparent only to you, the controller knows nothing of the file system or its abstractions; it sees only cells.

AlexGG · Nov 19, 2022

Volts said:
Is that a thing controllers do?

I'd expect them to. I don't know. There is plenty of research on the subject in academia, but I'm not sure what specific controllers use which specific implementations. Maybe I will do a test one day, although the metrics might be complicated. What should I even measure?

dxun said:
I believe it does - even consumer SSD manufacturers (e.g. Samsung) are giving OP capabilities through their software, why would they do so if it's useless?

Might well be for historical reasons.

HoneyBadger · Nov 19, 2022

dxun said:
If you claim space for a partition, from a controller's point of view it is "used" - the fact that you have empty space is apparent only to you, the controller knows nothing of the file system or its abstractions; it sees only cells.

This disconnect is what the TRIM/UNMAP commands are intended to solve. The guest OS can (at a given granularity) notify the underlying SSD firmware which areas of space are no longer required, and the SSD can then do the necessary sleight of hand in firmware to reorganize the data around, with the goal of being able to perform an erase cycle on a larger block of space.

I tested (many moons ago) to see if I could tease out a comparison between assigning the full device vs. partition vs. HPA - there didn't seem to be an appreciable difference in performance:

SLOG Underprovisioning

Splitting this off from Steven Sedory's thread about Hyper-V performance so as not to further hijack it on a tangent. I'm convinced that underprovisioning the SLOG devices is the way to go, simply because you're *guaranteeing* that the controller has a much larger bucket of free pages to work...

www.truenas.com

But this was assuming a world where TRIM was infallible and instant - two things we know it isn't. Overprovisioning allows the internal drive firmware to handle some of its own housekeeping in a hopefully more asynchronous manner. Newer drives are better about treating unallocated space (even within a partition) as free, if it's been cleaned up with TRIM, and wear-leveling across it - but this free space is often used in a pseudo-SLC manner to accelerate the writes short-term (eg: the first bit of a TLC cell is flipped, the rest are ignored) - if you outstrip either the "SLC-like" capacity of the free space by writing too fast for the TRIM/garbage collection engines to keep up, then you nosedive down to the underlying NAND speed.

Volts · Nov 19, 2022

dxun said:
partitioned space is used space

ZFS can use TRIM to explicitly tell the hardware "this logical block is empty, treat it appropriately". Setting the autotrim property is one option, or a daily/weekly zpool trim <poolname> task might be better.

I strongly suspect that all empty/TRIMmed blocks are treated the same by the controller. I also believe that when writing, the controller simply looks for a suitable block. I don't think overprovisioned vs. free matters when writing.

I would be very interested to learn more.

My question "is that a thing controllers do?" is about rearranging in-use blocks, which I thought was the question. That would be clever, but also risky, and an interesting trade-off. I'm not aware of any devices that do that. [Edit: They absolutely do this.]

HoneyBadger · Nov 19, 2022

Volts said:
My question "is that a thing controllers do?" is about rearranging in-use blocks, which I thought was the question. That would be clever, but also risky, and an interesting trade-off. I'm not aware of any devices that do that.

SSDs will absolutely do this, in a manner very similar to the "copy-on-write" of ZFS.

NAND is programmed at its "page" size - often 4K, 8K, or maybe even 16K - but is usually erased in a much larger "block" size that's measured in megabytes. A partial NAND erase isn't possible - think of it like shaking an Etch-A-Sketch - there's no way to save part of your drawing.

When an SSD page is overwritten or deleted, the controller marks it as invalid in the Flash Translation Layer (FTL) - that's the table telling the SSD where a given group of LBAs maps to on the physical NAND die. Once a block hits a trigger amount of invalid pages, internal fragmentation, or other firmware metrics, the SSD then flags the entire page for garbage collection. This process copies the remaining valid pages from that block to an empty location in a new block, and updates the pointer in the FTL. Once the entire block is now considered invalid, it zaps the NAND cells, and that block is now clean and ready to go.

Volts · Nov 19, 2022

HoneyBadger said:
Overprovisioning allows the internal drive firmware to handle some of its own housekeeping in a hopefully more asynchronous manner.

This is what I'm questioning. In what way?

I'm confident that overprovisioning doesn't reserve any particular physical blocks. I'm pretty sure all it does is reduce the reported number of logical blocks.

From the device controller's perspective, logical blocks are mapped to physical blocks. When writing a logical block, some eligible physical blocks are mapped. From a modern controller's perspective, "rewriting" isn't a thing - it’s the same as “writing” - a new eligible physical block is mapped.

I agree that having enough "free" space is necessary, so that any cleanup can be done in advance, and that when writing, eligible physical blocks can be identified quickly. I'm suggesting that "overprovisioned" is simply a "free quota", not anything special.

Volts · Nov 19, 2022

HoneyBadger said:
SSDs will absolutely do this, in a manner very similar to the "copy-on-write" of ZFS.

Hahah we're typing cross each other.

I agree. They don't rewrite cells, they copy-on-write pages. So having known-empty cells & pages to write into is critical for performance.

But if 90% of the data on a drive is static, and 10% is written and re-written frequently, that 90% static data isn't ever physically moved.

Do SSDs keep track of writes per page or cell, at that level?

Volts · Nov 19, 2022

AlexGG said:
Might well be for historical reasons.

That's a good point. Without TRIM to communicate "empty" to the hardware, overprovisioning makes sense.

HoneyBadger · Nov 19, 2022

Volts said:
Do SSDs keep track of writes per page or cell, at that level?

Typically it would be "erasures per block" as that's the operation that requires the most voltage and puts the most stress on the NAND.

If the data is 90% static and 10% changing, you'd eventually "wear a hole" in the remaining 10% of NAND if it didn't relocate perfectly valid data on the fly when the erasure count per page is heavily imbalanced.

Volts · Nov 20, 2022

OK, I was wrong. I went down a rabbit hole and learned a bunch about this stuff last night. I wish I hadn't made statements based on my bad assumptions and out-of-date guesses. There's a ton of easily-accessible research about FTLs I should have read first.

Modern SSDs certainly do 'static' wear leveling, tracking lifespan per block, moving data out of low-wear blocks so they can be used be future writes. Thanks for the gentle corrections!

I still wonder if additional overprovisioning is valuable. As long as there are enough free (TRIMmed, unallocated, unused, overprivisioned) blocks, I don't understand what special value overprovisioning could have - it's a quota at a different layer. Blocks are Blocks. Right?

(With a CoW filesystem, one type of write amplification is avoided, so the SSD's FTL/GC should have less work to do anyway, so enough free blocks might be a smaller quantity? Especially on very fast devices like the 970 PRO.)

I also still wonder about the impacts of free-space fragmentation and metaslab/space-map fragmentation. I know ZFS likes empty filesystems. But I can't find any benchmarks for how this affects SSD, only for spinning HDD - is anybody aware of any? It probably doesn't matter if the filesystem isn't "very" full, but in that case, neither does overprovisioning.

Important Announcement for the TrueNAS Community.

SOLVED How to over-provision NVMe disks?

Explorer

Contributor

Explorer

actually does care

Explorer

Patron

Explorer

Contributor

Patron

Patron

Explorer

Contributor

actually does care

Patron

actually does care

Patron

Patron

Patron

actually does care

Patron

Similar threads