Howdy
@naskit
Lots of questions here, so let's get cracking:
On the topic of 'over-provisioning', does TrueNAS have knowledge of and track 'spare cells' on the SSD, or is this only visible to and done by the SSD drive controller? (What I have read in SSD technical data sheets and specifications so far leads me to believe only the SSD drive controller 'sees' or 'knows' about spare cells and only the drive controller can manage them). It is my (perhaps naïve?) belief that the main reason why Enterprise SSDs are advertised with a lower capacity is *precisely because* a chunk of cells are carved out and hidden away for the express purpose of replacing bad cells whenever they are encountered and SMART mechanics detects that those cells are no longer reliable.
What (if any) commands are available to the user for said 'over-previsioning' configurations?
The SSD controller is the only one that knows the true extent of the overprovisioning and wear cycles on each particular page/block of NAND, as well as other stats being stored in the FTL (Flash Translation Layer) - TrueNAS doesn't have any knowledge of the "spare cells" beyond what the SSD will report back through SMART data or attributes; and some of those are vendor-specific, or use raw/hex coding to make them difficult or misleading to read at a glance.
Intel is one of the better ones, giving you an extended attribute page to poll:
Code:
admin@alderlake[~]$ sudo smartctl -x /dev/sdb
...
Device Statistics (GP Log 0x04)
Page Offset Size Value Flags Description
0x07 ===== = = === == Solid State Device Statistics (rev 1) ==
0x07 0x008 1 58 --- Percentage Used Endurance Indicator
Many "Enterprise" SSDs do indeed use a higher level of overprovisioning from the factory compared to consumer SSDs. Some of this comes from the fact that computers (and NAND) think in binary 2^X for their GiB and TiB - so most devices start with ~7.4% of spare area just from being built in binary and sold in decimal. But manufacturers can add extra overhead; for example, an SSD with 512GiB of raw NAND (549,755,813,888 bytes) might be sold as:
- A 512GB "consumer SSD" with 512,000,000,000 bytes of addressable space, and 7.4% overhead from the binary -> decimal conversion
- A 500GB "read-optimized enterprise SSD" with 500,000,000,000 bytes of space, and ~10% spare area
- A 480GB "mixed-use enterprise SSD" with 14.5% spare area
- A 400GB "write-intensive enterprise SSD" with 37% spare area
Spare area is used to not only replace failed or failing cells, but also allow for extra free pages that can receive writes without needing to be erased, typically leading to better sustained write performance.
So, how can you mimic this? Both SCALE and CORE support SSD overprovisioning through the webUI or shell, following the instructions below:
SCALE:
https://www.truenas.com/docs/scale/scaletutorials/storage/disks/slogoverprovisionscale/
CORE:
https://www.truenas.com/docs/core/coretutorials/storage/pools/slogoverprovision/
Now of course, there are often other changes between an "enterprise" and a "consumer" SSD - the raw speed and endurance of the NAND used (eMLC or TLC vs QLC), bin quality, power-loss-protection for in-flight data (which you've seen referenced here for the SLOG, and with good reason) - but overprovisioning a consumer SSD can greatly increase both its endurance and speed. Don't expect it to turn a random SSD into a viable SLOG device though.