Storage became extremely slow

tbaror · Jun 6, 2020

Hello ,

I have a system with following spec :
OS Version: FreeNAS-11.3-U3.1, Model: SSG-6048R-E1CR60L, Memory: 512 GiB POOL: 36 x Seagate st100000096 Disks 6x raidz2x6, Total x4 intel ssdpe2me800g4 2xmirror log, 2x cach

The system is used for spin-on XCP-NG vm guests , all worked well until recent days , suddenly we experience real slowness , vm guests are works slow to stack
The pool show healthy , i run scrub pool its took around 40 houres to complete 8TB data , no errors found , when i run dd if=/dev/zero of=/mnt/po01/vmrepository/testfile to test internal speed its takes really long time when finally stop it i get bad resualts like 200MB/s , just to compare i have same another identical system i get around 25GigaBytes/s.
I really don't know on what could cause is degrade performance like that , Please help
Thanks

Samuel Tai · Jun 6, 2020

This smells like your pool either lost both members of the ZIL mirror, so now have to use the ZIL within the pool, or your pool is too full. Please provide the output of zpool status -v and zpool list.

morganL · Jun 6, 2020

There's no way you get 25GB/s sustained through a mirror log...... suggest looking at the slog settings and devices.

HoneyBadger · Jun 6, 2020

morganL said:
There's no way you get 25GB/s sustained through a mirror log...... suggest looking at the slog settings and devices.

Agreed, I suspect the primary system's SLOGs have both failed, so you're choking on in-pool sync write speed, and the sync status of the second system doesn't match the first so it's showing artificially inflated results.

tbaror · Jun 7, 2020

morganL said:
There's no way you get 25GB/s sustained through a mirror log...... suggest looking at the slog settings and devices.

sorry
You are right my mistake i meant 25Gigabits/s

tbaror · Jun 7, 2020

HoneyBadger said:
Agreed, I suspect the primary system's SLOGs have both failed, so you're choking on in-pool sync write speed, and the sync status of the second system doesn't match the first so it's showing artificially inflated results.

I removed the log mirror and its improved to 500Mbytes/s , then i wiped the slog drives and readd and performance dropped to 120Mbytes/s
Whats wired is that on IPMI the disks reporting healthy and S.M.A.RT not reported any issues , could it be related to firmware or related issue i missed?
Thanks

Samuel Tai · Jun 7, 2020

Those SLOG drives are toast. What's probably going on is so many cells have failed, that the onboard controller is constantly garbage collecting and defragging. You're better off just replacing them entirely.

HoneyBadger · Jun 7, 2020

tbaror said:
I removed the log mirror and its improved to 500Mbytes/s , then i wiped the slog drives and readd and performance dropped to 120Mbytes/s
Whats wired is that on IPMI the disks reporting healthy and S.M.A.RT not reported any issues , could it be related to firmware or related issue i missed?
Thanks

Can you post the full output of smartctl for the drives? Serial numbers can be redacted, it's mostly the health and total writes I'm interested in.

morganL · Jun 8, 2020

It's usually possible to get better performance out of SLOG SSDs by formatting them down to 16GB in size. The remaining space is used for garbage collection and helps the drives do more writes.

tbaror · Jun 8, 2020

HoneyBadger said:
Can you post the full output of smartctl for the drives? Serial numbers can be redacted, it's mostly the health and total writes I'm interested in.

Hi
Wow Thanks all for the assistance , i attached my smart from debug file
Thanks

tbaror · Jun 8, 2020

morganL said:
It's usually possible to get better performance out of SLOG SSDs by formatting them down to 16GB in size. The remaining space is used for garbage collection and helps the drives do more writes.

Hi
didn't understand "The remaining space is used for garbage collection and helps the drives do more writes."
Thanks

HoneyBadger · Jun 8, 2020

tbaror said:
Hi
Wow Thanks all for the assistance , i attached my smart from debug file
Thanks

Unfortunately you aren't actually querying your NVMe SSDs regularly as evidenced by the output here.

Code:

/dev/nvd0
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

/dev/nvd0: To monitor NVMe disks use /dev/nvme* device names
Please specify device type with the -d option.

Use smartctl -h to get a usage summary

As stated here you'll need to run smartctl -a /dev/nvmeX rather than the nvdX name. If you can run that now against them and post that would be appreciated.

tbaror said:
Hi
didn't understand "The remaining space is used for garbage collection and helps the drives do more writes."
Thanks

SSDs can be changed to only expose a limited amount of space as usable to the OS - this lets the remaining unused space help with the drive's wear-leveling and endurance algorithms.

tbaror · Jun 10, 2020

Hello ,
Update on the storage issue , I activate storage guarantee Supermicro local vendor , the came today replaced 4NVME ,4 disks with bad blocks
Now storage seems to be operational again
Thanks all for your support , cheers

jgreco · Jun 10, 2020

morganL said:
It's usually possible to get better performance out of SLOG SSDs by formatting them down to 16GB in size. The remaining space is used for garbage collection and helps the drives do more writes.

You know, years ago, I submitted a ticket to do this as a GUI option and it was rejected.

https://redmine.ixsystems.com/issues/2365

Jordan faffed around with me in mail about this and basically tried to make it my responsibility to show that this was true on a variety of devices, and I basically said I wasn't going to buy gear just so I could do the testing he wanted, seeing as how I am not paid by iX.

If I were to reopen this ticket, would I get a better response now?

HoneyBadger · Jun 10, 2020

tbaror said:
Hello ,
Update on the storage issue , I activate storage guarantee Supermicro local vendor , the came today replaced 4NVME ,4 disks with bad blocks
Now storage seems to be operational again
Thanks all for your support , cheers

This is a good resolution; however, I would suggest you make sure your smartctl scripts are set up to look for the correct /dev/nvmeX names so that you can monitor in the future.

Pulling the information from the failed drives would also be useful to look at the total amount of writes they have absorbed. The rated 4.38 PBW sounds like a huge amount but in a heavily used system it's not at all out of the ordinary. Check your smartctl results separated by 24h and calculate an estimate of your DWPD/daily TBW and figure out if you're going to burn through drives; and replace proactively rather than reactively.

jgreco said:
You know, years ago, I submitted a ticket to do this as a GUI option and it was rejected.

https://redmine.ixsystems.com/issues/2365

Jordan faffed around with me in mail about this and basically tried to make it my responsibility to show that this was true on a variety of devices, and I basically said I wasn't going to buy gear just so I could do the testing he wanted, seeing as how I am not paid by iX.

If I were to reopen this ticket, would I get a better response now?

I'd like to think that it should be an expected feature now, as SSDs are fairly well-understood devices and several vendors explicitly mention over-provisioning as a method to improve write consistency and endurance. Algorithms are better now as well; I tested an Intel 320 80GB years ago, never updated after the end but the period of consistent throughput was slightly better with it limited via HPA. It also lets you use smaller metaslabs on the device.

morganL · Jun 10, 2020

jgreco said:
You know, years ago, I submitted a ticket to do this as a GUI option and it was rejected.

https://redmine.ixsystems.com/issues/2365

Jordan faffed around with me in mail about this and basically tried to make it my responsibility to show that this was true on a variety of devices, and I basically said I wasn't going to buy gear just so I could do the testing he wanted, seeing as how I am not paid by iX.

If I were to reopen this ticket, would I get a better response now?

Overprovisioning of drives is now a feature of 11.3. https://www.ixsystems.com/documentation/freenas/11.3-U3.2/storage.html#overprovisioning

It should have been done earlier...but not many other systems offer it.

jgreco · Jun 11, 2020

morganL said:
Overprovisioning of drives is now a feature of 11.3. https://www.ixsystems.com/documentation/freenas/11.3-U3.2/storage.html#overprovisioning

It should have been done earlier...but not many other systems offer it.

{{{vindicated}}}

Patrick M. Hausen · Jun 11, 2020

How to do that with NVME drives?

How to under-(over?)provision an NVME drive?

Hi all. according to the documentation this is the way to use only a set amount of a drive's capacity: https://www.ixsystems.com/documentation/freenas/11.3-U3.2/storage.html#overprovisioning This is the result when I try that on my system: root@freenas-pmh[~]# disk_resize nvd0 16GB Resizing...

www.ixsystems.com

jgreco · Jun 11, 2020

Patrick M. Hausen said:
How to do that with NVME drives?

How to under-(over?)provision an NVME drive?

Hi all. according to the documentation this is the way to use only a set amount of a drive's capacity: https://www.ixsystems.com/documentation/freenas/11.3-U3.2/storage.html#overprovisioning This is the result when I try that on my system: root@freenas-pmh[~]# disk_resize nvd0 16GB Resizing...

www.ixsystems.com

I don't know what they're doing in the command.

In theory, if you reset a drive (losing all the mappings), and simply allocate a much smaller SLOG partition up front, and don't use the rest of the drive, you should get the correct behaviour.

I had actually written quite a bit on the topic into the ticket and then the stupid Bugzilla had apparently logged me out by the time I posted it, vanishing a fairly detailed post.

So here's the idea.

SSD's are actually huge-block devices. Sorta like SMR

In order to cope with that, they allow writing to smaller "pages", sometimes 8192 bytes (which is where ashift 13 comes from). I still recommend the Ars tutorial on this as a general explainer of how the technology works.

So the thing here is that SSD's have this mechanism which causes the consolidation of pages into new blocks. This is predicated on the idea that an SSD has lots of user filesystem data on it. The normal workload for an SSD would have lots of data being retained long-term, while other data gets updated (Windows updated, you saved a spreadsheet, etc). And the average user who buys a 500GB SSD expects to be able to store ~~500GB of stuff. So your typical SSD comes with only a little extra flash space. The drive tries to maintain a healthy pool of new erased blocks so that when you do a big write, it can do these quickly.

The problem is, SLOG isn't like that, at all. SLOG is short-term data, which will be read a maximum of one time, and then only under duress (replay during import).

If you have a 500GB SLOG on a 1Gbps ethernet NAS, the maximum amount that could be written in a five second transaction group is around 625MB, and since you also have one being committed to the pool, you really can't make effective use of more than maybe 2GB of SLOG. (Let's not argue jails or stuff, trying to get the basic idea across here.) However, if your FreeNAS creates a 500GB partition on that 500GB SLOG, what's going to happen is that the SLOG is written to the entire 500GB. It won't come around to LBA #123456 very often... about once an hour.

But, this is also stressing the controller a bit, because if you're constantly writing all 500GB, you still have a relatively small pool of erased blocks. Depending on how well the controller figures out what's going on, you might not be actively thrashing about doing tons of garbage collection, but this is still a stupid thing to do.

There are two general fixes.

One is to rely on TRIM. TRIM involves sending craptons of extra commands at the drive, and not all drives support it, or support it correctly. Those extra commands are chewing up precious bandwidth on the SATA/SAS link, increasing latency of actual SLOG writes. On the upside, it lets the host tell the SSD exactly what isn't needed anymore. This does have the advantage of being 100% correct, but only if the drive does something useful with the data, and at a performance penalty of needing to do those other transactions.

The other is to lean on statistics. If we know that our maximum possible SLOG usage is a certain amount, let's use the 2GB as an example, then we can be clever. Resetting the drive to factory resets all the page mappings and the drive will work its way through all the blocks and erase them. This is a required precondition for this trick; it does not matter if it happened at the factory on a new drive, or if you do it manually on an existing drive that has had data previously written to it. The end result is a guaranteed massive pile of erased blocks. Now you create your 2GB partition on that and start writing to it. Well, because you are writing sequentially, an SSD controller will tend to pick contiguous pages on the same block, but even if it doesn't, there are so many available erased blocks, it isn't a problem.

When you get to the end of your 2GB partition and cycle around back to partition sector #1, you still have around 480GB worth of erased blocks out there. The controller is not going to waste time trying to garbage collect and consolidate pages, it is under no pressure. This is extremely good for wear leveling, as you should never get an unnecessary update to a block. When it writes those first few hundred sectors, the underlying old flash block no longer has any references, gets thrown onto the dirty page pile, and gets erased at the drive's convenience.

If the controller is struggling to keep pace, the other thing is that it can soak up a "sprint" of continuous SLOG activity, up to 480GB's worth, even if it cannot be actively erasing old flash blocks.

And the thing is, unlike the TRIM case, this doesn't rely on TRIM working, doesn't involve bogging down the drive with extra TRIM commands, and simply plays to the natural design of SSD's, taking advantage of how they work to get optimized behaviour.

So from my perspective, using TRIM for this is basically an example of saying "I can't think my way through the underlying problems to reach an obvious solution."

Also, as far as I'm concerned, overprovisioning refers to the amount of extra flash a manufacturer includes in an SSD (for example a "500GB" SSD will typically have 512GB but only advertises 500). Underprovisioning refers to artificially increasing that pool by using less than the advertised amount. Unfortunately, precision is a lost cause.

Patrick M. Hausen · Jun 11, 2020

I have been looking for ways to do a secure erase on my Optane SSD to start anew with a smaller partition, but did not find anything useful. I know the reasoning you laid out, but hopefully I am not the only one reading it. Thanks for the work!

As for terminology, coming from an ISP and hosting environment overprovisioning to me means selling more than you actually have, like in bandwidth, CPU cores, memory, ... Customers tend to not all request the maximum amount provisioned to them at the same time. Similarly in the data centre, 2x 10G LACP uplink for 48x 1G top of rack ports is plenty. Any one of those servers hardly ever uses even half of that 1G. That's what I would call overprovisioning. The terminology used here for SSDs always sounds the wrong way round for me. But I will get used to applying whatever is the standard and other folks understand.

Important Announcement for the TrueNAS Community.

Storage became extremely slow

Contributor

Never underestimate your own stupidity

Captain Morgan

actually does care

Contributor

Contributor

Never underestimate your own stupidity

actually does care

Captain Morgan

Contributor

Attachments

Contributor

actually does care

Contributor

Resident Grinch

actually does care

Captain Morgan

Resident Grinch

Hall of Famer

Resident Grinch

Hall of Famer

Similar threads