Suddenly very slow write speeds

Shoop83 · Mar 27, 2023

Davvo said:
If defrag doesn't solve the issue, please post the output of jgreco's sonlnet array.

It's going to be a while before I'm able to attempt to fix this, but I will report back with any results, even if I accidentally break the whole thing.

Axemann · Mar 27, 2023

Shoop83 said:
Yeah, old. Been basically constantly online since 2018.

Couldn't even begin to tell you why the age timer might have reset. Zero clue.

Likely it hit 65535 and rolled over. Not helpful in the slightest for your issue, lol, but you can rule that out as having anything to do with it.

jgreco · Mar 30, 2023

WI_Hedgehog said:
What is a good definitive indicator of fragmentation being an issue?

If you are looking for something like a conventional fragmentation percentage, there isn't one. The difficulties here are that ZFS is a CoW filesystem and naturally lends itself to fragmentation, PLUS the "Z" in ZFS -- you have massive quantities of data and space that would need to be analyzed to see how much of it was in contiguous allocated space versus fragmented. You might be able to write something akin to ZFS scrub, which walks all the metadata on the pool and analyzes it for contiguousness (is that a word?) but I think it'd just open up a new thing for people to freak out about when it reported something they had "heard" was bad but didn't really understand.

I'd say that it's more a matter of applying some experience and intuition. If you understand that your workload is going to generate fragmentation, just assume that to be the case and apply mitigations (gobs free space for writes, gobs ARC+L2ARC for reads). Your life will be more pleasant. If you are not clear on the workload, check out disk I/O to see if it is unreasonably high compared to the system activity. Seeing high IOPS (busy percentage in gstat for ex.) and relatively low MByte/sec throughput to the disks is the fastest clue that you may be experiencing fragmentation, but you need the third leg there as well -- you need to know that the pool traffic generating that high IOPS plus low thruput is sequential access. Then you have circumstantial indication of frag being an issue. You might also note that this "test" sucks on a busy system.

I've basically said for years that ZFS isn't good at this type of workload unless you massively resource it to mitigate the issues. The problem is that this is mostly a hobbyist forum and we don't have enough folks like @Stilez willing to provide the resources for a demanding application and then follow that to its end. People don't like hearing that they need to be able to hold their working set in ARC or L2ARC+ARC for fragmentation mitigation; they want their TrueNAS to be a $500 Synology or QNAP killer without the additional resource investment.

WI_Hedgehog · Mar 30, 2023

jgreco said:
contiguousness (is that a word?)

Contiguosity?

jgreco said:
You might be able to write something akin to ZFS scrub, which walks all the metadata on the pool and analyzes it for contiguousness but I think it'd just open up a new thing for people to freak out about when it reported something they had "heard" was bad but didn't really understand.

In Wisconsin we call "freaking out" "having a cow." That would mean the tool should be named "C.O.W. cow".

And the defraggler would then be the "C.O.W. --cow" (for C-language people).
A thorough defraggle would be "C.O.W. !cow" (cow-bang-cow, or the "cow-banger"), so someone should hire me to work on this C.O.W.-banging* situation.

jgreco said:
Seeing high IOPS (busy percentage in gstat for ex.) and relatively low MByte/sec throughput to the disks is the fastest clue that you may be experiencing fragmentation, but you need the third leg there as well.

Fortunately, historical statistics are our friend. (Unfortunately most people don't understand statistics and/or hate them.) Trends change over time (such as new software is introduced, or updates to existing software having "an enhanced feature set"), so logging this occasionally should help SysAdmins spot changes.

jgreco said:
I've basically said for years that ZFS isn't good at this type of workload unless you massively resource it to mitigate the issues. The problem is that this is mostly a hobbyist forum and we don't have enough folks like @Stilez willing to provide the resources for a demanding application and then follow that to its end. People don't like hearing that they need to be able to hold their working set in ARC or L2ARC+ARC for fragmentation mitigation; they want their TrueNAS to be a $500 Synology or QNAP killer without the additional resource investment.

Thank you for the detailed reply I was hoping for, as that answers a question that really did need answering.

From a "mostly hobbyist" standpoint, wouldn't staying below 80% (or maybe 75%) disk usage mainly solve that problem? The home user is going to use:

Plex : Large sequential writes of static data.
Minecraft : I'm going to guess this needs to be treated like a database and have dedicated drives.
PUSS : Personal Use Sequential Storage (like Word and Excel files where if changed the whole file is re-written, not just the update of a block or two blocks)

It seems few (though some) people are using TrueNAS for storing accounting, tax, or business data, and if they are they should be dedicating drives to those database-type applications.

There is a professional photographer as of late building a system, that would also seem to require sets of drive dedicated to the tasks of:

Short-term high-speed storage for current projects. (SSD Z2)
Long-term high-capacity storage for completed projects. (HDD Z3)
Long-term high-speed storage for financial data. (SSD Z3)
Personal storage, unrelated to business purposes (HDD Z3) (which is a convenience factor as successful business owners are always working)

(pirate voice) And there be a Storinator-type box and LSI 16i + LSI 8i cards. Arrrrg!**

---
*C.O.W.-banging : Head-banging or a large number of disk track seeks caused by normal fragmentation of a Copy-On-Write filesystem.
**Spend the money, it's worth it! (basically: Don't waste thousands on data-recovery, build a reliable system from the start)

Samuel Tai · Mar 30, 2023

jgreco said:
contiguousness (is that a word?)

Definition of CONTIGUOUSNESS

being in actual contact : touching along a boundary or at a point; adjacent; next or near in time or sequence… See the full definition

www.merriam-webster.com

jgreco · Mar 30, 2023

WI_Hedgehog said:
In Wisconsin we call "freaking out" "having a cow." That would mean the tool should be named "C.O.W. cow".

And the defraggler would then be the "C.O.W. --cow" (for C-language people).
A thorough defraggle would be "C.O.W. !cow" (cow-bang-cow, or the "cow-banger"), so someone should hire me to work on this C.O.W.-banging* situation.

Kent Brodie would probably have a problem with this; it too closely resembles moocow and his employer might have a problem with your cow banging. This opens up lots of hysterical raisins.

jgreco · Mar 30, 2023

WI_Hedgehog said:
From a "mostly hobbyist" standpoint, wouldn't staying below 80% (or maybe 75%) disk usage mainly solve that problem?

For the sake of completeness, it is reasonable to note that the fragmentation issue has to do with the number of write cycles that happen. If you have a fresh pool and you exclusively write (WORM-drive style) to the pool, the pool will go as fast as possible until it is very nearly full (95-98%) at which point things will go to hell. This includes disabling superfluous writes such as atime updates. Some metadata is necessarily overwritten but avoiding overwrites gives you a highly performant mostly fragmentation-free pool.

Our friends at Delphix have provided the opposite end of the spectrum, a study of pool fill vs steady state performance. When using a pool in a typical "departmental fileserver" type usage with overwrites and updates, your pool will get slower and slower over time as fragmentation takes hold and free space becomes harder to find.

What this is is an experiment where a pool was filled to 10%, 25%, 50%, 75%, and 95%, and then THAT data was repeatedly overwritten until the pool had reached a "steady state" where performance was not degrading further. You can see that a 25% full pool is expected to be maybe 5x or 6x a 75% full pool ONCE it reaches that steady state. However this assumes rewrites/overwrites are going to happen. Your pool might be an archival pool in which case it might only ever add data, or only do a modest amount of overwriting during the lifetime of the pool, in which case your performance will not fall down to these levels because you're just not doing that many overwrite cycles that are the root cause of fragmentation.

WI_Hedgehog said:
Plex : Large sequential writes of static data.

But is it? I do not condone piracy but I recognize that lots of Plex users participate in USENET (ask me how I know) or BitTorrent and probably thrash the hell out of whatever share they use for video spool.

WI_Hedgehog · Mar 30, 2023

Thanks for the pool steady-state information, it's quite helpful to know what ZFS can digest and for how long.

Important Announcement for the TrueNAS Community.

Suddenly very slow write speeds

Shoop83

Dabbler

Axemann

Dabbler

jgreco

Resident Grinch

WI_Hedgehog

Guru

Samuel Tai

Never underestimate your own stupidity

Definition of CONTIGUOUSNESS

jgreco

Resident Grinch

jgreco

Resident Grinch

WI_Hedgehog

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Suddenly very slow write speeds

Dabbler

Dabbler

Resident Grinch

Guru

Never underestimate your own stupidity

Resident Grinch

Resident Grinch

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Suddenly very slow write speeds"

Similar threads