Suddenly very slow write speeds

Shoop83 · Mar 25, 2023

Hi Folks

Haven't been able to find another thread discussing this exact problem.

Came back from vacation and moved some files from my Windows PC to my TrueNAS storage and the best it would give me is 355 kb/s.

This machine has been running for almost 5 years with transfer speeds I'm OK with. I updated to TrueNAS-13.0 a few months ago with no issues.

Some experimenting and I've found that there seems to be no issues copying or moving files from TrueNAS to my Windows PC. There seems to be no issues moving files around inside TrueNAS. If I try to copy a file in from TrueNAS to TrueNAS it nearly grinds to a halt.

Initially my storage utilization was at 85%, so I moved enough data off the TrueNAS to get it down to 74%. Rebooted it and waited over night. That did not seem to help anything.

RAM, CPU do not seem to be taxed during the file copy. Disc utilization doesn't seem maxed or anything.

I'm trying to figure out the right commands to run a dd test, and how to run an iperf test, but I am a slow learner and really don't want to mess anything up. I call this entire thing my house of cards. If I break it, it will take me a long time to figure out how to fix it.

That's a long way of asking, does anyone have a theory as to why my server would suddenly have horrendous write speeds? Are there any "do this first" tests I should run that might illuminate a problem?

Thank you all.

root@freenas[~]# zpool status
pool: freenas-boot
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0B in 00:00:14 with 0 errors on Tue Mar 21 03:45:14 2023
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
ada4p2 ONLINE 0 0 0

errors: No known data errors

pool: tank
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0B in 12:01:37 with 0 errors on Wed Mar 22 12:01:37 2023
config:

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gptid/4f64359d-1206-11e9-a206-ac1f6b60c826 ONLINE 0 0 0
gptid/50f5c733-1206-11e9-a206-ac1f6b60c826 ONLINE 0 0 0
gptid/52bd6456-1206-11e9-a206-ac1f6b60c826 ONLINE 0 0 0
gptid/544a2e08-1206-11e9-a206-ac1f6b60c826 ONLINE 0 0 0

errors: No known data errors
root@freenas[~]#

Samuel Tai · Mar 25, 2023

The path to success for block storage

It seems like I haven't written a sticky for awhile, but just in the last week I've had to cover this topic several times. ZFS does two different things very well. One is storage of large sequentially-written files, such as archives, logs, or data files, where the file does not have the middle...

www.truenas.com

Your pool is too fragmented.

Shoop83 · Mar 25, 2023

Samuel Tai said:
The path to success for block storage

It seems like I haven't written a sticky for awhile, but just in the last week I've had to cover this topic several times. ZFS does two different things very well. One is storage of large sequentially-written files, such as archives, logs, or data files, where the file does not have the middle...

www.truenas.com

Your pool is too fragmented.

Thanks for responding!

root@freenas[~]# zpool get fragmentation
NAME PROPERTY VALUE SOURCE
freenas-boot fragmentation - -
tank fragmentation 16% -

It looks like the primary remedies for a too fragmented pool are...
-add more drives in a second vdev to expand the pool and increase total storage, thus increasing free space for writing
-move everything out of the pool and back into the pool, thus reducing overall fragmentation
-something else?

Thank you.

Dice · Mar 26, 2023

Shoop83 said:
tank fragmentation 16% -

I don't believe 16% fragmentation should grind the pool to virtually a standstill.

What could behave similarly is when a drive is significantly slower than others.
Ie, it might be letting go...
What's the SMART status on the drives? smartctl -a /dev/adaX

Solnetarraytest is something posted here on the forums that help speed test drives on a live pool.

solnet-array-test

Back in the late '90's, I was managing a bunch of large whitebox storage servers. For the largest of these, I had the pleasure of building and deploying a massive storage server, 8 shelves of 9 drives each, Seagate ST173404LW 73GB drives, a...

www.truenas.com

Samuel Tai · Mar 26, 2023

Dice said:
I don't believe 16% fragmentation should grind the pool to virtually a standstill.

Well, OP also said his pool was at 85% full, but he got it to 74% full. His pool's free space is too fragmented.

Shoop83 said:
It looks like the primary remedies for a too fragmented pool are...
-add more drives in a second vdev to expand the pool and increase total storage, thus increasing free space for writing
-move everything out of the pool and back into the pool, thus reducing overall fragmentation

There's also backing up the pool, destroying it, recreating it, and then restoring the contents from backup.

Dice · Mar 26, 2023

Samuel Tai said:
His pool's free space is too fragmented.

I believe you are correct here.

Shoop83 · Mar 26, 2023

Dice said:
I don't believe 16% fragmentation should grind the pool to virtually a standstill.

What could behave similarly is when a drive is significantly slower than others.
Ie, it might be letting go...
What's the SMART status on the drives? smartctl -a /dev/adaX

Solnetarraytest is something posted here on the forums that help speed test drives on a live pool.

solnet-array-test

Back in the late '90's, I was managing a bunch of large whitebox storage servers. For the largest of these, I had the pleasure of building and deploying a massive storage server, 8 shelves of 9 drives each, Seagate ST173404LW 73GB drives, a...

www.truenas.com

Smart status on the drives:

smartctl -a /dev/adaX - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

pastebin.com

Thank you for looking at this.

Dice · Mar 26, 2023

Shoop83 said:
Mart status on the drives:

smartctl -a /dev/adaX - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

pastebin.com

Thank you for looking at this.

Nothing screams broken drives from that.
Old, but not necessarily broken.
67k power on hours!

Also, peculiar how the age timer somehow resetted here?

Code:


SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1666         -
# 2  Extended offline    Completed without error       00%      1512         -
# 3  Short offline       Completed without error       00%      1331         -
# 4  Short offline       Completed without error       00%       995         -
# 5  Extended offline    Completed without error       00%       841         -
# 6  Short offline       Completed without error       00%       659         -
# 7  Short offline       Completed without error       00%       250         -
# 8  Extended offline    Completed without error       00%        98         -
# 9  Short offline       Completed without error       00%     65450         -
#10  Short offline       Completed without error       00%     65043        -

jgreco · Mar 26, 2023

Dice said:
I don't believe 16% fragmentation should grind the pool to virtually a standstill.

The number doesn't represent fragmentation, it represents the difficulty the system is having finding free space -- it's almost a worthless number. And yes, high levels of fragmentation especially on a pool that has previously been nearly filled may cause huge amounts of I/O that bring the pool to a crawl.

Davvo · Mar 26, 2023

This might be useful in reducing your fragmentation if you don't want to restore from backup.
arc_summary output could be interesting.

Shoop83 · Mar 26, 2023

Davvo said:
This might be useful in reducing your fragmentation if you don't want to restore from backup.
arc_summary output could be interesting.

arc_summary output here:

arc_summary - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

pastebin.com

Thanks for looking at this!

I'll read through that other thread when I've got some time.

Shoop83 · Mar 26, 2023

Dice said:
Nothing screams broken drives from that.
Old, but not necessarily broken.
67k power on hours!

Also, peculiar how the age timer somehow resetted here?

Yeah, old. Been basically constantly online since 2018.

Couldn't even begin to tell you why the age timer might have reset. Zero clue.

Shoop83 · Mar 26, 2023

jgreco said:
The number doesn't represent fragmentation, it represents the difficulty the system is having finding free space -- it's almost a worthless number. And yes, high levels of fragmentation especially on a pool that has previously been nearly filled may cause huge amounts of I/O that bring the pool to a crawl.

It's looking like excessive fragmentation is the likely culprit.

I have a SAS expansion card, and 4 new hard drives (identical to the existing ones) and a plan to expand my pool with a second vdev.

When I find the time to do that, seems like it might be best to just destroy the pool and create a new double size pool and then restore from backup. Is there a significantly better way to do this?

Thanks for looking at this.

Samuel Tai · Mar 27, 2023

Shoop83 said:
Yeah, old. Been basically constantly online since 2018.

Couldn't even begin to tell you why the age timer might have reset. Zero clue.

Could be simple overflow. Looks like the max value is probably 65535.

Davvo · Mar 27, 2023

Shoop83 said:
arc_summary output here:

arc_summary - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

pastebin.com

Thanks for looking at this!

I'll read through that other thread when I've got some time.

Your ARC values look normal at a first glance.

WI_Hedgehog · Mar 27, 2023

Shoop83 said:
I have a SAS expansion card, and 4 new hard drives (identical to the existing ones) and a plan to expand my pool with a second vdev.

When I find the time to do that, seems like it might be best to just destroy the pool and create a new double size pool and then restore from backup. Is there a significantly better way to do this?

See @Davvo's post on:
Simple bash script to rebalance pool data between all mirrors when adding vdevs to a pool:

Davvo said:
This might be useful in reducing your fragmentation if you don't want to restore from backup.

Shoop83 · Mar 27, 2023

WI_Hedgehog said:
See @Davvo's post on:
Simple bash script to rebalance pool data between all mirrors when adding vdevs to a pool:

Very cool. Hadn't read that post yet. That looks a lot nicer than taxing my backup system.

Thank you for looking at this.

Dice · Mar 27, 2023

jgreco said:
The number doesn't represent fragmentation, it represents the difficulty the system is having finding free space -- it's almost a worthless number. And yes, high levels of fragmentation especially on a pool that has previously been nearly filled may cause huge amounts of I/O that bring the pool to a crawl.

In retrospect I can see how it is far more important to know the 'history of utilization' rather than current fragmentation numbers.
I got some hand's on experience on that department too.

Probably should've paid better attention to OP, had I read those parts more carefully, other clues would've emerged.

Interesting note on the fragmentation number. I can see what it <does>, also I can see how it would be less ambiguous if it was labeled differently.

WI_Hedgehog · Mar 27, 2023

jgreco said:
The number doesn't represent fragmentation, it represents the difficulty the system is having finding free space -- it's almost a worthless number. And yes, high levels of fragmentation especially on a pool that has previously been nearly filled may cause huge amounts of I/O that bring the pool to a crawl.

What is a good definitive indicator of fragmentation being an issue?

Davvo · Mar 27, 2023

WI_Hedgehog said:
What is a good definitive indicator of fragmentation being an issue?

I would think slow write speeds and possibly (though harder to notice thanks to ARC) slower reads too if we talk about extreme situations.
Anyway, imho keeping that number low on spinners can't be a bad thing.

I wouldn't guarantee fragmentation being the issue here though.
If defrag doesn't solve the issue, please post the output of jgreco's sonlnet array.

Important Announcement for the TrueNAS Community.

Suddenly very slow write speeds

Dabbler

Never underestimate your own stupidity

Dabbler

Wizard

Never underestimate your own stupidity

Wizard

Dabbler

Wizard

Resident Grinch

MVP

Dabbler

Dabbler

Dabbler

Never underestimate your own stupidity

MVP

Guru

Dabbler

Wizard

Guru

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Suddenly very slow write speeds"

Similar threads