ZFS crashing when writing for too long on SSD

nooblard

Dabbler
Joined
Jul 23, 2023
Messages
13
Hi everyone,


I've got an issue with ZFS. When I'm transferring files (More than 3 Go) from my NAS to my Desktop, I do not meet any issues or any poor performance.

But, if I'm transferring a file from my Desktop to my NAS, then after a couple of seconds, I've got a big iowait usage and it looks like when I do k3s kubectl get pods -A, my zfs-controller pod restart count is increasing.

But it does not only happen with transferring NAS to Client, it also happened when NAS writing stuff in general.

I did SMART tests, and scrubs. But everything looks fine

I also got a netdata screenshot:
schem.png

* Just a detail. When my ZFS crashes, I stop the file transfer


Here's my configuration :
Code:
Motherboard: ASUS G11CD-K Motherboard
CPU : i5-7400 3.00GHz
RAM : 8 Go (Using currently 4Go)
Network controller: RTL8111

Boot disk: KINGSTON_SA400S37120G
Disk where I got the issue: CT480BX500SSD1


Have a nice day :smile:
 

nooblard

Dabbler
Joined
Jul 23, 2023
Messages
13
Addenda: The netdata capture shows the CPU usage
My CT480BX500SSD1 was bought recently.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700

nooblard

Dabbler
Joined
Jul 23, 2023
Messages
13
Hello everyone and thanks for your answers,

what the drive vendor and type?
Is it only 1 SSD?
@sretalla is correct. I know it's not recommended, but I'm saving money in order to buy 3 * 4 TB SSD SATA.

That's probably the issue... driver dumping under heavy load.
That's what I thought at the beginning. I read on the forum that Realtek drivers are whimsical sometimes. But I tried to write on my disk (dd if=/dev/random of=/mnt/somePool/test bs=1G count=100) with the ethernet unplugged from the NAS CLI. I wait until it's done and when I plug the ethernet and check at my netdata application address, I see the same behavior with high io wait usage. Do you think it could still be caused by Realtek driver? Because I don't see how Realtek drivers could cause that even with no cable plugged (I might be wrong, I just want to be sure)


Have a nice day :smile:
 

nooblard

Dabbler
Joined
Jul 23, 2023
Messages
13
Addendum: I forgot to answer, yes I only have got 1 SSD for data and applications :-(
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
The Micron SSD might be going into a very slow garbage collection process.....
You would see high latency on drives.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
dd if=/dev/random
You're also testing your CPU with that, maybe you have heat issues.

I would suggest turning off compression on a test dataset and running the test again with if=/dev/zero
 

nooblard

Dabbler
Joined
Jul 23, 2023
Messages
13
Hello,

You're also testing your CPU with that, maybe you have heat issues.

I would suggest turning off compression on a test dataset and running the test again with if=/dev/zero
This is my CPU temperature (from the CPU sensors) :
librewolf_6AacttZomH.png

The highlighted part shows the interval where my CPU usage is at 100% due to the io wait.
These results concord with the TrueNAS reporting.

Something I noticed this morning is that when I'm writing, it's also reading and we can see a symmetry. Can it be the cause of iowait?
librewolf_BA1yP3jLTP.png


Furthermore, I tried to test in a non-compressed and non-encrypted pool, I still have got the issue.


Have a nice day :smile:
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
Are you doing sync writes? That may explain it... writes to the ZIL... in-pool as it's only a single disk...

Perhaps set sync=disabled on that dataset and test again.
 

nooblard

Dabbler
Joined
Jul 23, 2023
Messages
13
Are you doing sync writes? That may explain it... writes to the ZIL... in-pool as it's only a single disk...

Perhaps set sync=disabled on that dataset and test again.

Well, it's supposed to be disabled in my temp dataset. But indeed, others were in "Standard" mode. Now they are all disabled.

After a new writing test, I still got the issue.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
Back to RAM then... you have 8GB (you say it's using 4)... does that remain true during the operation?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
BTW, you're not running with enough RAM according to the documentation...


Says you need 8GB for basic functions, but need to add more for Apps, VMs, etc. (which it seems you have given you mention k3s).
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
ZFS is well known to not like being starved of memory, so I think we have your answer.
 

nooblard

Dabbler
Joined
Jul 23, 2023
Messages
13
Back to RAM then... you have 8GB (you say it's using 4)... does that remain true during the operation?
No totally, but not enough to have less than 1GB remaining, here are 2 screenshots:
librewolf_Oe9KW9zTUM.png

1690286613939.png

The highlighted part shows the interval where my CPU usage is at 100% due to the io wait.

I have got 2 applications currently running:
- Netdata
- Vaultwarden
 
Last edited:

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Are you using swap space anytime during the process?
8GB of RAM feels tight, especially on SCALE.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
I understand that it will make it more difficult to see what's happening, but can you reproduce the problem (zfs crash) with all apps stopped?
 

nooblard

Dabbler
Joined
Jul 23, 2023
Messages
13
Hello @Davvo
Are you using swap space anytime during the process?
8GB of RAM feels tight, especially on SCALE.

Not a single change.
1690289610632.png

The highlighted part shows the interval where my CPU usage is at 100% due to the io wait.

I understand that it will make it more difficult to see what's happening, but can you reproduce the problem (zfs crash) with all apps stopped?

Sure, gonna try this.
 

nooblard

Dabbler
Joined
Jul 23, 2023
Messages
13
ZFS crashes again + the reporting graph stopped monitoring for a few seconds
1690290062741.png


1690290143943.png


I just want to notice this is the first time I got the status CrashLoopBackOff, I remember to have read one post talking about CrashLoopBackOff, and I'm trying to find it.
 
Top