ZFS crashing when writing for too long on SSD

nooblard · Jul 23, 2023

Hi everyone,

I've got an issue with ZFS. When I'm transferring files (More than 3 Go) from my NAS to my Desktop, I do not meet any issues or any poor performance.

But, if I'm transferring a file from my Desktop to my NAS, then after a couple of seconds, I've got a big iowait usage and it looks like when I do k3s kubectl get pods -A, my zfs-controller pod restart count is increasing.

But it does not only happen with transferring NAS to Client, it also happened when NAS writing stuff in general.

I did SMART tests, and scrubs. But everything looks fine

I also got a netdata screenshot:

* Just a detail. When my ZFS crashes, I stop the file transfer

Here's my configuration :

Code:

Motherboard: ASUS G11CD-K Motherboard
CPU : i5-7400 3.00GHz
RAM : 8 Go (Using currently 4Go)
Network controller: RTL8111

Boot disk: KINGSTON_SA400S37120G
Disk where I got the issue: CT480BX500SSD1

Have a nice day

nooblard · Jul 23, 2023

Addenda: The netdata capture shows the CPU usage
My CT480BX500SSD1 was bought recently.

morganL · Jul 23, 2023

nooblard said:
Addenda: The netdata capture shows the CPU usage
My CT480BX500SSD1 was bought recently.

what the drive vendor and type?
Is it only 1 SSD?

sretalla · Jul 24, 2023

morganL said:
what the drive vendor and type?

it's this:

Crucial BX500 480GB 3D NAND SATA 2.5-inch SSD | CT480BX500SSD1 | Crucial.com

Buy Crucial BX500 480GB 3D NAND SATA 2.5-inch SSD CT480BX500SSD1. FREE US Delivery, guaranteed 100% compatibility when ordering using our online tools.

www.crucial.com

nooblard said:
Network controller: RTL8111

That's probably the issue... driver dumping under heavy load.

nooblard · Jul 24, 2023

Hello everyone and thanks for your answers,

morganL said:
what the drive vendor and type?
Is it only 1 SSD?

@sretalla is correct. I know it's not recommended, but I'm saving money in order to buy 3 * 4 TB SSD SATA.

sretalla said:
That's probably the issue... driver dumping under heavy load.

That's what I thought at the beginning. I read on the forum that Realtek drivers are whimsical sometimes. But I tried to write on my disk (dd if=/dev/random of=/mnt/somePool/test bs=1G count=100) with the ethernet unplugged from the NAS CLI. I wait until it's done and when I plug the ethernet and check at my netdata application address, I see the same behavior with high io wait usage. Do you think it could still be caused by Realtek driver? Because I don't see how Realtek drivers could cause that even with no cable plugged (I might be wrong, I just want to be sure)

Have a nice day

nooblard · Jul 24, 2023

Addendum: I forgot to answer, yes I only have got 1 SSD for data and applications :-(

morganL · Jul 24, 2023

The Micron SSD might be going into a very slow garbage collection process.....
You would see high latency on drives.

nooblard · Jul 24, 2023

morganL said:
The Micron SSD might be going into a very slow garbage collection process.....
You would see high latency on drives.

Well, that's unfortunate... Is there any way to confirm this?

sretalla · Jul 25, 2023

nooblard said:
dd if=/dev/random

You're also testing your CPU with that, maybe you have heat issues.

I would suggest turning off compression on a test dataset and running the test again with if=/dev/zero

nooblard · Jul 25, 2023

Hello,

sretalla said:
You're also testing your CPU with that, maybe you have heat issues.

I would suggest turning off compression on a test dataset and running the test again with if=/dev/zero

This is my CPU temperature (from the CPU sensors) :

The highlighted part shows the interval where my CPU usage is at 100% due to the io wait.
These results concord with the TrueNAS reporting.

Something I noticed this morning is that when I'm writing, it's also reading and we can see a symmetry. Can it be the cause of iowait?

Furthermore, I tried to test in a non-compressed and non-encrypted pool, I still have got the issue.

Have a nice day

sretalla · Jul 25, 2023

Are you doing sync writes? That may explain it... writes to the ZIL... in-pool as it's only a single disk...

Perhaps set sync=disabled on that dataset and test again.

nooblard · Jul 25, 2023

sretalla said:
Are you doing sync writes? That may explain it... writes to the ZIL... in-pool as it's only a single disk...

Perhaps set sync=disabled on that dataset and test again.

Well, it's supposed to be disabled in my temp dataset. But indeed, others were in "Standard" mode. Now they are all disabled.

After a new writing test, I still got the issue.

sretalla · Jul 25, 2023

Back to RAM then... you have 8GB (you say it's using 4)... does that remain true during the operation?

sretalla · Jul 25, 2023

BTW, you're not running with enough RAM according to the documentation...

TrueNAS Hardware Guide

Provides information on system hardware and minimum requirements (CPU, memory, etc.). Covers storage considerations, media, controllers, device sizing and cooling, SAS expanders.

www.truenas.com

Says you need 8GB for basic functions, but need to add more for Apps, VMs, etc. (which it seems you have given you mention k3s).

sretalla · Jul 25, 2023

ZFS is well known to not like being starved of memory, so I think we have your answer.

nooblard · Jul 25, 2023

sretalla said:
Back to RAM then... you have 8GB (you say it's using 4)... does that remain true during the operation?

No totally, but not enough to have less than 1GB remaining, here are 2 screenshots:

The highlighted part shows the interval where my CPU usage is at 100% due to the io wait.

I have got 2 applications currently running:
- Netdata
- Vaultwarden

Davvo · Jul 25, 2023

Are you using swap space anytime during the process?
8GB of RAM feels tight, especially on SCALE.

sretalla · Jul 25, 2023

I understand that it will make it more difficult to see what's happening, but can you reproduce the problem (zfs crash) with all apps stopped?

nooblard · Jul 25, 2023

Hello @Davvo

Davvo said:
Are you using swap space anytime during the process?
8GB of RAM feels tight, especially on SCALE.

Not a single change.

The highlighted part shows the interval where my CPU usage is at 100% due to the io wait.

sretalla said:
I understand that it will make it more difficult to see what's happening, but can you reproduce the problem (zfs crash) with all apps stopped?

Sure, gonna try this.

nooblard · Jul 25, 2023

ZFS crashes again + the reporting graph stopped monitoring for a few seconds

I just want to notice this is the first time I got the status CrashLoopBackOff, I remember to have read one post talking about CrashLoopBackOff, and I'm trying to find it.

Important Announcement for the TrueNAS Community.

ZFS crashing when writing for too long on SSD

Dabbler

Dabbler

Captain Morgan

Powered by Neutrality

Dabbler

Dabbler

Captain Morgan

Dabbler

Powered by Neutrality

Dabbler

Powered by Neutrality

Dabbler

Powered by Neutrality

Powered by Neutrality

Powered by Neutrality

Dabbler

MVP

Powered by Neutrality

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "ZFS crashing when writing for too long on SSD"

Similar threads