webui becomes unresponsive and nvme error after 24+ hours

JoelJames · Jan 16, 2022

TrueNAS-SCALE-22.02-RC.2
Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz
32GB RAM
boot nvme 256GB Samung 860 EVO
2 x 500GB Samsung 850 pro
Please let me know if you need more information.

Ran the system for about 3 weeks using TRUENAS Core and never had a problem. I decided to try TrueNAS Scale, fresh install, and after about 24 hours webui and ssh becomes unresponsive. I can ping the system and the VMs still work. I have to hard reboot the system to get the webui to start working. I re-installed TrueNAS Scale just in case it was a bad install and still the same. After about 24 hours webui stops responding.

I see in the error log: kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1. But like I said I never had this issue with TrueNAS Core.

I'm leaning towards a bad drive but thought I would see if anyone else has had this issue first.

Here is what is in the error log.

Jan 16 01:14:01 truenas CRON[4026655]: pam_env(cron:session): Unable to open env file: /etc/default/locale: No such file or directory
Jan 16 01:15:40 truenas kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1
Jan 16 01:16:01 truenas CRON[4029766]: pam_env(cron:session): Unable to open env file: /etc/default/locale: No such file or directory
Jan 16 01:16:11 truenas kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1
Jan 16 01:16:38 truenas kernel: INFO: task txg_sync:580 blocked for more than 120 seconds.
Jan 16 01:16:38 truenas kernel: Tainted: P W OE 5.10.81+truenas #1
Jan 16 01:16:38 truenas kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 16 01:16:38 truenas kernel: INFO: task collectd:7721 blocked for more than 120 seconds.
Jan 16 01:16:38 truenas kernel: Tainted: P W OE 5.10.81+truenas #1
Jan 16 01:16:38 truenas kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 16 01:16:42 truenas kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1
Jan 16 01:16:42 truenas kernel: blk_update_request: I/O error, dev nvme0n1, sector 203940584 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Jan 16 01:16:42 truenas kernel: blk_update_request: I/O error, dev nvme0n1, sector 202647336 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Jan 16 01:17:01 truenas CRON[4031199]: pam_env(cron:session): Unable to open env file: /etc/default/locale: No such file or directory
Jan 16 01:18:01 truenas CRON[4032654]: pam_env(cron:session): Unable to open env file: /etc/default/locale: No such file or directory
Jan 16 01:20:01 truenas CRON[4035271]: pam_env(cron:session): Unable to open env file: /etc/default/locale: No such file or directory
Jan 16 01:20:40 truenas kernel: INFO: task systemd:1 blocked for more than 120 seconds.
Jan 16 01:20:40 truenas kernel: Tainted: P W OE 5.10.81+truenas #1
Jan 16 01:20:40 truenas kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 16 01:20:40 truenas kernel: INFO: task txg_sync:580 blocked for more than 121 seconds.
Jan 16 01:20:40 truenas kernel: Tainted: P W OE 5.10.81+truenas #1
Jan 16 01:20:40 truenas kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 16 01:20:40 truenas kernel: INFO: task asyncio_loop:995 blocked for more than 121 seconds.
Jan 16 01:20:40 truenas kernel: Tainted: P W OE 5.10.81+truenas #1
Jan 16 01:20:40 truenas kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 16 01:20:40 truenas kernel: INFO: task loop_monitor:1650 blocked for more than 121 seconds.
Jan 16 01:20:40 truenas kernel: Tainted: P W OE 5.10.81+truenas #1
Jan 16 01:20:40 truenas kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 16 01:20:40 truenas kernel: INFO: task collectd:7721 blocked for more than 121 seconds.
Jan 16 01:20:40 truenas kernel: Tainted: P W OE 5.10.81+truenas #1
Jan 16 01:20:40 truenas kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 16 01:20:41 truenas kernel: INFO: task k3s-server:27045 blocked for more than 121 seconds.
Jan 16 01:20:41 truenas kernel: Tainted: P W OE 5.10.81+truenas #1
Jan 16 01:20:41 truenas kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 16 01:20:41 truenas kernel: INFO: task Plex Media Serv:19113 blocked for more than 121 seconds.
Jan 16 01:20:41 truenas kernel: Tainted: P W OE 5.10.81+truenas #1
Jan 16 01:20:41 truenas kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 16 01:20:41 truenas kernel: INFO: task zed:4030714 blocked for more than 122 seconds.
Jan 16 01:20:41 truenas kernel: Tainted: P W OE 5.10.81+truenas #1
Jan 16 01:20:41 truenas kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 16 01:22:01 truenas CRON[4037851]: pam_env(cron:session): Unable to open env file: /etc/default/locale: No such file or directory

Thanks,
Joel

jgreco · Jan 16, 2022

JoelJames said:
boot nvme 256GB Samung 860 EVO

Well there's your problem, that's a SATA SSD, not NVMe...

JoelJames · Jan 21, 2022

I replaced the the drive with a new Samsung 980 Pro and so far at almost 30 hours no issues.

Joel

piwko28 · Sep 2, 2022

Hi @JoelJames , still no issues?

I have same problem with unresponsive UI and CLI, both over ssh and directly by connecting screen and keyboard.
It happens after night, but all services (smb, apps) seems to work properly and fast.
First access to web ui shows "Checking HA status" message, then it becomes unavailable at all.

I have two HDD in RAID0 and the pool is degraded. I replaced one of the disk recently, but the second one still reports sectors reading errors. After your post I sespect it as a reason, but I ran TrueNAS Core for last few years and the pool reports to be degraded for last 3 months with no issues. The problems started when I upgraded the system to TrueNAS Scale.

I don't think that it should work like this. This is server and it should work no matter what hardware issues are - the issues should be reported and defective elements should be replaced, and that is how it had worked before with FreeBSD. Sympthoms like unresponsive UI sounds like serious bug.

piwko28 · Sep 2, 2022

But it may be something else in my case...
I found this jira issue: https://ixsystems.atlassian.net/browse/NAS-114442?attachmentViewMode=list
I do backup of apps during the night.

sretalla · Sep 2, 2022

piwko28 said:
I have two HDD in RAID0

piwko28 said:
This is server and it should work no matter what hardware issues are

I think your desired performance and your configuration don't match.

Even worse considering:

piwko28 said:
I have two HDD in RAID0 and the pool is degraded

If your only data pool is RAID0 (I assume you mean striped VDEV) which is degraded, you have issues with your system dataset (which will be on that pool) which would explain the hanging that you're seeing.

piwko28 · Sep 2, 2022

sretalla said:
If your only data pool is RAID0

Ah, sorry, mistake. RAID1 of course, mirrored. Second disk is marked as degraded - SMART limits exceeded.
The weird thing is that it worked fine on Core even when two disks were marked as degraded (one of them was replaced recently).

So key things are:
* Core is more stable on that regard than Scale
* Everything works fine but "middlewared" (web ui and cli) and there is no option to restart it (because interface to restart it hanged)

sretalla · Sep 2, 2022

So maybe I misunderstood, but you're saying you had a Mirrored VDEV with 2 degraded disks in it and did nothing to address that (meaning fixing both), but then changed OS to SCALE and expected that nothing would devolve with your degraded disk to arrive at a worse situation?

I think the variable to watch in that situation isn't OS, it's time.

Depending on what exactly was wrong with your degraded disk and what controller hardware you have connected to it, it may have just been too much for the controller to handle continuous retries and it's causing the middleware issues dealing with the errors in both directions.

piwko28 · Sep 2, 2022

I didn't know what exactly "degraded" means (ie. smart tests are only suggestions). However I will replace second disk soon to have healthy pool.
You may be right, because middleware is on boot pool which is degraded and apps are on different pool and that may be clue why it works without any issues.
So it seems it is quite important to have healthy boot pool on Scale.

sretalla said:
So maybe I misunderstood, but you're saying you had a Mirrored VDEV with 2 degraded disks in it and did nothing to address that (meaning fixing both), but then changed OS to SCALE and expected that nothing would devolve with your degraded disk to arrive at a worse situation?

It sounds like I'm insane with this summary ;) But I assumed that OS have nothing to say when my hardware handled it before (on Core).

I agree that devices should be healthly, but how could I know it when middleware does not work? IMO web ui or cli should work always to provide an interface with hardware info and possibility to replace degraded devices. And Core meets the requirements, Scale does not.

sretalla said:
I think the variable to watch in that situation isn't OS, it's time.

Did you mean that time made matters worse and it's just coincident? Maybe... I can't check it currently.

jgreco · Sep 2, 2022

piwko28 said:
I didn't know what exactly "degraded" means (ie. smart tests are only suggestions).

Yes, SMART tests are only suggestions. They have nothing at all to do with ZFS, they are estimations of drive health provided by the HDD. This is very useful because ZFS does not include any mechanism to do comprehensive drive testing -- ZFS will only test allocated disk space during a scrub, for example. We suggest doing both SMART tests and also ZFS scrubs to get maximum protection against disk failures, as these are entirely separate mechanisms. ZFS reporting an array as degraded means ZFS is seeing problems with the array, and is a prompt for an administrator to pay attention and resolve the problem. In a previously stable array, it is a predictor of a future failure.

piwko28 said:
IMO web ui or cli should work always to provide an interface with hardware info and possibility to replace degraded devices. And Core meets the requirements, Scale does not.

This is false. Both Scale and Core are built on top of Linux and FreeBSD, respectively, and the underlying operating systems are general purpose UNIX or UNIX-like operating systems. The middleware interfaces with the system by issuing shell commands or kernel API calls, and there are numerous potential scenarios which can result in either an indefinitely long wait for completion, or in some cases a complete hang. The middleware is not actually magic and does not have the ability to force itself to continue working in the face of kernel issues such as a hung or stalled device. I assure you that I've witnessed TrueNAS Core and FreeNAS both hang in the past, and there's no technical reason why it couldn't do so today. In your case, NVMe is an evolving pain point which has become more stable over time, but lockups due to device firmware or kernel driver problems are basically expected potholes in the road. You can't avoid them, but you can pick well-paved roads by using devices that have been successfully used by other folks. This set of "safe" NVMe devices may not even be the same between the two operating systems.

piwko28 · Sep 14, 2022

Ok, so I replaced the disks and the pool is healthy. But unfortunately it did not help with the issue.

In fact I upgraded Core to Scale with manual update script. So I decided to download whole configuration and do clean install of Scale. It seems it perform better.

Then I realized that a lot of snapshots (over 7k), mostly comes from ix containers. Previously there was no notification about it; it appeared after reinstall the system. I tried to get rid of them, but I couldn't remove some (clones problem). I must add that there was a snapshot task for the whole apps pool + replication task. It was fine on Core with iocage, but apparently not on Scale with kubernetes. The only way to fix it was to remove all the containers (config files forunately were stored in another place than within containers), unset containers pool and run:

Code:

docker system prune --all --force --volumes

Then I can remove unused snapshots:

Code:

zfs list -t snap | awk '/ix-containers/ { printf "zfs destroy %s\n", $1 }' | sh

And install the apps again. Btw. it's great on Scale to do it in so simply way.

After all I changed snapshots policy to avoid ix-containers, I do only snapshots of configuration. Now the number of ix-containers snapshots is around 200 (is it valid number for 6 containers?)

I have no idea which operation helped, but I believe all of them were important. Today the first night without hanging is over. I need to install plex and nextcloud (2 of 8 apps) to restore my NAS to previous state and I hope it was not the reason of hanging ;) so fingers crossed.

In summary: I have an impression that it is a common problem with removing ix-containers snapshots. And with cloning ix-containers. The system works much better and faster without these all snapshots and datasets, so I think it drove the Scale unstable. Maybe it's good idea to warn the user before establishing such tasks.

Important Announcement for the TrueNAS Community.

webui becomes unresponsive and nvme error after 24+ hours

JoelJames

Cadet

jgreco

Resident Grinch

JoelJames

Cadet

piwko28

Cadet

piwko28

Cadet

sretalla

Powered by Neutrality

piwko28

Cadet

sretalla

Powered by Neutrality

piwko28

Cadet

jgreco

Resident Grinch

piwko28

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

webui becomes unresponsive and nvme error after 24+ hours

Cadet

Resident Grinch

Cadet

Cadet

Cadet

Powered by Neutrality

Cadet

Powered by Neutrality

Cadet

Resident Grinch

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "webui becomes unresponsive and nvme error after 24+ hours"

Similar threads