Stability issues - high I/O and systemd-journal spawning

mobrien118

Dabbler
Joined
Jun 22, 2020
Messages
25
I'm a semi-proficient Linux user and semi-professional SysAdmin and I decided to take TrueNAS SCALE for a spin on a non-critical workload.

I'm feeling kind of sophomoric here as I've spent several hours researching and troubleshooting, but it's going very slow. I'm wondering if I should submit a bug request, but not sure what information I should provide, or even if this is worthwhile or if it is something that I'm doing to myself.

The purpose of this system is simple and twofold: local SMB and as a remote source/target for RSync (so, server, not client). Basically, I'm receiving remote PC backups from another NAS via RSync and also receiving local PC backups via SMB which are then picked up by the remote RSync client, too.

I was previously running TrueNAS CORE and fell victim to the infamous filesystem corruption when upgrading from FreeNAS to TrueNAS 12.

I was running TrueNAS in a VM (OS on virtual disk) with direct access to 3 storage disks that comprised the ZFS storage pool. They are partitioned like this (looking from HOST system):
Number Start End Size File system Name Flags
1 65.5kB 2148MB 2147MB
2 2148MB 2000GB 1998GB zfs

So, I simply turned off that VM, created a new one with a new virtual disk as the OS disk and the same 3 storage pool disks (RAW access), installed TNS and mounted the ZFS pool. Success!

...until a few hours later when the system because relatively unresponsive. I noticed a consistent CPU load of 8.0, with htop reporting about 2% CPU usage - all of the rest was IO. I tried to kill the rsync daemon and smb deamon, both of which ended up as a "D" (wait) process, and their parents were Z (zombie). They would not respond to kill.

In the terminal, I'm seeing systemd-journal going nuts trying to kill existing processes, failing, and spawning new ones (I think). After just being online for 12 hours, these PIDs are in the 100000s or higher - so something is spawning processes. I don't see that many processes in htop, though.

I guess it's important to point out that the system works fine for hours, and even when it becomes completely unresponsive (can't even start a shell session) the remote RSync job doesn't "fail" until I physically kill the system or the RSync processes die (reboot command).

So, I'm feeling like the ZFS is suddenly getting hammered by something, maybe a process that keeps forking or reloading, and causing never-ending queueing of IO that can't be fulfilled.

Symptoms I'm seeing from the terminal are like:
"killing process XXXXX (systemd-journal) with signal sigkill" (multiple instances with different PIDs)
"systemd-journald.service: process still around after SIGKILL. Ignoring." (multiple instances with different PIDs)
"Found left-over process XXXXX (systemd-journal) in control group while starting unit. Ignoring."

I'm just hoping my issue will help troubleshoot this and avoid others having a similar problem in the future. I'm happy to submit whatever I can to this forum or a bug report, is that it the right path. If one of you FreeNAS gurus could just point me in the right direction.

Thanks!
 

ornias

Wizard
Joined
Mar 6, 2020
Messages
1,458
Just file a report in the Jira and try actually adding system specifications.

Using known-unstable things on SCALE like rsync and SMB isn't the best test case tbh.
 

mobrien118

Dabbler
Joined
Jun 22, 2020
Messages
25
Good feedback - thanks for that. I'm generally hesitant to open bug reports for systems that I'm not proficient in because I don't want to clog up the tracker if it's just user error.

Unfortunately, I probably won't get a chance at this point...
1613749485951.png


I guess doing a hard reboot when the system had already tried to shut down gracefully for 4+ hours was not the right approach. I'm not even sure if I could have saved the pool at that point or not, though.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,599
Please note that ZFS was both designed and thoroughly tested for unexpected shutdowns. The only loss of data were data in flight, (aka ZFS transaction groups, and other pending updates).

Almost all the known causes of ZFS data loss were due to:
- Hardware RAID controllers not doing what ZFS thought they were doing
- Running ZFS in VM without the proper configuration
- Hardware faults
- ZFS bugs

I say almost, as I am sure someone can add to the list.

Sun MIcrosystems purposefully designed ZFS to survive an unexpected shutdown without data loss. They did this so that on boot, no extended time consuming file system check would be needed. Sun anticipated that disks, (and the file systems on them), would get so large that the time at boot to run a file system check would be unreasonable.
 

mobrien118

Dabbler
Joined
Jun 22, 2020
Messages
25
Thanks, Arwen. I should have been more clear - I'm aware of the benefits of data reliability in ZFS, and I'm confident (hopeful?) that my data is still there, but I just built another new VM, which I can at least boot, and told it to import my 3 disk RAIDZ-1 array, and it has been spinning using 100% of one CPU in an IO wait for about 3 hours:
1613764157591.png


The data's no good if I can't access it, even if it is "there". I'm hoping that if I let it sit there long enough it will finish whatever it is trying to do. If not, I'll probably start digging into it this weekend.

UPDATE: I'm about 24 hours in and the pool still hasn't loaded. It's been sitting at a CPU load of 6 ever since I told it to import the pool:
1613845985103.png


UPDATE 2: Issued a "Reboot" command that eventually succeeded. It has been another 22 hours now of trying to import the pool in the startup job. Is there any way to know what it is actually doing? Any chance it will ever succeed?
1614004813914.png
 
Last edited:

mobrien118

Dabbler
Joined
Jun 22, 2020
Messages
25
I don't want to just leave this hanging out there.

I went ahead and wiped the data disks, rebuilt the pool and I'm currently reloading all of the data from backups. Seems to be working just fine now.

I can only assume that this is related to the fact that I had this pool running on TrueNAS Core v 12 and, while I'm not positive, I think I upgraded the pool to the version that caused all of the data issues people have seen elsewhere. On that system, it wouldn't even boot because the root partition was severely corrupted, I was believing that the data pool was still good and that hopefully the system crashing kept it safe, but I have t assume that that was not the case.

Anyway, just rebuilding a new pool seems to have it running tip-top and I can say that, for my simple purposes, I love TrueNAS SCALE. I'm much more comfortable in Linux than BSD, and the bare metal hardware I'm using even had a really bad BSD bug where the bootloader doesn't work (found the BSD bug report for it and it is 7 years old, so probably not getting fixed).

I truly think that utilizing a Linux platform opens all kinds of doors for TrueNAS and I'm excited to see how it evolves now that I'm a user.

I like the "ALPHA" of SCALE better than the STABLE CORE (don't get me wrong, I still understand the implications of it being ALPHA).

Anyway, I'm not sure if this will help anyone. I wish I was able to provide more helpful information on what the actual root cause of a pool trying to load for days on end and not succeeding, but I guess we'll just have to move on to more interesting and solvable problems.

Thanks,

--mobrien118
 
Top