mobrien118
Dabbler
- Joined
- Jun 22, 2020
- Messages
- 25
I'm a semi-proficient Linux user and semi-professional SysAdmin and I decided to take TrueNAS SCALE for a spin on a non-critical workload.
I'm feeling kind of sophomoric here as I've spent several hours researching and troubleshooting, but it's going very slow. I'm wondering if I should submit a bug request, but not sure what information I should provide, or even if this is worthwhile or if it is something that I'm doing to myself.
The purpose of this system is simple and twofold: local SMB and as a remote source/target for RSync (so, server, not client). Basically, I'm receiving remote PC backups from another NAS via RSync and also receiving local PC backups via SMB which are then picked up by the remote RSync client, too.
I was previously running TrueNAS CORE and fell victim to the infamous filesystem corruption when upgrading from FreeNAS to TrueNAS 12.
I was running TrueNAS in a VM (OS on virtual disk) with direct access to 3 storage disks that comprised the ZFS storage pool. They are partitioned like this (looking from HOST system):
Number Start End Size File system Name Flags
1 65.5kB 2148MB 2147MB
2 2148MB 2000GB 1998GB zfs
So, I simply turned off that VM, created a new one with a new virtual disk as the OS disk and the same 3 storage pool disks (RAW access), installed TNS and mounted the ZFS pool. Success!
...until a few hours later when the system because relatively unresponsive. I noticed a consistent CPU load of 8.0, with htop reporting about 2% CPU usage - all of the rest was IO. I tried to kill the rsync daemon and smb deamon, both of which ended up as a "D" (wait) process, and their parents were Z (zombie). They would not respond to kill.
In the terminal, I'm seeing systemd-journal going nuts trying to kill existing processes, failing, and spawning new ones (I think). After just being online for 12 hours, these PIDs are in the 100000s or higher - so something is spawning processes. I don't see that many processes in htop, though.
I guess it's important to point out that the system works fine for hours, and even when it becomes completely unresponsive (can't even start a shell session) the remote RSync job doesn't "fail" until I physically kill the system or the RSync processes die (reboot command).
So, I'm feeling like the ZFS is suddenly getting hammered by something, maybe a process that keeps forking or reloading, and causing never-ending queueing of IO that can't be fulfilled.
Symptoms I'm seeing from the terminal are like:
"killing process XXXXX (systemd-journal) with signal sigkill" (multiple instances with different PIDs)
"systemd-journald.service: process still around after SIGKILL. Ignoring." (multiple instances with different PIDs)
"Found left-over process XXXXX (systemd-journal) in control group while starting unit. Ignoring."
I'm just hoping my issue will help troubleshoot this and avoid others having a similar problem in the future. I'm happy to submit whatever I can to this forum or a bug report, is that it the right path. If one of you FreeNAS gurus could just point me in the right direction.
Thanks!
I'm feeling kind of sophomoric here as I've spent several hours researching and troubleshooting, but it's going very slow. I'm wondering if I should submit a bug request, but not sure what information I should provide, or even if this is worthwhile or if it is something that I'm doing to myself.
The purpose of this system is simple and twofold: local SMB and as a remote source/target for RSync (so, server, not client). Basically, I'm receiving remote PC backups from another NAS via RSync and also receiving local PC backups via SMB which are then picked up by the remote RSync client, too.
I was previously running TrueNAS CORE and fell victim to the infamous filesystem corruption when upgrading from FreeNAS to TrueNAS 12.
I was running TrueNAS in a VM (OS on virtual disk) with direct access to 3 storage disks that comprised the ZFS storage pool. They are partitioned like this (looking from HOST system):
Number Start End Size File system Name Flags
1 65.5kB 2148MB 2147MB
2 2148MB 2000GB 1998GB zfs
So, I simply turned off that VM, created a new one with a new virtual disk as the OS disk and the same 3 storage pool disks (RAW access), installed TNS and mounted the ZFS pool. Success!
...until a few hours later when the system because relatively unresponsive. I noticed a consistent CPU load of 8.0, with htop reporting about 2% CPU usage - all of the rest was IO. I tried to kill the rsync daemon and smb deamon, both of which ended up as a "D" (wait) process, and their parents were Z (zombie). They would not respond to kill.
In the terminal, I'm seeing systemd-journal going nuts trying to kill existing processes, failing, and spawning new ones (I think). After just being online for 12 hours, these PIDs are in the 100000s or higher - so something is spawning processes. I don't see that many processes in htop, though.
I guess it's important to point out that the system works fine for hours, and even when it becomes completely unresponsive (can't even start a shell session) the remote RSync job doesn't "fail" until I physically kill the system or the RSync processes die (reboot command).
So, I'm feeling like the ZFS is suddenly getting hammered by something, maybe a process that keeps forking or reloading, and causing never-ending queueing of IO that can't be fulfilled.
Symptoms I'm seeing from the terminal are like:
"killing process XXXXX (systemd-journal) with signal sigkill" (multiple instances with different PIDs)
"systemd-journald.service: process still around after SIGKILL. Ignoring." (multiple instances with different PIDs)
"Found left-over process XXXXX (systemd-journal) in control group while starting unit. Ignoring."
I'm just hoping my issue will help troubleshoot this and avoid others having a similar problem in the future. I'm happy to submit whatever I can to this forum or a bug report, is that it the right path. If one of you FreeNAS gurus could just point me in the right direction.
Thanks!