Network Dropout / Drive Unmounting

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
OK, I have a strange problem here. One of my TrueNAS boxes (SAN2) is acting up and I'm not sure where to look to find the answers. So, here's the scenario:

Replication between SAN1 and SAN2 never fails. Works every time.

I installed a large disk in SAN2 and formatted as NTFS so I could make an archival single-disk backup that could be read anywhere (hence NTFS)
Since I can't seem to find a way to mount it from the GUI, and fstab gets reset on reboot, I SSH in and manually mount the drive. Then, using tmux, I start an rsync job to move the data over. This job just trucks right along for a while until it fails.

The failure:

The remote system (SAN2) drops the connection. No problem, right? you set this all up in a tmux window, so just log back in and do a tmux attach and your job should still be running.

NOPE.

Not only does it kill the ssh, but it kills the tmux session, which kills the rsync job. Then! it UNMOUNTS the ntfs volume!

When this happens, if I happen to have the GUI open, it goes to the "waiting for the interface to load" screen. Once it's back up and the GUI is available, I can log back in, start tmux, mount the drive and restart the rsync job.

Yeah, sure, it will eventually all get copied, but I'm trying to figure out what would cause SSH, tmux AND a mounted drive to all go away at the same time.

The server is NOT rebooting. It comes back WAY too fast for that. I have had replication tasks (zfs send) that have taken 15+ hours never drop, even when the SSH/Tmux/rsync/mount problem occurs.

So SOMETHING is resetting.

Attached is the support dump file.
 

Attachments

  • debug-backup-20230426121445.tgz
    2.6 MB · Views: 58

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
You don't mention how the NTFS drive is attached to the NAS.

Is it via SATA?
And if so, what type SATA port, system board or HBA?
If HBA, what is the make & model of the chip?

Actually the whole hardware details of your SAN2 server would probably be helpful. We have seen power save on AMD CPUs be a problem, which disabling in BIOS can solve.
 

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
Sorry.

I don't have exact details from that server since I built it a couple years ago -- this setup is in a homelab, FWIW.

All power saving options are turned off in BIOS.
All drives are connected to an LSI HBA 8i (don't know model)
This drive is connected to the same HBA as all my ZFS drives.

Initially, I created this drive as a single disk ZFS volume and started copying -- the drive never errored and never unmounted itself but I did experience the dropout/tmux killed issue.

I get no errors on any other disks that would suggest an HBA problem, even when the big disk was a single disk pool, so I'm not inclined to believe it's a SATA connection issue.

In my experience, when a mounted drive fails, it doesn't unmount -- it just gives errors when you try to access it.
Also, a SATA error doesn't explain why my tmux session is killed. I see no error messages before the dropout -- I just notice that the transferred bytes on a file stop increasing, and then after about 15 seconds the SSH session drops.

This is a system built on a desktop motherboard with a Ryzen 3200G. My other SAN is also on a desktop board with a Ryzen 1500. That one doesn't exhibit this behavior, although admittedly I have not tried this exact exercise on it, so who knows?

I guess the thing that is really adding to my confusion is how the tmux is getting killed at the same time that the drive unmounts (and how those 2 things seem to be connected)

What log file can I look in to see when the system unmounts the drive? Is Middlewared crashing (the business of seeing the "waiting for interface") and causing all those other things to crash/revert?
 

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
Just tried restarting middlewared during the rsync. The rysnc didn't stop/crash, so that didn't do it....
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
The drop of the TMUX is puzzling. I don't know why the NTFS drive would be un-mounted, unless the middleware is thinking it was transferring the files from the NTFS, (aka importing data), and whence the data import was complete, it un-mounted it.

I'd make sure that the SHELL session & and directory changed into before and after the TMUX session was not on the NTFS drive. Perhaps if the directory becomes un-available because it was either a mount point, or a directory below a mount point, (on the NTFS drive), causes the problem. Using "/" as the directory to start TMUX, and using it as the directory when you start RSync should be safe. Except you probably did this already :-(.

As for the log files, I don't have those locations handy.

Sorry, I have no further suggestions.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Yep, I figured you knew that.

Well, perhaps someone else can figure it out.
 

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
Interesting.

I have a badblocks running in an SSH / tmux and it has been running solid for 2.5 days now.

So, this is either an rsync issue or a mount issue. If the drive is unmounting for some reason, I could see how it could crash rsync, but not tmux. But it seems that the problem is not tmux, ssh or the network since my current task is solid.
 
Top