No Web UI during resilver - nginx upstream websocket timed out

faisalm · Mar 24, 2023

I'm doing some drive replacements on my zpool tank. The pool layout is 2 11-drive raidz3 vdevs. I also have 2 500GB SSDs which I've partitioned to use for mirrored log and special vdevs.

During the resilver process, the Web UI does not load up at all. All I see is the note "Waiting for Active TrueNAS controller to come up..." I can root around the CLI via SSH without issues. All other functions, including NAS services NFS and SMB, appear to be unaffected. I see the following nginx error in the systemd journal (with slightly redacted IP and domain info).

Code:

Mar 24 11:43:26 nas nginx[11574]: 2023/03/24 11:43:26 [error] 11574#11574: *15428 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 2001:db8:dead:beef::1, server: localhost, request: "GET /websocket HTTP/1.1", upstream: "http://127.0.0.1:6000/websocket", host: "nas.int.example.com"

I'm upgrading drives 2 at a time using the following process, in case it's of any relevance:

zpool offline 2 drives from vdev raidz3-0, then physically remove from server and set aside
Install 2 new drives
Initiate 2 drive replacements for vdev raidz3-1 in the Web UI, 1 after the other
Run zpool resilver tank in the SSH to force the pool to resilver both drives at the same time rather than defer the second resilver

The resilver process is now on the 4th pair of drive replacements, and the loss of Web UI functionality has been consistent across each resilver.

Has anyone else seen this issue? What logs or other diagnostic information can I look for in the CLI? Thanks.

faisalm · Mar 24, 2023

To be more clear, my resilver process looks like this for multiple pairs of drives:

zpool offline 2 drives from vdev raidz3-0, then physically remove from server and set aside
Install 2 new drives
Initiate 2 drive replacements for vdev raidz3-1 in the Web UI, 1 after the other
Run zpool resilver tank in the SSH to force the pool to resilver both drives at the same time rather than defer the second resilver
After resilver complete, remove the 2 freed up old drives from server.
Go to step 2 and repeat for another pair of drives
After all drives are replaced, re-install 2 removed drives from vdev raidz3-0, and zpool online them for a final resilver.

morganL · Mar 24, 2023

faisalm said:
To be more clear, my resilver process looks like this for multiple pairs of drives:

zpool offline 2 drives from vdev raidz3-0, then physically remove from server and set aside

Install 2 new drives

Initiate 2 drive replacements for vdev raidz3-1 in the Web UI, 1 after the other

Run zpool resilver tank in the SSH to force the pool to resilver both drives at the same time rather than defer the second resilver

After resilver complete, remove the 2 freed up old drives from server.

Go to step 2 and repeat for another pair of drives

After all drives are replaced, re-install 2 removed drives from vdev raidz3-0, and zpool online them for a final resilver.

Can you try reslivering 1 drive at a time...? That is how we would normally test

faisalm · Mar 24, 2023

morganL said:
Can you try reslivering 1 drive at a time...? That is how we would normally test

Thank you for the prompt response!

Alas, I'm in the middle of replacing the last pair of drives. I suppose when I return the original drives from vdev raidz3-0, I could zpool online them one at a time. It'll be a slightly lighter resilver vs the replacements I've been doing.

Any hypothesis as to why the nginx websocket times out with >1 drive resilvering simultaneously? I'm new to TrueNAS but I've used ZFS on various Linux systems since 0.6.5. I've done mass resilvers in the past with multiple drives simultaneously with no impact on apps or services running on the same systems. Just curious what other logs or diagnostic information I can collect while this current resilver is taking place.

faisalm · Mar 24, 2023

I just noticed that my systemd journal also has this collectd line item every 10 seconds:

Code:

Mar 24 21:24:31 nas collectd[2988368]: Traceback (most recent call last):
                                         File "/usr/local/lib/collectd_pyplugins/cputemp.py", line 21, in read
                                           with Client() as c:
                                         File "/usr/lib/python3/dist-packages/middlewared/client/client.py", line 326, in __init__
                                           self._ws.connect()
                                         File "/usr/lib/python3/dist-packages/middlewared/client/client.py", line 129, in connect
                                           rv = super(WSClient, self).connect()
                                         File "/usr/lib/python3/dist-packages/ws4py/client/__init__.py", line 222, in connect
                                           bytes = self.sock.recv(128)
                                       socket.timeout: timed out

I'm not sure if this is related or not. I'm running TrueNAS SCALE in a VM so I expect it's not able to get CPU temperature. Still, the Web UI is fully functional aside from the missing CPU temperature when not resilvering.

faisalm · Mar 24, 2023

So just now, that error changed slightly, still repeats every 10 seconds:

Code:

Mar 24 21:33:51 nas collectd[2988368]: Traceback (most recent call last):
                                         File "/usr/local/lib/collectd_pyplugins/cputemp.py", line 21, in read
                                           with Client() as c:
                                         File "/usr/lib/python3/dist-packages/middlewared/client/client.py", line 326, in __init__
                                           self._ws.connect()
                                         File "/usr/lib/python3/dist-packages/middlewared/client/client.py", line 129, in connect
                                           rv = super(WSClient, self).connect()
                                         File "/usr/lib/python3/dist-packages/ws4py/client/__init__.py", line 215, in connect
                                           self.sock.connect(self.bind_addr)
                                       BlockingIOError: [Errno 11] Resource temporarily unavailable

morganL · Mar 24, 2023

Just trying to find simplest error replication process....at the moment the issue is only related to replacing two drives.

When you are resilvering how busy is the CPU?

faisalm · Mar 25, 2023

I can agree with that from an error replication perspective. The resilver has now completed for the replacements. I checked htop on and off during the resilvers this past week and CPU hovered around 10-15% per core, with a load average of 5 at most. This is a VM with 8 cores, more info in my signature below.

The Web UI returned to normal after the resilver finished. There are also no longer any collectd errors in the systemd journal (i.e. socket timing out or BlockingIOError).

I've now replaced the temporarily removed drives from raidz3-0 and have onlined them from the UI. This time I _won't_ run zpool resilver tank to force resilvering both drives at the same time. Similar load average and CPU use, although no issues with the Web UI.

Thanks for the help @morganL. I guess my problem is resolved for now since I've completed my drive replacements. I understand resilvering 1 drive at a time is the recommended practice. Given my risk tolerance and use of raidz3, doing 2 at a time cut my replacement time in half, for which I have no regrets.

morganL · Mar 25, 2023

faisalm said:
I can agree with that from an error replication perspective. The resilver has now completed for the replacements. I checked htop on and off during the resilvers this past week and CPU hovered around 10-15% per core, with a load average of 5 at most. This is a VM with 8 cores, more info in my signature below.

The Web UI returned to normal after the resilver finished. There are also no longer any collectd errors in the systemd journal (i.e. socket timing out or BlockingIOError).

I've now replaced the temporarily removed drives from raidz3-0 and have onlined them from the UI. This time I _won't_ run zpool resilver tank to force resilvering both drives at the same time. Similar load average and CPU use, although no issues with the Web UI.

Thanks for the help @morganL. I guess my problem is resolved for now since I've completed my drive replacements. I understand resilvering 1 drive at a time is the recommended practice. Given my risk tolerance and use of raidz3, doing 2 at a time cut my replacement time in half, for which I have no regrets.

No problem at my end-- just was wondering why no-one had seen/reported anything similar.

Important Announcement for the TrueNAS Community.

No Web UI during resilver - nginx upstream websocket timed out

faisalm

Cadet

faisalm

Cadet

morganL

Captain Morgan

faisalm

Cadet

faisalm

Cadet

faisalm

Cadet

morganL

Captain Morgan

faisalm

Cadet

morganL

Captain Morgan

Similar threads

Important Announcement for the TrueNAS Community.

No Web UI during resilver - nginx upstream websocket timed out

Cadet

Cadet

Captain Morgan

Cadet

Cadet

Cadet

Captain Morgan

Cadet

Captain Morgan

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "No Web UI during resilver - nginx upstream websocket timed out"

Similar threads