No Web UI during resilver - nginx upstream websocket timed out

faisalm

Cadet
Joined
Feb 17, 2023
Messages
8
I'm doing some drive replacements on my zpool tank. The pool layout is 2 11-drive raidz3 vdevs. I also have 2 500GB SSDs which I've partitioned to use for mirrored log and special vdevs.

During the resilver process, the Web UI does not load up at all. All I see is the note "Waiting for Active TrueNAS controller to come up..." I can root around the CLI via SSH without issues. All other functions, including NAS services NFS and SMB, appear to be unaffected. I see the following nginx error in the systemd journal (with slightly redacted IP and domain info).

Code:
Mar 24 11:43:26 nas nginx[11574]: 2023/03/24 11:43:26 [error] 11574#11574: *15428 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 2001:db8:dead:beef::1, server: localhost, request: "GET /websocket HTTP/1.1", upstream: "http://127.0.0.1:6000/websocket", host: "nas.int.example.com"


I'm upgrading drives 2 at a time using the following process, in case it's of any relevance:
  • zpool offline 2 drives from vdev raidz3-0, then physically remove from server and set aside
  • Install 2 new drives
  • Initiate 2 drive replacements for vdev raidz3-1 in the Web UI, 1 after the other
  • Run zpool resilver tank in the SSH to force the pool to resilver both drives at the same time rather than defer the second resilver
The resilver process is now on the 4th pair of drive replacements, and the loss of Web UI functionality has been consistent across each resilver.

Has anyone else seen this issue? What logs or other diagnostic information can I look for in the CLI? Thanks.
 

faisalm

Cadet
Joined
Feb 17, 2023
Messages
8
To be more clear, my resilver process looks like this for multiple pairs of drives:
  1. zpool offline 2 drives from vdev raidz3-0, then physically remove from server and set aside
  2. Install 2 new drives
  3. Initiate 2 drive replacements for vdev raidz3-1 in the Web UI, 1 after the other
  4. Run zpool resilver tank in the SSH to force the pool to resilver both drives at the same time rather than defer the second resilver
  5. After resilver complete, remove the 2 freed up old drives from server.
  6. Go to step 2 and repeat for another pair of drives
  7. After all drives are replaced, re-install 2 removed drives from vdev raidz3-0, and zpool online them for a final resilver.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
To be more clear, my resilver process looks like this for multiple pairs of drives:
  1. zpool offline 2 drives from vdev raidz3-0, then physically remove from server and set aside
  2. Install 2 new drives
  3. Initiate 2 drive replacements for vdev raidz3-1 in the Web UI, 1 after the other
  4. Run zpool resilver tank in the SSH to force the pool to resilver both drives at the same time rather than defer the second resilver
  5. After resilver complete, remove the 2 freed up old drives from server.
  6. Go to step 2 and repeat for another pair of drives
  7. After all drives are replaced, re-install 2 removed drives from vdev raidz3-0, and zpool online them for a final resilver.

Can you try reslivering 1 drive at a time...? That is how we would normally test
 

faisalm

Cadet
Joined
Feb 17, 2023
Messages
8
Can you try reslivering 1 drive at a time...? That is how we would normally test
Thank you for the prompt response!

Alas, I'm in the middle of replacing the last pair of drives. I suppose when I return the original drives from vdev raidz3-0, I could zpool online them one at a time. It'll be a slightly lighter resilver vs the replacements I've been doing.

Any hypothesis as to why the nginx websocket times out with >1 drive resilvering simultaneously? I'm new to TrueNAS but I've used ZFS on various Linux systems since 0.6.5. I've done mass resilvers in the past with multiple drives simultaneously with no impact on apps or services running on the same systems. Just curious what other logs or diagnostic information I can collect while this current resilver is taking place.
 

faisalm

Cadet
Joined
Feb 17, 2023
Messages
8
I just noticed that my systemd journal also has this collectd line item every 10 seconds:

Code:
Mar 24 21:24:31 nas collectd[2988368]: Traceback (most recent call last):
                                         File "/usr/local/lib/collectd_pyplugins/cputemp.py", line 21, in read
                                           with Client() as c:
                                         File "/usr/lib/python3/dist-packages/middlewared/client/client.py", line 326, in __init__
                                           self._ws.connect()
                                         File "/usr/lib/python3/dist-packages/middlewared/client/client.py", line 129, in connect
                                           rv = super(WSClient, self).connect()
                                         File "/usr/lib/python3/dist-packages/ws4py/client/__init__.py", line 222, in connect
                                           bytes = self.sock.recv(128)
                                       socket.timeout: timed out


I'm not sure if this is related or not. I'm running TrueNAS SCALE in a VM so I expect it's not able to get CPU temperature. Still, the Web UI is fully functional aside from the missing CPU temperature when not resilvering.
 

faisalm

Cadet
Joined
Feb 17, 2023
Messages
8
So just now, that error changed slightly, still repeats every 10 seconds:

Code:
Mar 24 21:33:51 nas collectd[2988368]: Traceback (most recent call last):
                                         File "/usr/local/lib/collectd_pyplugins/cputemp.py", line 21, in read
                                           with Client() as c:
                                         File "/usr/lib/python3/dist-packages/middlewared/client/client.py", line 326, in __init__
                                           self._ws.connect()
                                         File "/usr/lib/python3/dist-packages/middlewared/client/client.py", line 129, in connect
                                           rv = super(WSClient, self).connect()
                                         File "/usr/lib/python3/dist-packages/ws4py/client/__init__.py", line 215, in connect
                                           self.sock.connect(self.bind_addr)
                                       BlockingIOError: [Errno 11] Resource temporarily unavailable
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Just trying to find simplest error replication process....at the moment the issue is only related to replacing two drives.

When you are resilvering how busy is the CPU?
 

faisalm

Cadet
Joined
Feb 17, 2023
Messages
8
I can agree with that from an error replication perspective. The resilver has now completed for the replacements. I checked htop on and off during the resilvers this past week and CPU hovered around 10-15% per core, with a load average of 5 at most. This is a VM with 8 cores, more info in my signature below.

The Web UI returned to normal after the resilver finished. There are also no longer any collectd errors in the systemd journal (i.e. socket timing out or BlockingIOError).

I've now replaced the temporarily removed drives from raidz3-0 and have onlined them from the UI. This time I _won't_ run zpool resilver tank to force resilvering both drives at the same time. Similar load average and CPU use, although no issues with the Web UI.

Thanks for the help @morganL. I guess my problem is resolved for now since I've completed my drive replacements. I understand resilvering 1 drive at a time is the recommended practice. Given my risk tolerance and use of raidz3, doing 2 at a time cut my replacement time in half, for which I have no regrets.
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
I can agree with that from an error replication perspective. The resilver has now completed for the replacements. I checked htop on and off during the resilvers this past week and CPU hovered around 10-15% per core, with a load average of 5 at most. This is a VM with 8 cores, more info in my signature below.

The Web UI returned to normal after the resilver finished. There are also no longer any collectd errors in the systemd journal (i.e. socket timing out or BlockingIOError).

I've now replaced the temporarily removed drives from raidz3-0 and have onlined them from the UI. This time I _won't_ run zpool resilver tank to force resilvering both drives at the same time. Similar load average and CPU use, although no issues with the Web UI.

Thanks for the help @morganL. I guess my problem is resolved for now since I've completed my drive replacements. I understand resilvering 1 drive at a time is the recommended practice. Given my risk tolerance and use of raidz3, doing 2 at a time cut my replacement time in half, for which I have no regrets.

No problem at my end-- just was wondering why no-one had seen/reported anything similar.
 
Top