Uncorrectable I/O Failure

russellkwr

Cadet
Joined
Mar 5, 2024
Messages
1
I am getting a strange error realted to the "disktemp.py" that is resulting in a "pool 'boot-pool' has encountered an uncorrectable I/O failure and has been suspended". At this point the error has caused the server to lock up. When to a forced restart of my server, all boots up normally and it as if there are no issues.

All hardware is brand new. I don't suspect this is a hardware issue. Boot pool is a single Intel Optane SSD.

I have seen threads on this being SNMP related. I do not run SNMP and the service is disabled.

CPU usage is minimal - almost always close to 0%.

I have one storage pool with SMP access. I have one ubuntu VM running PiHole.

Any help is appreciated.

Thanks!


Mar 5 06:18:11 maxwell 1 2024-03-05T06:18:07.755119-08:00 maxwell.local collectd 2085 - - Traceback (most recent call last):
File "/usr/local/lib/collectd_pyplugins/disktemp.py", line 62, in read
with Client() as c:
File "/usr/local/lib/python3.9/site-packages/middlewared/client/client.py", line 286, in __init__
self._ws.connect()
File "/usr/local/lib/python3.9/site-packages/middlewared/client/client.py", line 124, in connect
rv = super(WSClient, self).connect()
File "/usr/local/lib/python3.9/site-packages/ws4py/client/__init__.py", line 223, in connect
bytes = self.sock.recv(128)
socket.timeout: timed out
Mar 5 06:19:11 maxwell ahcich4: Timeout on slot 20 port 0
Mar 5 06:19:11 maxwell ahcich4: is 00000000 cs 00300000 ss 00000000 rs 00300000 tfd c0 serr 00000000 cmd 0004d417
Mar 5 06:19:11 maxwell (ada0:ahcich4:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 0040 00 00 00 00 00 00
Mar 5 06:19:11 maxwell (ada0:ahcich4:0:0:0): CAM status: Command timeout
Mar 5 06:19:11 maxwell (ada0:ahcich4:0:0:0): Retrying command, 0 more tries remain
Mar 5 06:19:11 maxwell xhci0: Resetting controller
Mar 5 06:19:11 maxwell ahcich5: Timeout on slot 9 port 0
Mar 5 06:19:11 maxwell ahcich5: is 00000000 cs 00000600 ss 00000000 rs 00000600 tfd c0 serr 00000000 cmd 0004c917
Mar 5 06:19:11 maxwell (ada1:ahcich5:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 0040 00 00 00 00 00 00
Mar 5 06:19:11 maxwell (ada1:ahcich5:0:0:0): CAM status: Command timeout
Mar 5 06:19:11 maxwell (ada1:ahcich5:0:0:0): Retrying command, 0 more tries remain
Mar 5 06:20:18 maxwell uhub0: at usbus0, port 1, addr 1 (disconnected)
Mar 5 06:20:18 maxwell ugen0.2: <vendor 0x05e3 USB2.0 Hub> at usbus0 (disconnected)
Mar 5 06:20:18 maxwell uhub1: at uhub0, port 9, addr 1 (disconnected)
Mar 5 06:20:18 maxwell uhub1: detached
Mar 5 06:20:18 maxwell uhub0: detached
Mar 5 06:20:18 maxwell uhub0 on usbus0
Mar 5 06:20:18 maxwell uhub0: <Intel XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0
Mar 5 06:21:13 maxwell ahcich6: Timeout on slot 1 port 0
Mar 5 06:21:13 maxwell ahcich6: is 00000000 cs 00000000 ss 00000006 rs 00000006 tfd 40
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
you haven't given any hardware (required) so there is nothing else to work with.

Mar 5 06:19:11 maxwell (ada0:ahcich4:0:0:0): CAM status: Command timeout
Mar 5 06:19:11 maxwell (ada0:ahcich4:0:0:0): Retrying command, 0 more tries remain
Mar 5 06:19:11 maxwell xhci0: Resetting controller

your disk is timing out and the controller appears to be resetting. that usually means...your disk is timing out! you have a hardware issue with this disk.

my guess? it's overheating.
 
Top