Boot Pool failures might have been NIC - how to check for sure

donkmeister

Cadet
Joined
Jun 22, 2023
Messages
7
I'm hoping for guidance on how to check I have actually fixed a problem, rather than just been lucky enough that the problem hasn't bitten me since the last reboot.

The web interface for my TrueNAS-SCALE-22.12.3.2 box stopped responding today whilst I was out and about... When I got home I hooked up a monitor and the console reported that the issue was the boot-pool, which the system then deactivated. I can't recall the precise wording I'm afraid.

I rebooted, and kept an eye on the console... Sure enough after 10-15 minutes of faultless operation I started to see timeouts for the boot drive as follows... NVME1 is my boot drive.

nvme nvme1: I/O 105 QID 4 timeout, aborting
nvme nvme1: I/O 106 QID 4 timeout, aborting
nvme nvme1: I/O 47 QID 1 timeout, aborting
nvme nvme1: I/O 114 QID 1 timeout, aborting
nvme nvme1: I/O 115 QID 1 timeout, aborting
nvme nvme1: I/O 105 QID 4 timeout, reset controller
nvme nvme1: I/O 10 QID 0 timeout, reset controller

Then the console stopped responding so I did a hard power-down.

The Chelsio NIC is a used part. I bought three from a used equipment seller and one was DOA so I wondered if the one installed in the TN box was starting to fail. I powered up, set TN to use the motherboard's onboard ethernet, shut down and removed the NIC. I've now had the TN box running for a little while without the used NIC and it hasn't repeated the failure.

Any guidance appreciated, thank you.
 
Top