nvme Passthrough, works for a few days then dies with flurry of "__common_interrupt: 6.34 No irq handler for vector"?

tackyone

Dabbler
Joined
Jun 8, 2020
Messages
19
Hi,

I'm running TrueNAS SCALE 22.12.0 on a SuperMicro X10DRU board (with latest BIOS and E5-2690v3 CPU). This seems to work fine in daily use. I have a couple of VM's on there. I enabled 'PCI Passthrough' for one of the VM's, passing through (as seen from lspci):

03:00.0 Non-Volatile memory controller: Sandisk Corp WD Black SN750 / PC SN730 NVMe SSD

This works fine - the virtual machine when booting (Linux) can see the nvme, and can access it. Everything works great for a few days - then on TrueNAS SCALE I see a flurry of:

412820.847944] __common_interrupt: 6.34 No irq handler for vector [412827.830043] __common_interrupt: 6.34 No irq handler for vector

Dumped to the console / logged in session.

In the virtual machine I get a reciprocal:

406246.844310] nvme nvme0: I/O 703 QID 3 timeout, completion polled [406246.844332] nvme nvme0: I/O 704 QID 3 timeout, completion polled [406260.922879] nvme nvme0: I/O 705 QID 3 timeout, completion polled

The virtual machine at this point - looses all access to the nvme. If I 'force stop' it - I can see the nvme being returned to TrueNAS:

[414221.201968] nvme nvme0: pci function 0000:03:00.0 [414221.237193] nvme nvme0: 24/0/0 default/read/poll queues [414221.253739] nvme0n1: p1

And when I re-start the VM - I can see presumably it being taken away again:

[414319.946571] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x19@0x300 [414319.953971] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x1e@0x900

And the VM starts, and can access the nvme again. Until - the cycle repeats.

Any ideas?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I suspect this isn't particular to TrueNAS, more to KVM.

A quick internet search produced a reference to an article that talks about the device firmware of the NVME device needing to be updated... no idea if that's applicable here or not, but it might help you to widen the search.

 

tackyone

Dabbler
Joined
Jun 8, 2020
Messages
19
I suspect this isn't particular to TrueNAS, more to KVM.

A quick internet search produced a reference to an article that talks about the device firmware of the NVME device needing to be updated... no idea if that's applicable here or not, but it might help you to widen the search.


Hmm, interesting. I have another nvme to try.

I have a horrible feeling it's not going to be related / so simple - going through those threads, they seem to be discussing an issue with resets - which can be 'iffy' sometimes on passed through things (I have an LSI RAID controller that works fine - but cannot be reset while under pass-through as the card simply doesn't support it) - but even that runs ok until you get say a SAS bus error -> which goads the driver into trying to reset the card -> which unleashes meltdown. Which is why I don't use it any more.

From what I can see the WD uses a custom 'WD' controller - whether that's based on a more common one I don't know, though it ID's as "Sandisk" - which isn't Silicon Motion (SM) - but so much of this stuff is hidden behind the scenes.

You would have thought for something that essentially "just attaches to the PCIe bus" - they'd get it right, but oh well.

Anyway - thanks for the pointer in a different direction. I had looked up nvme passthrough stuff in general, but forgot the KVM aspect.

Cheers
 
Top