I experienced a situation where a failing controller chip on a single NVME drive of a two vdev striped mirror (4 drives) brought the whole system down. TrueNAS Core v13.0-U5.3 is running bare metal on a Dell r730 and the NVME vdevs are installed on a quad NVME PCIe adaptor (Bus ID 128). It appears that the controller was gradually failing and this was reported in the server log: -
The console messages did record that one of the NVME drives became read-only but testing later showed that the controller froze up when any significant load was applied (the drive is a WD Black SN750). The main problem however is that instead of the zpool running in a degraded state, the system then does a hard reset. My assumption is that the reset is perhaps as a result of a watchdog timer somewhere and not related to the OS?
Has anyone experienced this before and if perhaps this is specific to Dell?
Cheers
A fatal error was detected on a component at bus 128 device 2 function 0
The console messages did record that one of the NVME drives became read-only but testing later showed that the controller froze up when any significant load was applied (the drive is a WD Black SN750). The main problem however is that instead of the zpool running in a degraded state, the system then does a hard reset. My assumption is that the reset is perhaps as a result of a watchdog timer somewhere and not related to the OS?
Has anyone experienced this before and if perhaps this is specific to Dell?
Cheers