Crash. Now bad performance

orddie

Contributor
Joined
Jun 11, 2016
Messages
104
Hello Everyone!

Server specs in signature. 3x ESXi servers. I recently added 3 1TB SSD's to extend the pool. They have not been put in place yet.

Pool was configured with a LAG (Intel optane) and Cache (2x 500GB NVMe drives configured in mirror)

For the last two days, the system has "crashed" Crashed in that
- Each ESXi server shows the NFSv3 datastore as being offline (Unable to browse, VM's that were running have crashed or locked up, VM's that were not running are showing as invalid or missing)
- TrueNas UI is working, I can login, but the dashboard has no data and Pools show no pools or gives a spinning wheel when trying to load
- SSH works and I can login. I can also call a shell from the UI and issue commands
- attempts to reboot fail. the system acts like it takes the reboot command but does not react.

The first time this happened, i hard powered off the server.
26(ish) hours later, it happened a second time. I poked around a little bit and saw references to NVME2 failing outstanding i/o and resetting controller due to timeout and possible hot unplug va Dmesg

So today I removed the cache from the pool in hopes of keeping the system running for longer than 26 hours. LAG is still connected.

Each time TrueNas was rebooted, the ESXi hosts were not. Trying to boot VM's takes 40-45 seconds to power on. Progress bar goes right to 65%, than 78%, than powered on.

Think my issue is on TrueNas side or ESXi side?

I ran a crystal disk bench mark and still get 1156MB Read and 1023MB write. Esxi shows datastore latency (at times) being higher than 50MS! I also saw the following two lines in Dmesg today
pid 2203 (syslog-ng), jid 0, uid 0: exited on signal 6 (core dumped)
WARNING: 172.16.40.21 (iqn.1991-05.com.microsoft:<VM name>): no ping reply (NOP-Out) after 5 seconds; dropping connection
 

orddie

Contributor
Joined
Jun 11, 2016
Messages
104
system has been stable since the last post.

as far as I can tell, the following two posts appear to match my issue.
 
Top