FreeNAS VM's randomly go down with "blocked for more than 120 seconds." error.

d3vnu77

Cadet
Joined
May 4, 2020
Messages
5
I have 5 VM's running on a 32core AMD Threadripper 2990wx with 128 GB of ram.

Randomly it seems these servers are going down where you can't access them via SSH and when you use the VNC built into FreeNAS, I see the errors in the attached screenshot.

It seems the only way to fix the problem (temporarily) is to reboot the server, till it happens again.

Not sure where to look on this matter.
 

Attachments

  • blocked_vm.PNG
    blocked_vm.PNG
    175.3 KB · Views: 175
Joined
Dec 29, 2014
Messages
1,135
What kind of drives and HBA are you using? Also, what is your network setup?
 

d3vnu77

Cadet
Joined
May 4, 2020
Messages
5
Ok, it seems I'm also getting this error too on the actual physical server. is a drive bad... the front-end seems ok as the drives show "Green/Healthy". Not sure if this is related.

No HBA, just using the SATA slots on the motherboard to power 6 2TB Seagate Barracuda Compute harddrives in a raid Z2 pool.

The VM's all run on 3 Samsung 970 Evo NVMe M.2 1TB SSDs in a raid Z1 pool.

The server has two network connections.

1. The first connection is a LAN connection with an internal ip of 192.168.1.210
2. The second NIC has an outside static ip. The VMs are webservers and each have their own static external ip.
 

Attachments

  • IMG_5191.JPG
    IMG_5191.JPG
    200.5 KB · Views: 182
  • IMG_5190.JPG
    IMG_5190.JPG
    369.9 KB · Views: 162

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
The second NIC has an outside static IP.
Outside as in "public IPv4 address"? Don't do that. FreeNAS is not a hardened appliance and really should not be directly Internet accessible.

Regarding your errors, it definitely looks like your ada0 device is failing; but you mentioned the VMs all run from the Z1 of NVMe devices? It could be that the ZFS processes in general are getting hung up on a bad drive/vdev though and are blocking on that.

6 2TB Seagate Barracuda Compute harddrives in a raid Z2 pool.
Post the model number of these drives please. A regular 6-drive Z2 will be slow but shouldn't choke on a little bit of I/O - but if those are secretly SMR then you might be experiencing the wonderful world of reshingling.
 

d3vnu77

Cadet
Joined
May 4, 2020
Messages
5
Yes, all VMs run on the NVMe drives. Attached is an image of the drives in question. Pretty much the only real writing going on at the moment on those drives is the gradual writing of the bitcoin and litecoin blockchains.

To replace the drive do I just pull that drive and insert another or is there another process?
 

Attachments

  • IMG_5192.JPG
    IMG_5192.JPG
    191.6 KB · Views: 169
Joined
Dec 29, 2014
Messages
1,135

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Yes, all VMs run on the NVMe drives. Attached is an image of the drives in question. Pretty much the only real writing going on at the moment on those drives is the gradual writing of the bitcoin and litecoin blockchains.

To replace the drive do I just pull that drive and insert another or is there another process?

Those models (ST2000DM008) are indeed shingled (SMR) drives, but unless you are putting constant random I/O to them (BTC and LTC chain updates shouldn't be that heavy) it shouldn't be overwhelming them to the point of timing out. A bad drive though could be causing problems.

Here is the documentation page specifically on replacing a failed disk:

 
Top