Issues with qcow2 images and TrueNAS Core 12.0-BETA?

Nodefield

Cadet
Joined
Aug 6, 2020
Messages
4
I'm running three nearly identical virtualization setups each with:

- two to four virtualization hosts (Proxmox VE 6.2)
- two FreeNAS as VM image storage servers (one dedicated for SSD pool, one for HDD pool with SSD caches, 128G RAM each)
- 10G networking dedicated for VM storage traffic only (MTU 9000)
- NFS (4.1)
- Simple config: no special hacks or unsual configurations (both in TrueNAS/FreeNAS and in VM hypervisor)
- Mostly HP ProLiant hardware; with some SuperMicro servers; all JBOD disks

Recently I tested these virtualization clusters with TrueNAS Core 12.0-BETA. Initially everything looked fantastic: great performance, no immediate issues etc.

After seemingly running smooth and reliable for an average of 1-2 days, some virtual machines in each separate cluster (different locations, different hardware) started having random I/O issues and errors with their disks (qcow2 images mounted via NFSv4.1 from TrueNAS Core servers).

I found out that these qcow2 images had became internally corrupted. Some where beyond repair with `qemu-img` and some were repairable. After it became clear that issue started popping up in all three separate installations, I quickly reverted all storage servers back to FreeNAS 11.3-U4.1. I then rolled back affected qcow2 images from last ZFS snapshot that had uncorrupted file. After that - qcow2 corruption issues disappeared.

I must point out that during all this TrueNAS 12 servers itself seemed run smoothly without any apparent storage (or other) errors reported.

As I've rolled back to 11.3, I'm not immediately able to debug/retest. I'm interested in hearing if anyone else has experienced similar issues? Any thoughts from ixSystems?
 

Kris Moore

SVP of Engineering
Administrator
Moderator
iXsystems
Joined
Nov 12, 2015
Messages
1,448
I don't think we've seen anything similar to this elsewhere yet. By any chance did you grab a debug file while running on 12.0 before you rolled back? The logs there may have helped with some clues to why you were seeing this behavior.
 

Nodefield

Cadet
Joined
Aug 6, 2020
Messages
4
Unfortunately I did not grab debug files, but the 12.0 bootenv is still installed in all servers and I can temporarily reboot some of the servers back to it. Do they store relevant debug information through such boot environment switches?
 

mav@

iXsystems
iXsystems
Joined
Sep 29, 2011
Messages
1,428
System logs are written into the system dataset, which is not tied to boot environments. So unless you have them configured to be on RAM disk (in which case they are lost), logs should cover all your switches here and there, that may require more care. But otherwise we just have nothing to start from.
 

ornias

Wizard
Joined
Mar 6, 2020
Messages
1,458
Are those sync writes or not?
Besides testing going back, what additional testing have you done to exclude Proxmox being the one misbehaving?
(in theory it could also be Proxmox misbehaving and TrueNAS 12 just being more prone to it)
 

viniciusferrao

Contributor
Joined
Mar 30, 2013
Messages
192
Are those sync writes or not?
Besides testing going back, what additional testing have you done to exclude Proxmox being the one misbehaving?
(in theory it could also be Proxmox misbehaving and TrueNAS 12 just being more prone to it)

I have the same issue with oVirt. It's probably TrueNAS fault. NFS should not be affected by sync=standard as far as I know. iSCSI is much more prone to be affected by sync=standard/always than NFS.

 
Top