Issues with qcow2 images and TrueNAS Core 12.0-BETA?

Nodefield · Aug 6, 2020

I'm running three nearly identical virtualization setups each with:

- two to four virtualization hosts (Proxmox VE 6.2)
- two FreeNAS as VM image storage servers (one dedicated for SSD pool, one for HDD pool with SSD caches, 128G RAM each)
- 10G networking dedicated for VM storage traffic only (MTU 9000)
- NFS (4.1)
- Simple config: no special hacks or unsual configurations (both in TrueNAS/FreeNAS and in VM hypervisor)
- Mostly HP ProLiant hardware; with some SuperMicro servers; all JBOD disks

Recently I tested these virtualization clusters with TrueNAS Core 12.0-BETA. Initially everything looked fantastic: great performance, no immediate issues etc.

After seemingly running smooth and reliable for an average of 1-2 days, some virtual machines in each separate cluster (different locations, different hardware) started having random I/O issues and errors with their disks (qcow2 images mounted via NFSv4.1 from TrueNAS Core servers).

I found out that these qcow2 images had became internally corrupted. Some where beyond repair with `qemu-img` and some were repairable. After it became clear that issue started popping up in all three separate installations, I quickly reverted all storage servers back to FreeNAS 11.3-U4.1. I then rolled back affected qcow2 images from last ZFS snapshot that had uncorrupted file. After that - qcow2 corruption issues disappeared.

I must point out that during all this TrueNAS 12 servers itself seemed run smoothly without any apparent storage (or other) errors reported.

As I've rolled back to 11.3, I'm not immediately able to debug/retest. I'm interested in hearing if anyone else has experienced similar issues? Any thoughts from ixSystems?

Kris Moore · Aug 6, 2020

I don't think we've seen anything similar to this elsewhere yet. By any chance did you grab a debug file while running on 12.0 before you rolled back? The logs there may have helped with some clues to why you were seeing this behavior.

Nodefield · Aug 7, 2020

Unfortunately I did not grab debug files, but the 12.0 bootenv is still installed in all servers and I can temporarily reboot some of the servers back to it. Do they store relevant debug information through such boot environment switches?

mav@ · Aug 7, 2020

System logs are written into the system dataset, which is not tied to boot environments. So unless you have them configured to be on RAM disk (in which case they are lost), logs should cover all your switches here and there, that may require more care. But otherwise we just have nothing to start from.

ornias · Aug 22, 2020

Are those sync writes or not?
Besides testing going back, what additional testing have you done to exclude Proxmox being the one misbehaving?
(in theory it could also be Proxmox misbehaving and TrueNAS 12 just being more prone to it)

viniciusferrao · Dec 29, 2020

ornias said:
Are those sync writes or not?
Besides testing going back, what additional testing have you done to exclude Proxmox being the one misbehaving?
(in theory it could also be Proxmox misbehaving and TrueNAS 12 just being more prone to it)

I have the same issue with oVirt. It's probably TrueNAS fault. NFS should not be affected by sync=standard as far as I know. iSCSI is much more prone to be affected by sync=standard/always than NFS.

https://jira.ixsystems.com/browse/NAS-108627

Important Announcement for the TrueNAS Community.

Issues with qcow2 images and TrueNAS Core 12.0-BETA?

Nodefield

Cadet

Kris Moore

SVP of Engineering

Nodefield

Cadet

mav@

iXsystems

ornias

Wizard

viniciusferrao

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

Issues with qcow2 images and TrueNAS Core 12.0-BETA?

Nodefield

Cadet

Kris Moore

SVP of Engineering

Nodefield

Cadet

mav@

iXsystems

ornias

Wizard

viniciusferrao

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Issues with qcow2 images and TrueNAS Core 12.0-BETA?"

Similar threads