TrueNAS page fault crash/reboot on zvol snapshot rollback

Drucipher

Cadet
Joined
Dec 4, 2018
Messages
2
TN Version:TrueNAS-13.0-U2

CPU: 2 x Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
Motherboard: Dell PowerEdge R720 16-Bay 2.5" 2U
Video: Whatever comes on that board
Boot: SAMSUNG SSD SM841N 256GB connected to motherboard.
RAID: 16x Samsung SSD 870 3.64 TiB
RAM: 12x 64GB ECC - not sure on brand it was selected to max the ram from the online shop we got it from. 768 GB total
PERC H310 controller reflashed for IT mode (passthrough)

Setup - 2 RAIDZ2 vdevs mirrored (hard to recall what I did, but think it mirrored. Doesn't say on the ui that I'm seeing) - for like 40 TB or something ish.


What happens:
Seems fine for at least a week. But over time something happens that causes it to "page fault" and reboot when clicking the rollback buttons on a snapshot for a zvol.

We run 3 vm server hosts with 8 or so vm's on each talking iscsi back to the shared zvols on truenas. The truenas has 2 backups nas' being streamed also.
Everything is fine usually. But after a month or an unknown amount of time. If you go in to truenas and attempt a rollback on a zvol disk, what you end up with instead is a reboot of the entire nas. -_- and everything goes down for 10 minutes.

Immediately following the reboot, the rollback looks like it worked. Typically I need to go back further because it either didn't actually work or I didn't go back far enough...sometimes hard to tell. But rollbacks work fine after that for at least a week or so.

Anyway, we just bought this setup as an upgrade from basically the same setup but with smaller Dell SSD's and an unflashed H310 that was the correct version and didn't need to be flashed. It had the same issue. Rollbacks cause the "page fault" error left in the crash logs and a reboot occurs.
We setup the new server as a backup and then made it the live. So it was a fresh install but the config was restored.

Anyone heard of this or experienced it? Fixes? Thoughts? Should I post some of these crash logs somehow? Let me know. I'm fearful of rollbacks now that this has happened 3 times on 2 different hardwares. (Not super different, I mean they were both dell 720's)
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Setup - 2 RAIDZ2 vdevs mirrored (hard to recall what I did, but think it mirrored. Doesn't say on the ui that I'm seeing) - for like 40 TB or something ish.
You can't mirror RAIDZ2 VDEVs in a pool, they will be striped if both in the same pool.

We run 3 vm server hosts with 8 or so vm's on each talking iscsi back to the shared zvols on truenas. The truenas has 2 backups nas' being streamed also.
Block storage and RAIDZ VDEVs aren't a good mix (although that may be mitigated a little with all SSD pools)...

Also, backups (typically fileshares with large sequential transfer) aren't a good mix on the same pool as block storage (those would usually be best on RAIDZ VDEVs).

TN Version:TrueNAS-13.0-U2
I haven't gone deep into the details of the updates and remembered every fixed issue to compare it to yours, but I feel like U4 is something you could be running in an attempt to get any fixes that would be available if this is a bug of some sort.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Are you rolling back a zVol that is still in use?

While you can roll back ZFS Datasets in use, zVols may be different. You should probably shutdown any client use of the zVol before rolling back, (which you may be doing, but don't specify).

Next, zVol snapshots tend to be less reliable unless you take them when the zVol is unmounted form the client. Remember, ZFS does not know about the structure of the file system inside the zVol. Only the data blocks in use.

What is the purpose of rolling back a zVol for you?
 

Drucipher

Cadet
Joined
Dec 4, 2018
Messages
2
Each zVol is representative of a disk belonging to VM. So let's say we have an Domain Controller server and it will have 2 zvols, one for the boot and 1 for the data or other things.
If a ransomware comes in and encrypts some files that were on a share on the data disk and somehow it wiped shadow copies. I can simply rollback the shared disk. Or restore it to a new zvol and attach and restore specific things. ..but usually I'll just rollback the whole disk.
Or an sql server vm has a boot and a data zVol and someone has the great idea to try to upgrade windows or do updates or something and there's an issue. Rollback the boot disk.
I do turn off the VM while doing rollback on it's zVol share. However, it's iSCSI so technically the linux host still has the target and LUN's attached.

I'm using libvirt / virt-manager for the VM management on the KVM hosts running Fedora 34. (In the newer version of Fedora we have another issue where windows VM's fail to install on the zvols or fail to boot after install, which is weird..so we've been waiting to upgrade for a year/ish)

I see what you mean, when we turn on the VM using the soon to be rolled-back zVol and later turn it off and then attempt a rollback of the zVol maybe TrueNAS somehow cares that it's 'connected' or that some background read or write it still happening while the VM is off by the Host vm (virt-manager/libvirt/KVM) and when it starts the rollback...poof page fault, TrueNAS reboot.
We're ok with a little bit of loss on the VM's due to snapshots being taken while "on". We run DBCC checks and have plenty of "sane" backups also. As we don't solely trust the truenas. We have also a backup server where database dumps are archived and important things etc. So sanity of the snapshots is hoped for but not completely relied on.

Our biggest issue is this reboot thing. Which has spanned version 12 and 13 and two similar hardware setups (although not exact, they were both dell 720's using h310's even though they were two diff h310's..one was an 'i'). I do see many people complaining of random reboots on the forum, but ours isn't random after a month...it's basically expected if you attempt a rollback on a zvol to some arbitrary snapshot.

As for performance, everything is fine. We have 22 machines currently running. Some are remote user windows workstations and most are servers like DNS, sql server, postgres, domain controllers, window storage share boxes. I only use the TrueNAS NFS storage for sharing with the KVM hosts to backup VM config and control who's running on first and who's on second, and as a backup for the NAS config so it auto-backs up to the backup pools on the backup truenas boxes which just use the replication process every hour, once a day and weekly with varying retention.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
technically the linux host still has the target and LUN's attached
That may be problematic... you might want to restart that too after rolling back if that can't be managed properly.

ESXi (the only really recommended hypervisor for Virtual TrueNAS) handles that properly.
 
Top