viniciusferrao
Contributor
- Joined
- Mar 30, 2013
- Messages
- 192
Not to pile on, but wanted to list my experience. I believe I'm also facing this silent data corruption issue. I run for my home lab an ESXI / Freenas all in one server that has been rock solid for 5 years+....that is until December of 2020. I can't believe I didn't notice / consider Freenas could have been the possible problem, but it appears going back and checking my windows home server 2011 logs the timeline matches up now.
Within the ESXi host, one of my VMs is Windows Home Server 2011 that runs client PC backups in my home. In early/mid December I started to get some client PCs that would fail the backup process. The system would retry the back up the next night and most of the time it would go thru. But as December went on these errors became more and more frequent to the point that I could not even complete a backup on the system with the largest amount of data to back up. I've been spending the last 2 weeks ripping my hair out testing everything I can: scrubbing volumes in freenas, chkdsking everything on the clients and the windows server, memory tests. I even destroyed my client computer backup database and tried starting over assuming it was some bug / problem with WHS2011. Even rebuilding the backup database from scratch resulted in the same problem with multiple different client PCs not able to finish a single backup. I convinced myself it was something with my setup and software/OS configuration, perhaps WHS2011 that is well deprecated at this point and not supported. I even tried new Windows Server 2016 Essentials installs and ran into the exact same problems during client pc backups. Even built a new desktop with a fresh OS install and it also has backup errors.
Upon auditing all of the logs in both Windows Home Server 2011 and Windows Server essentials 2016, the common denominator were file checksum verification failures in the client backup database files that Windows Server uses for these tasks (WHS and WSE split the backup database into multiple 4GB files). Each time the failure seemed to be on a different backup database file, not a common one. I've scrubbed the volume over and over, and never a single error. I've moved the VMDK for the backup folder from a RAIDZ1 pool to a different RAIDZ2 pool and still had the same problem. Tried messing with NFS3 vs NFS4.1, changed record sizes and recopied the VMDK over and over. Forced Sync to Always from Standard, and no improvement. Basically I'm losing my sanity trying to figure this out. Spent days and days thinking I had an ESXi issue potentially, patching, downgrading, upgrading, etc but all failed. I was just about ready to replace the motherboard/CPU and further go down the hardware rabbit hole when I came across this thread...
Needless to say I'm exporting my pool right now and will be reinstalling a fresh FREENAS 11.3 install and reimporting these pools which luckily I did not upgrade. (I don't see 11.xx in my system boot options for some reason any more, I must have purged them after my upgrade, oddly I can see some of my 9.x boots from 5 years ago) I'll report back if my issues disappear immediately once I get things setup again. I'm not sure theres much else I can contribute to help solve this issue but am happy to answer any follow-ups about my setup / configuration specifics. -Dan
System Specifics:
Intel i7-4790 CPU
32GB DDR3 RAM (non-ecc, ya ya i know not ideal but never has been a problem in the past)
2x SAS2308 PCI-e cards for storage passed thru to the FreeNAS VM
2x 480GB Sandisk Ultra II Mirror SSD Volume
4x 8TB WD in RAIDZ1 HDD volume
8x 4TB Seagate in RAIDZ2 HDD volume
ESXi 7.0 Update 1c
FreeNAS 12.0U1
VM where I've seen the issue: WHS2011 SP1, also on WSE2016
NFS mounts for all 3 volumes into ESXi to be able to use the storage in my VMs
No issues observed with my media libraries directly stored within ESXi, but not sure I'd have picked that up yet in any manner...
Yes you are definitely with the same issue. Check the Jira ticket for more info.
In your case if you didn't removed the 11.3-U5 boot environment you can just select it and reboot. You'll be safe after this.
On the issue, right now Ryan (from iXsystems) has made a custom openzfs.ko module for 12.0-U1 without Async CoW, and I'm running it on two of three pools that I've. The issue is probably Async CoW, so if you upgraded the pool that would be an option. It's available on the Jira ticket.
iXsystems is doing an awesome job trying to fix this issue. We discussed the issue on a meeting 2hrs ago and we checked my systems, adding the custom modules to them. They are running right now, if they don't crash we may have found the issue.
More testing will be done in the next days.