FreeNAS (now TrueNAS) is no longer stable

viniciusferrao · Jan 13, 2021

djjaeger82 said:
Not to pile on, but wanted to list my experience. I believe I'm also facing this silent data corruption issue. I run for my home lab an ESXI / Freenas all in one server that has been rock solid for 5 years+....that is until December of 2020. I can't believe I didn't notice / consider Freenas could have been the possible problem, but it appears going back and checking my windows home server 2011 logs the timeline matches up now.

Within the ESXi host, one of my VMs is Windows Home Server 2011 that runs client PC backups in my home. In early/mid December I started to get some client PCs that would fail the backup process. The system would retry the back up the next night and most of the time it would go thru. But as December went on these errors became more and more frequent to the point that I could not even complete a backup on the system with the largest amount of data to back up. I've been spending the last 2 weeks ripping my hair out testing everything I can: scrubbing volumes in freenas, chkdsking everything on the clients and the windows server, memory tests. I even destroyed my client computer backup database and tried starting over assuming it was some bug / problem with WHS2011. Even rebuilding the backup database from scratch resulted in the same problem with multiple different client PCs not able to finish a single backup. I convinced myself it was something with my setup and software/OS configuration, perhaps WHS2011 that is well deprecated at this point and not supported. I even tried new Windows Server 2016 Essentials installs and ran into the exact same problems during client pc backups. Even built a new desktop with a fresh OS install and it also has backup errors.

Upon auditing all of the logs in both Windows Home Server 2011 and Windows Server essentials 2016, the common denominator were file checksum verification failures in the client backup database files that Windows Server uses for these tasks (WHS and WSE split the backup database into multiple 4GB files). Each time the failure seemed to be on a different backup database file, not a common one. I've scrubbed the volume over and over, and never a single error. I've moved the VMDK for the backup folder from a RAIDZ1 pool to a different RAIDZ2 pool and still had the same problem. Tried messing with NFS3 vs NFS4.1, changed record sizes and recopied the VMDK over and over. Forced Sync to Always from Standard, and no improvement. Basically I'm losing my sanity trying to figure this out. Spent days and days thinking I had an ESXi issue potentially, patching, downgrading, upgrading, etc but all failed. I was just about ready to replace the motherboard/CPU and further go down the hardware rabbit hole when I came across this thread...

Needless to say I'm exporting my pool right now and will be reinstalling a fresh FREENAS 11.3 install and reimporting these pools which luckily I did not upgrade. (I don't see 11.xx in my system boot options for some reason any more, I must have purged them after my upgrade, oddly I can see some of my 9.x boots from 5 years ago) I'll report back if my issues disappear immediately once I get things setup again. I'm not sure theres much else I can contribute to help solve this issue but am happy to answer any follow-ups about my setup / configuration specifics. -Dan

System Specifics:

Intel i7-4790 CPU
32GB DDR3 RAM (non-ecc, ya ya i know not ideal but never has been a problem in the past)
2x SAS2308 PCI-e cards for storage passed thru to the FreeNAS VM
2x 480GB Sandisk Ultra II Mirror SSD Volume
4x 8TB WD in RAIDZ1 HDD volume
8x 4TB Seagate in RAIDZ2 HDD volume

ESXi 7.0 Update 1c
FreeNAS 12.0U1
VM where I've seen the issue: WHS2011 SP1, also on WSE2016
NFS mounts for all 3 volumes into ESXi to be able to use the storage in my VMs
No issues observed with my media libraries directly stored within ESXi, but not sure I'd have picked that up yet in any manner...

Yes you are definitely with the same issue. Check the Jira ticket for more info.

In your case if you didn't removed the 11.3-U5 boot environment you can just select it and reboot. You'll be safe after this.

On the issue, right now Ryan (from iXsystems) has made a custom openzfs.ko module for 12.0-U1 without Async CoW, and I'm running it on two of three pools that I've. The issue is probably Async CoW, so if you upgraded the pool that would be an option. It's available on the Jira ticket.

iXsystems is doing an awesome job trying to fix this issue. We discussed the issue on a meeting 2hrs ago and we checked my systems, adding the custom modules to them. They are running right now, if they don't crash we may have found the issue.

More testing will be done in the next days.

inman.turbo · Jan 13, 2021

Is sync=always being set on these isci pools? Anyway I recently was experiencing similar issues, not silent corruption but completely inexplicable block failures and i/o errors, bad sectors on brand new disks, etc. Have you tried bringing in a Digital Static Field Meter? My case was at an edge location that has some spare hardware we often use for testing, and it is kind of an odd location, we found some irregular surface voltage and polarity. However I'm not entirely convinced it wasn't just 12 trashing my drives. More testing is needed.

viniciusferrao · Jan 13, 2021

inman.turbo said:
Is sync=always being set on these isci pools? Anyway I recently was experiencing similar issues, not silent corruption but completely inexplicable block failures and i/o errors, bad sectors on brand new disks, etc. Have you tried bringing in a Digital Static Field Meter? My case was at an edge location that has some spare hardware we often use for testing, and it is kind of an odd location, we found some irregular surface voltage and polarity. However I'm not entirely convinced it wasn't just 12 trashing my drives. More testing is needed.

Yes and no. Check the Jira issue for more details. Corruption has been confirmed by iXsystems.

sync=always ease the things, but corruption is still in place.

You should not need sync=always if you have proper UPS and a shutdown scheme in case of power failure. But yes, with sync=standard the corruption is extremely fast in comparison with sync=always.

djjaeger82 · Jan 13, 2021

viniciusferrao said:
Yes you are definitely with the same issue. Check the Jira ticket for more info.

In your case if you didn't removed the 11.3-U5 boot environment you can just select it and reboot. You'll be safe after this.

On the issue, right now Ryan (from iXsystems) has made a custom openzfs.ko module for 12.0-U1 without Async CoW, and I'm running it on two of three pools that I've. The issue is probably Async CoW, so if you upgraded the pool that would be an option. It's available on the Jira ticket.

iXsystems is doing an awesome job trying to fix this issue. We discussed the issue on a meeting 2hrs ago and we checked my systems, adding the custom modules to them. They are running right now, if they don't crash we may have found the issue.

More testing will be done in the next days.

Thank you for confirming that my issue at least "smells" like the same problem. What concerns me most is that I was lucky that WHS2011 was alarming with these checksum errors on the client computer backups to that DB, but how do I know that any other data written in the last month isn't compromised? So far I haven't seen evidence of other failures in my media library which is not thru any VMs or ESXi, but I worry about other operating system files in the VMs...

I did go back and confirm the exact dates of the errors in the WHS2011 VMs event log and it lines up almost exactly with my initial upgrade to 12.0 and continued after my upgrade to 12.0U1. The errors in the client computer backup databases began about 28hrs after the upgrade to 12.0

11/30 evening upgrade of FreeNAS 11.3 to TrueNAS 12.0
12/2 early AM first error in WHS2011 VM
12/5, 12/6, 12/7 continued errors in VM
Took the VM offline for troubleshooting 1 week

12/16 installed TrueNAS12.0U1 upgrade
12/16, 12/17 continued errors in WHS VM

Spent 2 weeks ripping apart everything I could think of in the system including trying multiple ESXi downgrades/upgrades, as well as various pool/share/datastore options (ISCSI vs NFS, NFS3 vs NFS4, record size, sync always/standard/disabled). Tons of hardware testing using memtest86+, pool scrubs, relocating the WHS2011 VMDK to different pools but errors have continued ever since. Even destroying and starting anew for the client backup database and can't get a single machine to backup more than 1x without the same checksum errors.

Also, it looks like I must have purged my 11.3x boot install when I did the 12.0U1 update (probably was out of space). So looks like I have to do a fresh install and resetup my config or load one of my old config backups I have laying around.

viniciusferrao · Jan 13, 2021

djjaeger82 said:
Thank you for confirming that my issue at least "smells" like the same problem. What concerns me most is that I was lucky that WHS2011 was alarming with these checksum errors on the client computer backups to that DB, but how do I know that any other data written in the last month isn't compromised? So far I haven't seen evidence of other failures in my media library which is not thru any VMs or ESXi, but I worry about other operating system files in the VMs...

I did go back and confirm the exact dates of the errors in the WHS2011 VMs event log and it lines up almost exactly with my initial upgrade to 12.0 and continued after my upgrade to 12.0U1. The errors in the client computer backup databases began about 28hrs after the upgrade to 12.0

11/30 evening upgrade of FreeNAS 11.3 to TrueNAS 12.0
12/2 early AM first error in WHS2011 VM
12/5, 12/6, 12/7 continued errors in VM
Took the VM offline for troubleshooting 1 week

12/16 installed TrueNAS12.0U1 upgrade
12/16, 12/17 continued errors in WHS VM

Spent 2 weeks ripping apart everything I could think of in the system including trying multiple ESXi downgrades/upgrades, as well as various pool/share/datastore options (ISCSI vs NFS, NFS3 vs NFS4, record size, sync always/standard/disabled). Tons of hardware testing using memtest86+, pool scrubs, relocating the WHS2011 VMDK to different pools but errors have continued ever since. Even destroying and starting anew for the client backup database and can't get a single machine to backup more than 1x without the same checksum errors.

Also, it looks like I must have purged my 11.3x boot install when I did the 12.0U1 update (probably was out of space). So looks like I have to do a fresh install and resetup my config or load one of my old config backups I have laying around.

Or you can just use the patched openzfs.ko module that Ryan Moeller gently compiled and posted in the Jira ticket. You can check for corruption with the Windows checksums that you've mentioned. It will be a great favor if you'll be willing to do this. The fix is really quick.

Here's the link direct to the patch: https://jira.ixsystems.com/browse/NAS-108627?focusedCommentId=125026&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-125026

Again, more info on the Jira issue.

djjaeger82 · Jan 13, 2021

viniciusferrao said:
Or you can just use the patched openzfs.ko module that Ryan Moeller gently compiled and posted in the Jira ticket. You can check for corruption with the Windows checksums that you've mentioned. It will be a great favor if you'll be willing to do this. The fix is really quick.

Here's the link direct to the patch: https://jira.ixsystems.com/browse/NAS-108627?focusedCommentId=125026&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-125026

Again, more info on the Jira issue.

Thanks, just reviewed JIRA. I've downloaded the non-debug version of the patch posted 4 hrs ago and installed it. I'll report back. I'm going to nuke my client backup database again and start from scratch and see if I can successfully rebuild it the first time thru with fresh client backups.

Mannekino · Jan 14, 2021

djjaeger82 said:
I'll report back if my issues disappear immediately once I get things setup again. I'm not sure theres much else I can contribute to help solve this issue but am happy to answer any follow-ups about my setup / configuration specifics.

Will be interesting to read if the downgrade to FreeNAS 11.3 will fix your problems. I've read the post of Kris and his comment that this limited to very few and specific use cases. I use my TrueNAS system for 8 Jails or so and storage of my personal media. I don't really use SMB that much nor do I use TrueNAS for storing any raw VM disks. Hopefully there's nothing the matter with my setup.

djjaeger82 · Jan 14, 2021

djjaeger82 said:
Thanks, just reviewed JIRA. I've downloaded the non-debug version of the patch posted 4 hrs ago and installed it. I'll report back. I'm going to nuke my client backup database again and start from scratch and see if I can successfully rebuild it the first time thru with fresh client backups.

Reporting back using the patched openzfs.ko module on 12.0U1. Its been about 7 hours and so far so good, Client computer backups are still in progress and so far over 700GB has been backed up without any errors or checksum errors. In the last 2 weeks I've never been able to get above 400 or 500GB and about 4-5hrs before triggering such an error. When all the clients finish backing up I'll report back again, but this fix is looking promising to me so far.

viniciusferrao · Jan 14, 2021

Mannekino said:
Will be interesting to read if the downgrade to FreeNAS 11.3 will fix your problems. I've read the post of Kris and his comment that this limited to very few and specific use cases. I use my TrueNAS system for 8 Jails or so and storage of my personal media. I don't really use SMB that much nor do I use TrueNAS for storing any raw VM disks. Hopefully there's nothing the matter with my setup.

It will. It's already confirmed on the Jira ticket.

viniciusferrao · Jan 14, 2021

djjaeger82 said:
Reporting back using the patched openzfs.ko module on 12.0U1. Its been about 7 hours and so far so good, Client computer backups are still in progress and so far over 700GB has been backed up without any errors or checksum errors. In the last 2 weeks I've never been able to get above 400 or 500GB and about 4-5hrs before triggering such an error. When all the clients finish backing up I'll report back again, but this fix is looking promising to me so far.

Yes, the patch is good. But this came with a cost. Ryan have completely disabled the ACoW/Async DMU feature. So it can be narrowed down to this feature specifically.

At this moment there's no way to control it over a sysctl, so the only way to disable is removing and recompiling the module. This is what Ryan just done.

So the problem now will be finding what's wrong with the feature. Even it only appears to be most damaging in VM Workload I will not risk using this feature at all until it's not properly fixed. It just not worth it.

@djjaeger82 if you could add your history to the Jira issue, it would be great.

djjaeger82 · Jan 14, 2021

viniciusferrao said:
Yes, the patch is good. But this came with a cost. Ryan have completely disabled the ACoW/Async DMU feature. So it can be narrowed down to this feature specifically.

I'm not as familiar with the ACoW/Async DMU feature, can you explain it briefly? Was it only enabled if you upgraded pools? Just trying to understand what I'm giving up by implementing this patch. I've not seen any severe performance hit so far, but its still early in my testing. And this feature was not part of 11.3 either correct?

EDIT:

Nevermind, found the feature list here:

ZFS async DMU and CoW: Within the original ZFS is a Data Management Unit (DMU) and an algorithm for Copy-on-Write (CoW). These algorithms were implemented in a synchronous manner, which required a transaction to wait until another transaction was completed. iXsystems contributed to the conversion of these algorithms to an asynchronous approach, which reduces the amount of wait time and increases parallelism in OpenZFS 2.0. An added benefit is that fewer disk I/Os are needed for sequential writes. This increases drive efficiency and reduces latency in heavy workloads.
ZFS Record Size Increases: One benefit of async CoW is that larger ZFS record sizes will perform better with fewer Read-Modify-Write activities. Instead of operating with 128KB record size, a 256KB or 512KB record size may be beneficial for some workloads. This will increase the bandwidth of many RAIDZ1/2/3 VDEVs.

viniciusferrao · Jan 14, 2021

So yes, @djjaeger82 even if you don't upgrade your pool version, just running 12.0-RELEASE / 12.0-U1 will enable Async CoW. 11.3-U5 didn't have this feature.

djjaeger82 · Jan 14, 2021

viniciusferrao said:
So yes, @djjaeger82 even if you don't upgrade your pool version, just running 12.0-RELEASE / 12.0-U1 will enable Async CoW. 11.3-U5 didn't have this feature.

Thanks @viniciusferrao, I didn't realize that feature was enabled without the pool upgrade. Now it all makes sense.

FYI just hit the 1TB of data backed up mark, and no corruption. Crazy how night and day the difference is with this one simple fix. It looks like TrueNAS team will need to urgently patch / recall / notify the entire install base of TRUENAS 12.x for this problem. I'm just a little Joe Schmo running client backups for PCs in my house, I cannot even imagine folks using this in business production systems with critical data on VMs...Yikes!!!

viniciusferrao · Jan 16, 2021

So far so good, the latest patch in the Jira ticket has been pushed to TrueNAS 12.0-U1.1 and we're stable once again.

I can confirm that corruption is no longer happening in the observed in 3 out of my 8 TrueNAS systems.

Thank you for everyone that helped figuring out this issue, and for the iX folks that nailed it down and disabled the offending feature. I'm marking this thread as solved.

indy · Jan 22, 2021

@viniciusferrao
I have seen your efforts to tackle and educate about this issue in several places, thank you for taking the time!

milo1 · Feb 7, 2021

Good day all,

i have a similar setup as the topic starter (ovirt - centos vm's - nfs - truenas12). Never had any problems with this setup for 5 years. After the upgrade to truenas 12 after some period of time one third of the vm's had crashed. Some of them wile running, some of them after reboot. Few of them had to restore the snapshots (xfs_repair didn't help).

Now i upgraded to the latest version 12U1.1, i hope it will resolve the issue.

Best regard,

Mikhail

viniciusferrao · Feb 7, 2021

milo1 said:
Good day all,

i have a similar setup as the topic starter (ovirt - centos vm's - nfs - truenas12). Never had any problems with this setup for 5 years. After the upgrade to truenas 12 after some period of time one third of the vm's had crashed. Some of them wile running, some of them after reboot. Few of them had to restore the snapshots (xfs_repair didn't help).

Now i upgraded to the latest version 12U1.1, i hope it will resolve the issue.

Best regard,

Mikhail

Just keep in mind that what as already been corrupted is gone. Mainly in VM's things went really bad when there are errors and metadata corruptions, so the VMs go down.

As today I still stumble with corrupted data. The VM's are good, they don't crash, but the data is destroyed.

If you have backups keep them for a long time. I'm keeping the last backup before the 12.0 upgrade because I still need to get data sometimes.

milo1 · Feb 7, 2021

viniciusferrao said:
Just keep in mind that what as already been corrupted is gone. Mainly in VM's things went really bad when there are errors and metadata corruptions, so the VMs go down.

As today I still stumble with corrupted data. The VM's are good, they don't crash, but the data is destroyed.

If you have backups keep them for a long time. I'm keeping the last backup before the 12.0 upgrade because I still need to get data sometimes.

Thank you for the information.
I am fortunate that i don't have any production DB's or a file storage in affected VM's, so data corruption is not an issue.

Best regards,

Mikhail

Anaerin · Feb 21, 2021

It seems this patch has introduced some serious issues. My system, which was/is stable under 12.0-U1 is now kernel panic-ing under U2. Ticket is here, with debug log attached: https://jira.ixsystems.com/browse/NAS-109500

Ryan Hunter · Feb 22, 2021

I have been having the exact same problems as mentioned in the above. I'm just running a simple home server here. I gave up on VM's because the images kept getting corrupted. Things seemed to get better (at least I thought) with the upgrade to U1.1, now I'm getting the same disk I/O errors as listed in the ticket posted above. The only solution is a hard reboot, as a graceful shutdown wasn't even possible with a keyboard and monitor connected to the machine itself. I've used freenas since 2008 and this is the least stable I have every seen it. My pool has been built and upgraded upon for 10+ years and this is first time I've had corruption issues. This just seems like serious bad show for home user, but I can't imagine how professional datacenters could deal with this without pulling their hair out.

Important Announcement for the TrueNAS Community.

FreeNAS (now TrueNAS) is no longer stable

Contributor

Contributor

Contributor

Dabbler

Contributor

Dabbler

Patron

Dabbler

Contributor

Contributor

Dabbler

Contributor

Dabbler

Contributor

Patron

Cadet

Contributor

Cadet

Cadet

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "FreeNAS (now TrueNAS) is no longer stable"

Similar threads