ZFS-8000-8A, ZFS array corruption

darkschmu

Cadet
Joined
Oct 10, 2017
Messages
4
Good morning~

I have built and maintained a FreeNAS box at home for a few years now, dating back to 10.0 I think it was. Never had an issue until I recently upgraded to TrueNAS13.0. Now, this may or may not be the crux of the issue, so let me give a few more details.

My server is built from parts of older PCs we grew out of, currently using 12G ram and a 4-disk (WD Red 3TBs) ZFS2 array. Previously it was a ZFS1 but I had to rebuild after it suddenly reported corruption. The drives had no SMART errors. Upon reboot, the array would no longer mount. I had to rescue-mount is as RO to export the data I was still able to reach and rebuild. Here's some of my notes in case they may help someone in the same situation:

root@underdesk[~]# zpool status -v pool: freenas-boot state: ONLINE status: Some supported and requested features are not enabled on the pool. The pool can still be used, but some features are unavailable. action: Enable all features using 'zpool upgrade'. Once this is done, the pool may no longer be accessible by software that does not support the features. See zpool-features(7) for details. scan: scrub repaired 0B in 00:07:59 with 0 errors on Tue Jun 28 03:53:00 2022 config: NAME STATE READ WRITE CKSUM freenas-boot ONLINE 0 0 0 ada0p2 ONLINE 0 0 0 errors: No known data errors pool: storage state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A scan: scrub repaired 0B in 09:40:27 with 0 errors on Sun Jun 12 09:40:27 2022 config: NAME STATE READ WRITE CKSUM storage DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 gptid/1cb77a81-da22-11e9-a666-002618a144fb DEGRADED 0 0 1.18K too many errors gptid/1e6b6db5-da22-11e9-a666-002618a144fb DEGRADED 0 0 1.18K too many errors gptid/1f351d58-da22-11e9-a666-002618a144fb DEGRADED 0 0 1.18K too many errors gptid/255219dc-d469-11ec-900c-60a44c57a05a DEGRADED 0 0 1.18K too many errors errors: Permanent errors have been detected in the following files: storage/iocage/jails/plexmediaserver/root:<0x0> /mnt/storage/iocage/jails/plexmediaserver/root/Plex Media Server/Plug-in Support/Databases/com.plexapp.plugins.library.db

That last file is commonly (?) corrupted
potential fix:

$ "/usr/lib/plexmediaserver/Plex SQLite" "/var/lib/plex/Plex Media Server/Plug-in Support/Databases/com.plexapp.plugins.library.db" ".recover" | "/usr/lib/plexmediaserver/Plex SQLite" fix.db $ "/usr/lib/plexmediaserver/Plex SQLite" fix.db "PRAGMA integrity_check"

And while I was looking for that sqlite command (`find /Plex\ Media\ Server -iname "*sqlite*"`), it started an endless and unprompted reboot loop, kernel panic when loading storage pool.

I found a thread that referred to 'many threads' where the outcome was to nuke the pool and restart from scratch, another thread mentioning how ZFS relies heavily on RAM and if that goes bad it may bring down the array with it - I'll run a memcheck and find out.

The article listed above,
mentions the only outcome is to restore the pool from backup.

Mounting the storage pool RO for recovery:
rebooted option 3? boot options or something > set vfs.zfs.recover=1 > set vfs.zfs.debug=1 > boot -s (enter for shell) > zdb -e -bcsvL storage > zfs set readonly=off freenas-boot > zfs mount -a this got me a readonly storage pool now need to get the data off if possible tried an ip address > ifconfig re0 192.168.1.54

Since the drives themselves had no errors, I thought reusing them was fine but the same error came up again:
root@truenas[~]# zpool status -xv pool: storage state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 gptid/c07fb6d0-fb04-11ec-b551-60a44c57a05a ONLINE 0 0 0 gptid/c08a096c-fb04-11ec-b551-60a44c57a05a ONLINE 0 0 0 gptid/c091abba-fb04-11ec-b551-60a44c57a05a ONLINE 0 0 0 gptid/c0492aeb-fb04-11ec-b551-60a44c57a05a ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: storage/iocage/jails/plex_server/root:<0x27795> root@truenas[~]#

We're now on track to replace the drives and rebuild yet again, but I wanted to know what exactly is that `root:<0x27795>` file, it was the first file to report corruption in both iterations.

I would be very interested in finding the root cause of this or if someone else encountered this, be it drives, ZFS, lack of resources, bad cables.

Thanks for reading,

Ben
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
ZFS2 array. Previously it was a ZFS1
It's RAIDZ1/2/3

I wanted to know what exactly is that `root:<0x27795>`
Well, it's ZFS metadata. The details are only visible using zdb, I believe.

That last file is commonly (?) corrupted
potential fix:

$ "/usr/lib/plexmediaserver/Plex SQLite" "/var/lib/plex/Plex Media Server/Plug-in Support/Databases/com.plexapp.plugins.library.db" ".recover" | "/usr/lib/plexmediaserver/Plex SQLite" fix.db $ "/usr/lib/plexmediaserver/Plex SQLite" fix.db "PRAGMA integrity_check"

And while I was looking for that sqlite command (`find /Plex\ Media\ Server -iname "*sqlite*"`), it started an endless and unprompted reboot loop, kernel panic when loading storage pool.
I think you're mixing up two different things:
  1. Plex ending up with a corrupted database is plain old bug in Plex, the database, or both. Nothing to do with ZFS.
  2. Corruption identified by ZFS could be caused by a few things, but Plex is not one of them (outside of crazy misconfigured systems).

So, what might be happening? A few questions to narrow this down:
  1. You reused the same disks for a different pool on the same system, correct?
  2. Is the pool with new disks working correctly?
  3. How are these disks attached?
  4. More broadly, what are the detailed system specs?
 

darkschmu

Cadet
Joined
Oct 10, 2017
Messages
4
Thanks for the quick reply.

I agree there were two issues originally: Plex DB corruption and ZFS array corruption. I wasnt trying to blame one on the other, but was describing the path I took.

Answers:
1- Yes, same disks reused on same system
2- New pool is showing the same corruption a day or two after creation
3- SATA drives attached to onboard controller
3- i7-3770, 12G (non-ECC) RAM, ASUS mobo, basic desktop PC. Boots off a 250g drive. Running Truenas 13.0.

Thanks,

Ben
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
another thread mentioning how ZFS relies heavily on RAM and if that goes bad it may bring down the array with it - I'll run a memcheck and find out.
FYI:
 
Top