ZFS-8000-8A, ZFS array corruption

darkschmu · Jul 31, 2022

Good morning~

I have built and maintained a FreeNAS box at home for a few years now, dating back to 10.0 I think it was. Never had an issue until I recently upgraded to TrueNAS13.0. Now, this may or may not be the crux of the issue, so let me give a few more details.

My server is built from parts of older PCs we grew out of, currently using 12G ram and a 4-disk (WD Red 3TBs) ZFS2 array. Previously it was a ZFS1 but I had to rebuild after it suddenly reported corruption. The drives had no SMART errors. Upon reboot, the array would no longer mount. I had to rescue-mount is as RO to export the data I was still able to reach and rebuild. Here's some of my notes in case they may help someone in the same situation:


root@underdesk[~]# zpool status -v
  pool: freenas-boot
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:07:59 with 0 errors on Tue Jun 28 03:53:00 2022
config:

        NAME          STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          ada0p2      ONLINE       0     0     0

errors: No known data errors

  pool: storage
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 09:40:27 with 0 errors on Sun Jun 12 09:40:27 2022
config:

        NAME                                            STATE     READ WRITE CKSUM
        storage                                         DEGRADED     0     0 0
          raidz1-0                                      DEGRADED     0     0 0
            gptid/1cb77a81-da22-11e9-a666-002618a144fb  DEGRADED     0     0 1.18K  too many errors
            gptid/1e6b6db5-da22-11e9-a666-002618a144fb  DEGRADED     0     0 1.18K  too many errors
            gptid/1f351d58-da22-11e9-a666-002618a144fb  DEGRADED     0     0 1.18K  too many errors
            gptid/255219dc-d469-11ec-900c-60a44c57a05a  DEGRADED     0     0 1.18K  too many errors

errors: Permanent errors have been detected in the following files:

        storage/iocage/jails/plexmediaserver/root:<0x0>
        /mnt/storage/iocage/jails/plexmediaserver/root/Plex Media Server/Plug-in Support/Databases/com.plexapp.plugins.library.db

That last file is commonly (?) corrupted

Reddit - The heart of the internet

www.reddit.com

potential fix:


$ "/usr/lib/plexmediaserver/Plex SQLite" "/var/lib/plex/Plex Media Server/Plug-in Support/Databases/com.plexapp.plugins.library.db" ".recover" | "/usr/lib/plexmediaserver/Plex SQLite" fix.db

$ "/usr/lib/plexmediaserver/Plex SQLite" fix.db "PRAGMA integrity_check"

And while I was looking for that sqlite command (`find /Plex\ Media\ Server -iname "*sqlite*"`), it started an endless and unprompted reboot loop, kernel panic when loading storage pool.

I found a thread that referred to 'many threads' where the outcome was to nuke the pool and restart from scratch, another thread mentioning how ZFS relies heavily on RAM and if that goes bad it may bring down the array with it - I'll run a memcheck and find out.

The article listed above,

Message ID: ZFS-8000-8A — OpenZFS documentation

openzfs.github.io

mentions the only outcome is to restore the pool from backup.

Mounting the storage pool RO for recovery:


rebooted
option 3? boot options or something
> set vfs.zfs.recover=1
> set vfs.zfs.debug=1
> boot -s
(enter for shell)
> zdb -e -bcsvL storage
> zfs set readonly=off freenas-boot
> zfs mount -a
this got me a readonly storage pool
now need to get the data off if possible

tried an ip address
> ifconfig re0 192.168.1.54

Since the drives themselves had no errors, I thought reusing them was fine but the same error came up again:


root@truenas[~]# zpool status -xv
  pool: storage
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
config:

        NAME                                            STATE     READ WRITE CKSUM
        storage                                         ONLINE       0     0 0
          raidz2-0                                      ONLINE       0     0 0
            gptid/c07fb6d0-fb04-11ec-b551-60a44c57a05a  ONLINE       0     0 0
            gptid/c08a096c-fb04-11ec-b551-60a44c57a05a  ONLINE       0     0 0
            gptid/c091abba-fb04-11ec-b551-60a44c57a05a  ONLINE       0     0 0
            gptid/c0492aeb-fb04-11ec-b551-60a44c57a05a  ONLINE       0     0 0

errors: Permanent errors have been detected in the following files:

        storage/iocage/jails/plex_server/root:<0x27795>
root@truenas[~]#

We're now on track to replace the drives and rebuild yet again, but I wanted to know what exactly is that `root:<0x27795>` file, it was the first file to report corruption in both iterations.

I would be very interested in finding the root cause of this or if someone else encountered this, be it drives, ZFS, lack of resources, bad cables.

Thanks for reading,

Ben

Ericloewe · Jul 31, 2022

darkschmu said:
ZFS2 array. Previously it was a ZFS1

It's RAIDZ1/2/3

darkschmu said:
I wanted to know what exactly is that `root:<0x27795>`

Well, it's ZFS metadata. The details are only visible using zdb, I believe.

darkschmu said:
That last file is commonly (?) corrupted

Reddit - The heart of the internet

www.reddit.com

potential fix:

$ "/usr/lib/plexmediaserver/Plex SQLite" "/var/lib/plex/Plex Media Server/Plug-in Support/Databases/com.plexapp.plugins.library.db" ".recover" | "/usr/lib/plexmediaserver/Plex SQLite" fix.db $ "/usr/lib/plexmediaserver/Plex SQLite" fix.db "PRAGMA integrity_check"

And while I was looking for that sqlite command (`find /Plex\ Media\ Server -iname "*sqlite*"`), it started an endless and unprompted reboot loop, kernel panic when loading storage pool.

I think you're mixing up two different things:

Plex ending up with a corrupted database is plain old bug in Plex, the database, or both. Nothing to do with ZFS.
Corruption identified by ZFS could be caused by a few things, but Plex is not one of them (outside of crazy misconfigured systems).

So, what might be happening? A few questions to narrow this down:

You reused the same disks for a different pool on the same system, correct?
Is the pool with new disks working correctly?
How are these disks attached?
More broadly, what are the detailed system specs?

darkschmu · Jul 31, 2022

Thanks for the quick reply.

I agree there were two issues originally: Plex DB corruption and ZFS array corruption. I wasnt trying to blame one on the other, but was describing the path I took.

Answers:
1- Yes, same disks reused on same system
2- New pool is showing the same corruption a day or two after creation
3- SATA drives attached to onboard controller
3- i7-3770, 12G (non-ECC) RAM, ASUS mobo, basic desktop PC. Boots off a 250g drive. Running Truenas 13.0.

Thanks,

Ben

Davvo · Jul 31, 2022

darkschmu said:
another thread mentioning how ZFS relies heavily on RAM and if that goes bad it may bring down the array with it - I'll run a memcheck and find out.

FYI:

ECC vs non-ECC RAM and ZFS

I've seen many people unfortunately lose their zpools over this topic, so I'm going to try to provide as much detail as possible. If you don't want to read to the end then just go with ECC RAM. For those of you that want to understand just how destructive non-ECC RAM can be, then I'd encourage...

www.truenas.com

Important Announcement for the TrueNAS Community.

ZFS-8000-8A, ZFS array corruption

darkschmu

Cadet

Reddit - The heart of the internet

Message ID: ZFS-8000-8A — OpenZFS documentation

Ericloewe

Server Wrangler

Reddit - The heart of the internet

darkschmu

Cadet

Davvo

MVP

ECC vs non-ECC RAM and ZFS

Similar threads

Important Announcement for the TrueNAS Community.

ZFS-8000-8A, ZFS array corruption

Cadet

Server Wrangler

Cadet

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "ZFS-8000-8A, ZFS array corruption"

Similar threads