TrueNAS 12.0 - Something clearing errors from zpool status (possibly nag to upgrade to openzfs2.0)

andersenep · Dec 7, 2020

I have a drive that encountered 15 read errors during a scheduled scrub. Upon sourcing a replacement drive (several weeks later), I checked zpool status to identify which disk it was, and the errors had somehow been cleared.

Code:

  pool: storage
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(5) for details.
scan: scrub repaired 896K in 06:16:00 with 0 errors on Sun Nov 22 06:16:01 2020
config:

    NAME                                            STATE     READ WRITE CKSUM
    storage                                         ONLINE       0     0     0
      raidz1-0                                      ONLINE       0     0     0
        da6                                         ONLINE       0     0     0
        da3                                         ONLINE       0     0     0
        da9                                         ONLINE       0     0     0
      raidz1-1                                      ONLINE       0     0     0
        da5                                         ONLINE       0     0     0
        gptid/cbe9e8e5-0d8f-11eb-a1ae-00e081e51614  ONLINE       0     0     0 /* the first & last time i let webui replace a drive */
        da8                                         ONLINE       0     0     0
      raidz1-2                                      ONLINE       0     0     0
        da4                                         ONLINE       0     0     0
        da1                                         ONLINE       0     0     0
        da11                                        ONLINE       0     0     0
      raidz1-3                                      ONLINE       0     0     0
        da2                                         ONLINE       0     0     0
        da0                                         ONLINE       0     0     0
        da10                                        ONLINE       0     0     0
    cache
      da12                                          ONLINE       0     0     0

errors: No known data errors

Not sure what happened, ran another scrub and got 1 CKSUM error. It was da11.

Code:

  pool: storage
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 64K in 06:20:56 with 0 errors on Mon Dec  7 02:41:21 2020
config:

    NAME                                            STATE     READ WRITE CKSUM
------
        da11                                        ONLINE       0     0     1
------

Reboot and check zpool status:

Code:

  pool: storage
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(5) for details.
  scan: scrub repaired 64K in 06:20:56 with 0 errors on Mon Dec  7 02:41:21 2020
config:

    NAME                                            STATE     READ WRITE CKSUM
------
        da11                                        ONLINE       0     0     0
------

It seems like checking/nagging me to update to openzfs 2.0 is essentially running zpool clear on my pools. As a user, I am far more concerned with my scrub results being properly handled than I am with openzfs 2.0 awareness.

sretalla · Dec 8, 2020

A reboot has always cleared the zpool stats as far as I remember.

If the stats disappear without a reboot, we should continue the discussion.

andersenep · Dec 8, 2020

I cannot say for certain that a reboot is supposed to clear zpool status, but if so that is rather absurd behavior. This information is retained in zpool status across reboots:

Code:

scan: scrub repaired 896K in 06:16:00 with 0 errors on Sun Nov 22 06:16:01 2020

Code:

scan: scrub repaired 64K in 06:20:56 with 0 errors on Mon Dec  7 02:41:21 2020

But zfs (or TrueNAS) is going to clear which drive (or drives) the errors occured on???? And again, not only that, but it is overwriting the status and action sections of zpool status with information about an upgrade that I think most people would consider trivial in comparison to a scrub finding errors on disks.

The only thing that should clear those errors is zpool clear or replacing the device. Otherwise zpool status should not simply state:

Code:

action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.

and should at least include the fact that you could just simply reboot and hope all your problems disappear just because someone somewhere decided it was a good idea to do this, or that if you reboot the errors will clear.

andersenep · Dec 8, 2020

everything i have attempted to replicate CKSUM errors on a newly created pool indicates that you are correct that per disk zpool status errors do not survive reboots or export/import (under truenas 12.0 at least)...this is simply amazing to me....what is the point of zpool scrub/status reporting errors across boots, if it doesn't tell you which device(s) exhibited the errors?????????

sretalla · Dec 9, 2020

I think you're possibly placing too much stock in the permanence of those counters.

If the problem is persistent, a scrub will always re-increment the counters after a clear or reboot.

andersenep · Dec 9, 2020

Perhaps, but I think you're possibly being a bit too little stock about the counters being silently cleared for no good reason. One of the primary purposes of creating ZFS was to detect/prevent bit rot. What good is checksumming and detecting errors in a pool, and then not caring about what device (or devices!) the errors occurred on?

Sure I can just run another scrub and see if zfs detects more errors. That's exactly what I did. It shouldn't have been necessary. There is no good reason why

Code:

scan: scrub repaired 0B in 00:10:36 with 0 errors on Tue Dec  8 22:25:38 2020

should persist across reboots (and exports) in zpool status while the actual error counters are cleared.

sretalla · Dec 9, 2020

Well, if you're not happy about it, you can raise an issue to the OpenZFS team and see how it goes from there. I don't think it's a TrueNAS peculiarity, but in any case, you won't see a fix for it in FreeNAS, only likely to get something to happen in TrueNAS with Open ZFS 2.0.

You could perhaps start by opening an issue here:

Issues · openzfs/zfs

OpenZFS on Linux and FreeBSD. Contribute to openzfs/zfs development by creating an account on GitHub.

github.com

Important Announcement for the TrueNAS Community.

TrueNAS 12.0 - Something clearing errors from zpool status (possibly nag to upgrade to openzfs2.0)

andersenep

Cadet

sretalla

Powered by Neutrality

andersenep

Cadet

andersenep

Cadet

sretalla

Powered by Neutrality

andersenep

Cadet

sretalla

Powered by Neutrality

Issues · openzfs/zfs

Similar threads

Important Announcement for the TrueNAS Community.

TrueNAS 12.0 - Something clearing errors from zpool status (possibly nag to upgrade to openzfs2.0)

Cadet

Powered by Neutrality

Cadet

Cadet

Powered by Neutrality

Cadet

Powered by Neutrality

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "TrueNAS 12.0 - Something clearing errors from zpool status (possibly nag to upgrade to openzfs2.0)"

Similar threads