How i solved System Dataset Pool: ONLINE (Unhealthy)

Bibi40k · Sep 20, 2022

Maybe this helps someone, maybe there is a more efficient solution.

TrueNAS-SCALE-22.02.3
Pool Status: Unhealthy
Disks with Errors: 5
Total Disks: 6 (data)

Suddenly i got this pool error:

All disks are healthy due to both SHORT and LONG S.M.A.R.T. tests.
I have scrubbed the pool twice with the same result:

Code:

root@nas1-truenas[~]# zpool status -v Vol1-Z2
  pool: Vol1-Z2
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Tue Sep 20 06:31:16 2022
    7.60T scanned at 4.27G/s, 6.86T issued at 0B/s, 8.17T total
    0B repaired, 83.94% done, no estimated completion time
config:

    NAME                                      STATE     READ WRITE CKSUM
    Vol1-Z2                                   ONLINE       0     0     0
      raidz2-0                                ONLINE       0     0     0
        b68e0c13-0f2a-11e8-812e-b8ca3abd3f7a  ONLINE       0     0     0
        b97fc375-0f2a-11e8-812e-b8ca3abd3f7a  ONLINE       0     0     0
        bb5f70ba-0f2a-11e8-812e-b8ca3abd3f7a  ONLINE       0     0     0
        bc51eff8-0f2a-11e8-812e-b8ca3abd3f7a  ONLINE       0     0     0
        bd3fb2a4-0f2a-11e8-812e-b8ca3abd3f7a  ONLINE       0     0     0
        c00de708-0f2a-11e8-812e-b8ca3abd3f7a  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        Vol1-Z2/Family:<0x0>

The only way i could find more details about corrupted files was this:

Code:

root@nas1-truenas[~]# du -sh /mnt/Vol1-Z2/Family/
du: cannot access '/mnt/Vol1-Z2/Family/Bruxelles/IMG_4616.JPG': Invalid exchange
du: cannot access '/mnt/Vol1-Z2/Family/Bruxelles/IMG_4618.JPG': Invalid exchange
du: cannot access '/mnt/Vol1-Z2/Family/Bruxelles/IMG_4623.JPG': Invalid exchange
du: cannot access '/mnt/Vol1-Z2/Family/Bruxelles/IMG_4624.JPG': Invalid exchange
du: cannot access '/mnt/Vol1-Z2/Family/Bruxelles/IMG_4625.JPG': Invalid exchange
du: cannot access '/mnt/Vol1-Z2/Family/Bruxelles/IMG_4622.JPG': Invalid exchange
du: cannot access '/mnt/Vol1-Z2/Family/Bruxelles/IMG_4619.JPG': Invalid exchange
du: cannot access '/mnt/Vol1-Z2/Family/Bruxelles/IMG_4617.JPG': Invalid exchange
du: cannot access '/mnt/Vol1-Z2/Family/Bruxelles/IMG_4621.JPG': Invalid exchange
du: cannot access '/mnt/Vol1-Z2/Family/Bruxelles/IMG_4615.JPG': Invalid exchange
du: cannot access '/mnt/Vol1-Z2/Family/Bruxelles/IMG_4620.JPG': Invalid exchange
504G    /mnt/Vol1-Z2/Family/

After all these steps, i have created a new Family dataset, moved everything from original Family dataset except the corrupted files which i copied from an old snapshot which i restored to a clone dataset.

Some hardware errors time to time:

Code:

[63959.864290] {13}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[63959.864296] {13}[Hardware Error]: It has been corrected by h/w and requires no further action
[63959.864307] {13}[Hardware Error]: event severity: corrected
[63959.864309] {13}[Hardware Error]:  Error 0, type: corrected
[63959.864311] {13}[Hardware Error]:  fru_text: CorrectedErr
[63959.864312] {13}[Hardware Error]:   section_type: memory error
[63959.864314] {13}[Hardware Error]:   node: 60840 device: 12343
[63959.864325] {13}[Hardware Error]:   error_type: 2, single-bit ECC
[64021.303808] {14}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[64021.303814] {14}[Hardware Error]: It has been corrected by h/w and requires no further action
[64021.303817] {14}[Hardware Error]: event severity: corrected
[64021.303819] {14}[Hardware Error]:  Error 0, type: corrected
[64021.303820] {14}[Hardware Error]:  fru_text: CorrectedErr
[64021.303822] {14}[Hardware Error]:   section_type: memory error
[64021.303824] {14}[Hardware Error]:   node: 60840 device: 12343
[64021.303827] {14}[Hardware Error]:   error_type: 2, single-bit ECC

sretalla · Sep 20, 2022

Bibi40k said:
Some hardware errors time to time:

It's great you were able to rescue your data.

I don't think those errors are really to be ignored though (despite the message mentioning nothing to be done) and may be the thing behind the corruption of your filesystem, which really shouldn't happen in normal ZFS operation.

ECC memory coming into play to correct in-memory errors is sort-of the reason you have ECC, but seeing that there are errors being corrected "from time to time" is something that would suggest to me to either replace the memory, CPU or motherboard at some point soon.

If you want to keep your data and not have more corruption (maybe at some point not recoverable), you should handle this now.

Bibi40k · Sep 20, 2022

Thank you.
Do you know how can i identify which hardware is "[Hardware Error]: node: 60840 device: 12343" ?

sretalla · Sep 20, 2022

I'm not sure if it will have the needed information, but you can try with:

dmidecode --type 16

and

dmidecode --type memory

And see if there is some information there that might help you get to the right module.

Bibi40k · Sep 20, 2022

In Dell bios i have all kind of diagnostics, including memory and ... bingo (sad bingo), i found errors on DIMM1.

I already bought a new one and all memory tests passed.

Right now i have cleared errors and running a new scrub hoping the pool will become healthy.

Thank you for staying around and guide me when necessary

Bibi40k · Sep 20, 2022

All ok :)

Important Announcement for the TrueNAS Community.

How i solved System Dataset Pool: ONLINE (Unhealthy)

Bibi40k

Contributor

sretalla

Powered by Neutrality

Bibi40k

Contributor

sretalla

Powered by Neutrality

Bibi40k

Contributor

Bibi40k

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

How i solved System Dataset Pool: ONLINE (Unhealthy)

Bibi40k

Contributor

sretalla

Powered by Neutrality

Bibi40k

Contributor

sretalla

Powered by Neutrality

Bibi40k

Contributor

Bibi40k

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "How i solved System Dataset Pool: ONLINE (Unhealthy)"

Similar threads