How i solved System Dataset Pool: ONLINE (Unhealthy)

Bibi40k

Contributor
Joined
Jan 26, 2018
Messages
136
Maybe this helps someone, maybe there is a more efficient solution.

TrueNAS-SCALE-22.02.3
Pool Status: Unhealthy
Disks with Errors: 5
Total Disks: 6 (data)


Suddenly i got this pool error:
Screenshot 2022-09-20 at 10.46.47.png


All disks are healthy due to both SHORT and LONG S.M.A.R.T. tests.
I have scrubbed the pool twice with the same result:
Code:
root@nas1-truenas[~]# zpool status -v Vol1-Z2
  pool: Vol1-Z2
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Tue Sep 20 06:31:16 2022
    7.60T scanned at 4.27G/s, 6.86T issued at 0B/s, 8.17T total
    0B repaired, 83.94% done, no estimated completion time
config:

    NAME                                      STATE     READ WRITE CKSUM
    Vol1-Z2                                   ONLINE       0     0     0
      raidz2-0                                ONLINE       0     0     0
        b68e0c13-0f2a-11e8-812e-b8ca3abd3f7a  ONLINE       0     0     0
        b97fc375-0f2a-11e8-812e-b8ca3abd3f7a  ONLINE       0     0     0
        bb5f70ba-0f2a-11e8-812e-b8ca3abd3f7a  ONLINE       0     0     0
        bc51eff8-0f2a-11e8-812e-b8ca3abd3f7a  ONLINE       0     0     0
        bd3fb2a4-0f2a-11e8-812e-b8ca3abd3f7a  ONLINE       0     0     0
        c00de708-0f2a-11e8-812e-b8ca3abd3f7a  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        Vol1-Z2/Family:<0x0>


The only way i could find more details about corrupted files was this:
Code:
root@nas1-truenas[~]# du -sh /mnt/Vol1-Z2/Family/
du: cannot access '/mnt/Vol1-Z2/Family/Bruxelles/IMG_4616.JPG': Invalid exchange
du: cannot access '/mnt/Vol1-Z2/Family/Bruxelles/IMG_4618.JPG': Invalid exchange
du: cannot access '/mnt/Vol1-Z2/Family/Bruxelles/IMG_4623.JPG': Invalid exchange
du: cannot access '/mnt/Vol1-Z2/Family/Bruxelles/IMG_4624.JPG': Invalid exchange
du: cannot access '/mnt/Vol1-Z2/Family/Bruxelles/IMG_4625.JPG': Invalid exchange
du: cannot access '/mnt/Vol1-Z2/Family/Bruxelles/IMG_4622.JPG': Invalid exchange
du: cannot access '/mnt/Vol1-Z2/Family/Bruxelles/IMG_4619.JPG': Invalid exchange
du: cannot access '/mnt/Vol1-Z2/Family/Bruxelles/IMG_4617.JPG': Invalid exchange
du: cannot access '/mnt/Vol1-Z2/Family/Bruxelles/IMG_4621.JPG': Invalid exchange
du: cannot access '/mnt/Vol1-Z2/Family/Bruxelles/IMG_4615.JPG': Invalid exchange
du: cannot access '/mnt/Vol1-Z2/Family/Bruxelles/IMG_4620.JPG': Invalid exchange
504G    /mnt/Vol1-Z2/Family/


After all these steps, i have created a new Family dataset, moved everything from original Family dataset except the corrupted files which i copied from an old snapshot which i restored to a clone dataset.
Screenshot 2022-09-20 at 11.00.54.png


Some hardware errors time to time:
Code:
[63959.864290] {13}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[63959.864296] {13}[Hardware Error]: It has been corrected by h/w and requires no further action
[63959.864307] {13}[Hardware Error]: event severity: corrected
[63959.864309] {13}[Hardware Error]:  Error 0, type: corrected
[63959.864311] {13}[Hardware Error]:  fru_text: CorrectedErr
[63959.864312] {13}[Hardware Error]:   section_type: memory error
[63959.864314] {13}[Hardware Error]:   node: 60840 device: 12343
[63959.864325] {13}[Hardware Error]:   error_type: 2, single-bit ECC
[64021.303808] {14}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[64021.303814] {14}[Hardware Error]: It has been corrected by h/w and requires no further action
[64021.303817] {14}[Hardware Error]: event severity: corrected
[64021.303819] {14}[Hardware Error]:  Error 0, type: corrected
[64021.303820] {14}[Hardware Error]:  fru_text: CorrectedErr
[64021.303822] {14}[Hardware Error]:   section_type: memory error
[64021.303824] {14}[Hardware Error]:   node: 60840 device: 12343
[64021.303827] {14}[Hardware Error]:   error_type: 2, single-bit ECC
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Some hardware errors time to time:
It's great you were able to rescue your data.

I don't think those errors are really to be ignored though (despite the message mentioning nothing to be done) and may be the thing behind the corruption of your filesystem, which really shouldn't happen in normal ZFS operation.

ECC memory coming into play to correct in-memory errors is sort-of the reason you have ECC, but seeing that there are errors being corrected "from time to time" is something that would suggest to me to either replace the memory, CPU or motherboard at some point soon.

If you want to keep your data and not have more corruption (maybe at some point not recoverable), you should handle this now.
 

Bibi40k

Contributor
Joined
Jan 26, 2018
Messages
136
Thank you.
Do you know how can i identify which hardware is "[Hardware Error]: node: 60840 device: 12343" ?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I'm not sure if it will have the needed information, but you can try with:

dmidecode --type 16

and

dmidecode --type memory

And see if there is some information there that might help you get to the right module.
 

Bibi40k

Contributor
Joined
Jan 26, 2018
Messages
136
In Dell bios i have all kind of diagnostics, including memory and ... bingo (sad bingo), i found errors on DIMM1.
IMG_8697.jpg


I already bought a new one and all memory tests passed.
IMG_8701.jpg


Right now i have cleared errors and running a new scrub hoping the pool will become healthy.

Thank you for staying around and guide me when necessary
 
Top