Pool data errors

Brad303

Dabbler
Joined
Apr 2, 2012
Messages
24
Hey all, I need some help; I'm flummoxed.

Code:
  pool: HugeZ
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Thu Dec 29 10:36:05 2022
        11.3T scanned at 808M/s, 9.04T issued at 645M/s, 13.8T total
        0B repaired, 65.57% done, 02:08:37 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        HugeZ                                           DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            gptid/ae8e8008-628e-11ec-b0fe-842b2b781408  DEGRADED     0     0   128  too many errors
            gptid/dbaec192-5c97-11ec-b0fe-842b2b781408  DEGRADED     0     0   128  too many errors
            gptid/dfb293cf-5c97-11ec-b0fe-842b2b781408  DEGRADED     0     0   128  too many errors
            gptid/e3e9a0ef-5c97-11ec-b0fe-842b2b781408  DEGRADED     0     0   128  too many errors
            gptid/e827e60e-5c97-11ec-b0fe-842b2b781408  DEGRADED     0     0   128  too many errors
            gptid/ec0ffe69-5c97-11ec-b0fe-842b2b781408  DEGRADED     0     0   128  too many errors
            gptid/f021300f-5c97-11ec-b0fe-842b2b781408  DEGRADED     0     0   128  too many errors
            gptid/f45bbb07-5c97-11ec-b0fe-842b2b781408  DEGRADED     0     0   128  too many errors
            gptid/f8f0a162-5c97-11ec-b0fe-842b2b781408  DEGRADED     0     0   128  too many errors
            gptid/00519ca2-5c98-11ec-b0fe-842b2b781408  DEGRADED     0     0   128  too many errors

errors: 1491 data errors, use '-v' for a list


My TrueNAS-12.0-U8.1 raidz2 pool is degraded with the same number of CKSUM errors on each drive. I typically get a few more data errors when I run a scrub. I've tried clean -F, which takes it out of DEGRADED state, but doesn't clear the errors. Subsequent scrubs toss it back to DEGRADED.

smartctl -a shows all of these 3TB drives OK with zero grown defects, although two drives have 3-4 uncorrected errors in read+verify and verify, respectively.

Base on that, and the fact that all of the drives show the same number of CKSUM errors, I'm assuming it's not the drives themselves.

My system has been spontaneously rebooting as of late, which I'm currently attributing to bad RAM. My terminal server is on the fritz, so I can't capture the console when it decides to bounce itself. I suppose that it's possible that corrupted data was written during a reboot, but I didn't think ZFS would do that, and I would expect that the data to be recovered in a scrub.

It also looks like I may have had a bad power supply in the original tray, since it would shut down before all the drives were fully spun up. New PS seems to have fixed that problem, but I also moved the drives to a different tray to isolate any hardware problems.

Hardware in question:

Host: Dell R710
1 CPU, 36 GB RAM
IT-mode H200 (LSI 9211) for internal drives
IT-mode SAS 9200-16e (HP-badged) for external trays
Trays (2x):
16-drive 4U Chenbro chassis
LSI-based Chenbro SAS expanders
all-SAS backplanes
 
Top