Checksum Errors how to find affected files?

flammen · Jan 19, 2023

Hi all,

I get hundrets of checksum errors on my main pool after scrub. Nearly the same number for each disk. SMART runs weekly and reports disks are fine. I moved a lot of files to the pool recently.
Theory: Files were already corrupted when I moved them to the pool.
My Question: How do I find the affected files?
Using SMB to access from windows machine.

Thanks for your help!

NugentS · Jan 19, 2023

flammen said:
Hi all,

I get hundrets of checksum errors on my main pool after scrub. Nearly the same number for each disk. SMART runs weekly and reports disks are fine. I moved a lot of files to the pool recently.
Theory: Files were already corrupted when I moved them to the pool.
My Question: How do I find the affected files?
Using SMB to access from windows machine.

Thanks for your help!

I think you should try attaching them to some hardware as per forum rules

winnielinnie · Jan 19, 2023

A useful mnemonic. If you get a bunch of checksum errors across all drives, yet no issues with SMART tests:

Check cables. Check connections. Check HBA (if applicable).

Patrick M. Hausen · Jan 19, 2023

flammen said:
Theory: Files were already corrupted when I moved them to the pool.

Although the files might be corrupted in some sense (documents, music, video ... with errors in them) there is no such thing as a corrupted file from ZFS' point of view when you write them. If your Word document contains errors it will be written with errors. Files are just one long bag of bytes. ZFS does not care if your Word document is valid or not or if your video skips frames.

The corruption you are seeing is detected by ZFS on the level of block checksums and probably happened after writing.

To find out which files are affected by those checksum errors use zpool status -v <poolname>

Etorix · Jan 19, 2023

If there were uncorrectable errors in the files, ZFS would report the affected files. For now it seems that all errors are being corrected, and the issue is probably with the cable(s) or HBA.
The output of zpool status (within CODE tags) could help narrow the issue.

flammen · Jan 19, 2023

Thank you for the explanation, it makes sense. That ist the output:

Code:

admin@truenas[~]# sudo zpool status vault -v

  pool: vault

 state: ONLINE

status: One or more devices has experienced an error resulting in data

        corruption.  Applications may be affected.

action: Restore the file in question if possible.  Otherwise restore the

        entire pool from backup.

   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A

  scan: scrub repaired 1.49M in 05:20:21 with 12 errors on Wed Jan 18 08:20:26 2023

config:



        NAME                                      STATE     READ WRITE CKSUM

        vault                                     ONLINE       0     0     0

          raidz2-0                                ONLINE       0     0     0

            15ef0a62-3b6a-4d4e-ad5a-71bf5e78a9a9  ONLINE       0     0   671

            12fafd0d-5788-49a9-8ce8-5558cdefd773  ONLINE       0     0   666

            de84f345-53e6-482c-9240-277729051b4a  ONLINE       0     0   666

            8ae62f89-6ecb-4514-9c57-a16f55cbfee4  ONLINE       0     0   669

            62fdecf2-44fb-4260-a41f-58ede8d67de3  ONLINE       0     0   667

            d6f86e95-41a7-4f8a-9240-18b1c6ab5b19  ONLINE       0     0   666

        special

          mirror-3                                ONLINE       0     0     0

            1c340f26-9646-4771-8e15-69bc117056ea  ONLINE       0     0     0

            926bf5a9-a3d5-4351-8cb0-d20740ec3e09  ONLINE       0     0     0

            4763af02-9d8b-4faa-842b-7d8d7ad42cd4  ONLINE       0     0     0



errors: Permanent errors have been detected in the following files:



        /mnt/vault/[...].avi

        /mnt/vault/[...].CR2

        /mnt/vault/[...].CR2

        /mnt/vault/[...].mkv

        /mnt/vault/[...].mbox

        /mnt/vault/[...].iso

        /mnt/vault/[...].mp4

        /mnt/vault/[...].MOV

        /mnt/vault/[...].mbox

        /mnt/vault/[...].CR2

        vault/vault@20230102_0300_1day_1year:/[...].CR2

        vault/vault@20230102_0300_1day_1year:/[...].CR2

        vault/vault@20230102_0300_1day_1year:/[...].mkv

        vault/vault@20230102_0300_1day_1year:/[...].mp4

        vault/vault@20230102_0300_1day_1year:/[...].mbox

        vault/vault@20230102_0300_1day_1year:/[...].iso

        vault/vault@20230102_0300_1day_1year:/[...].mp4

        vault/vault@20230102_0300_1day_1year:/[...].mbox

        vault/vault@20230102_0300_1day_1year:/[...].CR2

        vault/vault@20230103_0300_1day_1year:/[...].MOV

admin@truenas[~]#

I replaced the file names with [...] for some privacy. The files from the snapshots (naming scheme e.g. "20230102_0300_1day_1year") are the same as listed above. I looked at the files and they seem to be fine.
I do use an HBA (LSI SAS2308-8I / 9217-8I). I already tried reseating the cables, but it did not change anything. I also wonder why the errors are evenly spread on all drives. Does this mean that all connections are faulty?
What would be the next step to find the cause?

Thank you for all your help already!

NugentS · Jan 19, 2023

Please post your hardware as per forum rules

winnielinnie · Jan 19, 2023

flammen said:
I do use an HBA (LSI SAS2308-8I / 9217-8I). I already tried reseating the cables, but it did not change anything.

Not just cables. Check / reseat the HBA. Check the connections on all ends, both ways.

As well as posting all the hardware of the system, it also helps to know if you flashed the HBA to "IT mode".

flammen said:
but it did not change anything.

You might not see a difference until the next time the entire pool is scrubbed. Otherwise, those "errors" will remain in the pool's status.

flammen · Jan 19, 2023

Thanks for your help, it is highly appreciated!

These are te specs:

Code:

Motherboard: ASUS WS X299 PRO/SE
CPU: Intel Core i9-7900X
RAM: 64 GB
Drives:     Vault:       Data: 6x 8TB WD Wite label (shucked, mostly WD Red Pro) SATA HDDs (RAIDZ2)
                         Metadata: 3x 1TB Samsung 980 NVMe SSD (mirror)
            Apps:       2x 500GB Samsung 970 Evo NVMe SSD (mirror)
            App-Data:   2x 1TB Samsung 970 Evo NVMe SSD (mirror)
            Boot-pool:  2x 240GB Intel Pro 2500 SATA SSD
HBA: LSI SAS2308-8I (9217-8I) flashed to IT mode
Network cards: Mellanox ConnectX-3 10GBe SFP+ CX311A
Case: 4U Rack with 3 Noctua NF-A14 iPPC 3000 @ 100%

I will try reseating all the connections both ends and the card itself again and scrub the pool to see whether it helped.
I did buy the HBA second hand and it worked for 8 months now, could it be faulty/end of life?

NugentS · Jan 19, 2023

Whats the airflow over the HBA like. Those do get hot and need airflow

flammen · Jan 20, 2023

Thank you for your help.

NugentS said:
Those do get hot

That might be a cause. The airflow in the case is good, I have 3x 140mm iPPC Noctua fans at full speed in the front of the case and can definitely feel a lot of airflow behind the case. However, I do notice that the NVMe-Drive in the slot directly in front of the HBA is considerably warmer than the others. I do not have enough space for a fan directly on the HBA heatsink, and I can not move the cards around. It is quite crammed, so maybe the airflow is not reaching the heatsink well.
I have attached a photo of the PCIe-Slot area, the HBA is the second card from the PSU.
Any suggestions or ideas for improved cooling?
Does the HBA report a temperature? If so, how can I get to it?

Redcoat · Jan 20, 2023

flammen said:
Any suggestions or ideas for improved cooling?

Do you "need" the video card in there (consider going headless) to create more space?

NugentS · Jan 20, 2023

I am afraid I have no idea if an LSI reports a temp.

What I did, its a bit janky, but either works, or doesn't do any harm is that I put a spare 120mm fan on top of the cards, cable tieing it to the exit holes at the back of the case and then powering through a spare molex. I keep meaning to replace with a 140mm high pressure fan I have as all the kit is in the garage so can make as much noise as it likes

As far as I know there is no way of measuring other than directly an HBA temp. Some people put a 40mm fan on them with glue or appropriate screws - but unless its very thin it won't fit for you

Davvo · Jan 20, 2023

flammen said:
Does the HBA report a temperature? If so, how can I get to it?

You bet it does, touch the heatsink with your finger! Physics always report!

Whattteva · Jan 20, 2023

Davvo said:
You bet it does, touch the heatsink with your finger! Physics always report!

Curse thermodynamics!

somethingweird · Jan 20, 2023

Newbie Question - Would/Could non-ECC memory affect these files and cause the chksum error?

Whattteva · Jan 20, 2023

somethingweird said:
Newbie Question - Would/Could non-ECC memory affect these files and cause the chksum error?

Such consistent errors with similar numbers most likely have nothing to do with ECC and is just indication of whatever the issue is being outright faulty (ie. faulty RAM, faulty HBA). I would probably rule out cables/drives because it's very unlikely that all 6 of them all went bad at virtually the exact same time. Such uniformity requires divine intervention.

winnielinnie · Jan 20, 2023

Whattteva said:
I would probably rule out cables/drives because it's very unlikely that all 6 of them all went bad at virtually the exact same time. The amount of coincidences there requires divine intervention.

That's why I think it's a single point: the HBA

Maybe it needs to be re-seated? Maybe it's overheating?

After confirming that it is secure and the temperature/airflow has been corrected, then a follow-up scrub should reveal if the pool still contains errors.

Davvo · Jan 20, 2023

When you get CKSUM errors the reason is usually (steps to be checked from top to bottom, left to right):

Drive (overheating, dying);
SATA or SAS cables (improper connectection, bad cables);
HBA issues (overheating, wrong firmware, fake card, dying);
PSU issues (improper connectection, bad cables, overheating, dying);
RAM (dust or cpu torque causing issues with the connection, dying, lack of ECC).

That's the ~~troubleshooting~~ blaming process the forum members usually follow, do note that everything is in that order for a reason; as you can see, lack of ECC is usually the last thing you assume (usually when nothing else seems to be wrong and you can't exactly make sense of the situation, that's when you go fishing with memtest).
Usually the cables (power and data connections) are reseated at the same time; same for CPU and RAM.

Important Announcement for the TrueNAS Community.

Checksum Errors how to find affected files?

Dabbler

Attachments

MVP

MVP

Hall of Famer

Wizard

Dabbler

MVP

MVP

Dabbler

Attachments

MVP

Dabbler

Attachments

MVP

MVP

MVP

Wizard

Contributor

Wizard

MVP

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Checksum Errors how to find affected files?"

Similar threads