Checksum Errors how to find affected files?

flammen

Dabbler
Joined
Oct 16, 2022
Messages
20
Hi all,

I get hundrets of checksum errors on my main pool after scrub. Nearly the same number for each disk. SMART runs weekly and reports disks are fine. I moved a lot of files to the pool recently.
Theory: Files were already corrupted when I moved them to the pool.
My Question: How do I find the affected files?
Using SMB to access from windows machine.

Thanks for your help!
 

Attachments

  • Screenshot 2023-01-19 161758.png
    Screenshot 2023-01-19 161758.png
    50 KB · Views: 608
  • Screenshot 2023-01-19 161945.png
    Screenshot 2023-01-19 161945.png
    16.3 KB · Views: 484
  • Screenshot 2023-01-19 162004.png
    Screenshot 2023-01-19 162004.png
    31.7 KB · Views: 480

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Hi all,

I get hundrets of checksum errors on my main pool after scrub. Nearly the same number for each disk. SMART runs weekly and reports disks are fine. I moved a lot of files to the pool recently.
Theory: Files were already corrupted when I moved them to the pool.
My Question: How do I find the affected files?
Using SMB to access from windows machine.

Thanks for your help!
I think you should try attaching them to some hardware as per forum rules
 
Joined
Oct 22, 2019
Messages
3,641
A useful mnemonic. If you get a bunch of checksum errors across all drives, yet no issues with SMART tests:

Check cables. Check connections. Check HBA (if applicable).
 
Last edited:

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Theory: Files were already corrupted when I moved them to the pool.
Although the files might be corrupted in some sense (documents, music, video ... with errors in them) there is no such thing as a corrupted file from ZFS' point of view when you write them. If your Word document contains errors it will be written with errors. Files are just one long bag of bytes. ZFS does not care if your Word document is valid or not or if your video skips frames.

The corruption you are seeing is detected by ZFS on the level of block checksums and probably happened after writing.

To find out which files are affected by those checksum errors use zpool status -v <poolname>
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
If there were uncorrectable errors in the files, ZFS would report the affected files. For now it seems that all errors are being corrected, and the issue is probably with the cable(s) or HBA.
The output of zpool status (within CODE tags) could help narrow the issue.
 

flammen

Dabbler
Joined
Oct 16, 2022
Messages
20
Thank you for the explanation, it makes sense. That ist the output:

Code:
admin@truenas[~]# sudo zpool status vault -v

  pool: vault

 state: ONLINE

status: One or more devices has experienced an error resulting in data

        corruption.  Applications may be affected.

action: Restore the file in question if possible.  Otherwise restore the

        entire pool from backup.

   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A

  scan: scrub repaired 1.49M in 05:20:21 with 12 errors on Wed Jan 18 08:20:26 2023

config:



        NAME                                      STATE     READ WRITE CKSUM

        vault                                     ONLINE       0     0     0

          raidz2-0                                ONLINE       0     0     0

            15ef0a62-3b6a-4d4e-ad5a-71bf5e78a9a9  ONLINE       0     0   671

            12fafd0d-5788-49a9-8ce8-5558cdefd773  ONLINE       0     0   666

            de84f345-53e6-482c-9240-277729051b4a  ONLINE       0     0   666

            8ae62f89-6ecb-4514-9c57-a16f55cbfee4  ONLINE       0     0   669

            62fdecf2-44fb-4260-a41f-58ede8d67de3  ONLINE       0     0   667

            d6f86e95-41a7-4f8a-9240-18b1c6ab5b19  ONLINE       0     0   666

        special

          mirror-3                                ONLINE       0     0     0

            1c340f26-9646-4771-8e15-69bc117056ea  ONLINE       0     0     0

            926bf5a9-a3d5-4351-8cb0-d20740ec3e09  ONLINE       0     0     0

            4763af02-9d8b-4faa-842b-7d8d7ad42cd4  ONLINE       0     0     0



errors: Permanent errors have been detected in the following files:



        /mnt/vault/[...].avi

        /mnt/vault/[...].CR2

        /mnt/vault/[...].CR2

        /mnt/vault/[...].mkv

        /mnt/vault/[...].mbox

        /mnt/vault/[...].iso

        /mnt/vault/[...].mp4

        /mnt/vault/[...].MOV

        /mnt/vault/[...].mbox

        /mnt/vault/[...].CR2

        vault/vault@20230102_0300_1day_1year:/[...].CR2

        vault/vault@20230102_0300_1day_1year:/[...].CR2

        vault/vault@20230102_0300_1day_1year:/[...].mkv

        vault/vault@20230102_0300_1day_1year:/[...].mp4

        vault/vault@20230102_0300_1day_1year:/[...].mbox

        vault/vault@20230102_0300_1day_1year:/[...].iso

        vault/vault@20230102_0300_1day_1year:/[...].mp4

        vault/vault@20230102_0300_1day_1year:/[...].mbox

        vault/vault@20230102_0300_1day_1year:/[...].CR2

        vault/vault@20230103_0300_1day_1year:/[...].MOV

admin@truenas[~]# 




I replaced the file names with [...] for some privacy. The files from the snapshots (naming scheme e.g. "20230102_0300_1day_1year") are the same as listed above. I looked at the files and they seem to be fine.
I do use an HBA (LSI SAS2308-8I / 9217-8I). I already tried reseating the cables, but it did not change anything. I also wonder why the errors are evenly spread on all drives. Does this mean that all connections are faulty?
What would be the next step to find the cause?

Thank you for all your help already!
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Please post your hardware as per forum rules
 
Joined
Oct 22, 2019
Messages
3,641
I do use an HBA (LSI SAS2308-8I / 9217-8I). I already tried reseating the cables, but it did not change anything.
Not just cables. Check / reseat the HBA. Check the connections on all ends, both ways.


As well as posting all the hardware of the system, it also helps to know if you flashed the HBA to "IT mode".


but it did not change anything.
You might not see a difference until the next time the entire pool is scrubbed. Otherwise, those "errors" will remain in the pool's status.
 

flammen

Dabbler
Joined
Oct 16, 2022
Messages
20
Thanks for your help, it is highly appreciated!

These are te specs:
Code:
Motherboard: ASUS WS X299 PRO/SE
CPU: Intel Core i9-7900X
RAM: 64 GB
Drives:     Vault:       Data: 6x 8TB WD Wite label (shucked, mostly WD Red Pro) SATA HDDs (RAIDZ2)
                         Metadata: 3x 1TB Samsung 980 NVMe SSD (mirror)
            Apps:       2x 500GB Samsung 970 Evo NVMe SSD (mirror)
            App-Data:   2x 1TB Samsung 970 Evo NVMe SSD (mirror)
            Boot-pool:  2x 240GB Intel Pro 2500 SATA SSD
HBA: LSI SAS2308-8I (9217-8I) flashed to IT mode
Network cards: Mellanox ConnectX-3 10GBe SFP+ CX311A
Case: 4U Rack with 3 Noctua NF-A14 iPPC 3000 @ 100%


I will try reseating all the connections both ends and the card itself again and scrub the pool to see whether it helped.
I did buy the HBA second hand and it worked for 8 months now, could it be faulty/end of life?
 

Attachments

  • debug-truenas-20230119230104.tgz
    14.4 MB · Views: 72

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Whats the airflow over the HBA like. Those do get hot and need airflow
 

flammen

Dabbler
Joined
Oct 16, 2022
Messages
20
Thank you for your help.
Those do get hot
That might be a cause. The airflow in the case is good, I have 3x 140mm iPPC Noctua fans at full speed in the front of the case and can definitely feel a lot of airflow behind the case. However, I do notice that the NVMe-Drive in the slot directly in front of the HBA is considerably warmer than the others. I do not have enough space for a fan directly on the HBA heatsink, and I can not move the cards around. It is quite crammed, so maybe the airflow is not reaching the heatsink well.
I have attached a photo of the PCIe-Slot area, the HBA is the second card from the PSU.
Any suggestions or ideas for improved cooling?
Does the HBA report a temperature? If so, how can I get to it?
 

Attachments

  • 20230115_011821.jpg
    20230115_011821.jpg
    653.6 KB · Views: 594

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Any suggestions or ideas for improved cooling?
Do you "need" the video card in there (consider going headless) to create more space?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
I am afraid I have no idea if an LSI reports a temp.

What I did, its a bit janky, but either works, or doesn't do any harm is that I put a spare 120mm fan on top of the cards, cable tieing it to the exit holes at the back of the case and then powering through a spare molex. I keep meaning to replace with a 140mm high pressure fan I have as all the kit is in the garage so can make as much noise as it likes

As far as I know there is no way of measuring other than directly an HBA temp. Some people put a 40mm fan on them with glue or appropriate screws - but unless its very thin it won't fit for you
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Does the HBA report a temperature? If so, how can I get to it?
You bet it does, touch the heatsink with your finger! Physics always report!
 

Whattteva

Wizard
Joined
Mar 5, 2013
Messages
1,824

somethingweird

Contributor
Joined
Jan 27, 2022
Messages
183
Newbie Question - Would/Could non-ECC memory affect these files and cause the chksum error?
 

Whattteva

Wizard
Joined
Mar 5, 2013
Messages
1,824
Newbie Question - Would/Could non-ECC memory affect these files and cause the chksum error?
Such consistent errors with similar numbers most likely have nothing to do with ECC and is just indication of whatever the issue is being outright faulty (ie. faulty RAM, faulty HBA). I would probably rule out cables/drives because it's very unlikely that all 6 of them all went bad at virtually the exact same time. Such uniformity requires divine intervention.
 
Joined
Oct 22, 2019
Messages
3,641
I would probably rule out cables/drives because it's very unlikely that all 6 of them all went bad at virtually the exact same time. The amount of coincidences there requires divine intervention.
That's why I think it's a single point: the HBA

Maybe it needs to be re-seated? Maybe it's overheating?

After confirming that it is secure and the temperature/airflow has been corrected, then a follow-up scrub should reveal if the pool still contains errors.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
When you get CKSUM errors the reason is usually (steps to be checked from top to bottom, left to right):
  1. Drive (overheating, dying);
  2. SATA or SAS cables (improper connectection, bad cables);
  3. HBA issues (overheating, wrong firmware, fake card, dying);
  4. PSU issues (improper connectection, bad cables, overheating, dying);
  5. RAM (dust or cpu torque causing issues with the connection, dying, lack of ECC).
That's the troubleshooting blaming process the forum members usually follow, do note that everything is in that order for a reason; as you can see, lack of ECC is usually the last thing you assume (usually when nothing else seems to be wrong and you can't exactly make sense of the situation, that's when you go fishing with memtest).
Usually the cables (power and data connections) are reseated at the same time; same for CPU and RAM.
 
Last edited:
Top