Checksum errors

KMR · Mar 9, 2013

Hey folks,

I seem to be having some checksum issues (already!).

Build:
Gigabyte MB (Sock 775)
Intel E8400 C2D @ 3.0Ghz
12GB RAM

I am currently running 3x 3TB seagate drives in RAIDZ1. I was told that RAIDZ2 was better for data integrity and that I should purchase 3 more drives and rebuild the pool as a 6x 3TB RAIDZ2 array. I bought the extra disks and started backing up data before rebuilding the array. When copying certain files (noted below) the whole server would freeze. I am not able to use the web UI or ping the box and must hard reset the machine. After a hard reset it will boot up fine and the pool is reported as healthy but there is a yellow alert light with the following message:

WARNING: The volume volume (ZFS) status is UNKNOWN: One or more devices has experienced an error resulting in data corruption. Applications may be affected.Restore the file in question if possible. Otherwise restore the entire pool from backup.

The results of zpool status -v are below (after running a scrub):

Code:

[root@freenas] /var/log# zpool status -v
  pool: volume
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
  scan: scrub repaired 256K in 2h46m with 5 errors on Sat Mar  9 09:40:56 2013
config:

        NAME                                            STATE     READ WRITE CKSUM
        volume                                          ONLINE       0     0     5
          raidz1-0                                      ONLINE       0     0    10
            gptid/19721fcb-7d65-11e2-9ba5-50e549b3a1da  ONLINE       0     0     1
            gptid/19f3b7b9-7d65-11e2-9ba5-50e549b3a1da  ONLINE       0     0     1
            gptid/1a772d25-7d65-11e2-9ba5-50e549b3a1da  ONLINE       0     0     2

errors: Permanent errors have been detected in the following files:

        /mnt/volume/ {media file}
        /mnt/volume/ {media file}
        /mnt/volume/ {media file}
        /mnt/volume/ {media file}

I am going to try copying the files again to see if I get another hang up. After that what should I do? These are brand new disks.. is it possible all three are bad? Fortunately most of the data is backed up and anything that isn't can be easily ripped again.

Thanks!

- - - Updated - - -

Unfortunately I got the same result. I am going to delete the offending files and perform another scrub.

cyberjock · Mar 9, 2013

It's possible that all 3 are bad, but I think very unlikely.

Random thoughts off the top of my head:

I'm thinking its more likely that because the system froze your zpool was left in an inconsistent state which may have caused the problems with the zpool. So I'm saying the system may have froze which caused the zpool issue, not that the zpool issue cause the system to freeze.

You didn't say if this has happened before, what version of FreeNAS you are using, or anything else that may be "weird" with your system either now or in the past. I haven't used a Core 2 Duo system in years, but from personal experience when I had 2 of them neither one worked right with more than 8GB of RAM. Both of my system would have random bizarre issues.

Bad RAM can manifest itself is random ways with random problems at random intervals. 2-3 passes with Memtest x86 should rule out if RAM is a problem. Note that this test could take more than a couple of hours to run. When I do a RAM test I normally start it and go to bed.

I'd check to see if there are any BIOS updates and check your BIOS settings.

But it really hard to judge from the small amount of information provided. You may have some hardware issue that you will have to find via troubleshooting.

KMR · Mar 9, 2013

Thank you for your reply.

I am running FreeNAS-8.3.1-BETA3-x64 (r13264).

This was an old computer that I rebuilt for FREENAS duty, and now that you mention it I do seem to remember some system stability issues with it in its previous life. When the current scrub is done (I have deleted the offending files) I will do the memtest x86 to rule out RAM as the culprit. Could a system crash corrupt the files beyond repair and then cause more system crashes each time those files are accessed now?

My original plan was to build a new box with a supermicro MB, ECC RAM, and an Ivy Bridge Pentium CPU in the future but I may have to push that schedule up if I am running into hardware errors already.

KMR · Mar 9, 2013

Now when I run zpool status -v I get:

Code:

[root@freenas] ~# zpool status -v
  pool: volume
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
  scan: scrub in progress since Sat Mar  9 10:07:47 2013
        1.35T scanned out of 4.27T at 508M/s, 1h40m to go
        0 repaired, 31.67% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        volume                                          ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/19721fcb-7d65-11e2-9ba5-50e549b3a1da  ONLINE       0     0     0
            gptid/19f3b7b9-7d65-11e2-9ba5-50e549b3a1da  ONLINE       0     0     0
            gptid/1a772d25-7d65-11e2-9ba5-50e549b3a1da  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        volume:<0x15a1a>
        volume:<0x1335c>
        volume:<0x13374>
        volume:<0x1468d>

This is after deleting the offending files.. Why am I getting addresses now? Note: I am still in the middle of another scrub.

KMR · Mar 9, 2013

Status after deleting files and running another scrub:

Code:

[root@freenas] ~# zpool status -v
  pool: volume
 state: ONLINE
  scan: scrub repaired 0 in 2h45m with 0 errors on Sat Mar  9 12:53:38 2013
config:

        NAME                                            STATE     READ WRITE CKSUM
        volume                                          ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/19721fcb-7d65-11e2-9ba5-50e549b3a1da  ONLINE       0     0     0
            gptid/19f3b7b9-7d65-11e2-9ba5-50e549b3a1da  ONLINE       0     0     0
            gptid/1a772d25-7d65-11e2-9ba5-50e549b3a1da  ONLINE       0     0     0

errors: No known data errors

Memtest x86 time I guess.

paleoN · Mar 9, 2013

KMR said:
This was an old computer that I rebuilt for FREENAS duty, and now that you mention it I do seem to remember some system stability issues with it in its previous life. When the current scrub is done (I have deleted the offending files) I will do the memtest x86 to rule out RAM as the culprit. Could a system crash corrupt the files beyond repair and then cause more system crashes each time those files are accessed now?

More likely a continuing hardware issue. A troublesome PSU is another possibility as well.

KMR · Mar 9, 2013

The PSU is only a few months old tops. I have backed up the entire pool minus the four files and am running memtest86 on the machine now - so far, no errors. Could this be a motherboard issue?

gpsguy · Mar 9, 2013

Yep

KMR said:
Could this be a motherboard issue?

If you are still experiencing the freeze/hang issue, I'd hang a monitor off the server and see if a message appears on the console, when it hangs. You might have experienced a kernel panic. Take a picture of it and include it with your message.

KMR · Mar 10, 2013

I let memtest86 run for 4 passes overnight and there were no errors. The first time it happened I lost four files and now that I have deleted those files and run a scrub I don't know how to get it to happen again for testing purposes.

paleoN · Mar 11, 2013

KMR said:
The PSU is only a few months old tops.

So it's a new, unproven component and could be defective. I concur.

In addition to suspecting the motherboard you also need to consider the CPU. Then you need to consider thermal issues.

KMR said:
The first time it happened I lost four files and now that I have deleted those files and run a scrub I don't know how to get it to happen again for testing purposes.

Best bet is to write new data then read that data back.

Important Announcement for the TrueNAS Community.

Checksum errors

KMR

Contributor

cyberjock

Inactive Account

KMR

Contributor

KMR

Contributor

KMR

Contributor

paleoN

Wizard

KMR

Contributor

gpsguy

Active Member

KMR

Contributor

paleoN

Wizard

Similar threads

Important Announcement for the TrueNAS Community.

Checksum errors

Contributor

Inactive Account

Contributor

Contributor

Contributor

Wizard

Contributor

Active Member

Contributor

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Checksum errors"

Similar threads