Lots of checksum errors on all drives

Agent92 · Sep 10, 2023

I've started to get lots of checksum errors on all of my six drives. I tried to run one scrub, clear the errors and then run another scrub right after the first one and I get about the same amount of errors.

Code:

        pike                                                ONLINE       0     0     0
          raidz2-0                                          ONLINE       0     0     0
            gptid/7fff5436-599d-11ea-a069-b496912a4888.eli  ONLINE       0     0     4  (repairing)
            gptid/7ffe0a1b-599d-11ea-a069-b496912a4888.eli  ONLINE       0     0     9  (repairing)
            gptid/801797e1-599d-11ea-a069-b496912a4888.eli  ONLINE       0     0    12  (repairing)
            gptid/804154be-599d-11ea-a069-b496912a4888.eli  ONLINE       0     0    11  (repairing)
            gptid/8027c34e-599d-11ea-a069-b496912a4888.eli  ONLINE       0     0     5  (repairing)
            gptid/7fef22da-599d-11ea-a069-b496912a4888.eli  ONLINE       0     0     9  (repairing)

Drives are getting a bit old so I thought they were starting to fail but all of them at once? I checked the SMART data and to my eyes it looks fine, but maybe someone here can see something I don't?

SMART data: https://pastebin.com/8yVTNSPV

I was thinking maybe the HBA card is getting bad? Feels weird that there would be this many errors on two consecutive scrubs on all drives, NAS isn't very active so not much is written to it every day.

System:

Version: TrueNAS-13.0-U5.3
Motherboard: MSI Z170a gaming M9 ACK
CPU: Intel 7700K
RAM: Corsair 64 GB
Hard drives:

6x WDC WD80EFAX-68L in raidz2
2x Kingston SUV500120G boot array

Network: Intel I350-T2
Disk controller:

Code:

nas01# sas2ircu 0 display
LSI Corporation SAS2 IR Configuration Utility.
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved.

Read configuration has been initiated for controller 0
------------------------------------------------------------------------
Controller information
------------------------------------------------------------------------
  Controller type                         : SAS2308_2
  BIOS version                            : 7.39.02.00
  Firmware version                        : 20.00.07.00
  Channel description                     : 1 Serial Attached SCSI
  Initiator ID                            : 0
  Maximum physical devices                : 1023
  Concurrent commands supported           : 10240
  Slot                                    : 16
  Segment                                 : 0
  Bus                                     : 1
  Device                                  : 0
  Function                                : 0
  RAID Support                            : No

Agent92 · Sep 10, 2023

And now it thinks two of the drives has too many errors

Code:

nas01# zpool status pike
  pool: pike
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub in progress since Sun Sep 10 10:34:14 2023
        8.39T scanned at 724M/s, 7.34T issued at 633M/s, 21.5T total
        2.03M repaired, 34.12% done, 06:31:11 to go
config:

        NAME                                                STATE     READ WRITE CKSUM
        pike                                                DEGRADED     0     0     0
          raidz2-0                                          DEGRADED     0     0     0
            gptid/7fff5436-599d-11ea-a069-b496912a4888.eli  ONLINE       0     0     8  (repairing)
            gptid/7ffe0a1b-599d-11ea-a069-b496912a4888.eli  ONLINE       0     0    12  (repairing)
            gptid/801797e1-599d-11ea-a069-b496912a4888.eli  DEGRADED     0     0    17  too many errors
            gptid/804154be-599d-11ea-a069-b496912a4888.eli  ONLINE       0     0    13  (repairing)
            gptid/8027c34e-599d-11ea-a069-b496912a4888.eli  DEGRADED     0     0     6  too many errors
            gptid/7fef22da-599d-11ea-a069-b496912a4888.eli  ONLINE       0     0     9  (repairing)

Think I will shut down and run memtest.

Agent92 · Sep 10, 2023

Ok memtest just found 10 errors just 1 hour into the run so I think I found the culprit.

Arwen · Sep 10, 2023

After you fix your memory, you will want to fix your LSI controller. It is in IR mode, Integrated RAID. This is not what is desired. You want Integrated Target, IT mode.

Agent92 · Sep 10, 2023

Hmm what makes you say it's in IR mode?

I specifically got this one to not have to do that: https://www.ebay.com/itm/162862201664

Agent92 · Sep 10, 2023

Any tips on how to salvage the array? Should I put the drives in a known good system and run a scrub? Can I tell ZFS to start using the drives marked as degraded again?

Arwen · Sep 10, 2023

I will gladly admit if I am wrong. But, the output you listed has this line. Note the "IR":

LSI Corporation SAS2 IR Configuration Utility.

I am guessing, (and it is truly a guess), that means it's not in IT mode. But, perhaps someone else has a complete answer.

You need to fix the memory first.

Whence the memory is fixed, you can clear the errors and run another scrub. Any file listed as bad, could be bad. Meaning a block to be written had it's checksum computed, then the bad memory corrupted it. After the corruption, it was written to disk. Thus, ZFS wrote bad data because of bad memory.

Erase those files and restore them from backup. Get the list of bad files from;
zpool status -v

The really annoying part, is that you may have corrupt files you can't detect. The corruption may have occurred before the checksum was computed. But this problem would happen with any file system,

Agent92 · Sep 10, 2023

I see, I think it's fine but I will investigate deeper!

Is there a log where I can go back and see when checksum errors where found? If so maybe I could restore from a backup from before then just to be safe.

Question is though, my backup is another TrueNAS box which I replicate the data to. If I had errors in the data on the source I assume that error would be replicated to the backup as well? There has been 2 checksum errors on the backup NAS, could that be errors that were replicated from the primary NAS?

Arwen · Sep 10, 2023

I don't know about any logs for the checksum errors. Perhaps someone else does.

The error blocks would not be replicated to your backup TrueNAS server as far as I know. However, the actual corrupt file's good blocks would likely be replicated. Thus, those files with bad blocks are for practical purposes bad files on the backup NAS.

Of course, if you have backups going back before the source NAS' bad blocks, then that backup would still be good.

In theory, any file on the backup NAS can be compared to the source NAS, and if different, (even if the file has no checksum errors), could mean you have silent corruption on your source NAS. (If the file in question is static, aka has not changed between the source NAS and backup NAS.)

This complicated recovery is one reason why I wish both Intel & AMD would make lower end server CPUs with ECC memory. Yes, you can get cheaper AMD Epyc server boards, but they are limited in I/O. And not that cheap. Plus, the AMD Epyc 3000 series are using older cores, not 3rd generation or later. An AMD server CPU with ECC memory support, 4 channel memory support, and more limited number of cores, (reducing cost), would be nice.

To be fair, Intel is doing a better job recently. In the past, Intel CPUs with ECC memory on the low end were much more limited.

Davvo · Sep 10, 2023

Arwen said:
I will gladly admit if I am wrong. But, the output you listed has this line. Note the "IR":

I am guessing, (and it is truly a guess), that means it's not in IT mode. But, perhaps someone else has a complete answer.

Isn't that because he is using sas2ircu? Also it's telling no RAID support.
@Agent92 can you try sas2flash instead?

Regarding the checksum logs, you could try zpool events pike | grep checksums. I do not remember where the actual file is (if there is such a thing).

Arwen · Sep 10, 2023

@Davvo - Excellent point. Been a while since I used a LSI based SAS controller at home.

Davvo · Sep 10, 2023

Arwen said:
In theory, any file on the backup NAS can be compared to the source NAS, and if different, (even if the file has no checksum errors), could mean you have silent corruption on your source NAS. (If the file in question is static, aka has not changed between the source NAS and backup NAS.)

If I am not wrong such a thing can be fine with resync and a parameter.

Arwen · Sep 10, 2023

Davvo said:
If I am not wrong such a thing can be fine with resync and a parameter.

Yes, if source and backup are supposed to be the same, RSync can perform a reverse copy, (from backup server to source server). I think the option is "--checksum", which compares RSync generated checksums of the blocks in backup and source files. If different, make them identical.

But, I think I would run a verbose "--dry-run" first to get a list of files.

I hope this did not turn out to be an example of the mythical "scrub of death" :-(.

Agent92 · Sep 11, 2023

Never heard of scrub of death actually, found this: https://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-your-data/

From that it doesn't sound like there is an issue but from what I saw it definitely sounded like I was running a scrub of death, or almost death at least.

I did actually find memtest errors on the backup NAS as well, don't know what is going on, this hardware has been running fine for a long time. But I verified that at least one stick is good so now I'm running a scrub on the backup NAS and it's looking good. I will do the same with the primary tomorrow when memtest finishes a second run.

The RSync tip is really good, I will look into that when everything is back up.

Yes sas2flash does echo out IT mode:

Code:

nas01# sas2flash -list
LSI Corporation SAS2 Flash Utility
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2008-2013 LSI Corporation. All rights reserved

        Adapter Selected is a LSI SAS: SAS2308_2(D1)

        Controller Number              : 0
        Controller                     : SAS2308_2(D1)
        PCI Address                    : 00:01:00:00
        SAS Address                    : 500605b-0-0915-b100
        NVDATA Version (Default)       : 14.01.00.06
        NVDATA Version (Persistent)    : 14.01.00.06
        Firmware Product ID            : 0x2214 (IT)
        Firmware Version               : 20.00.07.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : SAS9207-8i
        BIOS Version                   : 07.39.02.00
        UEFI BSD Version               : 07.27.01.01
        FCODE Version                  : N/A
        Board Name                     : SAS9207-8i
        Board Assembly                 : H3-25412-00H
        Board Tracer Number            : SV42830873

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.

Agent92 · Sep 12, 2023

Found this:

ZFS maintains a persistent log of all data errors associated with a pool. This log is rotated whenever a complete scrub of the system finishes.

So I guess that log is gone now when the scrub has finished. At least it now finished without errors but I will still try to compare the files using RSync.

Agent92 · Sep 12, 2023

Tell me if I'm being stupid now but I restored a snapshot taken well before the issues started, to a new dataset, and then I ran RSync to compare the files in that restored dataset compared to the current one. Is that a bad idea since what I restored are just references to the actual files? Or is it still a valid test?

winnielinnie · Sep 12, 2023

Unless the file was modified after the snapshot was taken, it's the same exact file, byte for byte.

A checksum mismatch on a file created two years ago applies to all "versions" of this file across several snapshots, if the file was never modified. In reality, there are no "different existences" of a file between snapshots (if the file was never modified.)

If the record that comprises the file's data rots or becomes corrupt, then snapshots cannot offer up a "good" version of this record. Repair / recovery is only possible with redundancy across a mirror or RAIDZ vdev. (That is: where another good copy exists, or can be rebuilt using parity.)

Visual example, of a single drive pool with no redundancy:

January 2022 I have the following file, and its data is:
ABCDEF01

( ... I take hundreds of snapshots for the next year or so ... )

Today, September 2023, the file's data somehow "rotted" or became corrupt for whatever reason. Electrical, sun flare, HDD issues, etc.
Its data is now:
ABCDEF02

Every single snapshot points to this exact record. Every snapshot offers up this file as:
ABCDEF02

Without redundancy or backups, there is nothing that my earlier snapshots can do for me. There's no "history" where I can extract a good copy of the file from some "alternate universe".

Agent92 · Sep 12, 2023

So therefore the only valid test would be to run an RSync compare between the primary and the backup NAS I guess?

winnielinnie · Sep 12, 2023

Agent92 said:
So therefore the only valid test would be to run an RSync compare between the primary and the backup NAS I guess?

Theoretically, but reading this thread from the beginning, the memtest failures on both machines gives me chills.

Even with redundancy via vdevs, bad memory (RAM) bypasses that future protection when writing new data. @Arwen explained it earlier.

Since you had bad RAM on both ends, there's no telling if your backup NAS has good copies of your data.

You can go ahead and run a dry-run with rsync, using the "-c" flag, and save the results to a text file to inspect a sampling of these files yourself.

* Non-ECC RAM going bad is most nefarious when writing new data. ZFS cannot really protect you against this. (No filesystem can.)

Agent92 · Sep 12, 2023

Well it's the best I can do then, comparing data now. Backup NAS should have run bad RAM for only a few hours compared the the primary which ran it for about a week I think. So at least the chance for corruption should be lower.

Important Announcement for the TrueNAS Community.

Lots of checksum errors on all drives

Explorer

Explorer

Explorer

MVP

Explorer

Explorer

MVP

Explorer

MVP

MVP

MVP

MVP

MVP

Explorer

Explorer

Explorer

MVP

Explorer

MVP

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Lots of checksum errors on all drives"

Similar threads