Top Level Vdev Checksum Errors not fixed with Scrub

dkusek · Aug 9, 2016

Hardware:
Supermicro X9SRH-7TF
Intel Xeon 2.1GHz E5-2620 v2 Processor
256GB RAM
45 x 8TB Ultrastar He8 3.5" 7200RPM SAS 6GB/s
3 x 11 disk RAIDZ-3
2 x 10GB Intel x540 onboard NICS
Freenas 9.3 mos current, stable. (Mirrored boot drives with newly installed OS)

So everything is running stable. We went to add another external drive enclosure via mini-SAS connection. This would make 2 external drive enclosures in total. In doing so, a power issue was noticed as all the drives in that external enclosure came back as degraded due to excessive checksum errors. We replaced the board in the enclosure and most of the issue went away. Did a scrub on the zpool.

When we came back there were still errors that were uncorrectable.

Screen Shot 2016-08-09 at 10.49.26 AM.png

The image above illustrates what the pool looked like before a scrub. After the scrub, all errors were gone EXCEPT the last one: pool/.system@auto-20160720.2300-2d:<0x0>

No matter what we do, that does not go away. And even after we run zpool clear, the checksums will go away on the pool and then accumulate on an individual vdev and NOT on specific drives.

Here is a shot to illustrate that:

Screen Shot 2016-08-09 at 10.48.17 AM.png

That checksum error will continue to increase until it eventually degrades the pool.
The "Errors: 706 goes down significantly once a scrub completes, but then goes back up. Not exactly sure why but I speculate that it is due to errors in the metadata somewhere.

After doing some research, found this from Oracle's documentation: "
If non-zero errors are reported for a top-level virtual device, portions of your data might have become inaccessible"

So we know that there is corrupt metadata somewhere on the pool. I assume it is something to do with the pool./system snapshot task. The thing is, we know, and we want to tell the system that we just dont care about it. We have even cleared out all of the snapshots on the system, deleted the system/ folder and rebooted, reinstalled the OS and run multiple scrubs. We know the disks are solid as we have run countless SMART test and the likelihood of 6-8 SAS drives being bad right out of the box is ridiculous.

The question I have, is how to get this error out of the system. A scrub wont fix it, and I cant lock down where the file is to move it or delete it. Does anyone know any tricks? We have been troubleshooting this for the past 2 weeks to no avail. Any help is greatly appreciated. We do have onsite backups on LTO. But there is A LOT of data to backup and the backups do not contain changes or new rendering so we are trying to avoid that unless absolutely necessary.

As is, the system is actually performing great. Same as before all of this happened. It is just a nuisance to have the same error being reported and having to constantly clear the zpool status so it doesnt go degraded. So again, if there is a way to completely clear this error out as it is not critical to our needs and get back to status quo, that would be some magic I would greatly appreciate.

Thank you all.

rs225 · Aug 9, 2016

Have you tried to destroy that specific snapshot? Maybe you have, and it can't destroy it because it can't read it. In which case, your only options are keep doing what you are doing, or re-create the pool.

zfs destroy pool/.system@auto-20160720.2300-2d

Another possibility is to recursively destroy the pool/.system dataset (which might cause other problems), but I don't think you should try that except right before you start rebuilding the pool.

dkusek · Aug 10, 2016

Yes, did destroy not only that snapshots, but all snapshots to no avail. Also go rid of the .system folder with no success. The metadata of that snapshot i listed above is corrupt. I think that is why this is so funky as we are in essence trying to find and delete something which can no longer be accessed. So yeah, the only thing I can really think of is to destroy the pool and rebuild from tape or magically tell ZFS to not check that specific snapshot which we no longer need to care about.

dkusek · Aug 23, 2016

Just wanted to poke around to see if anyone has ANY idea of what to do here.

Is there some sort of file that stores known system errors. Again, in our case, we are seeing "phantom" errors. They are being reported but the error themselves can be deleted. Is there a way to clear this cache or file out? We have done the ZFS clear but the top level errors just come back. Again, any out of the box ideas would be greatly appreciated. We are at a point where creative thinking is becoming a very thin resource.

Mirfster · Aug 23, 2016

What controller are the drives connected to and what firmware version is it running?

dkusek · Aug 23, 2016

The controllers for the onboard, internal PCI SAS expander, and external mini-SAS expander are SAS2308 from LSI. They are running 20.00.04.00, 20.00.04.00, and 20.00.02.00 respectively. Also, not sure if it makes a difference, but they are flashed to IR, IT, and IR modes respectively. The entire HBA is an LSI/Avago SAS 9205-8e.

Mirfster · Aug 23, 2016

dkusek said:
They are running 20.00.04.00, 20.00.04.00, and 20.00.02.00 respectively.

Don't quote me, but if I recall correctly there was something I vaguely recall about 20.00.04.00 which corrected some false parity errors. However, it did not affect all cards...

I don't think having some in IR vs IT makes a difference, but I run all my stuff in IT.

dkusek · Sep 6, 2016

We updated the 20.00.02.00 driver on the 9205-8e LSI/Avago card to most current 20.00.08.00 driver. This did not resolve the issue. I really want to make sure we have tried and explored all options. Open to concrete AND hypothetical techniques to mitigate this issue. We REALLY want to avoid migrating data off and back to the server as there is about 144TB to work through. Again, any suggestions are greatly appreciated.

Thank you!

Ericloewe · Sep 6, 2016

They're permanent errors, you can't magically fix them. You can delete the relevant file/snapshot (which is probably how the others disappeared) to "fix" it - assuming it's not metadata.

dkusek said:
This did not resolve the issue.

It'll probably keep it from happening again, though, if it's not a hardware defect of some sort. Early P20 releases had some nasty bugs, and even the latest release still has some annoying compatibility issues.

dkusek · Sep 6, 2016

pool/.system@auto-20160720.2300-2d:<0x0> is the only thing that is sticking around. Except is doesnt exist. Nothing related to this exists other that the system telling us that it does which creates the checksum error. I believe I have read that the "<0x0> portion which is following the ":" means that it is metadata. It seems like the system is telling us there is an error with something that does not exists yet the record of it is still causing issues?

dkusek · Sep 9, 2016

One last question. So we are pretty confident there is NO WAY to fix this issue. We are left with needing to start another pool and move data to it.

The question is, can we just create a new pool and copy the data or use rsync or is that going to potentially move the meta-data corruption with it to the new pool?

rs225 · Sep 9, 2016

Excellent question.

My best guess is that in this case, with corruption of a .system dataset(which isn't moving), in a particular snapshot(which isn't moving), that the corruption would not transfer even if you did a zfs send/zfs receive of the dataset/snapshots that you actually do want to move.

But if you use cp or rsync, it is virtually impossible that the corruption would transfer, unless your data is in fact corrupted(which it doesn't appear to be). But that corruption would not affect the new pool, it would just be user data that isn't very useful to you.

dkusek · Sep 12, 2016

Another thing we are trying is moving the .system dataset to the boot pool. It is mirrored. When we did this, the .system dataset stayed on the main pool, or at least remnants of it. Therefore, we started doing the zfs destroy command and removing the instances of pool/.system datasets we saw when we entered zfs list. We were able to remove everything but "pool/.system" as when we tried to do this, we got "cannot iterate filesystem: I/O error. We tried combination of zfs destroy with the -r, -f, and -R and just doing a simple "rm" command. The "rm" command said the file or directory did not exist.

We are currently running a scrub and will report as to what happens. We ultimately want the "pool/.system@auto-20160720.2300-2d:<0x0>" to clear as that has been what has been causing the issue all along.

Important Announcement for the TrueNAS Community.

Top Level Vdev Checksum Errors not fixed with Scrub

dkusek

Explorer

rs225

Guru

dkusek

Explorer

dkusek

Explorer

Mirfster

Doesn't know what he's talking about

dkusek

Explorer

Mirfster

Doesn't know what he's talking about

dkusek

Explorer

Ericloewe

Server Wrangler

dkusek

Explorer

dkusek

Explorer

rs225

Guru

dkusek

Explorer

Similar threads

Important Announcement for the TrueNAS Community.

Top Level Vdev Checksum Errors not fixed with Scrub

Explorer

Guru

Explorer

Explorer

Doesn't know what he's talking about

Explorer

Doesn't know what he's talking about

Explorer

Server Wrangler

Explorer

Explorer

Guru

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Top Level Vdev Checksum Errors not fixed with Scrub"

Similar threads