Top Level Vdev Checksum Errors not fixed with Scrub

Status
Not open for further replies.

dkusek

Explorer
Joined
Mar 16, 2016
Messages
78
Hardware:
Supermicro X9SRH-7TF
Intel Xeon 2.1GHz E5-2620 v2 Processor
256GB RAM
45 x 8TB Ultrastar He8 3.5" 7200RPM SAS 6GB/s
3 x 11 disk RAIDZ-3
2 x 10GB Intel x540 onboard NICS
Freenas 9.3 mos current, stable. (Mirrored boot drives with newly installed OS)

So everything is running stable. We went to add another external drive enclosure via mini-SAS connection. This would make 2 external drive enclosures in total. In doing so, a power issue was noticed as all the drives in that external enclosure came back as degraded due to excessive checksum errors. We replaced the board in the enclosure and most of the issue went away. Did a scrub on the zpool.

When we came back there were still errors that were uncorrectable.
Screen Shot 2016-08-09 at 10.49.26 AM.png

The image above illustrates what the pool looked like before a scrub. After the scrub, all errors were gone EXCEPT the last one: pool/.system@auto-20160720.2300-2d:<0x0>

No matter what we do, that does not go away. And even after we run zpool clear, the checksums will go away on the pool and then accumulate on an individual vdev and NOT on specific drives.

Here is a shot to illustrate that:

Screen Shot 2016-08-09 at 10.48.17 AM.png


That checksum error will continue to increase until it eventually degrades the pool.
The "Errors: 706 goes down significantly once a scrub completes, but then goes back up. Not exactly sure why but I speculate that it is due to errors in the metadata somewhere.

After doing some research, found this from Oracle's documentation: "
If non-zero errors are reported for a top-level virtual device, portions of your data might have become inaccessible"

So we know that there is corrupt metadata somewhere on the pool. I assume it is something to do with the pool./system snapshot task. The thing is, we know, and we want to tell the system that we just dont care about it. We have even cleared out all of the snapshots on the system, deleted the system/ folder and rebooted, reinstalled the OS and run multiple scrubs. We know the disks are solid as we have run countless SMART test and the likelihood of 6-8 SAS drives being bad right out of the box is ridiculous.

The question I have, is how to get this error out of the system. A scrub wont fix it, and I cant lock down where the file is to move it or delete it. Does anyone know any tricks? We have been troubleshooting this for the past 2 weeks to no avail. Any help is greatly appreciated. We do have onsite backups on LTO. But there is A LOT of data to backup and the backups do not contain changes or new rendering so we are trying to avoid that unless absolutely necessary.

As is, the system is actually performing great. Same as before all of this happened. It is just a nuisance to have the same error being reported and having to constantly clear the zpool status so it doesnt go degraded. So again, if there is a way to completely clear this error out as it is not critical to our needs and get back to status quo, that would be some magic I would greatly appreciate.

Thank you all.
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
Have you tried to destroy that specific snapshot? Maybe you have, and it can't destroy it because it can't read it. In which case, your only options are keep doing what you are doing, or re-create the pool.

zfs destroy pool/.system@auto-20160720.2300-2d

Another possibility is to recursively destroy the pool/.system dataset (which might cause other problems), but I don't think you should try that except right before you start rebuilding the pool.
 

dkusek

Explorer
Joined
Mar 16, 2016
Messages
78
Yes, did destroy not only that snapshots, but all snapshots to no avail. Also go rid of the .system folder with no success. The metadata of that snapshot i listed above is corrupt. I think that is why this is so funky as we are in essence trying to find and delete something which can no longer be accessed. So yeah, the only thing I can really think of is to destroy the pool and rebuild from tape or magically tell ZFS to not check that specific snapshot which we no longer need to care about.
 

dkusek

Explorer
Joined
Mar 16, 2016
Messages
78
Just wanted to poke around to see if anyone has ANY idea of what to do here.

Is there some sort of file that stores known system errors. Again, in our case, we are seeing "phantom" errors. They are being reported but the error themselves can be deleted. Is there a way to clear this cache or file out? We have done the ZFS clear but the top level errors just come back. Again, any out of the box ideas would be greatly appreciated. We are at a point where creative thinking is becoming a very thin resource.
 

dkusek

Explorer
Joined
Mar 16, 2016
Messages
78
The controllers for the onboard, internal PCI SAS expander, and external mini-SAS expander are SAS2308 from LSI. They are running 20.00.04.00, 20.00.04.00, and 20.00.02.00 respectively. Also, not sure if it makes a difference, but they are flashed to IR, IT, and IR modes respectively. The entire HBA is an LSI/Avago SAS 9205-8e.
 

Mirfster

Doesn't know what he's talking about
Joined
Oct 2, 2015
Messages
3,215
They are running 20.00.04.00, 20.00.04.00, and 20.00.02.00 respectively.
Don't quote me, but if I recall correctly there was something I vaguely recall about 20.00.04.00 which corrected some false parity errors. However, it did not affect all cards...

I don't think having some in IR vs IT makes a difference, but I run all my stuff in IT.
 

dkusek

Explorer
Joined
Mar 16, 2016
Messages
78
We updated the 20.00.02.00 driver on the 9205-8e LSI/Avago card to most current 20.00.08.00 driver. This did not resolve the issue. I really want to make sure we have tried and explored all options. Open to concrete AND hypothetical techniques to mitigate this issue. We REALLY want to avoid migrating data off and back to the server as there is about 144TB to work through. Again, any suggestions are greatly appreciated.

Thank you!
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
They're permanent errors, you can't magically fix them. You can delete the relevant file/snapshot (which is probably how the others disappeared) to "fix" it - assuming it's not metadata.

This did not resolve the issue.
It'll probably keep it from happening again, though, if it's not a hardware defect of some sort. Early P20 releases had some nasty bugs, and even the latest release still has some annoying compatibility issues.
 

dkusek

Explorer
Joined
Mar 16, 2016
Messages
78
pool/.system@auto-20160720.2300-2d:<0x0> is the only thing that is sticking around. Except is doesnt exist. Nothing related to this exists other that the system telling us that it does which creates the checksum error. I believe I have read that the "<0x0> portion which is following the ":" means that it is metadata. It seems like the system is telling us there is an error with something that does not exists yet the record of it is still causing issues?
 

dkusek

Explorer
Joined
Mar 16, 2016
Messages
78
One last question. So we are pretty confident there is NO WAY to fix this issue. We are left with needing to start another pool and move data to it.

The question is, can we just create a new pool and copy the data or use rsync or is that going to potentially move the meta-data corruption with it to the new pool?
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
Excellent question.

My best guess is that in this case, with corruption of a .system dataset(which isn't moving), in a particular snapshot(which isn't moving), that the corruption would not transfer even if you did a zfs send/zfs receive of the dataset/snapshots that you actually do want to move.

But if you use cp or rsync, it is virtually impossible that the corruption would transfer, unless your data is in fact corrupted(which it doesn't appear to be). But that corruption would not affect the new pool, it would just be user data that isn't very useful to you.
 

dkusek

Explorer
Joined
Mar 16, 2016
Messages
78
Another thing we are trying is moving the .system dataset to the boot pool. It is mirrored. When we did this, the .system dataset stayed on the main pool, or at least remnants of it. Therefore, we started doing the zfs destroy command and removing the instances of pool/.system datasets we saw when we entered zfs list. We were able to remove everything but "pool/.system" as when we tried to do this, we got "cannot iterate filesystem: I/O error. We tried combination of zfs destroy with the -r, -f, and -R and just doing a simple "rm" command. The "rm" command said the file or directory did not exist.

We are currently running a scrub and will report as to what happens. We ultimately want the "pool/.system@auto-20160720.2300-2d:<0x0>" to clear as that has been what has been causing the issue all along.
 
Status
Not open for further replies.
Top