permanent errors in ZFS pool

cyberjock · Aug 21, 2013

Maybe you won't try to use RAIDZ1 in the future :P

jbear · Aug 21, 2013

I still don't understand what makes you think that a lack of rendundancy lead to the corruption. According to the following output the error counter on each and every block device is zero:

NAME STATE READ WRITE CKSUM
store1 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gptid/7e3d623f-9190-11e2-b8f7-0019990d0a17 ONLINE 0 0 0
gptid/7eca7240-9190-11e2-b8f7-0019990d0a17 ONLINE 0 0 0
gptid/77cd1366-9556-11e2-858b-0019990d0a17 ONLINE 0 0 0

So, if ZFS did not encounter any read, write or checksum errors how could an additional block device have helped me in the first place?

cyberjock · Aug 21, 2013

Those errors are erased the second you do zpool clear.

You have to understand something about how zpool status tells you what is going on.

The fact that it reports no errors on those 3 disks right now provides no indication of the past. The fact that you have unrecoverable errors tells you that at some point you had corruption that was so severe that your redundancy couldn't fix it. More than likely, that means you had no redundancy.

One example is doing a failed disk replacement with RAIDZ1. Anytime you do a disk replacement of a RAIDZ1 you will have no protection from any corruption. Corruption will be able to be detected, but that's all. No fixing it.(This is the worse case scenario for RAIDZ1/RAID5 and is why it "died" almost 5 years ago). RAID6/RAIDZ2 won't even be acceptable after like 2019 because hard drive reliability isn't improving as fast as disk size. This is a physical reality and there is no solution except more and more redundancy.

Another example would be if you accidentally left 1 drive unplugged from the system, booted it up and then realized you forgot a disk. So you shutdown the server and added the disk back in. Any changes would not be included on the 3rd disk without a scrub. And if the difference between the pool and that 1 disk is corrupted, you'll detect the errors, but again are unable to fix them.

It could also mean you have bad RAM(assuming you are using non-ECC RAM) and data was written to the drives corrupted, but now its being read back and the system knows that the written data is trashed, but has no fix because the corrupted redundancy was written to the drives too.

If you had, at some point in the past, had enough redundancy to perform the repair the repair would have occurred(you'd get a value that you'd have seen with zpool status and had to clear with zpool clear or a system reboot) and that would have been that. No permanent errors. Because you had permanent errors you had no redundancy at some point in the past.

Hopefully this clears up the confusion. :)

paleoN · Aug 21, 2013

jbear said:
It Looks like <0x26> refers to the destroyed dataset.

It likely does. Do you have any clones/snapshots that are referencing the now destroyed dataset?

jbear · Aug 21, 2013

cyberjock said:
Those errors are erased the second you do zpool clear.

I posted my zpool history yesterday in this thread. As you can see in that history there had been no "zpool clear" prior to the first "zpool status" output, which I posted in the initial post. FreeNAS runs a daily "zpool status" query and mails it to me. On August 18 I got the first email with the error warning but at the same time, the error counters on the disks where ZERO.

One day earlier the counter was ZERO as well and there was _no_ incosistency in the pool.

Between those two days (08/17 and 08/18) there had ben no scrubs, no reboots, no "zpool clear", no exchange of hard drives. Nothing.

It could also mean you have bad RAM(assuming you are using non-ECC RAM) and data was written to the drives corrupted, but now its being read back and the system knows that the written data is trashed, but has no fix because the corrupted redundancy was written to the drives too.

Wouldn't this scenario cause an hit on the checksum error counter? As far as I understand ZFS stores a checksum for each block. So if corrupted data had been written to the disks, ZFS would detect the corruption by verifying the stored checksum with the calculated checksum. If they differ I would suspect that the checksumm error counter goes up. But it has been ZERO since the creation of the pool until this very moment (or until the "zpool clear" I did yesterday for that matter).

But let's say the corruption would have been caused by bad RAM. How would an additional disk have helped me there? The corrupted data from the bad RAM would just have been written to one more disk. But how would that help?

jbear · Aug 21, 2013

paleoN said:
It likely does. Do you have any clones/snapshots that are referencing the now destroyed dataset?

I did a "zfs destroy -r store1/hosting-backups" so all snapshots of that dataset have been destroyed as well.

paleoN · Aug 21, 2013

jbear said:
I did a "zfs destroy -r store1/hosting-backups" so all snapshots of that dataset have been destroyed as well.

Did you scrub the pool afterwards? I'm not sure from what you have posted.

cyberjock · Aug 21, 2013

jbear said:
I posted my zpool history yesterday in this thread. As you can see in that history there had been no "zpool clear" prior to the first "zpool status" output, which I posted in the initial post. FreeNAS runs a daily "zpool status" query and mails it to me. On August 18 I got the first email with the error warning but at the same time, the error counters on the disks where ZERO.

One day earlier the counter was ZERO as well and there was _no_ incosistency in the pool.

Between those two days (08/17 and 08/18) there had ben no scrubs, no reboots, no "zpool clear", no exchange of hard drives. Nothing.

But that only tells you that it found the error on August 18th. Not that the corruption occurred on that exact day. The corruption could have been from before then. It's like a road with bad potholes, you don't know if the road 5 miles up is bad or not until you get there.

jbear said:
Wouldn't this scenario cause an hit on the checksum error counter? As far as I understand ZFS stores a checksum for each block. So if corrupted data had been written to the disks, ZFS would detect the corruption by verifying the stored checksum with the calculated checksum. If they differ I would suspect that the checksumm error counter goes up. But it has been ZERO since the creation of the pool until this very moment (or until the "zpool clear" I did yesterday for that matter).

That's what its supposed to do ideally. But if you don't do regular scrubs to ensure everything is healthy, have bad RAM, or no redundancy for whatever reason you can think of then things can go very wrong(as you are seeing).

jbear said:
But let's say the corruption would have been caused by bad RAM. How would an additional disk have helped me there? The corrupted data from the bad RAM would just have been written to one more disk. But how would that help?

The short answer is it wouldn't have. You could have have had RAIDZ100 and it wouldn't have mattered. Bad RAM is the achille's heal of ZFS. And now you understand why when the FreeNAS manual and the forum stickies say to use ECC RAM, its dead serious.

You should get memtestx86 from www.memtest.org and let it run for 3 passes. It should take some hours. Normally I tell people to start it and go to bed. Any errors before you've completed 3 passes means you have problems.

titan_rw · Aug 21, 2013

cyberjock said:
But that only tells you that it found the error on August 18th. Not that the corruption occurred on that exact day. The corruption could have been from before then. It's like a road with bad potholes, you don't know if the road 5 miles up is bad or not until you get there.

Exactly. Because of a lack of scrubs, it's impossible to tell when the corruption occurred. All we know was that it was detected between the 18th and 19th. The corruption could have happened any time between March 25th, and Aug 19th. If you take a healthy zpool, and clobber part of a drive with dd, barring any major corruption, the zpool will mount just fine, and 'zpool status' will show 0 errors. The errors won't be detected until the data that was affected is read / written to. That's why automated scrubs are important.

I don't understand why there weren't any read/write/cksum errors when the problem was eventually detected though. I would have thought there would have been.

As I've said before, automated scrubs are good. Raidz2 is good. Drives under 40c are good. And ecc is good.

jbear · Aug 21, 2013

I don't mean to be stubborn, but I want to understand what happend here. So, let's say the pool got corrupted somewhere between March 25 and August 18. Why doesn't a scrub find any bad blocks or trigger any hits on the read/write/checksum error counters? According to "zpool status" there is an inconsitency. So isn't a "scrub" supposed to verify the consistency and find the bad blocks on the affected drive? What does a scrub actually do? I assumed it would verify the block's checksums against the stored checksums. So if there were actually an inconsistency in checksums scrub should be able to identify the bad blocks along with the disk where those bad blocks reside, right? This way I would know which disk is affected and I could replace it.

But I actually ran scrub twice since August 18 and it always came back like this:

"scan: scrub repaired 0 in 17h35m with 0 errors on Tue Aug 20 05:45:46 2013"

Even if the inconsistency was cause by bad RAM, a scrub should return something else than "0 errors", right?

titan_rw · Aug 21, 2013

I assume the corruption was found by regular pool activity. Since there were no scrubs before the problem showed up.

Again, I am unsure of why when the problems were detected, they didn't increment any of the error counters. Maybe someone else would have some ideas on that. I've never seen that before.

jbear · Aug 21, 2013

Does "zpool scrub" only check the used blocks? Is it possible that the inconsitency resides in currently unused blocks and therefore is not found during the scrub? Is there any way to check the unused blocks for parity errors?

jbear · Aug 21, 2013

Alright, one step further: I just found the following blog post, which describes my problem with the permanent errors in delete files:

http://unixetc.co.uk/2012/01/22/zfs-corruption-persists-in-unlinked-files/

So, apparently after destroying the affected dataset I had to run a scrub in order to have the "permanent errors" disappear.

Now my pool shows up clean. I will check the RAM and I will check if there's space for an additional disk to upgrade the pool to a RAIDZ2. I'm not sure there's anything I can do about the temperature. But as I said: It's just a backup NAS which does not host any production data, so I guess I can live with that.

titan_rw · Aug 21, 2013

Ahh. I thought you had indicated you had ran a scrub after deleting the dataset.

A scrub only verifies actual data on the pool. This is also why automated smart long tests are recommended. The scrubs will verify the data is readable, and actually correct. The long tests will verify the rest of the disks surface is 'readable'. Long tests won't guarantee the empty disk area can store, and later retrieve data correctly, but at least it does a simple read scan of the disk(s).

Cyberjock has a good schedule for scrubs / long test. Alternate weekly between the two. 1st and 15, do one, 7th and 21st, do the other.

Can you jam any more fans in the case? Upgrade the existing fans to more powerful ones? There has to something you can do to lower the drives temperatures. I don't have my nas's in an air-conditioned room, so they have to be able to put up with ~35c ambient. I had to upgrade the fans, and also add some fans in non standard locations in order to keep drive temperatures decent. I will say I still sometimes see 42-43c, but that's as hot as mine get. I wouldn't want them to get any hotter.

cyberjock · Aug 21, 2013

jbear said:
I don't mean to be stubborn, but I want to understand what happend here. So, let's say the pool got corrupted somewhere between March 25 and August 18. Why doesn't a scrub find any bad blocks or trigger any hits on the read/write/checksum error counters?

Remember, only if it can repair those issues, and only since the last clear or reboot. If the error is unrecoverable then no error counters are made, the unrecoverable errors start being listed(like what you see)

jbear said:
According to "zpool status" there is an inconsitency. So isn't a "scrub" supposed to verify the consistency and find the bad blocks on the affected drive?

Scrubs do verify consistency, but they also report inconsistency. Bad data is repaired, if possible. If not, you get those crappy unrecoverable errors like yours.

jbear said:
What does a scrub actually do? I assumed it would verify the block's checksums against the stored checksums. So if there were actually an inconsistency in checksums scrub should be able to identify the bad blocks along with the disk where those bad blocks reside, right? This way I would know which disk is affected and I could replace it.

Normally, yes. But when you don't have enough redundancy to fix an error it instead tells you what is broken so you can restore it from backup if necessary. Normally, if you are lucky, it lists a file name to recover and not metadata. :p

jbear said:
But I actually ran scrub twice since August 18 and it always came back like this:

"scan: scrub repaired 0 in 17h35m with 0 errors on Tue Aug 20 05:45:46 2013"

Even if the inconsistency was cause by bad RAM, a scrub should return something else than "0 errors", right?

Nope. It will return 0 errors on 2 conditions:

1. Errors it has repaired in the past(since there is no longer an error)
2. Errors that it has found or is aware of that it cannot repair(just like what you are getting).

cyberjock · Aug 21, 2013

titan_rw said:
I assume the corruption was found by regular pool activity. Since there were no scrubs before the problem showed up.

Again, I am unsure of why when the problems were detected, they didn't increment any of the error counters. Maybe someone else would have some ideas on that. I've never seen that before.

Normally errors without error counters is lost redundancy without the admin knowing, scrubs that are virtually never performed along with failing hardware, and bad RAM.

Bad RAM is the real PITA because it's literally like asking an alzheimer's patient what they forgot. They have no clue and neither do you.

paleoN · Aug 21, 2013

jbear said:
So, apparently after destroying the affected dataset I had to run a scrub in order to have the "permanent errors" disappear.

Yes, which was why I was asking as the timeline was unclear to me.

N00b · Aug 21, 2013

@jbear. I can talk about drive temps from personal experience. I figure from the flipkart reference we are in the same neck of the woods and live in 35C-40C ambient temp. Look at the cabinet and fans suggestion from titan_rw seriously. Changing to a a larger cabinet, placing it in a place with a little more air circulation will help you reduce the temp (I did both - moved them to a Cooler Master 343 and from under the table to top of the book cabinet :)). My drives now work within 1-2C of ambient. I am using the 3TB Barracudas.

cyberjock · Aug 22, 2013

Upgrading fans is something I had to do a couple of months back.... the link is: http://forums.freenas.org/threads/norco-rpc-4224-and-hard-drives-that-are-too-hot.13022/

Bought 3 fans at like $12 each or so from ebay. Well worth the money to keep my drives cool.

sirkkalap · May 17, 2014

cyberjock said:
Remember, only if it can repair those issues, and only since the last clear or reboot. If the error is unrecoverable then no error counters are made, the unrecoverable errors start being listed(like what you see)

Scrubs do verify consistency, but they also report inconsistency. Bad data is repaired, if possible. If not, you get those crappy unrecoverable errors like yours.

Normally, yes. But when you don't have enough redundancy to fix an error it instead tells you what is broken so you can restore it from backup if necessary. Normally, if you are lucky, it lists a file name to recover and not metadata. :p

Nope. It will return 0 errors on 2 conditions:

1. Errors it has repaired in the past(since there is no longer an error)
2. Errors that it has found or is aware of that it cannot repair(just like what you are getting).

Great points to know. I am also seeing permanent errors on one file (luckily just one) and I read through this thread to understand why the error counters show 0. My intuition would say that error counters would also count permanent errors, but I guess it is impossible to tell where the error is when it is permanent. That is to say that there is no redundant copy that is not corrupted, so every copy is broken. And yes, that sounds like a ram problem on my rig too.

Important Announcement for the TrueNAS Community.

permanent errors in ZFS pool

Inactive Account

Dabbler

Inactive Account

Wizard

Dabbler

Dabbler

Wizard

Inactive Account

Guru

Dabbler

Guru

Dabbler

Dabbler

Guru

Inactive Account

Inactive Account

Wizard

Explorer

Inactive Account

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "permanent errors in ZFS pool"

Similar threads