The volume backup1 (ZFS) state is ONLINE: One or more devices has experienced an error resulting in

SwisherSweet · Feb 11, 2018

I had a drive fail, and it went OFFLINE in my backup1 pool. I replaced the drive using the procedure found in this forum, where I power down, remove old drive, power up, choose REPLACE, and the drive resilvered and the volume shows HEALTHY.

As I was waiting to get my new drive in the mail, I stated to see messages about "data corruption":

Code:

Device: /dev/ada4, ATA error count increased from 0 to 41
Device: /dev/ada4, 19 Currently unreadable (pending) sectors
The volume backup1 (ZFS) state is DEGRADED: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
Device: /dev/ada4, Self-Test Log error count increased from 0 to 1
Device: /dev/ada4, 19 Offline uncorrectable sectors
Device: /dev/ada4, unable to open device

After replacing the drive, I still get this error:

Code:

CRITICAL:						Feb. 11, 2018, 7:33 p.m. - The volume backup1 (ZFS) state is ONLINE: One or more devices has experienced an error resulting in data corruption. Applications may be affected.

I am running a SCRUB on backup1 now that the resilvering process is complete.

My FreeNAS server has been running great for about 1 year, and so far I haven't had any data issues, or messages or about data loss until now.

But I am wondering:

Why am I still getting an error message after replacing the failed drive?
Is there really data corruption? If so, why?
Since the volume it is complaining about is `backup1`, which is just a bunch of snapshots of my main data, shouldn't I be able to reconcile this data some way?
Shouldn't the system be able to replace a failed drive without data corruption? I don't understand what could have gone wrong. The data corruption messages occurred just before I replaced the failed drive (and after), and I run weekly scrubs of my data.

A little more setup information:

Mac Pro, 2 x 6 Core 3.46Hz Xeons, 64gb ECC RAM
FreeNAS-9.10.2-U6 (561f0d7a1)
7 x 3TB Toshiba drives in "primary" data pool in raidz2
5 x 2TB Seagate drives in "backup1" backup pool (externally attached) in raidz1
5 x 2TB Seagate drives in "backup2" backup pool (externally attached) in raidz1

I found this on github, but not sure it's relevant to my situation:

https://github.com/zfsonlinux/zfs/issues/3256

It appears when the drive failed, data was corrupted. However, it is my understanding that the nature of ZFS is to prevent such things. Doens't ZFS assure each write? Plus, with RAIDZ isn't that write effectively duplicated? I just don't see how a failed disk would cause data corruption with ZFS and FreeNAS.

DrKK · Feb 11, 2018

could we see the output of

Code:

zpool status -v

please

SwisherSweet · Feb 12, 2018

Yes, here's the output:

Code:

Matthew@Megatron:~ % zpool status -v
  pool: backup1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 5h26m with 1 errors on Mon Feb 12 02:06:57 2018
config:

   NAME											STATE	 READ WRITE CKSUM
   backup1										 ONLINE	   0	 0	 2
	 raidz1-0									  ONLINE	   0	 0	 4
	   gptid/04434b17-5082-11e7-944a-0023dfdffb3b  ONLINE	   0	 0	 0
	   gptid/04f05433-5082-11e7-944a-0023dfdffb3b  ONLINE	   0	 0	 0
	   gptid/05a81539-5082-11e7-944a-0023dfdffb3b  ONLINE	   0	 0	 0
	   gptid/0654a54a-5082-11e7-944a-0023dfdffb3b  ONLINE	   0	 0	 0
	   gptid/f5348da5-0f56-11e8-99b2-0023dfdffb3b  ONLINE	   0	 0	 0

errors: List of errors unavailable (insufficient privileges)

It would appear something unholy happened when my disk failed, as now I don't even have permissions to list the errors on the pool:

errors: List of errors unavailable (insufficient privileges)

I believe I am logged in with a user with root access. Shouldn't that give me permissions to see a list of errors?

SwisherSweet · Feb 12, 2018

I was able to run the zpool status -v backup1 directly on the server itself, and there is one "file" corrupt:

It looks like a snapshot taken from a few days ago when all this started.

Does anyone know what my options are? Can I delete the snapshot in the GUI? What are the consequences of this?

Or can I resend that snapshot from my main pool to backup1?

Appreciate any help or advice.

DrKK · Feb 12, 2018

I would say,

Delete the affected snapshot in the GUI (storage->snapshots->highlight the snapshot->delete).

Then, do a

Code:

zpool clear poolname

on the affected volume to clear the error,

Then, kick off a scrub immediately to verify you're alright:

Code:

zpool scrub poolname

and I believe "poolname" here is "backup1" if I'm reading your (literal) screenshot correctly.

SwisherSweet · Feb 13, 2018

Appreciate the reply. I followed the steps.

I tried to clear the pool errors using zpool clear backup1 and zpool status still showed "1" error
I deleted the corrupt snapshot
I reran the scrub
I performance another zpool status -v and now there's another snapshot of the same kind, just a different date, showing as corrupt.
I deleted that snapshot, and ran zpool status -v, but now there's a strange "file" showing as the corrupt file:

Strange thing is, zpool status -v only reports 1 error / corrupt file, but when I delete that "file" (in my case, snapshot), it shows a different "file" next time I run the command. Now it's showing <0x1a65>:<0x1> as the corrupt error and I have no idea what to do next.

Any suggestions are greatly appreciated.

DrKK · Feb 13, 2018

That error you have is a "metadata" error.

I don't know why you had non-correctable errors in the first place, if you are on a raid-z1, and you only had one drive go bonkers. You're supposed to be resilient to one drive offlining.

Do you have any insight on that? You should have been able to have corrected these errors fully without corrupted snapshots or whatever.

danb35 · Feb 13, 2018

SwisherSweet said:
It would appear something unholy happened when my disk failed, as now I don't even have permissions to list the errors on the pool:

That's because you weren't root when you ran the command.

SwisherSweet · Feb 14, 2018

Thanks for the replies.

You're supposed to be resilient to one drive offlining.

You should have been able to have corrected these errors fully without corrupted snapshots or whatever.

I feel like you get me... LOL. Seriously though, these are the same thoughts / concerns I was having when this all went down. I have recovered from failed drives before, and experienced no issues. FreeNAS has been running smoothly, with no errors from my weekly scrubs and smart tests for well over 1 year. I am using ECC ram, so I don't have any insight as to why this occurred. It has me very concerned, though, since I thought I was protected from these sorts of things.

The only thing I can think of is being on RAIDZ1. Some other poor chap had the same problem, and it was suggested that RAIDZ1 (in addition to him not using ECC ram) may be the culprit:

https://forums.freenas.org/index.php?threads/permanent-errors-in-zfs-pool.14453

I was running with a failed drive for 4-5 days on my 5 drive RAIDZ1 array, while waiting on new drive to arrive. The only thing I can think of was there was an additional device error leading to the corruption during that time, yet I saw no errors any the logs about additional device errors.

As of now, I have completely deleted the data set and am replicating it again from the source. After deleting the dataset and running another scrub, I no longer see the errors.

Again, appreciate your help.

Important Announcement for the TrueNAS Community.

The volume backup1 (ZFS) state is ONLINE: One or more devices has experienced an error resulting in

SwisherSweet

Contributor

DrKK

FreeNAS Generalissimo

SwisherSweet

Contributor

SwisherSweet

Contributor

DrKK

FreeNAS Generalissimo

SwisherSweet

Contributor

DrKK

FreeNAS Generalissimo

danb35

Hall of Famer

SwisherSweet

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

The volume backup1 (ZFS) state is ONLINE: One or more devices has experienced an error resulting in

Contributor

FreeNAS Generalissimo

Contributor

Contributor

FreeNAS Generalissimo

Contributor

FreeNAS Generalissimo

Hall of Famer

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "The volume backup1 (ZFS) state is ONLINE: One or more devices has experienced an error resulting in"

Similar threads