SOLVED Resilver every reboot

ajschot · May 26, 2017

I get new errors now on another snapshot.... could it be a problem that these snapshots are made with Corral?
Because snapshots from 20170425 were ok but after last resilver i get again errors (same files) but in snapshot from 20170425

Looks like a copatibility problem with corral snapshots i think i have to delete them all and try to resilver again, the first error in the file is gone too. Only have 13 new errors in other snapshots. Could corral use another checksum code?

cyberjock · May 26, 2017

I've seen resilvers autostart on a pool mount. Historically this has been due to a zpool with corruption. If enough corruption (or even apparent corruption due to something like a bad cable) is found on a disk, then ZFS decides that resilvering is necessary and immediately begins the resilver. I would try this:

1. Delete the offending files that are corrupted. If its metadata, you may end up having to destroy and recreate the zpool, then restore data from backup.
2. Replace any bad disks, cables, etc that you may have.
3. Let a resilver finish
4. Do a scrub.
5. Once that scrub is done, do another one.
6. Reboot FreeNAS and see if a resilver starts. If it does, replace whatever disk is being labeled as "resilvering" and start with #1 again.

The cause isn't going to be known since we're past the cause. Now all you can do is try to recover. The easier answer would be to destroy the zpool and restore from backup. Since your zpool has corruption, I'd recommend a new zpool rather than try to save the zpool you have.

ajschot · May 26, 2017

cyberjock said:
I've seen resilvers autostart on a pool mount. Historically this has been due to a zpool with corruption. If enough corruption (or even apparent corruption due to something like a bad cable) is found on a disk, then ZFS decides that resilvering is necessary and immediately begins the resilver. I would try this:

1. Delete the offending files that are corrupted. If its metadata, you may end up having to destroy and recreate the zpool, then restore data from backup.
2. Replace any bad disks, cables, etc that you may have.
3. Let a resilver finish
4. Do a scrub.
5. Once that scrub is done, do another one.
6. Reboot FreeNAS and see if a resilver starts. If it does, replace whatever disk is being labeled as "resilvering" and start with #1 again.

The cause isn't going to be known since we're past the cause. Now all you can do is try to recover. The easier answer would be to destroy the zpool and restore from backup. Since your zpool has corruption, I'd recommend a new zpool rather than try to save the zpool you have.

This is what i did! And yes it starts again to resilver the new disk and when that is completed i had 13 new errors.... data corruption in next snapshot. I can fill in what happens when i delete those... it will find 13 checksum errors right after resilver is completed in the next day snapshots and so on.
Would it be possible that the Corral snapshots are incompatible? And FreeNAS 11 think that they are corrupt? I had only errors in the snapshots when checking zpool status.
After latest resilver. Also no smart errors on all the disks.
The corrupted snapshots are all snapshots made with FreeNAS Corral.

Also these are not read and write errors but checksum errors when i understand my zpool status right

rs225 · May 26, 2017

You say the first error file, which had no snapshot, no longer shows an error? But you didn't delete the file?

That would mean valid data was read correctly by the drives, and then scrambled or mis-calculated yesterday. If that is what happened, your hardware (or the hardware combined with virtualization) is not working right.

ajschot · May 26, 2017

rs225 said:
You say the first error file, which had no snapshot, no longer shows an error? But you didn't delete the file?

That would mean valid data was read correctly by the drives, and then scrambled or mis-calculated yesterday. If that is what happened, your hardware (or the hardware combined with virtualization) is not working right.

That is correct, so that could be a problem combined with ESXi. I will try to make a usb bootable to test...
But then i come to my problem, i need ESXi because of all the things want to use, when dockers return that problem is for the most gone. But more important: this is a problem of ESXi in combination with FreeNAS 11 with Corral i never had these strange data problems.
And why are other people not having the same? I am not the only one using ESXi, however with 9.10.2.
I am on FN11 but since i don't use VM's in FN11 i can switch to 9.10 or go back to where i. Came from, Corral, i still have the VM, so i can switch back.

ajschot · May 27, 2017

Here is today's pool status, only files deleted was the snapshot from 20170424 so no other files (actual files are good only snapshots seems to be infected....

Code:

pool: Data

state: DEGRADED

status: One or more devices has experienced an error resulting in data

	corruption.  Applications may be affected.

action: Restore the file in question if possible.  Otherwise restore the

	entire pool from backup.

  see: http://illumos.org/msg/ZFS-8000-8A

  scan: resilvered 937G in 4h15m with 13 errors on Fri May 26 15:37:05 2017

config:


	NAME											STATE	 READ WRITE CKSUM

	Data											DEGRADED	 0	 0	34

	  raidz2-0									  DEGRADED	 0	 0	68

		gptid/a6a6f8ac-ba6f-11e6-a6a7-6805ca0cfed6  DEGRADED	 0	 0	 0  too many errors

		gptid/a6c19d32-ba6f-11e6-a6a7-6805ca0cfed6  DEGRADED	 0	 0	 0  too many errors

		gptid/a7937aca-ba6f-11e6-a6a7-6805ca0cfed6  DEGRADED	 0	 0	 0  too many errors

		gptid/41a9a23d-3f33-11e7-a87b-000c29bfa44f  ONLINE	   0	 0	 0

		gptid/aaff7d0f-ba6f-11e6-a6a7-6805ca0cfed6  DEGRADED	 0	 0	 0  too many errors

		gptid/ad0ecf4a-ba6f-11e6-a6a7-6805ca0cfed6  DEGRADED	 0	 0	 0  too many errors

		gptid/ac78268c-ba6f-11e6-a6a7-6805ca0cfed6  DEGRADED	 0	 0	 0  too many errors

		gptid/acf30023-ba6f-11e6-a6a7-6805ca0cfed6  DEGRADED	 0	 0	 0  too many errors


errors: Permanent errors have been detected in the following files:


		Data/Videos@Data-auto-20170425.2030:/Series/MacGyver/Macgyver 2x02.avi

		Data/Videos@Data-auto-20170425.2030:/KinderFilms/Suske en Wiske & De Texas Rakkers (2009).mkv

		Data/Videos@Data-auto-20170425.2030:/Series/MacGyver/Macgyver 7x07.avi

		Data/Videos@Data-auto-20170425.2030:/Films/Once Upon a Time in the West (1968).mkv

		Data/Videos@Data-auto-20170425.2030:/KinderFilms/Mary Poppins (1964) 1080p.mkv

		Data/Videos@Data-auto-20170425.2030:/KinderFilms/Pinocchio (1940).mkv

		Data/Videos@Data-auto-20170425.2030:/KinderFilms/Planes (2013).mkv

		Data/Music@Data-auto-20170425.2030:/iTunes/iTunes Media/Movies/Romeo + Juliet/Romeo + Juliet (HD).m4v

		Data/Music@Data-auto-20170425.2030:/iTunes/iTunes Media/Movies/42/42 (1080p HD).m4v

		Data/Music@Data-auto-20170425.2030:/iTunes/iTunes Media/Music/Stotijn/33613/05_Rota-Divertimetno Concertanto-Allegro.flac

		Data/Music@Data-auto-20170425.2030:/iTunes/iTunes Media/Movies/Horrible Bosses/Horrible Bosses (1080p HD).m4v

		Data/Music@Data-auto-20170425.2030:/iTunes/iTunes Media/Movies/Ice Age_ het mysterie van de eieren/Ice Age_ het mysterie van de eieren (1080p HD).m4v

		Data/Music@Data-auto-20170425.2030:/iTunes/iTunes Media/Movies/La crème de la crème/La crème de la crème (1080p HD).m4v

And the same files are mentioned in the snapshots, while the real files are not infected (so it seems).
but why are all disks degraded.... this is not going well....

ajschot · May 27, 2017

back in Corral: no issues found yet and also no resilver started. It all seems to be fine.

I will try a scrub in Corral to see if that changes the thing...

Does anybody know if the zfs version is the same in Corral as in FN11? I think it is, if it is different FN11 would be newer right?

Ericloewe · May 27, 2017

ajschot said:
Does anybody know if the zfs version is the same in Corral as in FN11?

It is.

ajschot · May 27, 2017

i have to dig a little deeper in this, i need to do a harddisk test with something else.... before real data gets corrupt

rs225 · May 27, 2017

Have you checked to see if there is a firmware update for your LSI card? Or can it placed in a different slot?

ajschot · May 27, 2017

rs225 said:
Have you checked to see if there is a firmware update for your LSI card? Or can it placed in a different slot?

I did not check a different slot but both cards have the same, latest (p20) firmware.

I found out that Corral is also trying to resilver the disk with errors but it is only shown in shell so not in cli not in tasks and also not in GUI.
I think i will try to make a local backup i have a spare pool of 6TB which would be enough for the most important data, have to find a easy way to copy it is in the same machine but in an other vdev and it uses the other lsi interface.
I deleted the snapshots but the original files are now giving errors too.
I am scared that if i deleted thise files other files will become infected too.
In the time i changed some hardware too so it could be also to do with the internal memeory back then or with the sata card back then.

But how is it possible in a RaidZ2 with 8 drive to get corrupt data when only 22% of space is used and had only smart seek time errors and that disk was replaced after 3 days but i put the disk offline when i saw the message.

I don't understand how data get corrupt before the build in januari i tested the memory, only thing is maybe a problem then with the pcislot or with one of the LSI cards. I have a crashplan backup but the files are probably also corrupt in the backup...

So steps to check:
1. Again do a memory test?
2. Put the card which is connected to the drives from the corrupted pool in an other PCI slot. (Is there a more direct way to test the PCI slot?)

I have now other SATA cables but if it happend earlier then I will never get the data back.
I have done again new SMART test and again no errors on all the drives....

Smartest way way is to backup on a local machine and skip the files that are giving problems. Then wipe the zpool and put everything back? When i look at the files infected i think it is from a while ago.
Strange is that i can play those movies, and i skipped through them and have not found problems. But i have them on an online backup, bit to download the whole data takes a lot of time so maybe making an local backup and put it back later is a better idea.

EDIT:
I found a strange mounting point....
in /mnt/ are my 3 pools but also a map called mnt en inside that are my 3 pools. Is this normal? But those /mnt/mnt/Data is empty

Schermafbeelding 2017-05-27 om 23.32.08.png

ajschot · May 27, 2017

Update:
I found out these files are missing on my crashplan backup so it seems those files were already corrupt... probably when i copied them to FreeNAS could that be right?
It could be that i copied them to FreeNAS when i had my AMD system, with one broken memory dimm.
I know i need to scrub now and then but i don't know if i did that ever before replacing the bad drive.

ajschot · May 27, 2017

rs225 said:
You say the first error file, which had no snapshot, no longer shows an error? But you didn't delete the file?

That would mean valid data was read correctly by the drives, and then scrambled or mis-calculated yesterday. If that is what happened, your hardware (or the hardware combined with virtualization) is not working right.

this was right but they are now back and all the original files named in the snapshots are now mentioned not ok, so maybe the file was already get corrupted when i copied them....?

What is the quickest and safest way of copying files from one to another zpool? (In the same machine, in shell the best way?)

rs225 · May 27, 2017

I would only recommend that you backup your critical data to a desktop computer. Moving data from one pool to another on the same machine that has problems is not a good idea.

Before you do anything else, I recommend you make sure you still have at least one recent snapshot on every dataset where you deleted the snapshots. It is not safe to have no snapshots.

If you do want to move the data, rsync is probably the best tool in this situation.

If the AMD machine was running FreeNAS, that could be a source of these problems. If not, then it did not cause these errors.

Raidz2 has nothing to do with this problem, because the drives have nothing to do with this problem. The computer itself is the problem.

danb35 · May 27, 2017

rs225 said:
It is not safe to have no snapshots.

¿Que?

ajschot · May 28, 2017

rs225 said:
I would only recommend that you backup your critical data to a desktop computer. Moving data from one pool to another on the same machine that has problems is not a good idea.

Before you do anything else, I recommend you make sure you still have at least one recent snapshot on every dataset where you deleted the snapshots. It is not safe to have no snapshots.

If you do want to move the data, rsync is probably the best tool in this situation.

If the AMD machine was running FreeNAS, that could be a source of these problems. If not, then it did not cause these errors.

Raidz2 has nothing to do with this problem, because the drives have nothing to do with this problem. The computer itself is the problem.

the machine then was running these disks and i never had problems like i wrote i could play those movies without any problem but i could not copy them. However the alerts are annoying so i deleted all snapshots and the infected files, and did 1 scrub (2nd is now running) first did not found any problem.
I still don't understand that i could play those files without problems why i could not copy them.... bet well... it were not important files.
I will reboot after second scrub but for now it looks like this problem is gone, and i need to schedule scrubs ;-)

rs225 · May 28, 2017

danb35 said:
¿Que?

Murphy's Law: "If there are two or more ways to do something, and one of those ways can result in a catastrophe, then someone will do it." Or ransom-ware.

ajschot said:
for now it looks like this problem is gone, and i need to schedule scrubs ;-)

Yes, scrubs should be on an automatic schedule with notification to you if something goes wrong. If similar errors return, they will be related again to whatever caused this incident.

A copy will fail if there is a read error, but video players will attempt to skip over damage or read errors. Or, you just never hit the damaged area, or the damage doesn't exist: it was an intermittent read problem in your system.(Perhaps only under heavy load, like a scrub or copy, but not a playback.)

danb35 · May 28, 2017

rs225 said:
"If there are two or more ways to do something, and one of those ways can result in a catastrophe, then someone will do it." Or ransom-ware.

So, "It is not safe to have no snapshots"? No, I don't think so. Sure, they add some protection, and they're often a good idea, but I think you're overstating both the risk and the benefit.

ajschot · May 29, 2017

Last update:
No resilver and no checksum errors anymore so problem looks solved i keep an eye on all things but it was data that was comming from my old Readynas and back then i had other hardware in my FreeNAS machine (one thing was memory that was not good) but it seems now good.

ajschot · May 29, 2017

danb35 said:
So, "It is not safe to have no snapshots"? No, I don't think so. Sure, they add some protection, and they're often a good idea, but I think you're overstating both the risk and the benefit.

I think you are right. Snapshots are just a little protection but never as good as a real backup, only my corrupt files were there before backup so backup was also corrupt

Important Announcement for the TrueNAS Community.

SOLVED Resilver every reboot

Patron

Inactive Account

Patron

Guru

Patron

Patron

Patron

Server Wrangler

Patron

Guru

Patron

Patron

Patron

Guru

Hall of Famer

Patron

Guru

Hall of Famer

Patron

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Resilver every reboot"

Similar threads