SOLVED Resilver every reboot

Status
Not open for further replies.

ajschot

Patron
Joined
Nov 7, 2016
Messages
341
I get new errors now on another snapshot.... could it be a problem that these snapshots are made with Corral?
Because snapshots from 20170425 were ok but after last resilver i get again errors (same files) but in snapshot from 20170425

Looks like a copatibility problem with corral snapshots i think i have to delete them all and try to resilver again, the first error in the file is gone too. Only have 13 new errors in other snapshots. Could corral use another checksum code?
 
Last edited:

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I've seen resilvers autostart on a pool mount. Historically this has been due to a zpool with corruption. If enough corruption (or even apparent corruption due to something like a bad cable) is found on a disk, then ZFS decides that resilvering is necessary and immediately begins the resilver. I would try this:

1. Delete the offending files that are corrupted. If its metadata, you may end up having to destroy and recreate the zpool, then restore data from backup.
2. Replace any bad disks, cables, etc that you may have.
3. Let a resilver finish
4. Do a scrub.
5. Once that scrub is done, do another one.
6. Reboot FreeNAS and see if a resilver starts. If it does, replace whatever disk is being labeled as "resilvering" and start with #1 again.

The cause isn't going to be known since we're past the cause. Now all you can do is try to recover. The easier answer would be to destroy the zpool and restore from backup. Since your zpool has corruption, I'd recommend a new zpool rather than try to save the zpool you have.
 

ajschot

Patron
Joined
Nov 7, 2016
Messages
341
I've seen resilvers autostart on a pool mount. Historically this has been due to a zpool with corruption. If enough corruption (or even apparent corruption due to something like a bad cable) is found on a disk, then ZFS decides that resilvering is necessary and immediately begins the resilver. I would try this:

1. Delete the offending files that are corrupted. If its metadata, you may end up having to destroy and recreate the zpool, then restore data from backup.
2. Replace any bad disks, cables, etc that you may have.
3. Let a resilver finish
4. Do a scrub.
5. Once that scrub is done, do another one.
6. Reboot FreeNAS and see if a resilver starts. If it does, replace whatever disk is being labeled as "resilvering" and start with #1 again.

The cause isn't going to be known since we're past the cause. Now all you can do is try to recover. The easier answer would be to destroy the zpool and restore from backup. Since your zpool has corruption, I'd recommend a new zpool rather than try to save the zpool you have.
This is what i did! And yes it starts again to resilver the new disk and when that is completed i had 13 new errors.... data corruption in next snapshot. I can fill in what happens when i delete those... it will find 13 checksum errors right after resilver is completed in the next day snapshots and so on.
Would it be possible that the Corral snapshots are incompatible? And FreeNAS 11 think that they are corrupt? I had only errors in the snapshots when checking zpool status.
After latest resilver. Also no smart errors on all the disks.
The corrupted snapshots are all snapshots made with FreeNAS Corral.

Also these are not read and write errors but checksum errors when i understand my zpool status right
 
Last edited:

rs225

Guru
Joined
Jun 28, 2014
Messages
878
You say the first error file, which had no snapshot, no longer shows an error? But you didn't delete the file?

That would mean valid data was read correctly by the drives, and then scrambled or mis-calculated yesterday. If that is what happened, your hardware (or the hardware combined with virtualization) is not working right.
 

ajschot

Patron
Joined
Nov 7, 2016
Messages
341
You say the first error file, which had no snapshot, no longer shows an error? But you didn't delete the file?

That would mean valid data was read correctly by the drives, and then scrambled or mis-calculated yesterday. If that is what happened, your hardware (or the hardware combined with virtualization) is not working right.
That is correct, so that could be a problem combined with ESXi. I will try to make a usb bootable to test...
But then i come to my problem, i need ESXi because of all the things want to use, when dockers return that problem is for the most gone. But more important: this is a problem of ESXi in combination with FreeNAS 11 with Corral i never had these strange data problems.
And why are other people not having the same? I am not the only one using ESXi, however with 9.10.2.
I am on FN11 but since i don't use VM's in FN11 i can switch to 9.10 or go back to where i. Came from, Corral, i still have the VM, so i can switch back.
 
Last edited:

ajschot

Patron
Joined
Nov 7, 2016
Messages
341
Here is today's pool status, only files deleted was the snapshot from 20170424 so no other files (actual files are good only snapshots seems to be infected....
Code:
pool: Data

state: DEGRADED

status: One or more devices has experienced an error resulting in data

	corruption.  Applications may be affected.

action: Restore the file in question if possible.  Otherwise restore the

	entire pool from backup.

  see: http://illumos.org/msg/ZFS-8000-8A

  scan: resilvered 937G in 4h15m with 13 errors on Fri May 26 15:37:05 2017

config:


	NAME											STATE	 READ WRITE CKSUM

	Data											DEGRADED	 0	 0	34

	  raidz2-0									  DEGRADED	 0	 0	68

		gptid/a6a6f8ac-ba6f-11e6-a6a7-6805ca0cfed6  DEGRADED	 0	 0	 0  too many errors

		gptid/a6c19d32-ba6f-11e6-a6a7-6805ca0cfed6  DEGRADED	 0	 0	 0  too many errors

		gptid/a7937aca-ba6f-11e6-a6a7-6805ca0cfed6  DEGRADED	 0	 0	 0  too many errors

		gptid/41a9a23d-3f33-11e7-a87b-000c29bfa44f  ONLINE	   0	 0	 0

		gptid/aaff7d0f-ba6f-11e6-a6a7-6805ca0cfed6  DEGRADED	 0	 0	 0  too many errors

		gptid/ad0ecf4a-ba6f-11e6-a6a7-6805ca0cfed6  DEGRADED	 0	 0	 0  too many errors

		gptid/ac78268c-ba6f-11e6-a6a7-6805ca0cfed6  DEGRADED	 0	 0	 0  too many errors

		gptid/acf30023-ba6f-11e6-a6a7-6805ca0cfed6  DEGRADED	 0	 0	 0  too many errors


errors: Permanent errors have been detected in the following files:


		Data/Videos@Data-auto-20170425.2030:/Series/MacGyver/Macgyver 2x02.avi

		Data/Videos@Data-auto-20170425.2030:/KinderFilms/Suske en Wiske & De Texas Rakkers (2009).mkv

		Data/Videos@Data-auto-20170425.2030:/Series/MacGyver/Macgyver 7x07.avi

		Data/Videos@Data-auto-20170425.2030:/Films/Once Upon a Time in the West (1968).mkv

		Data/Videos@Data-auto-20170425.2030:/KinderFilms/Mary Poppins (1964) 1080p.mkv

		Data/Videos@Data-auto-20170425.2030:/KinderFilms/Pinocchio (1940).mkv

		Data/Videos@Data-auto-20170425.2030:/KinderFilms/Planes (2013).mkv

		Data/Music@Data-auto-20170425.2030:/iTunes/iTunes Media/Movies/Romeo + Juliet/Romeo + Juliet (HD).m4v

		Data/Music@Data-auto-20170425.2030:/iTunes/iTunes Media/Movies/42/42 (1080p HD).m4v

		Data/Music@Data-auto-20170425.2030:/iTunes/iTunes Media/Music/Stotijn/33613/05_Rota-Divertimetno Concertanto-Allegro.flac

		Data/Music@Data-auto-20170425.2030:/iTunes/iTunes Media/Movies/Horrible Bosses/Horrible Bosses (1080p HD).m4v

		Data/Music@Data-auto-20170425.2030:/iTunes/iTunes Media/Movies/Ice Age_ het mysterie van de eieren/Ice Age_ het mysterie van de eieren (1080p HD).m4v

		Data/Music@Data-auto-20170425.2030:/iTunes/iTunes Media/Movies/La crème de la crème/La crème de la crème (1080p HD).m4v


And the same files are mentioned in the snapshots, while the real files are not infected (so it seems).
but why are all disks degraded.... this is not going well....
 
Last edited:

ajschot

Patron
Joined
Nov 7, 2016
Messages
341
back in Corral: no issues found yet and also no resilver started. It all seems to be fine.

I will try a scrub in Corral to see if that changes the thing...

Does anybody know if the zfs version is the same in Corral as in FN11? I think it is, if it is different FN11 would be newer right?
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194

ajschot

Patron
Joined
Nov 7, 2016
Messages
341
i have to dig a little deeper in this, i need to do a harddisk test with something else.... before real data gets corrupt
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
Have you checked to see if there is a firmware update for your LSI card? Or can it placed in a different slot?
 

ajschot

Patron
Joined
Nov 7, 2016
Messages
341
Have you checked to see if there is a firmware update for your LSI card? Or can it placed in a different slot?
I did not check a different slot but both cards have the same, latest (p20) firmware.

I found out that Corral is also trying to resilver the disk with errors but it is only shown in shell so not in cli not in tasks and also not in GUI.
I think i will try to make a local backup i have a spare pool of 6TB which would be enough for the most important data, have to find a easy way to copy it is in the same machine but in an other vdev and it uses the other lsi interface.
I deleted the snapshots but the original files are now giving errors too.
I am scared that if i deleted thise files other files will become infected too.
In the time i changed some hardware too so it could be also to do with the internal memeory back then or with the sata card back then.

But how is it possible in a RaidZ2 with 8 drive to get corrupt data when only 22% of space is used and had only smart seek time errors and that disk was replaced after 3 days but i put the disk offline when i saw the message.

I don't understand how data get corrupt before the build in januari i tested the memory, only thing is maybe a problem then with the pcislot or with one of the LSI cards. I have a crashplan backup but the files are probably also corrupt in the backup...

So steps to check:
1. Again do a memory test?
2. Put the card which is connected to the drives from the corrupted pool in an other PCI slot. (Is there a more direct way to test the PCI slot?)

I have now other SATA cables but if it happend earlier then I will never get the data back.
I have done again new SMART test and again no errors on all the drives....

Smartest way way is to backup on a local machine and skip the files that are giving problems. Then wipe the zpool and put everything back? When i look at the files infected i think it is from a while ago.
Strange is that i can play those movies, and i skipped through them and have not found problems. But i have them on an online backup, bit to download the whole data takes a lot of time so maybe making an local backup and put it back later is a better idea.

EDIT:
I found a strange mounting point....
in /mnt/ are my 3 pools but also a map called mnt en inside that are my 3 pools. Is this normal? But those /mnt/mnt/Data is empty
Schermafbeelding 2017-05-27 om 23.32.08.png
 
Last edited:

ajschot

Patron
Joined
Nov 7, 2016
Messages
341
Update:
I found out these files are missing on my crashplan backup so it seems those files were already corrupt... probably when i copied them to FreeNAS could that be right?
It could be that i copied them to FreeNAS when i had my AMD system, with one broken memory dimm.
I know i need to scrub now and then but i don't know if i did that ever before replacing the bad drive.
 

ajschot

Patron
Joined
Nov 7, 2016
Messages
341
You say the first error file, which had no snapshot, no longer shows an error? But you didn't delete the file?

That would mean valid data was read correctly by the drives, and then scrambled or mis-calculated yesterday. If that is what happened, your hardware (or the hardware combined with virtualization) is not working right.
this was right but they are now back and all the original files named in the snapshots are now mentioned not ok, so maybe the file was already get corrupted when i copied them....?

What is the quickest and safest way of copying files from one to another zpool? (In the same machine, in shell the best way?)
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
I would only recommend that you backup your critical data to a desktop computer. Moving data from one pool to another on the same machine that has problems is not a good idea.

Before you do anything else, I recommend you make sure you still have at least one recent snapshot on every dataset where you deleted the snapshots. It is not safe to have no snapshots.

If you do want to move the data, rsync is probably the best tool in this situation.

If the AMD machine was running FreeNAS, that could be a source of these problems. If not, then it did not cause these errors.

Raidz2 has nothing to do with this problem, because the drives have nothing to do with this problem. The computer itself is the problem.
 

ajschot

Patron
Joined
Nov 7, 2016
Messages
341
I would only recommend that you backup your critical data to a desktop computer. Moving data from one pool to another on the same machine that has problems is not a good idea.

Before you do anything else, I recommend you make sure you still have at least one recent snapshot on every dataset where you deleted the snapshots. It is not safe to have no snapshots.

If you do want to move the data, rsync is probably the best tool in this situation.

If the AMD machine was running FreeNAS, that could be a source of these problems. If not, then it did not cause these errors.

Raidz2 has nothing to do with this problem, because the drives have nothing to do with this problem. The computer itself is the problem.
the machine then was running these disks and i never had problems like i wrote i could play those movies without any problem but i could not copy them. However the alerts are annoying so i deleted all snapshots and the infected files, and did 1 scrub (2nd is now running) first did not found any problem.
I still don't understand that i could play those files without problems why i could not copy them.... bet well... it were not important files.
I will reboot after second scrub but for now it looks like this problem is gone, and i need to schedule scrubs ;-)
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
Murphy's Law: "If there are two or more ways to do something, and one of those ways can result in a catastrophe, then someone will do it." Or ransom-ware.

ajschot said:
for now it looks like this problem is gone, and i need to schedule scrubs ;-)
Yes, scrubs should be on an automatic schedule with notification to you if something goes wrong. If similar errors return, they will be related again to whatever caused this incident.

A copy will fail if there is a read error, but video players will attempt to skip over damage or read errors. Or, you just never hit the damaged area, or the damage doesn't exist: it was an intermittent read problem in your system.(Perhaps only under heavy load, like a scrub or copy, but not a playback.)
 
Last edited:

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
"If there are two or more ways to do something, and one of those ways can result in a catastrophe, then someone will do it." Or ransom-ware.
So, "It is not safe to have no snapshots"? No, I don't think so. Sure, they add some protection, and they're often a good idea, but I think you're overstating both the risk and the benefit.
 

ajschot

Patron
Joined
Nov 7, 2016
Messages
341
Last update:
No resilver and no checksum errors anymore so problem looks solved i keep an eye on all things but it was data that was comming from my old Readynas and back then i had other hardware in my FreeNAS machine (one thing was memory that was not good) but it seems now good.
 

ajschot

Patron
Joined
Nov 7, 2016
Messages
341
So, "It is not safe to have no snapshots"? No, I don't think so. Sure, they add some protection, and they're often a good idea, but I think you're overstating both the risk and the benefit.
I think you are right. Snapshots are just a little protection but never as good as a real backup, only my corrupt files were there before backup so backup was also corrupt
 
Status
Not open for further replies.
Top