Contributing Pre and Post Scrub hooks for avoiding time machine issues

Status
Not open for further replies.

Todd Nine

Dabbler
Joined
Nov 16, 2013
Messages
37
Hey guys,
I'm a software engineer, and I'd like to contribute a solution to a common problem most Mac users seems to be experiencing with any ZFS data set and time machine (not just FreeNas)


I've run into the issue that several users have been reporting with Apple's Time Machine. It seems that every time the scrub process is executed by cron, it causes a corruption of the backup image of my time machine disk if it's mounted by the client and currently in use. I'm assuming this is due to some sort of internal checksum in the sparse image that fails after the scrub corrects the file that is the disk device (sparse image) for time machine.


In my configuration, each client has their own ZFS data set. I use this to enforce quota's per client, and permissions. Each zfs data set has it's own AFP network share. I would like to do the following every time a ZFS Scrub executes.



Pre hook

1) List all AFS shares for every time machine related ZFS data set. I'm not sure if there's a way to get this meta data from the share itself. Each share has the time machine compatibility enabled, so I'm sure it's just an awk/grep to get all the share points.

2) Stop all shares (force if necessary) returned from step 1

Job execution

3) Perform the scrub

Post hook

4) Re-enable all the hooks from 1)


I use Linux and OS X quite heavily, and I'm very comfortable with the command line. However, I'm not nearly as comfortable in the Free BSD os. If someone could give me some pointers to documentation on where I can integrate my pre and post hooks, as well as the commands to list/start/stop the shares I would really appreciate it.

Also, anything I create I would be more than happy to contribute back. It seems to be a common issue for most Mac users, and if I can contribute back to the community, that would be great!

Thanks,
Todd
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
WOW. Umm.. I don't know what to say. To be honest, I'm really not buying that a scrub is corrupting data. That's not something that goes unnoticed by an Enterprise Class file system like ZFS. Not that I'm disputing you(I don't even have a Mac) but I find that really hard to swallow. But I'm thinking something more sinister than a scrub is causing this.

I know that AFP got an overhaul and is now the latest version. So you might want to check out the 9.2.0-RC and see if the issue persists.
 

Todd Nine

Dabbler
Joined
Nov 16, 2013
Messages
37
To be clear, I'm not saying that FreeNAS is corrupting the file that represents the volume. Rather, I'm saying that Apple's time machine "perceives" the file as corrupted after a scrub has corrected any errors on it. It's not just FreeNas, it seems this is a problem with any ZFS network provider.

http://goo.gl/kDA6To

If you force an fsck on the sparse image, it's actually fine. However something from the disk image volume info doesn't match what's in the plist that time machine creates in the sparse image, so it thinks the disk is corrupt.


I was going to create a new zfs share and manually force a backup and a Scrub concurrently. However, without some way to simulate scrub rectifying a corruption issue, I'm not sure I can reproduce it accurately during my testing.

Any suggestions?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
You could force a scrub from the CLI with zpool scrub poolname. If it fixes anything it will say so because it will say something like "resilvered XXXK" during and after the scrub with a zpool status.

Can you explain what you think is going on? I read http://www.garth.org/archives/2011,...ine-sparsebundle-nas-based-backup-errors.html and it's all in greek to me. I'm not a Mac user but I've been interested in this problem for a while. One of our mods uses the Time Machine for his Mac with FreeNAS as the backup location over CIFS and AFP and claims to never have had this problem.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I've actually been pondering if it might be something like latency, because TM is weird in many ways.

What we should do is make Jordan look into it, haha.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I've actually been pondering if it might be something like latency, because TM is weird in many ways.

What we should do is make Jordan look into it, haha.

+1

You created the sh*tstain.. you can fix it. :)
 
J

jkh

Guest
I've run into the issue that several users have been reporting with Apple's Time Machine. It seems that every time the scrub process is executed by cron, it causes a corruption of the backup image of my time machine disk if it's mounted by the client and currently in use. I'm assuming this is due to some sort of internal checksum in the sparse image that fails after the scrub corrects the file that is the disk device (sparse image) for time machine.

I'm sorry, but that doesn't actually make any sense. :-/

ZFS scrubs aren't going to change pool data out from under the clients of the pool - that would kind of defeat the purpose of a prophylactic check and make background scrubbing also cause everything from database corruption to premature hair loss in system administrators. I just don't see it as being the problem because this forum would be full of a lot more consternation if it was.

The way Time Machine does backups involves sparse disk images shared as mountable filesystems over AFP, a sparse disk image being itself composed of individual "band" files that each comprise a region of said disk image. Without going into a lot of geeky details about AFP, I'll just say that it's a pretty complicated dance behind the scenes and requires a number of AFP server features that need to interact properly with locking, caches, flushing semantics and quite a few other things (follow that link if you care) that if not implemented 100% correctly, will corrupt a backup faster than a politician in his first week in Washington.

Also, if you re-read that link you shared earlier, you'll note that folks are having problems with Synology, QNAP and quite a few other NAS types which, guess what, don't even use ZFS! They use everything from EXT3fs to proprietary RAID hardware. What they DO all have in common is netatalk. That's the only game in town for AFP interoperability if your name isn't "Apple", and I'm sure Time Machine backups have been giving the nice Netatalk folks a real workout in terms of implementing every last dark corner of the AFP protocol spec.

FreeNAS 9.2.0 has just updated to Netatalk 3.1.0, the latest and greatest. Before that, netatalk in FreeNAS was fairly old and out of date (and given how infrequently folks update their NAS software, probably old and out of date on a lot of the other NAS solutions your link cited). Why not try the 9.2.0-RC image and see if that makes your problem go away, first?
 
J

jkh

Guest
+1

You created the sh*tstain.. you can fix it. :)


Dorks! I had nothing to do with Time Machine and disavow any and all knowledge of its implementation or the... people... who created it! In fact, I've been disavowing all knowledge for quite a long time. Time Machine, Chernobyl, the asian Tsunami - all things I am distinctly not responsible for!
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
Time Machine, Chernobyl, the asian Tsunami - all things I am distinctly not responsible for!
You sure about the Chernobyl? I remember you mentioning your secret Ukrainian assets :P.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Dorks! I had nothing to do with Time Machine and disavow any and all knowledge of its implementation or the... people... who created it! In fact, I've been disavowing all knowledge for quite a long time. Time Machine, Chernobyl, the asian Tsunami - all things I am distinctly not responsible for!

You worked at Apple. That's enough to pin the blame on you for ANY Apple failures past, present, and future for eternity. So man up! LOL

Chernobyl was your fault though. If you guys hadn't sent that one Apple II to Chernobyl... ;)

And the Tsunami was your fault too. They did the calculations on the predicted strongest quake on a Mac. ;)

And "Asian Tsunami"? Sure smells like racism against Asian Tsunamis. If it had been a White Tsunami...
 

slowfranklin

Cadet
Joined
Jul 14, 2012
Messages
9
Also, if you re-read that link you shared earlier, you'll note that folks are having problems with Synology, QNAP and quite a few other NAS types which, guess what, don't even use ZFS! They use everything from EXT3fs to proprietary RAID hardware. What they DO all have in common is netatalk.
fwiw, Time Capsule owners report backup image corruption too, so afaict the issue is more likely created to something on the client side. I've spent quite a lot of time stress testing the AFP features that are needed for Time Machine and I couldn't find aynthing wrong on the Netatalk servers side.

That's the only game in town for AFP interoperability if your name isn't "Apple", and I'm sure Time Machine backups have been giving the nice Netatalk folks a real workout in terms of implementing every last dark corner of the AFP protocol spec.
nah, that wasn't very complex. Spotlight was hard, really hard, but not the AFP features for Time Machine. :)
 
J

jpaetzel

Guest
fwiw, Time Capsule owners report backup image corruption too, so afaict the issue is more likely created to something on the client side. I've spent quite a lot of time stress testing the AFP features that are needed for Time Machine and I couldn't find aynthing wrong on the Netatalk servers side.


nah, that wasn't very complex. Spotlight was hard, really hard, but not the AFP features for Time Machine. :)


Can I contact you out of band? I'd like to talk to you about the FreeNAS netatalk implimentation.
 

Todd Nine

Dabbler
Joined
Nov 16, 2013
Messages
37
I wanted to report back my findings after running 9.2.0 -RC for a while.

The Good News:

Once I upgraded to 9.2.0-RC this issue went away completely. I backed up for a month and a half every hour, without this incident occurring on 4 different mac clients.



The Bad News:

When I upgraded from 9.2.0-RC to 9.2.1-RC, every backup failed immediately on the next attempt. Each client reported that the image was corrupt, and that it needed to be re-created. I haven't had any failures since this one, however I've only had the new backups running for a couple of days.


I noticed in the change log that the afp protocol had a few changes in it from 9.2.0 to 9.2.1. I have no idea if the issue was caused by some sort of migration bug, or if the problem was potentially reintroduced in the changes from 9.2.0 to 9.2.1.

Thanks guys!
 

mute

Dabbler
Joined
Dec 8, 2013
Messages
19
Sadly, I'm not convinced that it's fixed. I had my TM backup corrupt after my weekly scrub (which didn't actually repair anything) running 9.2.1.2 this week.
 
J

jkh

Guest
It's best not to jump to conclusions with Time Machine. I've seen TM backups "auto corrupt" over time even on Time Capsules. The Disk Image format used by TM is somewhat fragile over AFP and I believe that it's more than possible for TM to essentially corrupt itself, detect this, and start over without any "help" from the file store.
 

mute

Dabbler
Joined
Dec 8, 2013
Messages
19
It's best not to jump to conclusions with Time Machine. I've seen TM backups "auto corrupt" over time even on Time Capsules. The Disk Image format used by TM is somewhat fragile over AFP and I believe that it's more than possible for TM to essentially corrupt itself, detect this, and start over without any "help" from the file store.


I'm with you -- I've heard stories of this happening to people via AFP on linux, on actual Time Capsules, etc. It may be coincidental, but I've had this happen to me 3 times now. I don't have any hard evidence, but it seems to coincide with my weekly scrub, so I made the leap.

Whatever the reason is, I'd love to know what I can do to mitigate the chances that it will happen again, as I intended on using my FreeNAS box as a TM backup for my home computers, and would love to do the same at the office on our servers as well.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I'm with you -- I've heard stories of this happening to people via AFP on linux, on actual Time Capsules, etc. It may be coincidental, but I've had this happen to me 3 times now. I don't have any hard evidence, but it seems to coincide with my weekly scrub, so I made the leap.

Whatever the reason is, I'd love to know what I can do to mitigate the chances that it will happen again, as I intended on using my FreeNAS box as a TM backup for my home computers, and would love to do the same at the office on our servers as well.

Whoa.. there's an AFP on linux!? I've never head of that.. but what a poor bastard child that must be!

I'm not a Mac user since I'm not going to spend money on one, but my first guess is there's a permissions problem hiding for those that have this problem. Permissions problems are responsible for such a large number of users that see an error and run off in the wrong direction that despite having any evidence to support this theory, I'd almost rank it plausible!
 

mute

Dabbler
Joined
Dec 8, 2013
Messages
19
Yeah well AFP on linux is via Netatalk, which is the same thing that FreeNAS uses. While I was trying to avoid having to restart my TM backup the first time this happened, I did some research and found others that had the same problem of seemingly random corruption going as far back as 2011: http://www.garth.org/archives/2011,...ine-sparsebundle-nas-based-backup-errors.html

Even if the fix on that blog worked, if I wanted to use a FreeNAS box as time machine target for even a few people I wouldn't want to have to hold everyone's hands when their TM backups corrupt randomly. Not to mention having to restart a TM backup means losing any history you might have on that backup, which is pretty much what makes TM cool.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I totally agree. And when someone buys me a Mac, even an older Mac, I'll try to reproduce the problem. I've already got a series of experiments I'd want to do just to prove what the problem is, or at least what the problem isn't. ;)
 
Status
Not open for further replies.
Top