Bitrotted movies [on] my TNC (unclear where/when it happened) & how I fixed the files

TrumanHW

Contributor
Joined
Apr 17, 2018
Messages
197
It's perfectly reasonable for people to verify I didn't overlook something: used ECC, no failed drives, scrubbed data before testing, etc.

But, I also kind of expect to be attacked ... bc if (and I'm not even saying this was caused by ZFS) it would make people depending on it anxious.

Irrespective where it happened (maybe I had a bad cable or the drive they came from had a problem) before transferring back to my TNC server.

I began discovering these a while ago and only recently realized just how many files it'd happened to.
Videos in my library (often the larger files) would just randomly stop and couldn't play past that point.

Many were ≥5GB, UHD, etc. Again, I can't say where the errors came from & have no way of knowing for sure.


MINIMIZING TIME WHILE (HYPOTHETICALLY) FIXING THESE FILES:

Rather than (HYPOTHETICALLY) torrenting the entire file ... one could (HYPOTHETICALLY)
- start a torrent of the exact same file
- once a data file emerged, pause the torrent
- replace the downloaded file with the corrupted file.
- Transmission at least (I think all) will see and scan the data and copy only what's required.


In most cases, this (HYPOTHETICALLY) cleaned the corruption yielding a good file in seconds to a few minutes.
Some did (hypothetically) take ≥40% to complete it.

I then realized I cannot know for certain the reported file size is correct, bc things like BT's "reserve" the capacity on the drive prior to DL-ing.
For this reason, I used "DU" command "disk usage" which calculates file space in 512-byte units.
There are still other possibilities, but, these were still equal also for the test file I chose for my example.
Bc the file's corrupt, had md5 been inadequate to show the discrepancy I'd've then used SHA or SHA256.
Pictures showing both the reported sizes and DU sizes were the same, and that only the md5 differed are attached.


But ... the files previously worked ... so it's pretty unlikely they were incomplete, and instead likely just had corrupted data.
I was definitely skeptical of my idea, bc there's no reason the BT protocol should be self-annealing.
Yes, filling in data that's missing is what it does ... but, I was surprised it replaced WRONG data.
Yes, it was exactly what I'd hoped it'd do ... but that didn't make it likely. It always seemed a tenuous solution.
 

Attachments

  • Example File.png
    Example File.png
    1.2 MB · Views: 119
  • Screenshot 2023-07-10 at 2.58.26 PM.png
    Screenshot 2023-07-10 at 2.58.26 PM.png
    245.3 KB · Views: 104
Joined
Oct 22, 2019
Messages
3,641
So what do you think actually happened?

Do your pools pass a scrub? If so, what bitrot are you referring to? Corrupted videos and photos are not evidence of bitrot if they were saved in such a corrupted state. Once any ZFS record is written, it is hashed, and this hash is stored in multiple locations across the pool. Even a single bit changing would fail a scrub or read. (If you have any redundancy, it can re-create the record using a known good copy or by using parity data..)
 

NickF

Guru
Joined
Jun 12, 2014
Messages
763
I agree that we need some additional information here.
Do you run a periodic scrub? What is the full output of
Code:
zpool status -v

I'd be interested to know when it last scrubbed.

I understand your "hypothetical" example. But many things could have changed your files. Do you use a media manager, like TinyMediaManger? Some of those media manager will trigger a metadata update of the file itself. I would hypothetically purposefully modify my files so that the metadata embedded in those files don't have "clues" as to where they came from, and to standardize (sterilize?) their names such that they follow a standard convention. I even hypothetically rencode the files to better support Direct Steams...
So your "example" does not prove anything in this particular instance, since external factors may have validly changed the files.

Now your previous statement
I began discovering these a while ago and only recently realized just how many files it'd happened to.
Videos in my library (often the larger files) would just randomly stop and couldn't play past that point.
Is cause for concern, but without knowing your full system specifications and other pertinent information such as the output of the above command, it's difficult to say. My best guess here? You have a crummy fakeraid or hardware raid card driving your pool. But I am guessing, since you haven't shared.
 

TrumanHW

Contributor
Joined
Apr 17, 2018
Messages
197
So what do you think actually happened?

Do your pools pass a scrub? If so, what bitrot are you referring to?
Yup, they do. That was my first guess ... but it said "no files repaired."
Unless something else makes me think it's endogenous, I'm assuming maybe a bad cable...? Maybe a failing SSD..?
Corrupted videos and photos are not evidence of bitrot if they were saved in such a corrupted state. Once any ZFS record is written, it is hashed, and this hash is stored in multiple locations across the pool. Even a single bit changing would fail a scrub or read. (If you have any redundancy, it can re-create the record using a known good copy or by using parity data..)

Obviously. I was being sincere when I said "I cannot know the cause."
I don't in the least blame ZFS or I'd have stopped using it.
I'd absolutely need corroboration to believe it...Maybe some people would become "superstitious" about it? But I like epistemology.

If you have any redundancy, it can re-create the record using a known good copy or by using parity data.

Exactly. I have another pool and if both have the same erroneous data it'll actually be a good thing (and I'll confirm it). But, I'm in the m iddle of moving and those drives // that array's packed up. I think I'm going to buy another 18-bay T630 and setup a mirrored array for the spinning drives .. (unless of course I can get my stupid NVMe array to perform something close to the drives that comprise it).

The real reason i thought this was worth reporting was the trick I used to 'reconstitute' the drives using BT.
Unless that's common knowledge or something and everyone else already knew about ..?
 

TrumanHW

Contributor
Joined
Apr 17, 2018
Messages
197
I agree that we need some additional information here.
Do you run a periodic scrub? What is the full output of
Code:
zpool status -v

I'd be interested to know when it last scrubbed.

I understand your "hypothetical" example. But many things could have changed your files. Do you use a media manager, like TinyMediaManger? Some of those media manager will trigger a metadata update of the file itself. I would hypothetically purposefully modify my files so that the metadata embedded in those files don't have "clues" as to where they came from, and to standardize (sterilize?) their names such that they follow a standard convention. I even hypothetically rencode the files to better support Direct Steams...
So your "example" does not prove anything in this particular instance, since external factors may have validly changed the files.

Now your previous statement

Is cause for concern, but without knowing your full system specifications and other pertinent information such as the output of the above command, it's difficult to say. My best guess here? You have a crummy fakeraid or hardware raid card driving your pool. But I am guessing, since you haven't shared.


Once I get set back up after I move ... even finding a single instance of these files on my other array with the same exact errors (md5) would seem like a good sign in my book ... I don't use any RAID hardware, ever. I mean, that's the opposite of ZFS ...I only use HBA.

And yes, good question, but yes, it scrubbed weekly in fact (which seemed excessive) and I reduced it down to monthly as I don't even have that machine on all the time. It literally is a media server...but like I said, I have another which is rather similar ... I just don't have it up and running as I'm literally loading my van up tomorrow and driving cross country on Wed.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Do you have redundancy on your media pool?

On every read, ZFS will compare the checksum, and if not good, report it. Plus, if their is redundancy, both supply the corrected data to the requesting program, and fix the failed media block, (if it has spares).

Now I have experienced dozens of media files that failed over the years due to storage device bit rot. In my case, the miniature media server does not have enough storage for redundancy. But, that is okay, I have multiple backups. So I simply restored the failing files. (Thank you ZFS for TELLING me which files to restore!)

I even had a weird failure on that miniature media server. Most files with bit rot were video files, MUCH larger than music, pictures or eBooks, so statistically that was expected. But, I got a correctable error! What the *ell was going on? Took me weeks to figure out that it must have been metadata, which by default is redundant, even on striped or single disk pools. Good for ZFS!


As for the cause, I think some of these unusual cases might be Non-ECC memory. With the 10,000s of PetaBytes ZFS is now storing, those with Non-ECC memory might be having bit flips saved, with proper checksums :-(, to data disks. These would not be "bit rot" because the storage did not "rot", the memory failed. A ZFS scrub would not detect this.

Their can even be an intermediate involved, like a desktop where the DVD or Blu-ray was extracted. You may have viewed the movie on or through the desk first, all good. Then proceeded to copy it to you media server for long term availability. Thus, passing through desktop (probably) Non-ECC memory again, as well as network, so one of those could have introduced the problem. ZFS gets bad data, checksums the bad data, writes the bad data to storage.

Of course, if you tell me you have ECC memory both on the NAS & the extractor, then this is not a possible cause for you.
 
Top