winnielinnie
MVP
- Joined
- Oct 22, 2019
- Messages
- 3,641
Disclaimer: My important data is not at risk. Everything explained in this post deals with "I don't care" temporary files. I used external USB drives, and couldn't care less if they were vaporized. But playing with USB drives does allow you to freely bump into "issues" to try to figure out. This is purely to better understand ZFS data integrity and recovery.
TestPool
Saving files, deleting files, taking snapshots. The usual.
Then one day errors start spamming the zpool. Checksum, read, and write, all regarding USB Drive B. Yikes!
Short SMART test quickly fails on USB Drive B. It's dying, it's failing, time to say goodbye. I go ahead and offline, then outright detach it. The pool is now comprised of a single drive (stripe) of USB Drive A. For good measure, short and long SMART tests pass for the remaining USB drive A.
(No big deal. Data's not important. I can always just attach another same-sized drive if I ever want it to become a mirror again.)
A scrub on the pool (comprised only of USB Drive A as a stripe, remember) returns several hundred checksum errors only for files within a particular snapshot. (Even though these files exist across multiple snapshots and on the live filesystem itself.)
So I destroy this snapshot and run a scrub again. The scrub completes with no errors.
Long story short, here's what happened.
TestPool
- mirror vdev
- USB drive A
- USB drive B
Saving files, deleting files, taking snapshots. The usual.
Then one day errors start spamming the zpool. Checksum, read, and write, all regarding USB Drive B. Yikes!
Short SMART test quickly fails on USB Drive B. It's dying, it's failing, time to say goodbye. I go ahead and offline, then outright detach it. The pool is now comprised of a single drive (stripe) of USB Drive A. For good measure, short and long SMART tests pass for the remaining USB drive A.
(No big deal. Data's not important. I can always just attach another same-sized drive if I ever want it to become a mirror again.)
But here's where things get interesting...
A scrub on the pool (comprised only of USB Drive A as a stripe, remember) returns several hundred checksum errors only for files within a particular snapshot. (Even though these files exist across multiple snapshots and on the live filesystem itself.)
So I destroy this snapshot and run a scrub again. The scrub completes with no errors.
So now my paranoia kicks in, which is the topic title of this thread:
How is it possible for there to exist checksum errors on files only in a certain snapshot, but these same files do not return checksum errors in other snapshots, or even on the live filesystem? How is it that destroying this snapshot resolves the situation? Creating/destroying snapshots does not alter file data: it only creates pointers.