What possible causes can file corruption have?

Status
Not open for further replies.

checksum

Cadet
Joined
Dec 1, 2016
Messages
5
Newb here,

I'm running a abominable testrig in preparation for building multiple proper FreeNAS servers.
It's a Apple iMac (21.5 inch), and it's a few years old. My 4 connected drives are:
2 x 3TB connected in a Firewire 800 daisychain
2 x 2TB connected with USB 2.0
500GB internal spinning drive

Ridiculous, I know. This is about breaking things to learn why they break.


I created a pool with the 3TB drives mirrored in a vdev and the 2TB drives mirrored in a second vdev. I used the 500GB drive as cache.
I filled the pool to 78% (3.4/4.4TB) and let it sit for a day. This data is of no importance. Then i ran a scrub.
The scrub found one corrupt file it couldn't recover.

After I copied the data in the first place I ran a checksum compare (I ran rsync a second time with the "--checksum" flag). And all was cool in the pool. After ZFS gave me the error I ran rsync with checksum on the broken file, and rsync did recopy it (meaning the checksum was different than on my source).

So now my questions:
- I assume my files made it into the pool whole, because the checksum pass after copy didn't complain. Am I wrong?
- The drives are old and well past worn in, so failure doesn't seem unlikely. But failure of both copies of that one file?
- Is the non ECC RAM to blame? (Will I not have to worry about this mode of failure in a proper ECC RAM build?)

I am running a new scrub after having replaced the broken file. I want to do a memtest after that finishes.

I know the machine is terrible, but I did not expect failure in this place. I hope you wise people of the forum can make me smarter.
 
Last edited by a moderator:

m0nkey_

MVP
Joined
Oct 27, 2015
Messages
2,739
2 x 3TB connected in a Firewire 800 daisychain
2 x 2TB connected with USB 2.0
This is the likely cause. You shouldn't use USB or Firewire (how is that even working?) for pools. It's very well known on this forum that it causes corruption.
I used the 500GB drive as cache.
Spinning rust or SSD? SLOG or L2ARC?
I assume my files made it into the pool whole, because the checksum pass after copy didn't complain. Am I wrong?
Yes, they made it. However, because you're using USB, it's anybody's guess.
The drives are old and well past worn in, so failure doesn't seem unlikely. But failure of both copies of that one file?
As long as the drives are in good health (no bad sectors, etc) this shouldn't be a factor.
Is the non ECC RAM to blame?
ECC is recommended for all FreeNAS builds. A bit flip followed by a scrub can cause corruption.

Long story short, you did everything wrong on your test bed.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
The usb or FireWire is what's causing your problems. If you want to test stuff is a vm. Even the is better than what you have now.

Sent from my Nexus 5X using Tapatalk
 

depasseg

FreeNAS Replicant
Joined
Sep 16, 2014
Messages
2,874
How much RAM do you have?
 

checksum

Cadet
Joined
Dec 1, 2016
Messages
5
You shouldn't use USB or Firewire
Good to know! Why not? (not questioning your knowledge and experience here, I'm just curious). Reccomended reading is also very welcome!

Spinning rust or SSD?
Spinning rust, and L2ARC. Again, probably a silly choice. I just wanted to see what would happen.


So the I/O was the weak link here!
Not that I plan on this to happen, but does this mean I should under no circumstance use USB (even 3) storage on a FreeNAS?


m0nkey_, SweetAndLow, I appreciate the help!

depasseg, I'll get back to you on that tomorrow. I believe it has loads (32GB maybe, at least 16).
 

m0nkey_

MVP
Joined
Oct 27, 2015
Messages
2,739
Why not? (not questioning your knowledge and experience here, I'm just curious).
We've had many a person here running a ZFS pool on USB. While it will work for a while, eventually it will cease to work. Every time you use a drive on USB (or FireWire), a ZFS deity murders a kitten.
Spinning rust, and L2ARC. Again, probably a silly choice. I just wanted to see what would happen.
An L2ARC is a read-only cache. Having another spinny disk wont speed things up. Probably make it slower.
but does this mean I should under no circumstance use USB (even 3) storage on a FreeNAS?
Correct. Using USB leads to severe depression, anxiety and data loss.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Good to know! Why not? (not questioning your knowledge and experience here, I'm just curious). Reccomended reading is also very welcome!


Spinning rust, and L2ARC. Again, probably a silly choice. I just wanted to see what would happen.


So the I/O was the weak link here!
Not that I plan on this to happen, but does this mean I should under no circumstance use USB (even 3) storage on a FreeNAS?


m0nkey_, SweetAndLow, I appreciate the help!

depasseg, I'll get back to you on that tomorrow. I believe it has loads (32GB maybe, at least 16).
If you read any of the stickies this knowledge should available.

Sent from my Nexus 5X using Tapatalk
 

checksum

Cadet
Joined
Dec 1, 2016
Messages
5
I got another error today after doing the second scrub. I think it confirms that the USB/FW were the cause (not that any of the other factors were any good).
Screen Shot 2016-12-02 at 09.29.25.png

A heartfelt thank you to all who have replied. I will do more reading before i set up a less ridiculous test.
Measuring by lessons learned, this test was a success :)
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
I think it confirms that the USB/FW were the cause
Any chance you can pin it down to one or the other? Which vdev is on USB and which is on FireWire? My hunch is that FireWire is more reliable than USB, but that's probably just my inner Apple fanboy. Also, it could still be the disks. What make and model are they? Any chance you can run badblocks destructive on those two disks, just to eliminate that possibility?
Measuring by lessons learned, this test was a success
Agreed, this would be a good place to direct others who are considering using <whichever connection is flaky> in production.
 

checksum

Cadet
Joined
Dec 1, 2016
Messages
5
Which vdev is on USB and which is on FireWire
Hmmm, that could be fun to find out. I will run badblocks monday.

Another thing that is weird with this machine is that it has 20GB RAM... so probably a 4GB and a 16GB stick I'm guessing. That's probably suboptimal haha

Memtest is happy so far, 3hours 30 minutes in, second pass, no errors.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
Mismatched RAM is suboptimal, but not likely to have any measurable impact on most workloads.

You should always test your RAM, but the point of ECC is that it detects, and can usually correct, bit flips during normal operation. These occur more often than you'd think, and have nothing to do with whether the RAM is good or not. That's why having ECC RAM is so important.
 
Status
Not open for further replies.
Top