What possible causes can file corruption have?

checksum · Dec 1, 2016

Newb here,

I'm running a abominable testrig in preparation for building multiple proper FreeNAS servers.
It's a Apple iMac (21.5 inch), and it's a few years old. My 4 connected drives are:
2 x 3TB connected in a Firewire 800 daisychain
2 x 2TB connected with USB 2.0
500GB internal spinning drive

Ridiculous, I know. This is about breaking things to learn why they break.

I created a pool with the 3TB drives mirrored in a vdev and the 2TB drives mirrored in a second vdev. I used the 500GB drive as cache.
I filled the pool to 78% (3.4/4.4TB) and let it sit for a day. This data is of no importance. Then i ran a scrub.
The scrub found one corrupt file it couldn't recover.

After I copied the data in the first place I ran a checksum compare (I ran rsync a second time with the "--checksum" flag). And all was cool in the pool. After ZFS gave me the error I ran rsync with checksum on the broken file, and rsync did recopy it (meaning the checksum was different than on my source).

So now my questions:
- I assume my files made it into the pool whole, because the checksum pass after copy didn't complain. Am I wrong?
- The drives are old and well past worn in, so failure doesn't seem unlikely. But failure of both copies of that one file?
- Is the non ECC RAM to blame? (Will I not have to worry about this mode of failure in a proper ECC RAM build?)

I am running a new scrub after having replaced the broken file. I want to do a memtest after that finishes.

I know the machine is terrible, but I did not expect failure in this place. I hope you wise people of the forum can make me smarter.

m0nkey_ · Dec 1, 2016

checksum said:
2 x 3TB connected in a Firewire 800 daisychain
2 x 2TB connected with USB 2.0

This is the likely cause. You shouldn't use USB or Firewire (how is that even working?) for pools. It's very well known on this forum that it causes corruption.

checksum said:
I used the 500GB drive as cache.

Spinning rust or SSD? SLOG or L2ARC?

checksum said:
I assume my files made it into the pool whole, because the checksum pass after copy didn't complain. Am I wrong?

Yes, they made it. However, because you're using USB, it's anybody's guess.

checksum said:
The drives are old and well past worn in, so failure doesn't seem unlikely. But failure of both copies of that one file?

As long as the drives are in good health (no bad sectors, etc) this shouldn't be a factor.

checksum said:
Is the non ECC RAM to blame?

ECC is recommended for all FreeNAS builds. A bit flip followed by a scrub can cause corruption.

Long story short, you did everything wrong on your test bed.

SweetAndLow · Dec 1, 2016

The usb or FireWire is what's causing your problems. If you want to test stuff is a vm. Even the is better than what you have now.

Sent from my Nexus 5X using Tapatalk

depasseg · Dec 1, 2016

How much RAM do you have?

checksum · Dec 1, 2016

m0nkey_ said:
You shouldn't use USB or Firewire

Good to know! Why not? (not questioning your knowledge and experience here, I'm just curious). Reccomended reading is also very welcome!

m0nkey_ said:
Spinning rust or SSD?

Spinning rust, and L2ARC. Again, probably a silly choice. I just wanted to see what would happen.

So the I/O was the weak link here!
Not that I plan on this to happen, but does this mean I should under no circumstance use USB (even 3) storage on a FreeNAS?

m0nkey_, SweetAndLow, I appreciate the help!

depasseg, I'll get back to you on that tomorrow. I believe it has loads (32GB maybe, at least 16).

m0nkey_ · Dec 1, 2016

checksum said:
Why not? (not questioning your knowledge and experience here, I'm just curious).

We've had many a person here running a ZFS pool on USB. While it will work for a while, eventually it will cease to work. Every time you use a drive on USB (or FireWire), a ZFS deity murders a kitten.

checksum said:
Spinning rust, and L2ARC. Again, probably a silly choice. I just wanted to see what would happen.

An L2ARC is a read-only cache. Having another spinny disk wont speed things up. Probably make it slower.

checksum said:
but does this mean I should under no circumstance use USB (even 3) storage on a FreeNAS?

Correct. Using USB leads to severe depression, anxiety and data loss.

SweetAndLow · Dec 1, 2016

checksum said:
Good to know! Why not? (not questioning your knowledge and experience here, I'm just curious). Reccomended reading is also very welcome!

Spinning rust, and L2ARC. Again, probably a silly choice. I just wanted to see what would happen.

So the I/O was the weak link here!
Not that I plan on this to happen, but does this mean I should under no circumstance use USB (even 3) storage on a FreeNAS?

m0nkey_, SweetAndLow, I appreciate the help!

depasseg, I'll get back to you on that tomorrow. I believe it has loads (32GB maybe, at least 16).

If you read any of the stickies this knowledge should available.

Sent from my Nexus 5X using Tapatalk

Robert Trevellyan · Dec 1, 2016

checksum said:
Is the non ECC RAM to blame?

Could be.

checksum · Dec 2, 2016

I got another error today after doing the second scrub. I think it confirms that the USB/FW were the cause (not that any of the other factors were any good).

A heartfelt thank you to all who have replied. I will do more reading before i set up a less ridiculous test.
Measuring by lessons learned, this test was a success :)

Robert Trevellyan · Dec 2, 2016

checksum said:
I think it confirms that the USB/FW were the cause

Any chance you can pin it down to one or the other? Which vdev is on USB and which is on FireWire? My hunch is that FireWire is more reliable than USB, but that's probably just my inner Apple fanboy. Also, it could still be the disks. What make and model are they? Any chance you can run badblocks destructive on those two disks, just to eliminate that possibility?

checksum said:
Measuring by lessons learned, this test was a success

Agreed, this would be a good place to direct others who are considering using <whichever connection is flaky> in production.

checksum · Dec 2, 2016

Robert Trevellyan said:
Which vdev is on USB and which is on FireWire

Hmmm, that could be fun to find out. I will run badblocks monday.

Another thing that is weird with this machine is that it has 20GB RAM... so probably a 4GB and a 16GB stick I'm guessing. That's probably suboptimal haha

Memtest is happy so far, 3hours 30 minutes in, second pass, no errors.

Robert Trevellyan · Dec 2, 2016

Mismatched RAM is suboptimal, but not likely to have any measurable impact on most workloads.

You should always test your RAM, but the point of ECC is that it detects, and can usually correct, bit flips during normal operation. These occur more often than you'd think, and have nothing to do with whether the RAM is good or not. That's why having ECC RAM is so important.

Important Announcement for the TrueNAS Community.

What possible causes can file corruption have?

checksum

Cadet

m0nkey_

MVP

SweetAndLow

Sweet'NASty

depasseg

FreeNAS Replicant

checksum

Cadet

m0nkey_

MVP

SweetAndLow

Sweet'NASty

Robert Trevellyan

Pony Wrangler

checksum

Cadet

Robert Trevellyan

Pony Wrangler

checksum

Cadet

Robert Trevellyan

Pony Wrangler

Similar threads

Important Announcement for the TrueNAS Community.

What possible causes can file corruption have?

Cadet

MVP

Sweet'NASty

FreeNAS Replicant

Cadet

MVP

Sweet'NASty

Pony Wrangler

Cadet

Pony Wrangler

Cadet

Pony Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "What possible causes can file corruption have?"

Similar threads