I'm really not challenging you on this.
Challenge away. I tend to be pragmatic about this sort of thing. With an EE/CS background, I understand why people expect to see a certain behaviour, and my best guess is that the issues aren't entirely understood. I reconcile the ground truth I observe with the theories of others, and here I come to the conclusion that theories fail in the face of facts. I am in no way married to my conclusions, if you can make a compelling challenge as to their correctness.
If cosmic radiation was flipping bits randomly(after all, Intel claims something like 300 random errors in RAM per year for a machine on 24x7x365) I'd expect ECC to correct approximately 300 random errors per year.
Yes, that's about what you'd expect, if we were to believe that.
So here's my practical observation.
1) I'm fully aware of the "Google ECC" study and other claims that ECC errors happen at the rate of (essentially some random number) per day.
2) On a computer that was nearing the ripe old age of 20 (years), we were starting to see parity errors pop up every quarter or thereabouts. Parity errors caused panics, so the events did not pass unnoticed. Aging parts. These panics were not regularly happening during the first half (two thirds?) of the server's life, though there may have been one or two over the years.
3) Our other gear typically doesn't show ANY significant level of ECC errors. Unless there's a bad chip, in which case you can get what I shared above.
My first conclusion is that the Google ECC study is fatally flawed, and my best guess is that it's because Google is buying the cheapest possible $#!+ that they possibly can and cramming it in racks as close as they can, running it as hot as they can, to keep their capex and opex as low as possible. I am *shocked* that they're seeing signs of unhappy hardware.
My second conclusion is that there's something wrong with the cosmic radiation theories, or at least how that's been applied to DRAM for error estimation purposes. We see this all day long in many industries. FDA has just ordered a reduction in the recommended dosage of sleep medications because they made what's essentially a mistake the first time around. Stuff happens.
So here's some hints.
A) Buy name-brand RAM. If you're Google and you're buying thousandlots from the lowest bidder, and the lowest bidder knows you're building ECC-protected servers for bottom dollar, you may well be getting something of a lower quality than the memory being sold in the channel to HP, Dell, or other OEM's and retail vendors. Remember that even marginal silicon often ends up somewhere. Hopefully not in your machine. A known vendor willing to provide a lifetime guarantee may charge 10% more than the scuzballs selling RAM over at never-effin-heard-of-us.com, but you know what? It's worth the extra. The known vendor is getting the stuff that they know that they're less likely to have to RMA.
B) Build a quality system that can watch for problems. Server quality mainboard, from a vendor who does this stuff for the enterprise market. Because you know what? That no-name board won't have ECC error logging support. Your ASUS, Gigabyte, etc. ones may - but they may not either. But you can be pretty sure that the HP, IBM, Supermicro, etc. stuff is intended to do that. Don't skimp on other parts like the power supply either.
C) Keep your system well cooled and unstressed.
D) Be proactive in bench testing (run memtest86 for a week!) and ongoing monitoring, replacing faulty modules.
Strikes me as a good idea to go survey a few other systems to see what's what around here. Hm.