SOLVED ...error resulting in data corruption. Applications may be affected.

Stux · Aug 28, 2017

You verified that the error follows the memory when you move it from one slot to another? And that it doesn't stay in the slot?

~~Also, hang on to your faulty memory (label it!), you now have a way to test that ECC is working~~ sorry... if you had ECC memory. Faulty non-ecc memory is junk.

When you get your new memory, run memtest for a day or so. Then bootup your FreeNAS.

FIRST scrub the pool.

After that, then make a decision.

EDIT: btw, while waiting on the memory, nothing says you can't scrub your pool with only 24GB of memory installed.

dnilgreb · Aug 28, 2017

Are you saying I should scrub the pool before deleting corrupted files? Hoping that the files will be repaired?

Stux said:
After that, then make a decision.

The decision I am making after scrubbing, is that wether to delete the files or not? If it scrubs without errors, it´s safe to keep them?

And yes, I moved it to another slot and still had the error. I tested all the DIMMs in their respective slot, so slots seem OK.

Stux · Aug 28, 2017

dnilgreb said:
Are you saying I should scrub the pool before deleting corrupted files? Hoping that the files will be repaired?

Yes.

The decision I am making after scrubbing, is that wether to delete the files or not? If it scrubs without errors, it´s safe to keep them?

Yes. Since those files are Time Machine backups, I'd suggest using the Time Machine client to reset its backup. It should erase the files as part of that.

And yes, I moved it to another slot and still had the error. I tested all the DIMMs in their respective slot, so slots seem OK.

Good news. Guess you have an appreciation for ECC memory now :)

dnilgreb · Aug 29, 2017

Great!
Yes. Way back when I got into FreeNAS, I used old hardware I already had.
if I were to build a new machine for FreeNAS I would go with ECC. But that is purely based on recommendation. How would this type of problem played ot if I had ECC? What would have been different once a DIMM goes bad?

Stux · Aug 29, 2017

dnilgreb said:
Great!
Yes. Way back when I got into FreeNAS, I used old hardware I already had.
if I were to build a new machine for FreeNAS I would go with ECC. But that is purely based on recommendation. How would this type of problem played ot if I had ECC? What would have been different once a DIMM goes bad?

There would have been no problem.

You would've found out you had a memory error because your found a system event log about it being corrected, then you would've tested/replaced the DIMM.

There would be no corruption or crashing.

If the memory went heinously bad and had a double bit error (a bit like a double disk error), then depending on OS configuration, your system would've frozen as soon as the faulty data was accessed with a memory exception.

Again, no corruption. And you find out *instantly* when your memory goes bad.

Modern ECC systems will actually scrub all system memory every 24 hours too, so they can detect memory failures before they even affect anything.

dnilgreb · Aug 30, 2017

Thanks for the explanation. ECC definitely seems the way to go, but for now I´m stuck with what I´ve got.

Here´s my progress:
Scrubbed my pool with the 24gb memory I had left. None of my three files were repaired (2x Time Machine bands and 1x video). I deleted the video file, and went into Time Machine to remove all backups back to Aug 19, which was the date on the TM files running ls -la. The files still existed, so I ran a new TM backup, and then a new scrub. One of the two repaired, one to go. Will look into if there is a way to clear the error without deleting all backups.

Anyways, got my new memory stick yesterday, went home and plugged it in, ran memtest (with all 32GB).
Looking at it this morning, after running 5 hrs and 30 mins of memtest, I have 2 errors... F*ck.
Back to troubleshooting DIMMS and slots.

Currently running single DIMM, but here´s the question:
Assuming I have another bad DIMM (or slot for that matter), can I expect the error to occur in the same place/test every time? Or is there a chance it could occur somewhere else, or not even at all next time?

Memtest says

Code:

RAM may be vulnerable to high frequency row hammer bit flips
RAM may be vulnerable to high frequency row hammer bit flips
Test: 13 Addr: 2FA0466EC Expected: F1D79C7E Actual: B1D79C7E CPU: 0
Test: 13 Addr: 71F086878 Expected: D7FD2C8D Actual: D6FD2C8D CPU: 0

Stux · Aug 30, 2017

The problem is the bands are what compose the TM disk image, that TM mounts to then make multiple dependent backups.

Deleting a band would be bad.

dnilgreb said:
Currently running single DIMM, but here´s the question:
Assuming I have another bad DIMM (or slot for that matter), can I expect the error to occur in the same place/test every time? Or is there a chance it could occur somewhere else, or not even at all next time?

Memtest says

Code:
RAM may be vulnerable to high frequency row hammer bit flips RAM may be vulnerable to high frequency row hammer bit flips Test: 13 Addr: 2FA0466EC Expected: F1D79C7E Actual: B1D79C7E CPU: 0 Test: 13 Addr: 71F086878 Expected: D7FD2C8D Actual: D6FD2C8D CPU: 0

Depends on the error. Sometimes yes. Some linux kernels even let you feed the kernel with a list of bad memory blocks to avoid.

Best to replace the RAM.

I'm not 100% certain, it depends what Test 13 is... is that the row hammer test? Row hammer is not *supposed* to effect realistic work-loads.

If Test 13 is not the row hammer test, then yes, you have faulty bits in your ram.

dnilgreb · Aug 30, 2017

Stux said:
Deleting a band would be bad.

Yes, I´ve gathered as much, and is looking for a way to correctly rid the corrupt data. If possible. Otherwise I´ll just reset TM.

Stux said:
If Test 13 is not the row hammer test, then yes, you have faulty bits in your ram.

memtest86.com tells me that test 13 is the Hammer test. I think I´m going to run memtest on all sticks individually, and when I find the on reporting the error, try that one in different slots. Then I´ll get a warranty replacement if it´s bad. If the error doesn´t occur again, I could safely assume everything is all right?

Stux · Aug 30, 2017

dnilgreb said:
Yes, I´ve gathered as much, and is looking for a way to correctly rid the corrupt data. If possible. Otherwise I´ll just reset TM.

TM doesn't replace files which are unmodified in any previous backup. So until you remove all backups which reference the dodgy file, and there's almost no way to tell which file (you could do an exhaustive file compare, but this will take an eternity)

I'd reset the backup. If you want you can rename the image or snapshot it, and keep it for a while, but you'll have the errors until then ands are gone.

At the end of the day, this sort of corruption is what ZFS is designed to prevent... or at least detect.

memtest86.com tells me that test 13 is the Hammer test. I think I´m going to run memtest on all sticks individually, and when I find the on reporting the error, try that one in different slots. Then I´ll get a warranty replacement if it´s bad. If the error doesn´t occur again, I could safely assume everything is all right?

Nope. Maybe you could assume it's okay if you ran the memtest for a week, but all you can say is that a memory error didn't occur while you were testing. Maybe reseating the memory cleaned some dust of the contacts? Who knows. ECC knows ;)

As an aside, I suspect most spontaneous crashes on modern OSes are due to either faulty memory or faulty thermals. ~~Criminal~~ a shame that Intel has played games with ECC for so long that we haven't migrated to ECC memory 100% for all modern systems.

MrToddsFriends · Aug 30, 2017

dnilgreb said:
If the error doesn´t occur again, I could safely assume everything is all right?

Many (15?) years ago I had a memory problem resulting in unwanted behavior (program / OS crashes) that was not revealed with memtest86. So while it holds true in my experience that there is a problem if memtest86 finds one (which can be related to memory modules, CPU caches, memory sockets on the motherboard, ...) the negated statement is not true: There might still be a memory related problem even if memtest86 finds none.

dnilgreb · Aug 30, 2017

MrToddsFriends said:
There might still be a memory related problem even if memtest86 finds none.

OK. What other tests should be run, any suggestions?

Hopefully the errors will occur while testing single sticks somwhere, otherwise I guess more testing (outside of memtest?) would be the next step? Maybe it is the next step in either case?

Stux · Aug 30, 2017

dnilgreb said:
Hopefully the errors will occur while testing single sticks somwhere

Not necessarily true either. Sometimes you need to be using certain combinations of sticks for the errors to occur. I suggest checking out the memtest faq and forum. They go into a fair bit of detail about this sortof thing

Hell, maybe increasing the voltage will cause the errors to go away... maybe underlocking will... so many unknowns, and unknowables.

Stux · Aug 30, 2017

Can you provide your system specifications? Mobo? CPU? and RAM brand/model etc

dnilgreb · Aug 31, 2017

Stux said:
Can you provide your system specifications? Mobo? CPU? and RAM brand/model etc

I can, but not from memory. Will take a look tonight when I´m home.

Stux said:
I suggest checking out the memtest faq and forum.

Will do. Thanks!

dnilgreb · Sep 4, 2017

OK. So i finally got my hands on some healthy memory, removed all the corrupted files and scrubbed the volume. All is good and back to normal.
Thanks very much everybody for all the help (and education)!

Important Announcement for the TrueNAS Community.

SOLVED ...error resulting in data corruption. Applications may be affected.

Stux

MVP

dnilgreb

Contributor

Stux

MVP

dnilgreb

Contributor

Stux

MVP

dnilgreb

Contributor

Stux

MVP

dnilgreb

Contributor

Stux

MVP

MrToddsFriends

Documentation Browser

dnilgreb

Contributor

Stux

MVP

Stux

MVP

dnilgreb

Contributor

dnilgreb

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

SOLVED ...error resulting in data corruption. Applications may be affected.

MVP

Contributor

MVP

Contributor

MVP

Contributor

MVP

Contributor

MVP

Documentation Browser

Contributor

MVP

MVP

Contributor

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "...error resulting in data corruption. Applications may be affected."

Similar threads