SOLVED ...error resulting in data corruption. Applications may be affected.

Status
Not open for further replies.

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
You verified that the error follows the memory when you move it from one slot to another? And that it doesn't stay in the slot?

Also, hang on to your faulty memory (label it!), you now have a way to test that ECC is working sorry... if you had ECC memory. Faulty non-ecc memory is junk.

When you get your new memory, run memtest for a day or so. Then bootup your FreeNAS.

FIRST scrub the pool.

After that, then make a decision.

EDIT: btw, while waiting on the memory, nothing says you can't scrub your pool with only 24GB of memory installed.
 

dnilgreb

Contributor
Joined
Mar 29, 2016
Messages
168
Are you saying I should scrub the pool before deleting corrupted files? Hoping that the files will be repaired?

After that, then make a decision.
The decision I am making after scrubbing, is that wether to delete the files or not? If it scrubs without errors, it´s safe to keep them?

And yes, I moved it to another slot and still had the error. I tested all the DIMMs in their respective slot, so slots seem OK.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Are you saying I should scrub the pool before deleting corrupted files? Hoping that the files will be repaired?

Yes.

The decision I am making after scrubbing, is that wether to delete the files or not? If it scrubs without errors, it´s safe to keep them?

Yes. Since those files are Time Machine backups, I'd suggest using the Time Machine client to reset its backup. It should erase the files as part of that.

And yes, I moved it to another slot and still had the error. I tested all the DIMMs in their respective slot, so slots seem OK.

Good news. Guess you have an appreciation for ECC memory now :)
 

dnilgreb

Contributor
Joined
Mar 29, 2016
Messages
168
Great!
Yes. Way back when I got into FreeNAS, I used old hardware I already had.
if I were to build a new machine for FreeNAS I would go with ECC. But that is purely based on recommendation. How would this type of problem played ot if I had ECC? What would have been different once a DIMM goes bad?
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Great!
Yes. Way back when I got into FreeNAS, I used old hardware I already had.
if I were to build a new machine for FreeNAS I would go with ECC. But that is purely based on recommendation. How would this type of problem played ot if I had ECC? What would have been different once a DIMM goes bad?

There would have been no problem.

You would've found out you had a memory error because your found a system event log about it being corrected, then you would've tested/replaced the DIMM.

There would be no corruption or crashing.

If the memory went heinously bad and had a double bit error (a bit like a double disk error), then depending on OS configuration, your system would've frozen as soon as the faulty data was accessed with a memory exception.

Again, no corruption. And you find out *instantly* when your memory goes bad.

Modern ECC systems will actually scrub all system memory every 24 hours too, so they can detect memory failures before they even affect anything.
 
Last edited:

dnilgreb

Contributor
Joined
Mar 29, 2016
Messages
168
Thanks for the explanation. ECC definitely seems the way to go, but for now I´m stuck with what I´ve got.

Here´s my progress:
Scrubbed my pool with the 24gb memory I had left. None of my three files were repaired (2x Time Machine bands and 1x video). I deleted the video file, and went into Time Machine to remove all backups back to Aug 19, which was the date on the TM files running ls -la. The files still existed, so I ran a new TM backup, and then a new scrub. One of the two repaired, one to go. Will look into if there is a way to clear the error without deleting all backups.

Anyways, got my new memory stick yesterday, went home and plugged it in, ran memtest (with all 32GB).
Looking at it this morning, after running 5 hrs and 30 mins of memtest, I have 2 errors... F*ck.
Back to troubleshooting DIMMS and slots.

Currently running single DIMM, but here´s the question:
Assuming I have another bad DIMM (or slot for that matter), can I expect the error to occur in the same place/test every time? Or is there a chance it could occur somewhere else, or not even at all next time?

Memtest says
Code:
RAM may be vulnerable to high frequency row hammer bit flips
RAM may be vulnerable to high frequency row hammer bit flips
Test: 13 Addr: 2FA0466EC Expected: F1D79C7E Actual: B1D79C7E CPU: 0
Test: 13 Addr: 71F086878 Expected: D7FD2C8D Actual: D6FD2C8D CPU: 0
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
The problem is the bands are what compose the TM disk image, that TM mounts to then make multiple dependent backups.

Deleting a band would be bad.

Currently running single DIMM, but here´s the question:
Assuming I have another bad DIMM (or slot for that matter), can I expect the error to occur in the same place/test every time? Or is there a chance it could occur somewhere else, or not even at all next time?

Memtest says
Code:
RAM may be vulnerable to high frequency row hammer bit flips
RAM may be vulnerable to high frequency row hammer bit flips
Test: 13 Addr: 2FA0466EC Expected: F1D79C7E Actual: B1D79C7E CPU: 0
Test: 13 Addr: 71F086878 Expected: D7FD2C8D Actual: D6FD2C8D CPU: 0

Depends on the error. Sometimes yes. Some linux kernels even let you feed the kernel with a list of bad memory blocks to avoid.

Best to replace the RAM.

I'm not 100% certain, it depends what Test 13 is... is that the row hammer test? Row hammer is not *supposed* to effect realistic work-loads.

If Test 13 is not the row hammer test, then yes, you have faulty bits in your ram.
 

dnilgreb

Contributor
Joined
Mar 29, 2016
Messages
168
Deleting a band would be bad.
Yes, I´ve gathered as much, and is looking for a way to correctly rid the corrupt data. If possible. Otherwise I´ll just reset TM.

If Test 13 is not the row hammer test, then yes, you have faulty bits in your ram.
memtest86.com tells me that test 13 is the Hammer test. I think I´m going to run memtest on all sticks individually, and when I find the on reporting the error, try that one in different slots. Then I´ll get a warranty replacement if it´s bad. If the error doesn´t occur again, I could safely assume everything is all right?
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Yes, I´ve gathered as much, and is looking for a way to correctly rid the corrupt data. If possible. Otherwise I´ll just reset TM.

TM doesn't replace files which are unmodified in any previous backup. So until you remove all backups which reference the dodgy file, and there's almost no way to tell which file (you could do an exhaustive file compare, but this will take an eternity)

I'd reset the backup. If you want you can rename the image or snapshot it, and keep it for a while, but you'll have the errors until then ands are gone.

At the end of the day, this sort of corruption is what ZFS is designed to prevent... or at least detect.

memtest86.com tells me that test 13 is the Hammer test. I think I´m going to run memtest on all sticks individually, and when I find the on reporting the error, try that one in different slots. Then I´ll get a warranty replacement if it´s bad. If the error doesn´t occur again, I could safely assume everything is all right?

Nope. Maybe you could assume it's okay if you ran the memtest for a week, but all you can say is that a memory error didn't occur while you were testing. Maybe reseating the memory cleaned some dust of the contacts? Who knows. ECC knows ;)

As an aside, I suspect most spontaneous crashes on modern OSes are due to either faulty memory or faulty thermals. Criminal a shame that Intel has played games with ECC for so long that we haven't migrated to ECC memory 100% for all modern systems.
 
Last edited:

MrToddsFriends

Documentation Browser
Joined
Jan 12, 2015
Messages
1,338
If the error doesn´t occur again, I could safely assume everything is all right?

Many (15?) years ago I had a memory problem resulting in unwanted behavior (program / OS crashes) that was not revealed with memtest86. So while it holds true in my experience that there is a problem if memtest86 finds one (which can be related to memory modules, CPU caches, memory sockets on the motherboard, ...) the negated statement is not true: There might still be a memory related problem even if memtest86 finds none.
 

dnilgreb

Contributor
Joined
Mar 29, 2016
Messages
168
There might still be a memory related problem even if memtest86 finds none.
OK. What other tests should be run, any suggestions?

Hopefully the errors will occur while testing single sticks somwhere, otherwise I guess more testing (outside of memtest?) would be the next step? Maybe it is the next step in either case?
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Hopefully the errors will occur while testing single sticks somwhere

Not necessarily true either. Sometimes you need to be using certain combinations of sticks for the errors to occur. I suggest checking out the memtest faq and forum. They go into a fair bit of detail about this sortof thing

Hell, maybe increasing the voltage will cause the errors to go away... maybe underlocking will... so many unknowns, and unknowables.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Can you provide your system specifications? Mobo? CPU? and RAM brand/model etc
 

dnilgreb

Contributor
Joined
Mar 29, 2016
Messages
168

dnilgreb

Contributor
Joined
Mar 29, 2016
Messages
168
OK. So i finally got my hands on some healthy memory, removed all the corrupted files and scrubbed the volume. All is good and back to normal.
Thanks very much everybody for all the help (and education)!
 
Status
Not open for further replies.
Top