One or more devices has experienced an error resulting in data corruption.

Status
Not open for further replies.

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
HDDs are fine, so it can be the cables, the PSU or the RAM (most likely to least likely).

You can test the RAM with Memtest (bootable program).
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
I don't see anything wrong with any of your drives, other than that they haven't had any SMART tests before, indicating you didn't properly burn-in/test them before installing them. Memory is the next thing I'd test, using memtest86+. Probably the best way to get that is as part of the Ultimate Boot CD.
 

Jonas LV

Dabbler
Joined
Feb 12, 2016
Messages
13
... test, using memtest86+. Probably the best way to get that is as part of the Ultimate Boot CD.
I am downloading memtest86+ in UBCD. Will post result as soon as it is available.

I the mean time maybe you could tell me what's the best way forward would be in my case?:

I have bought new RAM (still non-ECC as these are not supported by CPU), - I could run system on those.

One: should I buy Xeon with ECC ram? Is it worth it having in mind that I want to have system for backups and acting like a simple home server - Backups, SQP, Apache, PHP, Media Streaming for number of various devices. Maybe medium to stream RPi cam's, etc.

Two: I have considered to start using MS Server (newest), so I could manage NAS/Server in more lame environment. Just got twin family 'extension' and simply no time to struggle with tech issues.

Now I still think that having at least software RAID would give me decent back up environment. Speed is not an issue. Main thing is to have data secure and accessible through multiple devices.

I did like the idea of FreeNAS, but now even hardly using it I ran in to such issues and lost data. Maybe I will be better off using something that I am more familiar with? - NTFS simply soft mirroring HDD's. Easy to use UI? I am guessing low power rig is out of the window also...
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
One: should I buy Xeon with ECC ram?
You could, but to get it to use ECC you'd also need a new motherboard. It would probably be less expensive to buy a pre-built, low-end server like the Dell PowerEdge T20 (those start under US$200 last I looked).

I'll note that the recommendation for ECC is not at all unique to FreeNAS. Every operating system, and every filesystem, are vulnerable to data corruption caused by bad RAM. If anything, ZFS is less vulnerable than most, but the vulnerability is still there. We tend to be much more vocal in recommending ECC than most other communities because we have what is otherwise a pretty much bulletproof filesystem, and it's just a waste to pair that with RAM of unproved integrity.
Now I still think that having at least software RAID would give me decent back up environment.
You've had software RAID with FreeNAS. You appear to have suffered a hardware failure that's caused widespread data corruption on your pool. NTFS would not improve on this outcome. It wouldn't have any way of telling you which files had problems, though.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Any other system would have the same data loss that you are seeing now except it wouldn't tell you and you would have never known. Zfs at least tries to fix it or tells you. It's strange that non of your disks are bad and zfs couldn't fix the data. I suspect your memory is probably bad. I'm interested in hearing the memtest results.
 

Jonas LV

Dabbler
Joined
Feb 12, 2016
Messages
13
Thank you Danb35. I am also looking at ReFS with storage spaces.

I am guessing I will try to cope with current box and some time in the future I will go for an upgrade to Xeon with ECC ram and new MB. I will definitely be very careful with power and will make sure UPS is always in-between. I really like to have my own hardware assembled so I have at least some flexibility to customize.

Mem test update: 53% done. So far 0 errors.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
You, uh, are nowhere near 53% done. You might be 53% done with a single pass, but you need to run memtest for days or even weeks to be reasonably assured of detecting problem memory.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
The problem with testing memory is that, while it can confirm a failure (i.e., you have a hard failure in your RAM somewhere), it can't confirm that there wasn't a transient problem that did something to hose your data. If it finds a problem, you know there's a problem, but if it doesn't, you don't really know it's all clear.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
but if it doesn't, you don't really know it's all clear.

Correct, but confidence increases as the days pass. When you look at the 5 year operational lifetime of a typical server, one week is 1/250th of that, so if it can go 1/250th of its life with no errors, the likelihood of an undetected error popping up later is much reduced.

Note that I suggest at least a week for ECC systems. For non-ECC, you may want to be more aggressive, like as I said, week_s_.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
In your situation, you turned the power off the system likely while it was doing something and that caused the corruption. I'm a bit surprised at how bad it got but it happens. We have no idea if it was RAM related as well, it's speculation. Not having ECC RAM is causing a lot of second guessing. But I'll take it one step further, even if your RAM test fails, you still don't know if it's the RAM, Motherboard, or Power Supply. That has nothing to do with ECC RAM or Non-ECC RAM, it's just the way things are. Computers can be fun or a pain to troubleshoot, depends on your perspective.

So heed the advice these knowledgeable people have provided, test your RAM for a few days at least, running non-stop. If there are no errors then you can reasonably expect there to be nothing wrong with your system as a whole (motherboard, RAM, PSU). Your hard drives are good as well but the data is of course corrupt so you will need to destroy the pool and recreate it, when you are ready of course.

And like everyone here is advising, try to upgrade to a more correct server hardware system as it will do more to protect your data.

-Good Luck
 

Jonas LV

Dabbler
Joined
Feb 12, 2016
Messages
13
If I remember correctly this was the article that has led me to ignore the ECC importance: http://blog.brianmoses.net/2015/01/diy-nas-2015-edition.html

Memory update: After almost whole day, there were no errors.

I have decided to give a try for MS Server 2012 R2 with ReFS and Storage Spaces. I will reassess in about a year if nothing went wrong.

Thank you all for helping me out with diagnosis. I really appreciate your help. Especially thank you Danb35
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
If I remember correctly this was the article that has led me to ignore the ECC importance: http://blog.brianmoses.net/2015/01/diy-nas-2015-edition.html

Yeah, OMG, that's the chump fool who thinks that an 850 Evo is a good SLOG device, and that it's okay to share an SSD for SLOG and L2ARC, and that running L2ARC on a 16GB RAM system is a good idea.

I should start assembling a list of people whose advice is worth less than zero when building a NAS. Linus Sebastian, Brian Moses, hmm...
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Yeah, OMG, that's the chump fool who thinks that an 850 Evo is a good SLOG device, and that it's okay to share an SSD for SLOG and L2ARC, and that running L2ARC on a 16GB RAM system is a good idea.

I should start assembling a list of people whose advice is worth less than zero when building a NAS. Linus Sebastian, Brian Moses, hmm...
You can add the people at Technikaffe.de to that list.
 

devnullius

Patron
Joined
Dec 9, 2015
Messages
289
You, uh, are nowhere near 53% done. You might be 53% done with a single pass, but you need to run memtest for days or even weeks to be reasonably assured of detecting problem memory.
For real? I always assumed a few passes would be enough! :| A quick google search seems to indicate that 1 night at most should be enough? Have you seen it pass for a few cycles only to crash later on? I'm curious here :)
 

Linkman

Patron
Joined
Feb 19, 2015
Messages
219
Novice data point - when I built my FreeNAS box I ran memtest for a little over 48 hours. Basically, started it Friday night, ignored computer for the weekend, and then quit it and checked results Monday morning.

I could see wanting to go longer if you have RAM that you already suspect might have an issue, I see the initial runs as more for catching out bad and soon-to-be bad (infant mortality type problems) memory right at the start.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
For real? I always assumed a few passes would be enough! :| A quick google search seems to indicate that 1 night at most should be enough?

Cite? You have a quote of someone who is competent at building server gear, or is this just some random YouTuber who figured out how to bodge together a desktop and happened to know about memtest86?

Sorry if that sounds harsh, but the average use case for memtest86 is for someone to identify that bad or not-quite-compatible desktop RAM.

Have you seen it pass for a few cycles only to crash later on? I'm curious here :)

For real. The burn-in time on a machine should exceed a thousand hours, which gives you a fair degree of certainty that you're past various infant mortality issues.

This is not your desktop machine. This is a machine that you need to be rock solid. Especially in the situation being discussed here, a machine that lacks ECC and therefore lacks any way to detect subtle problems after the fact, the only way to detect marginal memory is to test it repeatedly, hoping to catch that bad behaviour. Especially since the machine has been shown to corrupt data.

Is a week overkill? Perhaps. You're 98% likely to find a bad DIMM on the first pass. So if we're strictly talking memory testing, maybe a week isn't really buying you that much. But consider again what I said above:

When you look at the 5 year operational lifetime of a typical server, one week is 1/250th of that, so if it can go 1/250th of its life with no errors, the likelihood of an undetected error popping up later is much reduced.

Since the burn-in period is over a month anyways, there's no harm in running a week of memtest86 at the start and also at the end, and then run disk tests in between. That month isn't just for the memory. It's to allow disks to fail, for fans to show themselves as marginal, and for all other parts of the system to be strenuously tested and to fail.

A platform that you just throw together, run a one-pass memtest on, and then go ahead and load FreeNAS on and start putting your data on there, that seems to have about a 10% chance of ending badly.

A platform that you carefully put together and test for 1000+ hours, testing the way I've described in the burn-in document, that's a much safer place to house your data.

None of you have to listen to me. Bully for you if you don't! Hah.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Novice data point - when I built my FreeNAS box I ran memtest for a little over 48 hours. Basically, started it Friday night, ignored computer for the weekend, and then quit it and checked results Monday morning.

I could see wanting to go longer if you have RAM that you already suspect might have an issue, I see the initial runs as more for catching out bad and soon-to-be bad (infant mortality type problems) memory right at the start.

Oh, and, here, read what Supermicro does when qualifying new DIMM's.

https://www.supermicro.com/products/nfo/files/LRDIMM/Supermicro-LRDIMM.pdf

"As with other semiconductors, memory failure rates over time pass through three distinct phases: 1) early “infant mortality” phase; 2) random failure phase; and 3) wear-out failure phase. Margin testing helps identify potential failures in the early phase by operating the device up to three days at 45°C and varying the memory supply voltage by ±10%. The stress testing lasts five to seven days, depending on capacity, and includes the following steps:"

Note that I don't actually encourage you to bake your RAM for 3 days at 45'C. But I do encourage you to test the heck out of your memory.
 

Linkman

Patron
Joined
Feb 19, 2015
Messages
219
Thanks for the link.
 

devnullius

Patron
Joined
Dec 9, 2015
Messages
289
Learned something new :) Now back to looking at them smart logs, for to me they at first read like broken disks XD

Thanks for the feedback on this. I understand where you are coming from; test test test to be 100% sure. That's fair :) In my view that 'extreme' testing is all done in the factory. After that, I consider(ed) your 98% to be 100% :) Therefore I felt like a few cycles, a night at most should be enough to detect faulty ram. Your way really looks beyond that and looks for the tiniest cracks that might influence otherwise good working RAM (fans, voltages, ram cells on the brink of cracking, ...). Which is a good thing (before I get another rant on my 4ss ;p).

That said, it feels to me like in Jonas's case it's not a RAM problem in the sense that the modules themselves are 'bad'. It could even be a subtle crack in the mobo where the sata ports have become unstable. And a power-off never helps either :( My advice to Jonas is to give up on this specific machine as a server. I have an i7 that is haunted too (replacing RAM solved a lot of those problems, but that RAM checks out in my other machine just fine). The reason why you should give up is that SO MANY almost impossible to detect errors could be at the heart of it. All hardware related, and not always to be tested with software :/ Better use it as a game machine or something less important.

I recently picked up Supermicro server with 6-core CPU (and a 2nd CPU ready to install). ECC RAM actually is cheaper than non-ECC RAM when you buy them used (many servers dumping their memory for a small market). Server was 150 euro, RAM less than 100 (22GB). I think for 400 euro you can build a kick-ass system if you are willing to work with smaller disks to lower total costs. It will be worth it :)

Keep us informed?

Devvie
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
Note that I don't actually encourage you to bake your RAM for 3 days at 45'C. But I do encourage you to test the heck out of your memory.

At $OldJob, we use to qualify embedded computers for operation from -40c to +85c. It was only this high because they were sealed, and therefore the internal temperature got up to 85 or so.

Even with extended temperature memory, we still had fairly high failure rates in memtest. With regular temperature memory, very little would actually run for 24h in memtest without errors.

For the cold side it was less critical. Even regular dimm's had a fairly high percent pass rate.
 
Status
Not open for further replies.
Top