SOLVED All disks degraded but SMART tests OK

Status
Not open for further replies.

m0nkey_

MVP
Joined
Oct 27, 2015
Messages
2,739
Wait, stop right there! Have you been virtualizing FreeNAS all this time?
 

Rhys_O

Dabbler
Joined
Jan 11, 2016
Messages
17
No, just moved over as I thought it might give me more information than FreeNAS on its own. I moved over today.
 

m0nkey_

MVP
Joined
Oct 27, 2015
Messages
2,739
By doing that, you're just going to make things worse by adding an extra layer of complexity and removing FreeNAS's ability to manage the drives.
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
Yes, but the extra layer of complexity might expose more symptoms that point to the underlying problem. Plus, if it goes to Lenovo they will be more likely to respond to those symptoms rather than a 'ZFS is teh checksomme!'
 

Rhys_O

Dabbler
Joined
Jan 11, 2016
Messages
17
Right, it's been virtualised for 24 hours, I left it sitting without doing any heavy read or writes and there are 0 errors at the moment. I'll get a few things downloading and we'll see what happens with it.

Just so you know, I'm using SATA passthrough so the drives are purely managed by FreeNAS and not Proxmox.
 

Rhys_O

Dabbler
Joined
Jan 11, 2016
Messages
17
Right. I'm going to call a close on this problem. We're 3 days in with 0 checksum errors, over 1TB data transfer.

I think it was legacy errors from the bad RAM stick that was the issue and I wasn't diligent enough in my cleaning first time around.

Here's a condensed version of events for anyone facing similar issues:
  1. Disks became degraded very quickly, but it was all disks and it was mirroring errors in accordance to my RAID10 setup.
  2. I ran a memtest and found the offending ECC ram stick that was to blame
  3. I removed the stick, reran the memtest without error (ecc was shown active as well)
  4. I still saw errors on checksum - I'm now putting these down to residual errors rather than new errors
  5. I performed a scrub, which found more errors
  6. I removed ALL affected snapshots (which were a major cause of errors)
  7. I deleted ALL files that were affected
  8. Re-scrubbed
  9. Cleared errors
  10. All working well
Side note: I am now running FreeNAS virtualised using hardware passthrough and not virtualised disks. All working well so far.
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
I, for one, am extremely troubled by the idea of "bad ECC RAM" stick, that did not either halt the system, or pepper the IPMI logs with warnings. I find this resolution completely unsatisfactory.
 

Rhys_O

Dabbler
Joined
Jan 11, 2016
Messages
17
@DrKK I'm sure you're right, but as an obvious newb to ECC and FreeNAS I'm not in a good position to dig any deeper. Unless there are any tests you'd like to see run, it's beyond my current skill set to give you an informed answer.
 

DJ9

Contributor
Joined
Sep 20, 2013
Messages
183
Sometimes things happen in software development, and I've seen it happen again, and again, and again. (oops!)

Just saying.
 

DJ9

Contributor
Joined
Sep 20, 2013
Messages
183
Sometimes things happen in software development, and I've seen it happen again, and again, and again. (oops!)

Just saying.

Awesome delay on the forum for posting. Please delete this one. ;)
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Sometimes things happen in software development, and I've seen it happen again, and again, and again. (oops!)

Just saying.

Awesome delay on the forum for posting. Please delete this one. ;)
I think it makes for a rather funny coincidence, given the content.

I, for one, am extremely troubled by the idea of "bad ECC RAM" stick, that did not either halt the system, or pepper the IPMI logs with warnings. I find this resolution completely unsatisfactory.
It's the second similar-sounding case we've seen lately and I'm growing worried. And we're talking about a big OEM system with a Xeon, so it has no excuse not to work.
 
Status
Not open for further replies.
Top