SOLVED All disks degraded but SMART tests OK

Rhys_O · Feb 21, 2016

This is all on a TS140
Xeon 1226 v3
24GB ECC RAM
4x 3TB WD RED running in Raid 10

I had disk errors, I wiped them all and started again because I thought it was something that I'd done.

This is the output from zpool status

Code:

  pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0h21m with 12 errors on Sun Feb 21 10:07:12 2016
config:

    NAME                                            STATE     READ WRITE CKSUM
    tank                                            DEGRADED     0     0    91
      mirror-0                                      DEGRADED     0     0   100
        gptid/9e0c76e5-d6e2-11e5-b203-6c0b8408f021  DEGRADED     0     0   100  too many errors
        gptid/9eba2dbf-d6e2-11e5-b203-6c0b8408f021  DEGRADED     0     0   100  too many errors
      mirror-1                                      DEGRADED     0     0    82
        gptid/9f714294-d6e2-11e5-b203-6c0b8408f021  DEGRADED     0     0    82  too many errors
        gptid/a032be8e-d6e2-11e5-b203-6c0b8408f021  DEGRADED     0     0    82  too many errors

errors: 20 data errors, use '-v' for a list

Here's the smartctl output for one of the disks (they all output exactly the same results)

Code:

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       842         -
# 2  Short offline       Completed without error       00%       787         -
# 3  Short offline       Completed without error       00%       787         -
# 4  Short offline       Completed without error       00%       775         -
# 5  Short offline       Completed without error       00%       763         -
# 6  Short offline       Completed without error       00%       751         -
# 7  Short offline       Completed without error       00%       739         -
# 8  Short offline       Completed without error       00%       727         -
# 9  Short offline       Completed without error       00%       715         -
#10  Short offline       Completed without error       00%       704         -
#11  Short offline       Completed without error       00%       703         -
#12  Short offline       Completed without error       00%       691         -
#13  Short offline       Completed without error       00%       679         -
#14  Short offline       Completed without error       00%       667         -
#15  Short offline       Completed without error       00%       655         -
#16  Short offline       Completed without error       00%       643         -
#17  Short offline       Completed without error       00%       631         -
#18  Short offline       Completed without error       00%       622         -
#19  Short offline       Completed without error       00%       621         -
#20  Short offline       Completed without error       00%       620         -
#21  Short offline       Completed without error       00%       619         -

I've read a few other threads that reference bad SATA connections, but as the errors are on all disks I'm not sure it's that.

Any help would be wonderful.

Mlovelace · Feb 21, 2016

You should run a memtest to make sure you didn't have a dimm go bad. Make sure all the connections are okay to the disks, but it looks like it could be a bad dimm.

BigDave · Feb 21, 2016

Mlovelace said:
You should run a memtest to make sure you didn't have a dimm go bad. Make sure all the connections are okay to the disks, but it looks like it could be a bad dimm.

+1 ^^^^^^^^^^^^

See this stickied post from senior member @jgreco **BigDave bows low** :D
Building, Burn-In, and Testing your FreeNAS system

DrKK · Feb 21, 2016

The "SMART test" results you've posted to us are useless.

Let us see the ****ENTIRE**** output of

Code:

smartctl -qnoserial -x /dev/{drivedevice}

(note -x not -a, and the -qnoserial should prevent your serial number from being publicly outted) for each of the drives. Best to either pastebin it, or put it in "code" tags.

Rhys_O · Feb 21, 2016

Well done guys, you nailed it. memtest showed up errors immediately, after trial and error removing the modules I found the culprit.

It had never even crossed my mind. I've trashed my data 3 times because of that issue!

DrKK · Feb 21, 2016

How did you have bad ECC RAM, and not know it? This RAM should either correct itself, or halt the system. Neither appears to have happened here.

What gives?

Rhys_O · Feb 21, 2016

I'm sorry to say that I have no idea. The RAM is definitely ECC, although I know that doesn't help solve the issue.

I have mass data exchanges happening at the moment, with the system up and running without the RAM module for over around an hour and half and this is the zpool status output which looks perfect to me.

Code:

  pool: tank
state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0h21m with 12 errors on Sun Feb 21 10:07:12 2016
config:

    NAME                                            STATE     READ WRITE CKSUM
    tank                                            ONLINE       0     0     0
      mirror-0                                      ONLINE       0     0     0
        gptid/9e0c76e5-d6e2-11e5-b203-6c0b8408f021  ONLINE       0     0     0
        gptid/9eba2dbf-d6e2-11e5-b203-6c0b8408f021  ONLINE       0     0     0
      mirror-1                                      ONLINE       0     0     0
        gptid/9f714294-d6e2-11e5-b203-6c0b8408f021  ONLINE       0     0     0
        gptid/a032be8e-d6e2-11e5-b203-6c0b8408f021  ONLINE       0     0     0

Mirfster · Feb 21, 2016

DrKK said:
How did you have bad ECC RAM, and not know it? This RAM should either correct itself, or halt the system. Neither appears to have happened here.

What gives?

+1... inquiring minds want to know :D

DrKK · Feb 21, 2016

Well check me if I'm wrong, but really, what the hell is the purpose of ECC RAM if it will just let you go on your merry old way and corrupt things?

Is there an IPMI log? Does it show ECC errors detected?

m0nkey_ · Feb 21, 2016

I'm dubious about the fact you have ECC RAM. As @DrKK said, ECC should have corrected or shut down the system. Check in the BIOS, do you have anything that may disable the ECC capabilities? Also, can you post the make/model number of the RAM you're using that failed?

Rhys_O · Feb 21, 2016

There's nothing shown in the IPMI log that looks untoward. Are there any tests anyone would like me to run?

The RAM module is a Kingston KTD-PE316ELV/8G

I can't see anything in the BIOS to suggest ECC options. Is there anything I should be looking for?

Robert Trevellyan · Feb 21, 2016

Rhys_O said:
Kingston

Rhys_O · Feb 22, 2016

I didn't know what a poor decision that I'd made until now...

DrKK · Feb 22, 2016

Rhys_O said:
I didn't know what a poor decision that I'd made until now...

I don't agree.

Kingston RAM, with just one moment where they did a bait-and-switch on a certain SKU for which they were roundly criticized, has produced fairly reliable RAM. It's unfair to blame them for this, especially since this appears to be a different problem.

Again, what you claim to have, is a situation in which ECC RAM was in a failure state, and the motherboard failed to halt the system when uncorrectable errors occurred; or, you had correctable errors in RAM that the ECC RAM did not correct. I've literally never heard of the latter in my life. Not even once.

This is odd. Perhaps we can't rule out another problem with the system. If it's not "another problem", then you have ECC RAM which isn't doing shit, and is no better than regular RAM. But I'm out of ideas.

It would be nice if someone that understood this stuff better would weigh in with a possible explanation..... @jgreco @cyberjock

Mirfster · Feb 22, 2016

Umm, did you get it off of eBay for a "great" price from a seller in China?

Maybe yours looks something like this: Fake kingston ddr2 china memory

Rhys_O · Feb 22, 2016

I need to do due diligence on what m0nkey_ suggested and check to be 100% certain that ECC is enabled. I've read up and it seems that I should be able to do it with memtest apparently, it's in my loft so a real pain in the ass to get to, but I will do another memtest this week and take some pictures of the results etc just for the curious.

https://www.pugetsystems.com/labs/articles/How-to-Check-ECC-RAM-Functionality-462/

@Mirfster fortunately, no, it was from Amazon (Prime)

Rhys_O · Feb 22, 2016

OK still getting CKSUM errors, I'll have to test more thoroughly and try again, I've still got 3 sticks in there so could easily be one of them.

rs225 · Feb 22, 2016

The power supply is 280W. Could that be insufficient?

A bad motherboard?

m0nkey_ · Feb 22, 2016

Rhys_O said:
OK still getting CKSUM errors, I'll have to test more thoroughly and try again, I've still got 3 sticks in there so could easily be one of them.

You're still getting errors, even with the bad RAM removed? Sounds like the trouble is much deeper. I'm going to suggest that maybe the motherboard is bad some way. Time to call Lenovo/IBM to maybe get the ball rolling on a RMA?

Rhys_O · Feb 26, 2016

Urgh, still going with this.

I ran 4 passes over 3 hours and got 0 errors on memtest. I checked and memtest listed ECC as enabled so that's not the issue.

I've migrated over to Proxmox and did a fresh install of FreeNAS on there (imported my pool and FreeNAS backup) and am running a scrub at the moment. After all that I'm still seeing checksum errors popping up.

It's driving me insane!

Important Announcement for the TrueNAS Community.

SOLVED All disks degraded but SMART tests OK

Dabbler

Guru

FreeNAS Enthusiast

FreeNAS Generalissimo

Dabbler

FreeNAS Generalissimo

Dabbler

Doesn't know what he's talking about

FreeNAS Generalissimo

MVP

Dabbler

Pony Wrangler

Dabbler

FreeNAS Generalissimo

Doesn't know what he's talking about

Dabbler

Dabbler

Guru

MVP

Dabbler

Similar threads