SOLVED All disks degraded but SMART tests OK

Status
Not open for further replies.

Rhys_O

Dabbler
Joined
Jan 11, 2016
Messages
17
This is all on a TS140
Xeon 1226 v3
24GB ECC RAM
4x 3TB WD RED running in Raid 10

I had disk errors, I wiped them all and started again because I thought it was something that I'd done.

This is the output from zpool status
Code:
  pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0h21m with 12 errors on Sun Feb 21 10:07:12 2016
config:

    NAME                                            STATE     READ WRITE CKSUM
    tank                                            DEGRADED     0     0    91
      mirror-0                                      DEGRADED     0     0   100
        gptid/9e0c76e5-d6e2-11e5-b203-6c0b8408f021  DEGRADED     0     0   100  too many errors
        gptid/9eba2dbf-d6e2-11e5-b203-6c0b8408f021  DEGRADED     0     0   100  too many errors
      mirror-1                                      DEGRADED     0     0    82
        gptid/9f714294-d6e2-11e5-b203-6c0b8408f021  DEGRADED     0     0    82  too many errors
        gptid/a032be8e-d6e2-11e5-b203-6c0b8408f021  DEGRADED     0     0    82  too many errors

errors: 20 data errors, use '-v' for a list


Here's the smartctl output for one of the disks (they all output exactly the same results)

Code:
SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       842         -
# 2  Short offline       Completed without error       00%       787         -
# 3  Short offline       Completed without error       00%       787         -
# 4  Short offline       Completed without error       00%       775         -
# 5  Short offline       Completed without error       00%       763         -
# 6  Short offline       Completed without error       00%       751         -
# 7  Short offline       Completed without error       00%       739         -
# 8  Short offline       Completed without error       00%       727         -
# 9  Short offline       Completed without error       00%       715         -
#10  Short offline       Completed without error       00%       704         -
#11  Short offline       Completed without error       00%       703         -
#12  Short offline       Completed without error       00%       691         -
#13  Short offline       Completed without error       00%       679         -
#14  Short offline       Completed without error       00%       667         -
#15  Short offline       Completed without error       00%       655         -
#16  Short offline       Completed without error       00%       643         -
#17  Short offline       Completed without error       00%       631         -
#18  Short offline       Completed without error       00%       622         -
#19  Short offline       Completed without error       00%       621         -
#20  Short offline       Completed without error       00%       620         -
#21  Short offline       Completed without error       00%       619         -


I've read a few other threads that reference bad SATA connections, but as the errors are on all disks I'm not sure it's that.

Any help would be wonderful.
 

Mlovelace

Guru
Joined
Aug 19, 2014
Messages
1,111
You should run a memtest to make sure you didn't have a dimm go bad. Make sure all the connections are okay to the disks, but it looks like it could be a bad dimm.
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
The "SMART test" results you've posted to us are useless.

Let us see the ****ENTIRE**** output of
Code:
smartctl -qnoserial -x /dev/{drivedevice}
(note -x not -a, and the -qnoserial should prevent your serial number from being publicly outted) for each of the drives. Best to either pastebin it, or put it in "code" tags.
 

Rhys_O

Dabbler
Joined
Jan 11, 2016
Messages
17
Well done guys, you nailed it. memtest showed up errors immediately, after trial and error removing the modules I found the culprit.

It had never even crossed my mind. I've trashed my data 3 times because of that issue!
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
How did you have bad ECC RAM, and not know it? This RAM should either correct itself, or halt the system. Neither appears to have happened here.

What gives?
 

Rhys_O

Dabbler
Joined
Jan 11, 2016
Messages
17
I'm sorry to say that I have no idea. The RAM is definitely ECC, although I know that doesn't help solve the issue.

I have mass data exchanges happening at the moment, with the system up and running without the RAM module for over around an hour and half and this is the zpool status output which looks perfect to me.

Code:
  pool: tank
state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0h21m with 12 errors on Sun Feb 21 10:07:12 2016
config:

    NAME                                            STATE     READ WRITE CKSUM
    tank                                            ONLINE       0     0     0
      mirror-0                                      ONLINE       0     0     0
        gptid/9e0c76e5-d6e2-11e5-b203-6c0b8408f021  ONLINE       0     0     0
        gptid/9eba2dbf-d6e2-11e5-b203-6c0b8408f021  ONLINE       0     0     0
      mirror-1                                      ONLINE       0     0     0
        gptid/9f714294-d6e2-11e5-b203-6c0b8408f021  ONLINE       0     0     0
        gptid/a032be8e-d6e2-11e5-b203-6c0b8408f021  ONLINE       0     0     0
 

Mirfster

Doesn't know what he's talking about
Joined
Oct 2, 2015
Messages
3,215
How did you have bad ECC RAM, and not know it? This RAM should either correct itself, or halt the system. Neither appears to have happened here.

What gives?
+1... inquiring minds want to know :D
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
Well check me if I'm wrong, but really, what the hell is the purpose of ECC RAM if it will just let you go on your merry old way and corrupt things?

Is there an IPMI log? Does it show ECC errors detected?
 

m0nkey_

MVP
Joined
Oct 27, 2015
Messages
2,739
I'm dubious about the fact you have ECC RAM. As @DrKK said, ECC should have corrected or shut down the system. Check in the BIOS, do you have anything that may disable the ECC capabilities? Also, can you post the make/model number of the RAM you're using that failed?
 

Rhys_O

Dabbler
Joined
Jan 11, 2016
Messages
17
There's nothing shown in the IPMI log that looks untoward. Are there any tests anyone would like me to run?

The RAM module is a Kingston KTD-PE316ELV/8G

I can't see anything in the BIOS to suggest ECC options. Is there anything I should be looking for?
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
I didn't know what a poor decision that I'd made until now...
I don't agree.

Kingston RAM, with just one moment where they did a bait-and-switch on a certain SKU for which they were roundly criticized, has produced fairly reliable RAM. It's unfair to blame them for this, especially since this appears to be a different problem.

Again, what you claim to have, is a situation in which ECC RAM was in a failure state, and the motherboard failed to halt the system when uncorrectable errors occurred; or, you had correctable errors in RAM that the ECC RAM did not correct. I've literally never heard of the latter in my life. Not even once.

This is odd. Perhaps we can't rule out another problem with the system. If it's not "another problem", then you have ECC RAM which isn't doing shit, and is no better than regular RAM. But I'm out of ideas.

It would be nice if someone that understood this stuff better would weigh in with a possible explanation..... @jgreco @cyberjock
 

Rhys_O

Dabbler
Joined
Jan 11, 2016
Messages
17
I need to do due diligence on what m0nkey_ suggested and check to be 100% certain that ECC is enabled. I've read up and it seems that I should be able to do it with memtest apparently, it's in my loft so a real pain in the ass to get to, but I will do another memtest this week and take some pictures of the results etc just for the curious.

https://www.pugetsystems.com/labs/articles/How-to-Check-ECC-RAM-Functionality-462/

@Mirfster fortunately, no, it was from Amazon (Prime)
 

Rhys_O

Dabbler
Joined
Jan 11, 2016
Messages
17
OK still getting CKSUM errors, I'll have to test more thoroughly and try again, I've still got 3 sticks in there so could easily be one of them.
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
The power supply is 280W. Could that be insufficient?

A bad motherboard?
 

m0nkey_

MVP
Joined
Oct 27, 2015
Messages
2,739
OK still getting CKSUM errors, I'll have to test more thoroughly and try again, I've still got 3 sticks in there so could easily be one of them.
You're still getting errors, even with the bad RAM removed? Sounds like the trouble is much deeper. I'm going to suggest that maybe the motherboard is bad some way. Time to call Lenovo/IBM to maybe get the ball rolling on a RMA?
 

Rhys_O

Dabbler
Joined
Jan 11, 2016
Messages
17
Urgh, still going with this.

I ran 4 passes over 3 hours and got 0 errors on memtest. I checked and memtest listed ECC as enabled so that's not the issue.

I've migrated over to Proxmox and did a fresh install of FreeNAS on there (imported my pool and FreeNAS backup) and am running a scrub at the moment. After all that I'm still seeing checksum errors popping up.

It's driving me insane!
 
Status
Not open for further replies.
Top