False Smart Errors?

Status
Not open for further replies.

jamiejunk

Contributor
Joined
Jan 13, 2013
Messages
134
This is a system that’s been online and running fine for years. For some reason I keep getting alerts about drives in one of our SAS jbods.

We’ve replaced a few drives, but this is starting to be a daily thing.. So either all the drives are actually dying at the same time, or something else is happening.

Device: /dev/da51, failed to read SMART values
Device: /dev/da44, failed to read SMART values
Device: /dev/da38, failed to read SMART values

When I get an alert now, I’ll manually check the smart and it says it’s fine.

root@san3:~ # smartctl -H /dev/da51
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

root@san3:~ # smartctl -H /dev/da44
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

root@san3:~ # smartctl -H /dev/da38
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Does anyone have any idea what's wrong?
 

Nick2253

Wizard
Joined
Apr 21, 2014
Messages
1,633
Failing to read SMART values does not necessarily indicate a drive is failing. My guess is that you have some degree of latency that is causing a timeout, which is generating the error, or you have a controller/expander that is on the fritz, creating your problems.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I would agree with @Nick2253 that it could be a controller or expander. If all the errors are from the same enclosure and you have more than one (we don't know) it could be the SAS expander in that enclosure is getting flaky.

On a side note, I would suggest that you don't just check the -H value because I have had hundreds of drives that needed to be replaced over the years, for perfectly valid reasons, but still came back with that completely useless "OK".
I don't even look at that field any more.
 

jamiejunk

Contributor
Joined
Jan 13, 2013
Messages
134
I was recently emailed this error:
san3. kernel log messages:
> mps0: Out of chain frames, consider increasing hw.mps.max_chains.
> mps0: Out of chain frames, consider increasing hw.mps.max_chains.

But googling the problem didn't really give me to much information.

For example:
https://forums.freenas.org/index.php?threads/mps-lsi-hw-mps-max_chains.23067/

I upped the value in 4096 and haven't gotten that email message since. But obviously, I'm still having some kind of problem.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Can you give us a rundown on the hardware you are using? It might help clarify where the problem is?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Well, here's what I can tell you.

You can set hw.mps.max_chains which controls the maximum number of DMA chains allocated across all the adapters in the system.

You can use "sysctl" on the command line to read the current and low water number of chains, for example one of the hosts here has:

dev.mps.0.chain_free: 2048
dev.mps.0.chain_free_lowwater: 1818

Meaning all are free right now and it has dipped as low as 1818.

If you are having problems with this, you will probably want to make sure a reasonable number is set per adapter, looking especially at the lowwater number above.

dev.mps.0.max_chains: 2048

is the per-controller variable. It looks to me like the best strategy would be to sum up the total of dev.mps.N.max_chains and use that, or something close to that, for hw.mps.max_chains but I don't know for sure.

Now since you have multiple controllers, it is possible that you stressed out the systemwide number when doing something significant on the pool (perhaps a scrub) so you might find out that the lowwater numbers aren't *very* low (and are maybe in the 500-1000 range) in which case I think your change to hw.mps.max_chains was right on the head. Otherwise, you might need to bump up each mps device a little bit and then bump up the hw.mps.max_chains too.

You've been running that system long enough (5 years now?) that you should really look at the lowwater numbers when you have a significant uptime and lots of heavy activity, and this will tell you the most about what tuning changes should be made. There probably isn't one single answer. Trust your instinct but feel free to run it by us here.
 
Status
Not open for further replies.
Top