False Smart Errors?

jamiejunk · Jan 12, 2018

This is a system that’s been online and running fine for years. For some reason I keep getting alerts about drives in one of our SAS jbods.

We’ve replaced a few drives, but this is starting to be a daily thing.. So either all the drives are actually dying at the same time, or something else is happening.

Device: /dev/da51, failed to read SMART values
Device: /dev/da44, failed to read SMART values
Device: /dev/da38, failed to read SMART values

When I get an alert now, I’ll manually check the smart and it says it’s fine.

root@san3:~ # smartctl -H /dev/da51
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

root@san3:~ # smartctl -H /dev/da44
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

root@san3:~ # smartctl -H /dev/da38
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Does anyone have any idea what's wrong?

Nick2253 · Jan 12, 2018

Failing to read SMART values does not necessarily indicate a drive is failing. My guess is that you have some degree of latency that is causing a timeout, which is generating the error, or you have a controller/expander that is on the fritz, creating your problems.

Chris Moore · Jan 12, 2018

I would agree with @Nick2253 that it could be a controller or expander. If all the errors are from the same enclosure and you have more than one (we don't know) it could be the SAS expander in that enclosure is getting flaky.

On a side note, I would suggest that you don't just check the -H value because I have had hundreds of drives that needed to be replaced over the years, for perfectly valid reasons, but still came back with that completely useless "OK".
I don't even look at that field any more.

jamiejunk · Jan 12, 2018

I was recently emailed this error:
san3. kernel log messages:
> mps0: Out of chain frames, consider increasing hw.mps.max_chains.
> mps0: Out of chain frames, consider increasing hw.mps.max_chains.

But googling the problem didn't really give me to much information.

For example:
https://forums.freenas.org/index.php?threads/mps-lsi-hw-mps-max_chains.23067/

I upped the value in 4096 and haven't gotten that email message since. But obviously, I'm still having some kind of problem.

Chris Moore · Jan 12, 2018

Can you give us a rundown on the hardware you are using? It might help clarify where the problem is?

jamiejunk · Jan 12, 2018

Couple of sas jbods plugged into a few LSI HBAs.

Here's a copy of my dmesg.boot
https://pastebin.com/YSsd1pck

That seems to have all the detailed info in it. Hopefully that's helpful.

jgreco · Jan 12, 2018

Well, here's what I can tell you.

You can set hw.mps.max_chains which controls the maximum number of DMA chains allocated across all the adapters in the system.

You can use "sysctl" on the command line to read the current and low water number of chains, for example one of the hosts here has:

dev.mps.0.chain_free: 2048
dev.mps.0.chain_free_lowwater: 1818

Meaning all are free right now and it has dipped as low as 1818.

If you are having problems with this, you will probably want to make sure a reasonable number is set per adapter, looking especially at the lowwater number above.

dev.mps.0.max_chains: 2048

is the per-controller variable. It looks to me like the best strategy would be to sum up the total of dev.mps.N.max_chains and use that, or something close to that, for hw.mps.max_chains but I don't know for sure.

Now since you have multiple controllers, it is possible that you stressed out the systemwide number when doing something significant on the pool (perhaps a scrub) so you might find out that the lowwater numbers aren't *very* low (and are maybe in the 500-1000 range) in which case I think your change to hw.mps.max_chains was right on the head. Otherwise, you might need to bump up each mps device a little bit and then bump up the hw.mps.max_chains too.

You've been running that system long enough (5 years now?) that you should really look at the lowwater numbers when you have a significant uptime and lots of heavy activity, and this will tell you the most about what tuning changes should be made. There probably isn't one single answer. Trust your instinct but feel free to run it by us here.

Important Announcement for the TrueNAS Community.

False Smart Errors?

jamiejunk

Contributor

Nick2253

Wizard

Chris Moore

Hall of Famer

jamiejunk

Contributor

Chris Moore

Hall of Famer

jamiejunk

Contributor

jgreco

Resident Grinch

Similar threads

Important Announcement for the TrueNAS Community.

False Smart Errors?

jamiejunk

Contributor

Nick2253

Wizard

Chris Moore

Hall of Famer

jamiejunk

Contributor

Chris Moore

Hall of Famer

jamiejunk

Contributor

jgreco

Resident Grinch

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "False Smart Errors?"

Similar threads