Truenas Scale Pool degraded after a power outage. 15 read errors on random drives. R720xd

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Apparently increasing the CAM timeout defaults fixes the errors (which should be cause by a driver issue).

sysctl kern.cam.da.default_timeout=90
sysctl kern.cam.ada.default_timeout=60

You can set them in the tunables tab of the WebUI.
 

cole4011

Dabbler
Joined
Sep 29, 2023
Messages
18
Apparently increasing the CAM timeout defaults fixes the errors (which should be cause by a driver issue).

sysctl kern.cam.da.default_timeout=90
sysctl kern.cam.ada.default_timeout=60

You can set them in the tunables tab of the WebUI.
Thanks for the tip on this!
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222

cole4011

Dabbler
Joined
Sep 29, 2023
Messages
18
So I've installed one of the LSI 9211-8i PCIE cards in the machine and am still getting the "IOC Fault 0x40007e23" error. I'm 1,000% sure this is not a hardware issue or a disk issue but an issue with FreeBSD (and Linux because this originally used Truenas Scale) and my system. Could be firmware, going to try upgrading everything possible AGAIN. 4 different HBAs, cables, motherboards, backplanes, power supplies, RAM, etc etc etc. If it can be replaced it was replaced.

Not finding any more useful information about this other than to use freenas11 and be done with this hellish problem. I was bald before this issue appeared, and the one good thing about all this is all my hair has fallen out and I'll never have to shave my head ever again.

If anyone else is experiencing this issue on an R720 (which a lot of what I have read are affected r720 systems) I urge you to beat it with a sledgehammer and build your own system. I have another Truenas server running very unconventional hardware (x2 m.2 B-Key 5 slot PCIe cards) and have not had any issues.

What I'm going to do:
1. Put the H310 mini back in the system.
2. Install Freenas11 on a mirror boot-pool
3. SCRUB SCRUB SCRUB
4. BACKUP BACKUP BACKUP
5. Increase all my prices for everything ever
6. Profit

Is anything specifically wrong with using Freenas11 and never upgrading? This POS is only a glorified NFS share at the end of the day.
 

cole4011

Dabbler
Joined
Sep 29, 2023
Messages
18
Ohhhh, and if anyone has any hardware recommendations for 24 bay 2.5 in hot-swappable drives with rear bay hot-swap drives I'm all ears. I crave the destruction of Dell at this point.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
If you are still getting errors with latest release, I would try opening a bug report.

3. SCRUB SCRUB SCRUB
4. BACKUP BACKUP BACKUP
I like it!
 

cole4011

Dabbler
Joined
Sep 29, 2023
Messages
18
If you are still getting errors with latest release, I would try opening a bug report.


I like it!
Gotcha, I'll do that as well to save some others if possible.

Edit: Where is the proper place to submit bug reports for truenas core?
 

cole4011

Dabbler
Joined
Sep 29, 2023
Messages
18
Some more info on this issue for others in the future:

 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Some more info on this issue for others in the future:

In that user case, replacing the thermal compound of their HBAs solver the issue: feel confident enough of doing it? It looks like a different issue though, it's known that hot HBAs are troublemakers.
 

cole4011

Dabbler
Joined
Sep 29, 2023
Messages
18
In that user case, replacing the thermal compound of their HBAs solver the issue: feel confident enough of doing it? It looks like a different issue though, it's known that hot HBAs are troublemakers.
I've already re-pasted every HBA used and added a 40mm fan, but that did not solve my issue.
__________________
This part of the forum is useful though:

I've read somewhere that the .07 isn't the perfect choice (don't ask me where..)
I've the same card running in one of my servers with 20.00.06.00 without problems. Maybe give it a try?


Gesendet von iPhone mit Tapatalk

That sounds like good advice. Here's the official word I just got from the bug report:

Updated by Alexander Motin about 3 hours ago
  • Status changed from Unscreened to 3rd party to resolve
I am sorry, but I don't see what we can do about this. This message: "mps0: IOC Fault 0x40007e23, Resetting" means that firmware of the HBA crashed and was restarted. We have no any relation to the firmware development, so the only thing I can recommend is contacting LSI support.

So at least I know what that error is now. And considering I'm having it on two cards in two different machines ... I'm inclined to believe it is indeed a firmware issue. I will try some downgrades and see how it goes.
 
Top