AHCI Controller Unavailable during Resilver

RHOPKINS13

Cadet
Joined
May 3, 2023
Messages
3
I have a TrueNAS Scale 22.12.0 server that we recently replaced a bad hard drive in. During the resilver process, the server would run for around an hour and then stop responding. Wouldn't be able to log in to the Web GUI or SSH in, but ping still worked. Even plugging a monitor and keyboard into the server, it wouldn't shut down cleanly, and I had to resort to REISUB to reboot it.

When it stopped responding, I would get emails like the following:
This is an automatically generated mail message from mdadm
running on mynas.mydomain.com

A Fail event had been detected on md device /dev/md125.

It could be related to component device /dev/sdf1.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md125 : active raid1 sdf1[2](F) sdb1[1] sda1[0](F)
2095040 blocks super 1.2 [3/1] [_U_]

md126 : active raid1 sdh1[2] sde1[1](F) sdd1[0](F)
2095040 blocks super 1.2 [3/1] [__U]

md127 : active raid1 nvme2n1p4[1] nvme1n1p4[0]
16759808 blocks super 1.2 [2/2] [UU]

unused devices: <none>

The component device listed in these emails was constantly different. I have a 10gbps NIC installed, and connected using DAC. After multiple of these instances of TrueNAS hanging, I left the DAC disconnected, and while it still crashed, it stayed up a lot longer before crashing (around 8 hours). I've finally managed to get the resilver done, and I've reconnected the DAC and for the moment everything seems to be running smoothly. But I looked in dmesg, and saw in the logs where I'm getting a PCIe Bus Error and the AHCI controller becomes unavailable. I've attached the output from dmesg to this post, you can see the errors start at 30865.475498.

Any idea what is going on? All drives are plugged directly into the motherboard, no HBA. SATA ports are in AHCI mode. All hard drives seem to pass a generic short test in Seagate SeaTools. Ran through several passes of Memtest86+ without any issues. Trying to figure out whether or not it's a bad controller/motherboard, or if maybe there's a BIOS setting or something I need to set? When I first built the system I was just using the onboard network port, I don't think I ran into these sorts of crashes until after I added the 10Gbps NIC. Could the NIC be breaking things? Or is it an issue with the PCI express bus? I'm not sure where to go from here, what should my next troubleshooting step be? System seems to be stable as long as it's not resilvering.

Motherboard: ASRock X399M Taichi
CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
RAM: Kingston 64GB (4x16GB) DDR4-3200 ECC RAM 9965745-026.A00G (Configured Speed 2667 MT/s)
Power Supply: CORSAIR RM850 850 Watt 80 Plus Gold Certified
GPU: PNY GeForce GT 710 2GB
NIC: 10GTek 10Gb PCI-E NIC Network Card Single SFP+ Port with Intel 82599EN Controller (https://www.amazon.com/gp/product/B01LZRSQM9/)
Drives:
  • Seagate Exos x16 16TB Hard Drive x8 (RAID-Z2)
  • Samsung SSD 970 EVO Plus 500GB x2 (Boot Pool Mirror)
  • Samsung SSD 970 EVO Plus 1TB x1 (Cache)
Noctua NH-U9 TR4-SP3 CPU Cooler, Noctua NF-S12A FLX Case Fans x6
 

Attachments

  • dmesg_log.txt
    156 KB · Views: 76

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Older 10Gbps network cards, such as the X520 that uses the stated 82599EN chipset, have a tendency to run toasty hot. Your specs show a good number of fans, but are they arranged in a manner that pushes sufficient airflow and cooling over the PCIe slot area?

When I first built the system I was just using the onboard network port, I don't think I ran into these sorts of crashes until after I added the 10Gbps NIC. Could the NIC be breaking things? Or is it an issue with the PCI express bus? I'm not sure where to go from here, what should my next troubleshooting step be?

Personally, I would pull the 10Gbps NIC and run a scrub on the pool to generate a large amount of disk I/O and activity.
 

RHOPKINS13

Cadet
Joined
May 3, 2023
Messages
3
Personally, I would pull the 10Gbps NIC and run a scrub on the pool to generate a large amount of disk I/O and activity.
I already have a weekly scrub task scheduled, which I don't think has ever caused this crash before.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I already have a weekly scrub task scheduled, which I don't think has ever caused this crash before.

Fans, capacitors, PSU's, oxidation, and a variety of other things are all fickle and don't care how your system behaved last week or last year. Occasional maintenance is often necessary and you were on a reasonable trajectory in your initial message. The point here is to try to isolate any problem factors that can be identified. This can be heat from an ethernet chipset, reduced airflow causing something else to run uncomfortably warm, electrolytic capacitors past their service lifetime, etc. @HoneyBadger made a reasonable suggestion which needs to be considered, but if there are no obvious indications, I recommend using electronics contact cleaner on each PCIe card edge connector, DIMM and socket, and even pull the CPU if need be, remembering that a CPU pull requires an application of new thermal paste.
 
Joined
Jun 15, 2022
Messages
674
Thermal paste doesn't last forever, it gets cooked too. I just pulled the heatsinks off an LSI Host Bus Adapter and scraped off the old with a plastic razor blade, cleaned up the very minimal remainder, re-coated lightly, and reassembled. Dropping the temperature 10°F is common, sometimes more depending on if it's a CPU, what heatsink, and what airflow.
 

RHOPKINS13

Cadet
Joined
May 3, 2023
Messages
3
Whole build is less than a year old, 10gbps NIC was added less than 6 months ago. @HoneyBadger's recommendation was good, but if a scrub wasn't causing a crash before, it's unlikely to start crashing after pulling the NIC. But nothing in my original post would have indicated that I was doing weekly scrubs, so @HoneyBadger didn't know that. So far it's only happened while doing a resilver, but the system does seem to last a lot longer with the DAC detached. So I'm thinking the PCIe bus is being overloaded, but I'm not sure what to do about it. I just didn't know if this was an indicator of a faulty motherboard, or maybe there's a BIOS setting or something I'm missing?

Looking online I've found a few older posts regarding DPC and AER errors, some that mention X399 motherboards and AMD Threadrippers. Some people seem to have been lucky enough to fix it with a BIOS update, others are using kernel options like pci=nommconf. I've been using Debian Linux as my daily driver for around 6 years now, but I've never dealt with these DPC and AER errors before, had no idea what those two acronyms were until recently, and was hoping someone would have more insight. Fortunately (or unfortunately? for debugging purposes) this issue only has come up during a resilver, so it'll be hard to reproduce until the next time I need to replace a drive. But having to reboot a system that's hung in the middle of a resilver is scary.
 
Joined
Jun 15, 2022
Messages
674
Whoa, wait a minute: 10GTek 10Gb PCI-E NIC
That's a Chinese [possibly chip-clone] manufacturer. We've seen a lot of Zebras on those due to sketchy firmware.

The AMD Ryzen hasn't been a super-stable performer either under TrueNAS, and while the Ryzen is good hardware in general the stress TrueNAS puts on the CPU and Mainboard isn't what it was designed for and it can choke in some configurations. When things are done wayyyyy differently like TrueNAS does (which is why we use TrueNAS) the Hardware Guide becomes really important. No snobbery, the goal is for you to be running smoothly.
 
Top