RHOPKINS13
Cadet
- Joined
- May 3, 2023
- Messages
- 3
I have a TrueNAS Scale 22.12.0 server that we recently replaced a bad hard drive in. During the resilver process, the server would run for around an hour and then stop responding. Wouldn't be able to log in to the Web GUI or SSH in, but ping still worked. Even plugging a monitor and keyboard into the server, it wouldn't shut down cleanly, and I had to resort to REISUB to reboot it.
When it stopped responding, I would get emails like the following:
The component device listed in these emails was constantly different. I have a 10gbps NIC installed, and connected using DAC. After multiple of these instances of TrueNAS hanging, I left the DAC disconnected, and while it still crashed, it stayed up a lot longer before crashing (around 8 hours). I've finally managed to get the resilver done, and I've reconnected the DAC and for the moment everything seems to be running smoothly. But I looked in dmesg, and saw in the logs where I'm getting a PCIe Bus Error and the AHCI controller becomes unavailable. I've attached the output from dmesg to this post, you can see the errors start at 30865.475498.
Any idea what is going on? All drives are plugged directly into the motherboard, no HBA. SATA ports are in AHCI mode. All hard drives seem to pass a generic short test in Seagate SeaTools. Ran through several passes of Memtest86+ without any issues. Trying to figure out whether or not it's a bad controller/motherboard, or if maybe there's a BIOS setting or something I need to set? When I first built the system I was just using the onboard network port, I don't think I ran into these sorts of crashes until after I added the 10Gbps NIC. Could the NIC be breaking things? Or is it an issue with the PCI express bus? I'm not sure where to go from here, what should my next troubleshooting step be? System seems to be stable as long as it's not resilvering.
When it stopped responding, I would get emails like the following:
This is an automatically generated mail message from mdadm
running on mynas.mydomain.com
A Fail event had been detected on md device /dev/md125.
It could be related to component device /dev/sdf1.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md125 : active raid1 sdf1[2](F) sdb1[1] sda1[0](F)
2095040 blocks super 1.2 [3/1] [_U_]
md126 : active raid1 sdh1[2] sde1[1](F) sdd1[0](F)
2095040 blocks super 1.2 [3/1] [__U]
md127 : active raid1 nvme2n1p4[1] nvme1n1p4[0]
16759808 blocks super 1.2 [2/2] [UU]
unused devices: <none>
running on mynas.mydomain.com
A Fail event had been detected on md device /dev/md125.
It could be related to component device /dev/sdf1.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md125 : active raid1 sdf1[2](F) sdb1[1] sda1[0](F)
2095040 blocks super 1.2 [3/1] [_U_]
md126 : active raid1 sdh1[2] sde1[1](F) sdd1[0](F)
2095040 blocks super 1.2 [3/1] [__U]
md127 : active raid1 nvme2n1p4[1] nvme1n1p4[0]
16759808 blocks super 1.2 [2/2] [UU]
unused devices: <none>
The component device listed in these emails was constantly different. I have a 10gbps NIC installed, and connected using DAC. After multiple of these instances of TrueNAS hanging, I left the DAC disconnected, and while it still crashed, it stayed up a lot longer before crashing (around 8 hours). I've finally managed to get the resilver done, and I've reconnected the DAC and for the moment everything seems to be running smoothly. But I looked in dmesg, and saw in the logs where I'm getting a PCIe Bus Error and the AHCI controller becomes unavailable. I've attached the output from dmesg to this post, you can see the errors start at 30865.475498.
Any idea what is going on? All drives are plugged directly into the motherboard, no HBA. SATA ports are in AHCI mode. All hard drives seem to pass a generic short test in Seagate SeaTools. Ran through several passes of Memtest86+ without any issues. Trying to figure out whether or not it's a bad controller/motherboard, or if maybe there's a BIOS setting or something I need to set? When I first built the system I was just using the onboard network port, I don't think I ran into these sorts of crashes until after I added the 10Gbps NIC. Could the NIC be breaking things? Or is it an issue with the PCI express bus? I'm not sure where to go from here, what should my next troubleshooting step be? System seems to be stable as long as it's not resilvering.
Motherboard: ASRock X399M Taichi
CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
RAM: Kingston 64GB (4x16GB) DDR4-3200 ECC RAM 9965745-026.A00G (Configured Speed 2667 MT/s)
Power Supply: CORSAIR RM850 850 Watt 80 Plus Gold Certified
GPU: PNY GeForce GT 710 2GB
NIC: 10GTek 10Gb PCI-E NIC Network Card Single SFP+ Port with Intel 82599EN Controller (https://www.amazon.com/gp/product/B01LZRSQM9/)
Drives:
CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
RAM: Kingston 64GB (4x16GB) DDR4-3200 ECC RAM 9965745-026.A00G (Configured Speed 2667 MT/s)
Power Supply: CORSAIR RM850 850 Watt 80 Plus Gold Certified
GPU: PNY GeForce GT 710 2GB
NIC: 10GTek 10Gb PCI-E NIC Network Card Single SFP+ Port with Intel 82599EN Controller (https://www.amazon.com/gp/product/B01LZRSQM9/)
Drives:
- Seagate Exos x16 16TB Hard Drive x8 (RAID-Z2)
- Samsung SSD 970 EVO Plus 500GB x2 (Boot Pool Mirror)
- Samsung SSD 970 EVO Plus 1TB x1 (Cache)