System Rebooting After Hardware Upgrade

damaddbomma

Dabbler
Joined
Aug 27, 2019
Messages
13
I hope someone here can help. I was attempting to upgrade my system by adding another HBA SAS to incorporate more drives at a later time. When I rebooted, the original HBA would not show up, but the new one did. So I removed the second one, and the first one showed back up, and the system booted as before. After it got to the menu on system monitor, it the server would operate normally, then the error messages in the pictures appear and the system reboots. The system stays online that I can see the UI and the pool looks normal, but apparently right before it reboots, it sends an email indicating the pool is offline.

My motherboard is a SuperMicro X9SCM-F with 32 Gig of memory.

I need some direction....

Thanks!
 

Attachments

  • 20230118_210417.jpg
    20230118_210417.jpg
    260.6 KB · Views: 150
  • 20230118_210429.jpg
    20230118_210429.jpg
    316.4 KB · Views: 145
  • 20230118_210431.jpg
    20230118_210431.jpg
    315.5 KB · Views: 158
  • 20230118_210432.jpg
    20230118_210432.jpg
    310.3 KB · Views: 146

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
If all was good before, I would suggest checking that the existing HBA or other hardware components weren't moved or slightly unseated when you inserted and removed the new HBA.
 

damaddbomma

Dabbler
Joined
Aug 27, 2019
Messages
13
If all was good before, I would suggest checking that the existing HBA or other hardware components weren't moved or slightly unseated when you inserted and removed the new HBA.
Thanks for the suggestion, but I tried that. I pulled and reseated them, and made sure that the screw was torqued down.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
I'm thinking the pool never mounted correctly and drives itself offline.

My troubleshooting advice: (hope it helps, use ONLY the original HBA)
1) During bootup, does the HBA display ALL the connected hard drives? (recommend disconnecting the boot drive so you don't continue booting into TrueNAS, that isn't what we are looking at here)
2) Does the HBA have the ORIGINAL drives connected to it? I assume you added a second HBA to add more drives. Did you mix the drives up that make the VDEV/Pool?
3) If all the drives are not recognized while it's booting (before TRUENAS), fix this problem first.

That is all I got for now, lets see what you find out.
 

damaddbomma

Dabbler
Joined
Aug 27, 2019
Messages
13
I'm thinking the pool never mounted correctly and drives itself offline.

My troubleshooting advice: (hope it helps, use ONLY the original HBA)
1) During bootup, does the HBA display ALL the connected hard drives? (recommend disconnecting the boot drive so you don't continue booting into TrueNAS, that isn't what we are looking at here)
2) Does the HBA have the ORIGINAL drives connected to it? I assume you added a second HBA to add more drives. Did you mix the drives up that make the VDEV/Pool?
3) If all the drives are not recognized while it's booting (before TRUENAS), fix this problem first.

That is all I got for now, lets see what you find out.
1) Yes, the original drives are still showing during the boot phase. The boot drive has always been attached to the SATA port on the motherboard.
2) Yes, I did not move the drives to the new HBA. The only thing I attached to the new HBA was a spare drive that I was going to attach to the pool once the system got back up. Even though I removed that when I removed the new HBA, the problem still exists.
3) N/A

Thanks!
 

damaddbomma

Dabbler
Joined
Aug 27, 2019
Messages
13
I'm thinking the pool never mounted correctly and drives itself offline.

My troubleshooting advice: (hope it helps, use ONLY the original HBA)
1) During bootup, does the HBA display ALL the connected hard drives? (recommend disconnecting the boot drive so you don't continue booting into TrueNAS, that isn't what we are looking at here)
2) Does the HBA have the ORIGINAL drives connected to it? I assume you added a second HBA to add more drives. Did you mix the drives up that make the VDEV/Pool?
3) If all the drives are not recognized while it's booting (before TRUENAS), fix this problem first.

That is all I got for now, lets see what you find out.
I decided to do some experimentation ... I pulled out the old HBA and put in the new in a different slot and it didn't register. So I moved it down one slot, still didn't register. Moved it to the last slot and it finally appeared, all the drives were there, but it still rebooted. Also, during the boot, I noticed some zio_read_errors when drive letters were being assigned.

Given that two of the slots didn't work (even the slot that the old HBA was in), is it possible that my motherboard is going bad?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
Given that two of the slots didn't work (even the slot that the old HBA was in), is it possible that my motherboard is going bad?
Depending on the board, it may be that the slots aren't rated to the requisite number of lanes for the card, but that seems unlikely as HBAs are usually x8, but can run on x4.

That leads to the conclusion that plugging in the additional card may have somehow caused a fracture on the board, which is now leading to intermittent connections for some of the parts.

Very difficult to diagnose if it's just a hairline fracture and partial contact is still made.
 

damaddbomma

Dabbler
Joined
Aug 27, 2019
Messages
13
Depending on the board, it may be that the slots aren't rated to the requisite number of lanes for the card, but that seems unlikely as HBAs are usually x8, but can run on x4.

That leads to the conclusion that plugging in the additional card may have somehow caused a fracture on the board, which is now leading to intermittent connections for some of the parts.

Very difficult to diagnose if it's just a hairline fracture and partial contact is still made.
The original card was in Slot 6 in the attached picture .... the only slot the cards are working on is Slot 4.

I guess my only recourse is to replace the motherboard and see if that works. Luckily, they are readily available on EBay at a low price.

Unless you have some other suggestions.

Thanks!
 

Attachments

  • Screenshot - 1_20_2023 , 4_51_23 AM.jpg
    Screenshot - 1_20_2023 , 4_51_23 AM.jpg
    26.7 KB · Views: 143

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
Unless you have some other suggestions.
You have a tough problem there. I'd reset the BIOS to DEFAULT using Clear CMOS and try again, just incase it is corrupt.

I would do is examine all the PCIe slots with a bright light and look for any damaged or crossed fingers within the slot or debris. Blow the slots out with compressed air.

And I'm assuming you did absolutely nothing else to the system? Think hard, some people will make some kind of change and it won't even pop into their head that it was a problem.

Also I'd run a Memtest on the system for at least one full pass, make sure nothing else is obviously wrong. And even a CPU Stress test for about 1 hour to make sure that isn't it.

I doubt it's the power supply BUT it could be, but with a trap error I doubt it.

Otherwise, glad you can buy it on ebay if you need to.
 

damaddbomma

Dabbler
Joined
Aug 27, 2019
Messages
13
You have a tough problem there. I'd reset the BIOS to DEFAULT using Clear CMOS and try again, just incase it is corrupt.

I would do is examine all the PCIe slots with a bright light and look for any damaged or crossed fingers within the slot or debris. Blow the slots out with compressed air.

And I'm assuming you did absolutely nothing else to the system? Think hard, some people will make some kind of change and it won't even pop into their head that it was a problem.

Also I'd run a Memtest on the system for at least one full pass, make sure nothing else is obviously wrong. And even a CPU Stress test for about 1 hour to make sure that isn't it.

I doubt it's the power supply BUT it could be, but with a trap error I doubt it.

Otherwise, glad you can buy it on ebay if you need to.
No other changes ... it started as soon as I put the new board in .... it had been up and running for 51 days with no issues.

I will check the other things you mentioned and see if that provides any useful information.

Thanks for your help and suggestions.
 
Top