Going through a lot of other forum posts about frequent rebooting, and how likely it is to be a hardware problem, I'm seeing that the usual troubleshooting method is to remove and replace components to see what helps or doesn't help. We'll be running a memtest tomorrow night for a while, but until we get the results from that I figured I'd see if anyone had a similar setup with the most recent updates where they had fixed the issue:
Our system:
* Supermicro 6027R-E1R12L (X9DRD-7LN4F mobo, LSI controller in IT mode)
* ~100Gb of ECC RAM
* just flashed the mobo firmware to the latest version - 3.2.
* Chelsio T420-CR
* pulled the jumper for watchdog on the mobo, plus made sure watchdog wasn't enabled in the BIOS
* 12x2Tb WD SAS drives in RAID10
* Intel P3700 SLOG
* dual 960w power supplies connected to a switched PDU, and that's going to a UPS.
* mirrored 8Gb USB sticks (Kingston DataTraveler SE9's)
... and it'd been running fine for the last 6 months.
Basically I'm using this system as an iSCSI storage target for our ESXi (5.5) hosts. It had been in a test environment for the last 6 months, we'd put a few "B-list" servers on it, and that was fine. We're sharing the iSCSI as a file extent in a dataset on the only zvol; it's got 1Tb assigned to it, with a capacity limit of 85%.
Anyway, things were going smooth. Just finished uploading the last of our semi-production VMs on to the system late at night last Thursday) - it's now at 230Gb of 10Tb total storage. Default compression, no dedupe. Friday morning the system starts a reboot cycle - it's rebooting over and over again.
We've checked the UPS logs, no weird voltage problems. I switched USB sticks, tried a fresh install and re-apply our backup config, checked the kernel panic logs (there were none), time and date on the BIOS (it's consistent). It doesn't happen within consistent intervals: sometimes it reboots during the boot process, sometimes the system gets to the console and freezes after 5 minutes, sometimes it runs fine for about an hour. Right now it's back to rebooting over & over, never finishing the boot process.
So I got a replacement PSU just in case, and again - running memtest starting tomorrow night. Anything else I should be looking at?
Our system:
* Supermicro 6027R-E1R12L (X9DRD-7LN4F mobo, LSI controller in IT mode)
* ~100Gb of ECC RAM
* just flashed the mobo firmware to the latest version - 3.2.
* Chelsio T420-CR
* pulled the jumper for watchdog on the mobo, plus made sure watchdog wasn't enabled in the BIOS
* 12x2Tb WD SAS drives in RAID10
* Intel P3700 SLOG
* dual 960w power supplies connected to a switched PDU, and that's going to a UPS.
* mirrored 8Gb USB sticks (Kingston DataTraveler SE9's)
... and it'd been running fine for the last 6 months.
Basically I'm using this system as an iSCSI storage target for our ESXi (5.5) hosts. It had been in a test environment for the last 6 months, we'd put a few "B-list" servers on it, and that was fine. We're sharing the iSCSI as a file extent in a dataset on the only zvol; it's got 1Tb assigned to it, with a capacity limit of 85%.
Anyway, things were going smooth. Just finished uploading the last of our semi-production VMs on to the system late at night last Thursday) - it's now at 230Gb of 10Tb total storage. Default compression, no dedupe. Friday morning the system starts a reboot cycle - it's rebooting over and over again.
We've checked the UPS logs, no weird voltage problems. I switched USB sticks, tried a fresh install and re-apply our backup config, checked the kernel panic logs (there were none), time and date on the BIOS (it's consistent). It doesn't happen within consistent intervals: sometimes it reboots during the boot process, sometimes the system gets to the console and freezes after 5 minutes, sometimes it runs fine for about an hour. Right now it's back to rebooting over & over, never finishing the boot process.
So I got a replacement PSU just in case, and again - running memtest starting tomorrow night. Anything else I should be looking at?