- Joined
- Feb 15, 2014
- Messages
- 20,194
It seems that every week I'm plagued by a new, weird issue (see the previous Intel NICs not actually doing proper gigabit Ethernet thread).
Today's is a nerve-wrecking one (let's just say that my backup policy has suffered from a lack of time to get the new server ready - let this be a warning that backups are essential...):
I've been busy getting the remodeled office into a usable state. Today's tasks included the first Ethernet copper terminations (a Telegärtner Cat 6a patch panel, Corning shielded Cat 6a keystone jacks and Cat 7a cable - mostly because Cat 6a is damned hard to find).
I was testing the first two finished terminations - one for one of the desks and one for the AP in the hallway - the latter of which isn't working properly (the keystone end was a learning experience, so it's not too much of a surprise). The other one was working normally, and iperf showed 900Mb/s (reasonable, given that the GbE NIC was the one in the Surface Pro 3 dock, which is probably a Realtek USB model). After giving up on the first one for the day, I notice that my server wasn't responding to an SSH session, and after that, a weird noise coming from downstairs, where about half of the office was relocated during the remodeling.
Turns out that everything around my workstation had no power and the UPS was responsible for the noise, with an ominous error message on the screen - F02, which I later found out is a "Battery-side short" error.
After some debugging, I arrive at the following preliminary conclusions
When FreeBSD decides that it can't mount root, I realize that I was still on 9.3.1 on the old server (Skylake. 'nuff said).
"No problem, I'll boot into 9.10 on the SSD, even though it's a pre-release version." Huh, can't mount root?
So next I install the latest 9.10. That goes well. And it boots up. First step, zpool import. And there's Bender, all six drives.
Ouch, failed to import? Oh, right, it wasn't exported. So, to make sure everything's ok, I figured I'd try
But I forgot the -n! I was panicking as the import took something like 30 seconds. Then it returned without error. Phew... Let's scrub this thing. What? Already scrubbing? It resumed the ongoing scrub initiated at midnight? Huh, learn something new every day.
Anyway, so far so good, everything else seems to be working, the scrub is progressing normally, so that's a bullet dodged.
Which brings us to the juicy part, from the "That idiot fell behind on his backups" part. I've had about two hours now to think about this, and the more I think about it, the more I believe that this is one insane coincidence or a sign of some serious issue in some network hardware.
I fully believe that I may have screwed up some termination. However, everything tells me that an Ethernet network should trivially survive any sort of egregious twisted-pair termination mistake - even a freaking short to mains live, since the figure I found for the isolation rating of Ethernet transformers is 1kV+, an order of magnitude above mains voltage, even at 230V. And for such a flaw to propagate only to one of the servers, with the rest of the network moving along normally... It's just crazy unlikely. For that then to somehow wreck the PSU, which is not some Wun Hung Lo Happy China Super Quality Shenzhen back alley model. I can't believe such a scenario with the data I have on hand.
One thought I had was the IPMI LAN, which runs off +5V standby. But then, the PSU would've simply shutdown, not presented a short circuit to the UPS.
So, the plan now is as follows:
But first, some sleep.
For the sake of reference, specs:
UPS: APC Back-UPS Pro 900
Old Server:
Supermicro X10SLM+-F
Intel Core i3-4330
16GB ECC RAM
Seasonic G-550
Currently missing 6x WD Red 3TB in RAIDZ2, Bender
New server:
Supermicro X11SSM-F
Intel Core i3-6300
16GB ECC RAM
Seasonic X-650
Currently holding Bender.
Today's is a nerve-wrecking one (let's just say that my backup policy has suffered from a lack of time to get the new server ready - let this be a warning that backups are essential...):
I've been busy getting the remodeled office into a usable state. Today's tasks included the first Ethernet copper terminations (a Telegärtner Cat 6a patch panel, Corning shielded Cat 6a keystone jacks and Cat 7a cable - mostly because Cat 6a is damned hard to find).
I was testing the first two finished terminations - one for one of the desks and one for the AP in the hallway - the latter of which isn't working properly (the keystone end was a learning experience, so it's not too much of a surprise). The other one was working normally, and iperf showed 900Mb/s (reasonable, given that the GbE NIC was the one in the Surface Pro 3 dock, which is probably a Realtek USB model). After giving up on the first one for the day, I notice that my server wasn't responding to an SSH session, and after that, a weird noise coming from downstairs, where about half of the office was relocated during the remodeling.
Turns out that everything around my workstation had no power and the UPS was responsible for the noise, with an ominous error message on the screen - F02, which I later found out is a "Battery-side short" error.
After some debugging, I arrive at the following preliminary conclusions
- My old server's PSU is somehow shorted out.
- Seasonic G-550 thinks it's a nice time to ruin a night, decides to short out.
- Circuit breaker is tripped, workstation/improvised server room loses power.
- Servers move to UPS power
- UPS detects short, halts power delivery, makes unmissable noise, rendering servers offline
When FreeBSD decides that it can't mount root, I realize that I was still on 9.3.1 on the old server (Skylake. 'nuff said).
"No problem, I'll boot into 9.10 on the SSD, even though it's a pre-release version." Huh, can't mount root?
So next I install the latest 9.10. That goes well. And it boots up. First step, zpool import. And there's Bender, all six drives.
Code:
zpool import Bender
Ouch, failed to import? Oh, right, it wasn't exported. So, to make sure everything's ok, I figured I'd try
Code:
zpool import -f -F -n Bender
But I forgot the -n! I was panicking as the import took something like 30 seconds. Then it returned without error. Phew... Let's scrub this thing. What? Already scrubbing? It resumed the ongoing scrub initiated at midnight? Huh, learn something new every day.
Anyway, so far so good, everything else seems to be working, the scrub is progressing normally, so that's a bullet dodged.
Which brings us to the juicy part, from the "That idiot fell behind on his backups" part. I've had about two hours now to think about this, and the more I think about it, the more I believe that this is one insane coincidence or a sign of some serious issue in some network hardware.
I fully believe that I may have screwed up some termination. However, everything tells me that an Ethernet network should trivially survive any sort of egregious twisted-pair termination mistake - even a freaking short to mains live, since the figure I found for the isolation rating of Ethernet transformers is 1kV+, an order of magnitude above mains voltage, even at 230V. And for such a flaw to propagate only to one of the servers, with the rest of the network moving along normally... It's just crazy unlikely. For that then to somehow wreck the PSU, which is not some Wun Hung Lo Happy China Super Quality Shenzhen back alley model. I can't believe such a scenario with the data I have on hand.
One thought I had was the IPMI LAN, which runs off +5V standby. But then, the PSU would've simply shutdown, not presented a short circuit to the UPS.
So, the plan now is as follows:
- Examine the PSU while not attached to anything
- Try to power up the server with an X-650 that's currently sitting idle (waiting for the delayed migration of the office desktop to a decent PSU and chassis).
- Examine the server for any evidence of physical damage of electrical origin.
But first, some sleep.
For the sake of reference, specs:
UPS: APC Back-UPS Pro 900
Old Server:
Supermicro X10SLM+-F
Intel Core i3-4330
16GB ECC RAM
Seasonic G-550
Currently missing 6x WD Red 3TB in RAIDZ2, Bender
New server:
Supermicro X11SSM-F
Intel Core i3-6300
16GB ECC RAM
Seasonic X-650
Currently holding Bender.
Last edited: