The plague of weird issues strikes again...

Status
Not open for further replies.

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
It seems that every week I'm plagued by a new, weird issue (see the previous Intel NICs not actually doing proper gigabit Ethernet thread).

Today's is a nerve-wrecking one (let's just say that my backup policy has suffered from a lack of time to get the new server ready - let this be a warning that backups are essential...):

I've been busy getting the remodeled office into a usable state. Today's tasks included the first Ethernet copper terminations (a Telegärtner Cat 6a patch panel, Corning shielded Cat 6a keystone jacks and Cat 7a cable - mostly because Cat 6a is damned hard to find).

I was testing the first two finished terminations - one for one of the desks and one for the AP in the hallway - the latter of which isn't working properly (the keystone end was a learning experience, so it's not too much of a surprise). The other one was working normally, and iperf showed 900Mb/s (reasonable, given that the GbE NIC was the one in the Surface Pro 3 dock, which is probably a Realtek USB model). After giving up on the first one for the day, I notice that my server wasn't responding to an SSH session, and after that, a weird noise coming from downstairs, where about half of the office was relocated during the remodeling.

Turns out that everything around my workstation had no power and the UPS was responsible for the noise, with an ominous error message on the screen - F02, which I later found out is a "Battery-side short" error.

After some debugging, I arrive at the following preliminary conclusions
  • My old server's PSU is somehow shorted out.
And the following sequence of events:
  1. Seasonic G-550 thinks it's a nice time to ruin a night, decides to short out.
  2. Circuit breaker is tripped, workstation/improvised server room loses power.
  3. Servers move to UPS power
  4. UPS detects short, halts power delivery, makes unmissable noise, rendering servers offline
So, I'm minus a server and my immediate concern is the safety of my pool, Bender. A quick HDD transplant follows, along with the moving of boot devices.
When FreeBSD decides that it can't mount root, I realize that I was still on 9.3.1 on the old server (Skylake. 'nuff said).
"No problem, I'll boot into 9.10 on the SSD, even though it's a pre-release version." Huh, can't mount root?

So next I install the latest 9.10. That goes well. And it boots up. First step, zpool import. And there's Bender, all six drives.

Code:
zpool import Bender

Ouch, failed to import? Oh, right, it wasn't exported. So, to make sure everything's ok, I figured I'd try
Code:
zpool import -f -F -n Bender

But I forgot the -n! I was panicking as the import took something like 30 seconds. Then it returned without error. Phew... Let's scrub this thing. What? Already scrubbing? It resumed the ongoing scrub initiated at midnight? Huh, learn something new every day.

Anyway, so far so good, everything else seems to be working, the scrub is progressing normally, so that's a bullet dodged.

Which brings us to the juicy part, from the "That idiot fell behind on his backups" part. I've had about two hours now to think about this, and the more I think about it, the more I believe that this is one insane coincidence or a sign of some serious issue in some network hardware.

I fully believe that I may have screwed up some termination. However, everything tells me that an Ethernet network should trivially survive any sort of egregious twisted-pair termination mistake - even a freaking short to mains live, since the figure I found for the isolation rating of Ethernet transformers is 1kV+, an order of magnitude above mains voltage, even at 230V. And for such a flaw to propagate only to one of the servers, with the rest of the network moving along normally... It's just crazy unlikely. For that then to somehow wreck the PSU, which is not some Wun Hung Lo Happy China Super Quality Shenzhen back alley model. I can't believe such a scenario with the data I have on hand.

One thought I had was the IPMI LAN, which runs off +5V standby. But then, the PSU would've simply shutdown, not presented a short circuit to the UPS.

So, the plan now is as follows:
  1. Examine the PSU while not attached to anything
  2. Try to power up the server with an X-650 that's currently sitting idle (waiting for the delayed migration of the office desktop to a decent PSU and chassis).
  3. Examine the server for any evidence of physical damage of electrical origin.
And, step 0, backup the most important stuff ASAP.

But first, some sleep.

For the sake of reference, specs:

UPS: APC Back-UPS Pro 900

Old Server:

Supermicro X10SLM+-F
Intel Core i3-4330
16GB ECC RAM
Seasonic G-550
Currently missing 6x WD Red 3TB in RAIDZ2, Bender

New server:

Supermicro X11SSM-F
Intel Core i3-6300
16GB ECC RAM
Seasonic X-650
Currently holding Bender.
 
Last edited:

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Did you connect the shield on both sides of the cable? because the data pairs should be isolated at both ends by the transformers in the NIC but the ground isn't (it's ground after all...), the problem may come from that.
 

styno

Patron
Joined
Apr 11, 2016
Messages
466
It is hard to believe that patching network cables would lead to a PSU shorting out. I am not a big believer of the butterfly effect when it comes to hardware. Electronic components just die sometimes.
However what I DO know for a fact is that they smell or sense the lack of backups in one way or another as they always tend to fail at the worst possible moment in time. About time someone writes a scientific paper about this!
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
The other thing very weird is that the PSU isn't a crappy one so even if it fails there's a fuse inside, the short circuit on the primary side seen by the UPS is very very unlikely. Are you sure it's not the UPS?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Did you connect the shield on both sides of the cable? because the data pairs should be isolated at both end by the transformers in the NIC but the ground isn't (it's ground after all...), the problem may come from that.
I've thought about it, but the Surface Pro 3 dock isn't earth-referenced and the patch panel is floating. In fact, I'd been wondering why they went with floating power supplies for about as long as I've had the thing.

It also tends to not explain how it didn't affect the Asus RT-AC87U or netgear switch sitting between them, as well as every other device.

Also, I just realized that the cable I used to connect the patch panel or keystone jacks to the rest of the network is UTP, with plastic unshielded connector, so no ground loops there.

The other thing very weird is that the PSU isn't a crappy one so even if it fails there's a fuse inside, the short circuit on the primary side seen by the UPS is very very unlikely. Are you sure it's not the UPS?
It's hard to tell, but the UPS appears to be operating normally with other devices.
I also checked the IEC cable, which is fine.

Very very unlikely is right
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Here's an update:
  • With the server assembled, the PSU consistently trips the circuit circuit breaker (16A, standard European outlet circuit, no differential circuit breakers tripped) the moment I switch the PSU's power switch to on.
  • If I disconnect the secondary-side cables, the PSU no longer trips the circuit breaker. +5V standby seems dead, but I'm not going to blindly trust measurements made in the dark, in an awkward position, with an El-Cheapo multimeter (it even has a traditional Chinese QC pass sticker).
  • The "spare" X-650, once connected, seems to provide good power. Motherboard immediately attempts to power on, starts beeping (probably due to lack of RAM, removed to ease access to the ATX power connectors). IPMI is working, and I'm now going to try to boot it up, to get an idea of the state of the hardware.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
It would be very interesting to do an autopsy on the PSU because I'm curious to know what the fault is :)

My bet is on a short-circuit on the secondary side who makes the PSU draw too much current on the primary side.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
It would be very interesting to do an autopsy on the PSU because I'm curious to know what the fault is :)
Oh yes. I wish I had contacts in Seasonic, so that they'd let me open it without voiding the warranty, to inspect it.

Holy crap, this constant feeling that the timing might be anything more than coincidental...

What's also funny is that it was specifically the server I was using for iperf. Maybe it received a TCP "destroy PSU" packet. :p
 

anodos

Sambassador
iXsystems
Joined
Mar 6, 2014
Messages
9,554
Oh yes. I wish I had contacts in Seasonic, so that they'd let me open it without voiding the warranty, to inspect it.

Holy crap, this constant feeling that the timing might be anything more than coincidental...

What's also funny is that it was specifically the server I was using for iperf. Maybe it received a TCP "destroy PSU" packet. :p
Maybe you need to stop drooling over shiny new components.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Either I'm even more tired than I thought and I missed something obvious, or my new server just magically booted into FreeNAS 9.3.1.

More specifically, I was going to grab the latest config off the boot pool. It started by trying to boot from the USB drives, which naturally failed. So I overrode it and had it boot from the SSD, which has 9.10 installed.
So I get to GRUB and it only gives me one option, as expected.

When the server has finished booting, I notice something weird. It was using the old server's hostname. I thought it may have automagically imported the config from the .system dataset on the pool and shrugged it off... Until it wouldn't accept my credentials. The old server's credentials worked.
For extra WTF, USB seems to magically be working - and there's been no BIOS update since my original struggle with this server and FreeBSD 9.3.
For extra fun, here's the banner from the SSH session:

Code:
FreeBSD 10.3-RELEASE (FreeNAS.amd64) #0 f935af8(freebsd10): Mon Apr 18 10:58:36               PDT 2016

        FreeNAS (c) 2009-2015, The FreeNAS Development Team
        All rights reserved.
        FreeNAS is released under the modified BSD license.

        For more information, documentation, help or support, go here:
        http://freenas.org
Welcome to FreeNAS


Somehow, and I have NO FUCKING IDEA HOW, FreeBSD 10.3, from the SSD, loaded the FreeNAS 9.3.1 middleware and GUI from the USB drive!

This is so completely insane that I am going to write a bug report.

Here are a few... interesting shots:

upload_2016-4-25_1-47-20.png


Nothing spectacular so far, you say? The red alert is for the boot pool, which is missing the mirror - which means it's running from the USB drive, not the SSD.

UPS is communicating normally over USB, apparently:
Code:
[root@freenas] ~# upsc Back-UPS
battery.charge: 100
battery.charge.low: 10
battery.charge.warning: 50
battery.date: 2001/09/25
battery.mfr.date: 2014/05/13
battery.runtime: 7690
battery.runtime.low: 120
battery.type: PbAc
battery.voltage: 27.1
battery.voltage.nominal: 24.0
device.mfr: American Power Conversion
device.model: Back-UPS RS 900G
device.serial: 3B1420X03070
device.type: ups
driver.name: usbhid-ups
driver.parameter.pollfreq: 30
driver.parameter.pollinterval: 2
driver.parameter.port: /dev/ugen1.4
driver.version: 2.7.2
driver.version.data: APC HID 0.95
driver.version.internal: 0.38
input.sensitivity: medium
input.transfer.high: 294
input.transfer.low: 176
input.voltage: 243.0
input.voltage.nominal: 230
ups.beeper.status: enabled
ups.delay.shutdown: 20
ups.firmware: 879.L4 .I
ups.firmware.aux: L4
ups.load: 9
ups.mfr: American Power Conversion
ups.mfr.date: 2014/05/13
ups.model: Back-UPS RS 900G
ups.productid: 0002
ups.realpower.nominal: 540
ups.serial: 3B1420X03070
ups.status: OL
ups.test.result: No test initiated
ups.timer.reboot: 0
ups.timer.shutdown: -1
ups.vendorid: 051d


upload_2016-4-25_1-51-47.png


Note da1 and da2. They are USB 3.0 external drives connected to USB 3.0 ports. On this Skylake system. With FreeNAS 9.3.1. See also the 120GB SSD, not being hidden because it is not the boot pool.

upload_2016-4-25_1-58-15.png


This is my favorite. The OS' timestamp is later than the latest boot environment, by two months!


At this point, I feel that this weekend is slowly devolving into a Linus Tech Tips episode, so I am going to shut this monstrosity down, before I lose my now backed-up data. Then I'll boot it up without the old boot device attached.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
For extra oddity, my Asus RT-AC87U was acting up, too. Fortunately, a firmware recovery fixed that.

It's interesting, because the RT-AC87U was the device I was connected to using patch cables and the new runs of cable. But, again, it was connected with UTP cable.

To be fair, it may have been acting up for a few days. I'd have a hard time noticing, because its Ethernet switch was working normally.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
The most disconcerting thing to me in all of this is that Eric is the one having random hardware issues. What's next, jgreco posting for advice about a degraded pool? Bidule0hm asking how to perform a level 1 electronics wizard incanctation?
The most disconcerting thing to me is the existence of a time correlation without any physical explanation for a link between these events.

Minus the FrankenNAS, of course. That's just dubious code loading stuff willy-nilly from pools named freenas-boot, with no regard for version numbers.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
It's not a bug, it's a feature (© Microsoft): FreeNAS is so much compatible and flexible that any parts of any version works with any parts of any other version... :D
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
At this point, I feel that this weekend is slowly devolving into a Linus Tech Tips episode, so I am going to shut this monstrosity down, before I lose my now backed-up data.
This made me laugh so hard I probably woke up the neighbours (7:40 am saturday morning) X)))

Besides that, really awkward reading (down to that line)...
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
upload_2016-4-30_12-18-4.png


The errors followed the DIMM. 4 more errors in six passes total (three of which clustered in a single pass).

At this point, I'm fairly confident the CPU memory controller is fine.

I'm still no closer to finding something that might link the server's mysterious (and mysteriously hard - Seasonic units don't generally just short circuit) failure.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Just an update now that I finally had the time to get things going:
  • The office is all wired up, without any further incidents.
  • The malfunctioning cable had been accidentally terminated for T568A, instead of T568B which was being used everywhere else. :oops:
  • Seasonic approved the RMA without any issues, the PSU is in transit to their RMA shop in Germany. I'll probably get a crap hydrodynamic bearing model back...
  • My testing back in April must've blown the PSU fuse, since the unit is completely dead now. I'd even prepared a 30mA Residual-Current + 10A circuit breaker combination to more safely test the damned thing, but it won't turn on and +5VSB is dead.
Unless someone comes up with some crazy theory I haven't thought of yet, I'm going to attribute this to a dodgy PSU with freakishly incredible timing.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Yup, could have been the power supply. If it happens again then I'd suspect the motherboard.

Your series of problems were not very funny (been there for the multiple hardware failures) and I'm sure aggravating at the time. I hope all those gremlins are gone now and your system gives you no grief in the future. That software thing was very weird.
 
Status
Not open for further replies.
Top