Well we do live in the land of lightning strikes LOL, so go figure. Sorry my twisted sense of humor after another long day dealing with technology. Ordering the new cables tonight and am going to order some different power adapters from monoprice. Those 2 drives in the motherboard side are not mounted in a way that is overly exciting, so tolerances with cables are very tight to non-existent. To address the other question, most of this power design is without the use of molex connectors. Where possible, the power cables are direct run modular cables from the PSU. The exceptions that come to mind are a molex to 2x SATA split on the (2) WD's and another set powering the OS mirrored SSD's. Haven't reseated the HBA but saving that for a later moment of desperation :) One thing that bears mentioning is there may be a pattern to how the errors manifest in the system log, will be taking a closer look at that shortly. Also contemplating ordering another PSU and also going to reach out to Seasonic to see if this has been mentioned before as a PSU fault.
I really appreciate your offer to borrow your motherboard, but let's hold off on that gesture of good will just a bit while the other items are worked through.
So tonight:
1) Order new power and SFF cables
2) Check all existing connections to PSU and components and try scrub again; it normally takes about an hour for this problem to show up and am pretty sure it starts with one particular drive each time and cascades from there. That's the part needing further investigation. Spent time mapping out the serial number to port to gptit labels to ensure drives aren't swapping around after a reboot.
3) Potentially order new PSU
4) Reach out to Seasonic with a support query
5) Turn it off and go relax to deal with the next day.
Anything overlooked :) ?
Thanks!
-Dan
Well hello fellow forum members :)
I have been a busy camper since we last spoke.
All of the above items have been completed, including ordering a new PSU (eVGA P2 850 this time), new SFF cables directly from Supermicro, new more streamlined Molex to 2x SATA power for the drives in the motherboard side.
Additionally, all fans are now powered from the motherboard and the 3-way switch has been removed from the equation, as well as removal of the SATA power cable from the PSU itself.
I used this as an excuse too to buy a nice Fluke meter for doing some basic testing.
I also noticed there was a questionably loose Molex to 2x SATA providing power to the OS SSD's, so of course I fixed that.
While waiting for the various "new" components to arrive, conducted several more tests.
1) After checking everything from a wiring integrity point of view (power, data cables), I decided to re-try the scrub. It took about an hour, but sure enough, right around that timeframe, the errors started occurring. Well that was discouraging, but I already decided, I would not be defeated by this.
2) I decided to dismantle the current Z3 volume and build a slightly smaller Z3 volume, using only 8 drives (in the 2 hanging racks). Of course it required re-populating the volume with data (luckily that wasn't an issue, RSYNC to the the rescue). At first things were looking better and the first time it completed the scrub with no issues, but there was less than a TB on the volume at the time. I decided to add more data to it and run another scrub. Once again, after going beyond about 1.5TB, a scrub revealed the same issues as before.
3) The next objective was to create a mirror on the 2 drives I had taken out of the original volume pool. I decided to create a mirror, load them up with data and see if I could reproduce the error. Surprise! Everything worked as it should.
4) I decided to re-create the volume again (with the 8 drives) but this time create as a Z2 volume. I copied data over and EUREKA, the scrub completed. Not only did it complete once, it completed multiple times, including while I was actively performing a copy. It completed multiple times with different levels of data on the volume.
So, the next question is WHY? Why would it work with a RAIDZ2 volume and not a RAIDZ3 volume? There is more stress on the drives and on the CPU due to the increased parity but beyond that, I cannot offer a reason why I am seeing this behavior.
Today, my new PSU and cables arrived, so I will be taking the server down in a moment and replacing the cables. But after seeing this behavior, I am not convinced this is a power problem as originally thought. I will be back in a while to post more results. Time to replace those crappy cables right quick.