Descent into Unhappiness

Constantin · Mar 6, 2023

Good morning,

This weekend, I took some time to try and revive my NAS after its PSU apparently gave out.

I installed a different PSU (450W Corsair) and learned the hard way that modular PSU’s are not pin compatible with each other. So exchanged the cabling, the system booted, all seemed to be fine.

But, two drives were absent, the pool was degraded, so I started to hunt down the issue. All drives were present from the CLI but In the GUI they only showed up as as a long GPID string, not by port as usual.

Changed the relevant signal cabling, no improvement. Put the units on spare SATA ports on the motherboard vs. the HBA, they now showed up. So I’ve decided to buy new cabling. Rebooted, now the whole pool is gone and my only option per the GUI is disconnect / export.

Attempting to reimport the pool results in a I/o error, which to me points to yet more cabling issues. Or perhaps the HBA was damaged?

Constantin · Mar 6, 2023

I’ve ordered a PCIe based HBA as a means of bypassing the HBA built into the motherboard.

WI_Hedgehog · Mar 6, 2023

After having a bad SATA power connector short on my server (I heard it would logically be 12V & 5V bridging, though I see no physical way for this to happen), instead of being caught by the power supply control board it [in theory] fed back to the mainboard and ~~took out~~ smoked a [what looks to be] resettable fuse. Pulling the drives and power cables and starting the system via USB recovery drive it appeared as if everything still worked, but I replaced the mainboard, repaired 2 cables and 1 HDD, another HDD is now salvage (the motherboard was toasted like a marshmallow over an open flame at a fraternity outing). I was amazed nothing else got cooked (although I and a female co-worker may have gotten baked).

You're probably wanting to assume all parts are suspect and do a short system burn-in. (I suggest a few glasses of bourbon and a brunette too.)

Constantin · Mar 6, 2023

That’s what I am afraid of… but I want to test first. For example, the whole pool shows up as gone, yet I seem to be able to query the individual drives via SMART. That doesn’t make a whole lot of sense to me. Is it safe to disconnect a pool only to reimport it afterwards?

WI_Hedgehog · Mar 6, 2023

I don't know about disconnecting the pool in TrueNAS as I'm too new to know the potential outcomes.

I booted off a USB recovery stick and started probing/testing system resources so TrueNAS wouldn't see what happened until the hardware was stable. I was able to replace the cabling with a different style cable (which should have no logical effect), replace the mainboard with the same model (which should have no logical effect), repair one HDD, and replace the other (that would be noticed), test everything, disconnect the new drive, and upon TrueNAS restart reported a drive failed which I then replaced and all was good.

From my standpoint that was a complete tear-down to empty shell (even the hot-swap PSUs were opened and inspected), cleaning, and piece-by-piece rebuild. Each boot was off USB, and I had a laptop with photos of the previous BIOS settings (system and HBA) so it was "a guts-out project" (kind of like re-building RoboCop). TrueNAS only saw an unexpected system crash and restart (like being in a coma, it had no idea what really took place).

Your case is somewhat similar (power issue), maybe first test the hardware and make sure it works reliably, when you know it does then fire up TrueNAS and see what the pool status is.

Constantin · Mar 6, 2023

Here are some SMART results from tonight. I am now re-running the tests on the HDDs, since I was not able to finish that yesterday. I will have more on that tomorrow afternoon.

Given that I was able to interrogate the SMART storage on those drives, I am cautiously optimistic that the main issue may be something benign. I have a offline version of the pool config. Two of the He drives appear to have had an issue early in life which I think had to do with a bad cable (49 hrs) but nothing since.

The three S3610's are for the sVDEV, the eight He10's are bulk storage, a Samsung 860 is L2ARC, two SuperMicro SSDs are the boot pool, and Intel 4801x is there as a SLOG.

FWIW, I got zero response if I use the "zpool import -F -f -n pool_name" command. If I omit the -n, the pool is declared dead as needing to be destroyed.

Assuming all the SMART tests come back OK, should I attempt to disconnect the pool in the GUI and then re-attach it?

Constantin · Mar 6, 2023

Well, never mind, the disconnect / re-import didn't help. I got an I/O error in the GUI with the following traceback as well as the command line.

Error: concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/lib/python3.9/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 111, in main_worker
res = MIDDLEWARE._run(*call_args)
File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 45, in _run
return self._call(name, serviceobj, methodobj, args, job=job)
File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 39, in _call
return methodobj(*params)
File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 39, in _call
return methodobj(*params)
File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 979, in nf
return f(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/zfs.py", line 352, in import_pool
self.logger.error(
File "libzfs.pyx", line 392, in libzfs.ZFS.__exit__
File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/zfs.py", line 346, in import_pool
zfs.import_pool(found, new_name or found.name, options, any_host=any_host)
File "libzfs.pyx", line 1151, in libzfs.ZFS.import_pool
File "libzfs.pyx", line 1179, in libzfs.ZFS.__import_pool
libzfs.ZFSException: I/O error
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/middlewared/job.py", line 355, in run
await self.future
File "/usr/local/lib/python3.9/site-packages/middlewared/job.py", line 391, in __run_body
rv = await self.method(*([self] + args))
File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 975, in nf
return await f(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/pool.py", line 1464, in import_pool
await self.middleware.call('zfs.pool.import_pool', pool['guid'], {
File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1278, in call
return await self._call(
File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1243, in _call
return await self._call_worker(name, *prepared_call.args)
File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1249, in _call_worker
return await self.run_in_proc(main_worker, name, args, job)
File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1168, in run_in_proc
return await self.run_in_executor(self.__procpool, method, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1151, in run_in_executor
return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
libzfs.ZFSException: ('I/O error',)

Constantin · Mar 6, 2023

I suppose I could try restoring an older config to see if that got borked?

Ah, just noticed that two drives are missing - one 1.6TB SSD, the other a 10TB He10. Would a degraded pool result in a i/o error abort?

WI_Hedgehog · Mar 6, 2023

Intel S3610 (3) sVDEV : (2)
He10 (8) bulk storage : (7)
Samsung 860 (1) L2ARC : (1)
SuperMicro SSDs (2) boot pool : (2)
Intel 4801x (1) SLOG : (1)

I had to count given the order isn't grouped tight, you are missing 2 drives, though I'd *think* you'd still be okay (I'm new to TrueNAS and ZFS).

The HGST's that I have use removable circuit boards, I don't know about yours but there's a good chance of getting the bad drive working again by replacing the circuit board (though this shouldn't be your issue currently). It's something like 5 screws, pop off the board, stick on another, re-screw, done.

The reports with:

# 1 Extended offline Interrupted (host reset) 90% 33599 -

were interrupted. #1 is the most recent, but if it completed to 90% it's probably fine as drives generally pull a short test first as part of the long test.

---
I'm not that experienced with ZFS, so I'd run a read-only badblocks on each drive (1 pass) and see if the system stayed connected or reported a bunch of errors. (smartctl runs on the drive's controller, badblocks runs on the CPU)

Constantin · Mar 7, 2023

I will let this set of long tests finish by tonight for the drives I found yesterday. The two missing drives will be attached to the extra SATA ports on the motherboard and interrogated from there.

WI_Hedgehog · Mar 7, 2023

@Constantin, you're a great member who contributes a lot of fun stuff, so let me state this isn't directed at you:

You know how I'm usually saying people should build a Test server on older equipment and a Production server on newer stuff, so they have a hot-backup if needed? Yeah, this is why.

Also to be clear for new members, smartctl issues commands to the hard drive controller, which is on the part of the hard drive unit--it's the circuitry you see stuck to the bottom of the hard drive. For simplicity I'm going to refer to everything collectively as "the hard drive."
--all will report the S.M.A.R.T. information the drive knows about (this information is stored on the HDD controller stuck to the drive).
--xall will also print non-standard vendor-specific information.

--test=long tells the drive to run a short test as a sanity check, then read the entire disk and verify the data against the checksum data. If a block doesn't check out the controller reads the Error Correction Information stored in a different set of blocks on that sector and tries to correct the error. If it can, the controller will write the data back to the original block, wait a bit, then read it back and see if it stored properly. If the previous magnetic pattern "got weak" from age (bit rot) then things should be fine. If not, then the controller will write the data elsewhere on the drive, verify it, generate an ECC code and write that, update a table of bad-blocks and record where the bad block is now mapped. If the controller could not correct the error because the block is too corrupted it reports the block is unrecoverable and the data lost. With RAID-5 the controller will then reconstruct the data from parity data. Since an error occurred the drive controller will update the S.M.A.R.T. data; TrueNAS will see the what happened and report it, and a user can read the full report via smartctl --xall.

The drive will also record information on its own when things go wrong, like when the user is streaming Sex In The City and a block doesn't match the checksum and it goes through the process mentioned previously.

In summary, and the point I'm trying to make, is all this happens on the drive, and if a long test isn't run the drive can only report what it knows about, which is in relation to what it remembers, meaning if it blacks out due to an unexpected power issue its memory might be a bit fuzzy.

---

If there is a problem external to the drive, like a bad connection, flaky HBA (because you zapped it repeatedly with high-voltage), mainboard issue (don't get me started), corrupt cache, etc., the drive probably doesn't know about it. That's where badblocks comes in.

badblocks -v will cause the Operating System to try to read the whole drive. If there is a cabling or some other issue you have a much better clue where the problem is located because smartctl --test=long confirmed the drive seems to be fine (hopefully that was the result), but badblocks is reporting errors. What it is can be a bit dicey, as a non-ECC RAM problem can look like a controller issue. On that, if your mainboard HDD controller has a cache and the cache is corrupt (a common problem on some HP servers), a full mainboard reset is needed to clear everything out. This again is why I buy extra drives and have an old system set up for testing, because it helps quickly isolate which component is causing the issue. (If possible I actually have an exact copy of the Production server(s) running in the Test environment to swap out parts, and a second exact copy to put into the Production environment while a production server is down, but that's not always in the budget, which is why I often fudge my expense report. Well, that and other reasons, but that's a bit off-topic. Anyway, if the second backup server is an exact copy of the problematic server then the second backup server simply replaces the downed production server and stays in production; what used to be the production server is tested, repaired, and becomes the backup server. This is why I have time to thoroughly test the failed server's equipment while still fraternizing with anything on two legs that doesn't identify as the same gender as myself.)

---
Full Disclosure: There were two times in my career the "second exact copy" server failed in short order while in production, so I had to scramble to replace it with the "first exact copy," then quickly fix the "second exact copy," which in both cases was quick and easy as I had a spare set of parts from the failed production server and the problems weren't a 100% overlap. In both cases the "first exact copy" continued to run fine, although the hardware was a bit older and slower one of those times, so when both the systems sitting in the Test Environment burned-in successfully I swapped the original "production server" back into Production. Admittedly, that was a bit nerve-racking as for a short time I had no backup system in those two cases, but it was only briefly, and women love to see a man in full-action getting the job done against what seem like impossible odds. (The guys did too, but my gender-identity precludes me from capitalizing on that admiration in quite the same way.)

Constantin · Mar 7, 2023

First of all, I really appreciate the interest and the tips. Attached are the latest long SMART reports for the 7 drives that were part of array. All of them look good. So my next task is to attach the two missing drives to the motherboard via spare SATA ports (bypassing the HBA) and seeing if I can interrogate them.

Constantin · Mar 7, 2023

OK, both can be interrogated via motherboard ADA connections. All 15 drives in the original cohort now appear active - i.e. 8 He10 HDDs, 3 S3610 DDC sVDEVs, the L2ARC, the SLOG, and the two boot drives. Now could be a good time to re-attempt a pool import.... which worked!

All drives are active, all appears well, except the pool is allegedly "unhealthy". So that's my next quest while I wait for the long SMART tests to conclude on the two remaining drives. Ditto the Pool Scrub, which should finish sometime tomorrow. If all that comes out smelling like roses, it's likely time for a shutdown and a data cable replacement for the two adjacent drives.

If that solves the problem, I am lucky. If not, I may be looking at a dead SATA backplane. If it’s not the SATA backplane then I might be in real trouble since it would be the motherboard HBA that went kaploink.

WI_Hedgehog · Mar 7, 2023

Do you have a current backup? If not, that's the step right after everything starts working, because you never know when/if it will stop working. Touch nothing, back up everything.

In this case wait for the S.M.A.R.T. tests to finish to minimize the hardware load.

Otherwise, awesome!

Constantin said:
But, two drives were absent, the pool was degraded, so I started to hunt down the issue. All drives were present from the CLI but In the GUI they only showed up as as a long GPID string, not by port as usual.

Changed the relevant signal cabling, no improvement. Put the units on spare SATA ports on the motherboard vs. the HBA, they now showed up. So I’ve decided to buy new cabling. Rebooted, now the whole pool is gone and my only option per the GUI is disconnect / export.

It sounds like the HBA might have taken a hit. Sometimes putting a card in another system straightens it out, sometimes not. I first try to make the error happen consistently, then once it's known how to force it to happen I work on identifying possible sources and solutions.

There might also be a 1.) primary and 2.) backup Firmware on the HBA, 3.) BIOS, and 4.) settings. Flashing the whole set might clear out power bugs.

If it is the HBA and the problem is fixable I'd personally be leery of putting it in Production and swap the card into the test system with a yellow tag from a label maker and note it in the Asset Log. (I do that on my home systems too when they're somewhat complex.) It all comes down to how valuable your data is, and I like minimizing potential computer issues. As an aside, I had one Windows system that was so flaky I named it Blonde and used it for failure testing.

Constantin · Mar 7, 2023

We will see and many thanks for the tips and the hand holding throughout the process. I will know in another 5 hours or so how the pool is feeling re it’s scrub. In another 17 hours, the long smart test will be over. Fingers crossed the drives / pool are ok, then it’s a matter of figuring out any cabling / backplane / motherboard HBA issues.

As an aside, what a different place this pool is at vs. “pool is dead, destroy it and rebuild from scratch” just a few hours ago. Almost have to have some sort of gallows humor re: the unhelpful zfs error suggestions.

WI_Hedgehog · Mar 7, 2023

You bet. I'm pretty sure you had this one without any help, though it is nice to not have to go it alone. Great job on saving the server!!!!

(although until the hardware is proven good, you might not want to trust it)

Constantin · Mar 7, 2023

If all goes well, the backup will happen tomorrow. As it is, there is very little delta between the NAS and the backup ATM because I shut off the NAS for the holiday and nothing has been going on it since. So far, the scrub is 66% done and none of the drives are offline, unhappy, or whatever, suggesting that the scrub will complete. Sometime tomorrow evening, the last He10 will issue its SMART report, at which point, I think we can put the motherboard in general and the drives in the "likely good bin" and the cabling, BPSATA3 backplane, some HBA channels in the ??? bin.

Replacement cabling is already here, so that will get changed first. If that's not it, then I will replace the backplane. If that still doesn't work then I'm afraid the motherboard HBA took the hit.

Constantin · Mar 8, 2023

Scrub reports 0 errors. Applied latest System update. Saved config and rebooted. Pool now at zero errors and happy.

Now onto next steps, i.e. figuring out for those two drives if it's a SATA cable, backplane, or HBA issue.

Constantin · Mar 8, 2023

The latest issues were apparently caused by electrical power issues, i.e. the lack of "interconnectedness" on the Lian-Li SATA backplanes. Simply connecting to one of the four power connectors on each BP3 is not enough, it has to be one of the upper ones (IDE/SATA Power) and one of the lower two. Never had that issue with the Seasonic because it had so many widely-spaced power plugs.

So with the Corsair PSU, the lowest two drives simply didn't get any 12/5VDC power, hence didn't show up per the HBA. That is now fixed.

Happily, moving the SFF connector around on the motherboard showed the drives in different slots per the LSI utility, so the HBA seems to be OK.

Everything seems to be back to normal, so now it's time for a big backup.

Constantin · Mar 10, 2023

Big Backup complete. Disk on HBA @ DA7 is unhappy, causing UDMA CRC issues, which is likely a cabling issue. Newly purchased Seasonic PSU is here, advance-replacement PSU will be here in another week, I guess. I'm kind a curious what differences re: idle power consumption I'd measure between a Platinum Corsair 450W (which is what's in now) vs. a Ti 700 from Seasonic, or the eventual replacement 650W Ti.

Given the usual plug load with the Ti (about 106W) the smaller Corsair may approach the same efficiency as the Ti units despite being a lower-tier PSU (due to operating at 1/4 PSU capacity vs. 1/6).

Time to take down the NAS, reboot after replacing cable. Retest. Then burn in a new sVDEV replacement drive in case it's needed.

Important Announcement for the TrueNAS Community.

Descent into Unhappiness

Vampire Pig

Vampire Pig

Guru

Vampire Pig

Guru

Vampire Pig

Attachments

Vampire Pig

Vampire Pig

Guru

Vampire Pig

Guru

Vampire Pig

Attachments

Vampire Pig

Guru

Vampire Pig

Guru

Vampire Pig

Vampire Pig

Vampire Pig

Vampire Pig

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Descent into Unhappiness"

Similar threads