Drive in dmesg, not Web CLI

swedish_six_shooter · Mar 6, 2024

Hey all,

I rather *suspect* the answer to this is to just keep waiting or to pop and re-home the drive, but I wanted to check if anyone had workaround via the Web CLI itself.

We've just installled a new JBOD. According to the Web CLI, 23 of the 24 drives in (all Seagate Exos) can be seen.
We rehomed the missing drive and have confirmed that it's been picked up by dmesg. I've even run a short smart test from the command line on it and had it complete successfully, so ZFS is certainly aware of the drive.

The Web CLI hasn't picked up on the drive, and it's going on four hours of not noticing it. Is there any way to force the Web CLI to notice the drive?

artlessknave · Mar 6, 2024

swedish_six_shooter said:
Is there any way to force the Web CLI to notice the drive?

I don't believe so; there is no need, as its automatic. if its not seeing it, I would expect there is something wrong with the drive or its connection. just waiting isn't going to help, you will need to isolate the problem.

what do you mean by "re-home"? this is not a standard term.

swedish_six_shooter said:
I've even run a short smart test from the command line on it and had it complete successfully, so ZFS is certainly aware of the drive.

smart tests have nothing to do with zfs, so this does not mean that "zfs is certainly aware of the drive". zfs will not be aware of the drive just because its in the system.. it will ignore any drives/partitions that do not have zfs pool info on them.

have you tried switching the drive around with one of the others? does anything change? if you lose a different drive, than its your backplane/connection. if the same drive remains unavilable, then you need to test it in another system and see how it behaves.

you have put no hardware details, so its impossible to say much else. is there enough power? cooling? is the controller able to handle 24 drives?

swedish_six_shooter · Mar 7, 2024

Hey! Thanks very dang much for the quick response.

By "re-home", I mean pop the drive slide from the JBOD and then re-insert it. That's worked for bad connections before. (Didn't work this time, though. Since yesterday, I've done:
* pop-and-re-insert: no impact, though everything registers with dmesg
* pop-and-replace-with-spare-drive-and-reinsert: spare drive not notice by Web UI though, again, everything registers with dmesg
* swap-drive-with-a-second-drive-in-the-same-jbod: drive not in Web UI, though the second drive is. Both registered by dmesg

There's a chance that the controller card couldn't support this drive, as you mention but I'm sort of leery of that: wouldn't it not be getting registered by the TrueNas OS at all -e.g. in dmesg?

At present, my status remains that I can do

Code:

root@my-nas ~# camcontrol devlist | grep da221

<SEAGATE ST8000NM0075 E004>        at scbus12 target 20 lun 0 (da221,pass239)

and see my device.
I can also use disklist.pl:

Code:

root@my-nas ~# ./disklist.pl -i:dev da221
device  disk                  size  type  serial                 rpm
--------------------------------------------------------------------
da221   SEAGATE ST8000NM0075  8001  HDD   ZA1AGGV20000R829CG7B  7200

So what is being lost between the Web UI and TrueNas/BSD?

Patrick M. Hausen · Mar 7, 2024

How is that JBOD connected to your TrueNAS?

swedish_six_shooter · Mar 7, 2024

Daisy chained, JBOD to (three other JBODs in series) to CPU. The three other JBODs and the CPU connection are SFF-8644 cables. The new JBOD is an SFF-8644 cable but it's some slightly different type. (Two cables come from the SFF-8644 connector and go into the other SFF-8644 connector.) Sourced from our tech guy, can't get more specific.

Patrick M. Hausen · Mar 7, 2024

I just wanted to rule out an USB enclosure or something like that.

With proper SAS infrastructure I am at the end of my wits. If you can get a full detailled inventory of the hardware involved, some of the other regulars more familiar with enterprise gear might be able to help.

nabsltd · Mar 8, 2024

swedish_six_shooter said:
The new JBOD is an SFF-8644 cable but it's some slightly different type. (Two cables come from the SFF-8644 connector and go into the other SFF-8644 connector.)

The two-cable version is just a construction variant. I don't know if the manufacturer does it for better noise rejection, or some arrangement of pin/cable, but it's the same electrically.

I've noticed that HP-branded versions use the dual-cable method.

swedish_six_shooter · Mar 11, 2024

Li'l update for the record, since something is obviously broken. Not gonna be quoting logs since I don't expect an immediate fix from this.

a) Over the course of the weekend, TrueNAS spat out reports of ATA errors on a number of the drives in the new JBOD. These have since gone from middling (1) to absurd (90) on a mix of drives, enough to make me wonder if the JBOD is bad, a lot of the drives are bad, or the cable is bad. (Also possible there was a power issue in the office, but seems pretty unlikely.)

b) TrueNAS is now picking up all drives in the JBOD, though they've notably all been reassigned much lower device numbers (da199 -> da3, for instance).

c) If I try to convert the JBOD into a pool, I receive an error from middlewared.py and the traditional long traceback that culminates in

Code:

middlewared.service_exception.CallError: [EFAULT] Failed to wipe disk da3: [Errno 30] Read-only file system: '/dev/da3'

Currently trying to figure out how to wipe /dev/da3 since it isn't mounted but `cat /dev/zero /dev/da3` throws up another complaint about a read-only FS, emailing our tech guy for any notes, and planning a reboot of the NAS to see if that thing sorts it out.

Apologies, this all just feels like a bit of a hot mess from something that has previously been quite mundane.

artlessknave · Mar 11, 2024

swedish_six_shooter said:
`cat /dev/zero /dev/da3`

/dev/zero is a system device that provides an endless stream of zeros. why are you trying to cat that? thats all it will do, give you endless zeros. and cat a drive? this is also very strange.

you can wipe drives with dd or wipefs, but drives are rarely read-only - that feels like something between your OS and the drives is fubar, and you would be at best trying to smash 1 symptom.

this definitely feels like a backplane, cabling, or controller issue - something in common among the drives.

swedish_six_shooter · Mar 11, 2024

artlessknave said:
/dev/zero is a system device that provides an endless stream of zeros. why are you trying to cat that? thats all it will do, give you endless zeros. and cat a drive? this is also very strange.

True, dd would have been a simpler choice! I was just trying to wipe out any drive content as directly as possible, and my mind leapt to filling the drive with zeros first. A silly thing.

artlessknave · Mar 11, 2024

your command, `cat /dev/zero /dev/da3`, would try to read all the contents of /dev/zero and print them to the console, THEN try to read all the contents of /dev/da3, and print them to the console...
it would, however, never reach /dev/da3, as /dev/zero has infinite zeros. you would have had to do something like cat /dev/zero | tee /dev/da3, but i dunno if that would even work.

dd would be the way, and I have used that to nuke mbr. (the whole drive takes forever). by the time you run these 2 the drive will show up as completely fresh. (not securely erased, though)

Code:

dd if=/dev/zero bs=512 count=1 of=/dev/sda
wipefs -a /dev/sda

but, again, I do not think that is the actual problem here, so I am doubful that would help you any.

swedish_six_shooter · Mar 13, 2024

I will continue to add notes to my dumb odyssey.

As of yesterday...

After I starting digging through the SNs of the drives reporting ATA errors, I discovered I had things exactly backwards - these were all on a different JBOD, daXX numbers hadn't been getting, reassigned, etc. So, it seems like I do have an old JBOD that might be failing b/c of backplane issues, but it isn't the new one.

The new JBOD seems to be _mostly_ sorted out now, but again raises a question about some information being cached somewhere between the Web UI and the underlying TrueNas OS.

As, noted earlier, the new JBOD had noticed the new drive, so seemed worth trying to convert to a pool.
This failed with the same error I logged above: middlewared.service_exception.CallError: [EFAULT] Failed to wipe disk da3: [Errno 30] Read-only file system: '/dev/da3' I’ll call this drive A.

I swapped out Drive A for Drive B. Drive B is actually the wrong size for this JBOD (we have a mix of sizes), and I simply didn’t notice.

Drive B was picked up by dmesg, but not by the Web UI. I figured I’d try reformatting the pool anyway. This worked! The Web UI reported it had a new pool, successfully built, but still logged that the pool had drive A.

I then did a detach-and-destroy of the new pool. Once that completed, I checked the list of disks and found that the Web UI had finally updated, and replaced Drive A with Drive B in its listing. It was now that I noticed the bad size.

I popped Drive B again and replaced it with Drive C, a new drive that’s the right size for the JBOD. Once again, dmesg picked up the new drive but the Web UI didn’t refresh. I then recreated the pool. The Web UI got understandably mad about this when I included Drive B in the data vdev because it was the wrong size, but quieted down when I said it should be a hot spare. Once the pool was made, I did a disconnect-and-destroy and the Web UI refreshed its data about the drive. I then made the pool (yet again), and I everything seems copacetic.

So, to recap, here’s where I thought I was yesterday

* I have an old JBOD that appears to have some backplane or other serious hardware issues. (I will pray that these are not and will be asking our tech man to get a new JBOD)
* I have a new JBOD that seems like it’s ok?
* I have a mystery about why the new JBOD wasn’t noticing the one drive.
* I have Drive A, which TrueNas thinks has a read-only filesystem that it refuses to purge with dd
* I have what seems to me like evidence of a failure of the Web UI to sync up with the underlying OS about drive in data in some particular circumstances.

As of today, I’ve had a drive fault overnight in the new JBOD, so maybe something else is going on…

artlessknave · Mar 13, 2024

a bad backplane can cause issues elsewhere, as it starts outputting literal garbage that your controller has to deal with; SAS is usually pretty good at coping when things go wrong but it's definitely not perfect at it. are both of the chassis connected to the same HBA?
also, I don't see your hardware info; please include it, as thats a requirement of the forum (or tell me im blind if i missed it) and might make this more clear

Important Announcement for the TrueNAS Community.

Drive in dmesg, not Web CLI

swedish_six_shooter

Dabbler

artlessknave

Wizard

swedish_six_shooter

Dabbler

Patrick M. Hausen

Hall of Famer

swedish_six_shooter

Dabbler

Patrick M. Hausen

Hall of Famer

nabsltd

Contributor

swedish_six_shooter

Dabbler

artlessknave

Wizard

swedish_six_shooter

Dabbler

artlessknave

Wizard

swedish_six_shooter

Dabbler

artlessknave

Wizard

Similar threads

Important Announcement for the TrueNAS Community.

Drive in dmesg, not Web CLI

Dabbler

Wizard

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Contributor

Dabbler

Wizard

Dabbler

Wizard

Dabbler

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Drive in dmesg, not Web CLI"

Similar threads