Bluefin to Cobia RC1 drive now fails with "duplicate IDs"

pjwelsh

Dabbler
Joined
Oct 4, 2023
Messages
13
Running fine on a test homebrew TrueNAS Scale latest Bluefin. Decided to try the Cobia RC1 path. The only significant issue is that I've "lost" a drive according to the system because of "duplicate IDs". Was not an issue before and had no warning. Luckily the system is a) for testing and b) only two drives have this issue. I could have been more interesting if I had a bunch of drives in this state.

Here are some dmesg lines that show the issue:
Code:
[~]# dmesg -T |grep -i nvme0
[Tue Oct  3 15:06:40 2023] nvme nvme0: pci function 0000:03:00.0
[Tue Oct  3 15:06:40 2023] nvme nvme0: failed to register the CMB
[Tue Oct  3 15:06:40 2023] nvme nvme0: 48/0/0 default/read/poll queues
[Tue Oct  3 15:06:40 2023] nvme nvme0: VID:DID 8086:0a54 model:INTEL SSDPD2KS019T7 firmware:QDAA0130
[Tue Oct  3 15:06:40 2023] nvme nvme0: ignoring nsid 1 because of duplicate IDs


One vdev has both of the affected drives. Both drives are at the latest Intel/Solidigm firmware.
As close to useful information I've been able to find so far concerning non unique EUI numbers in general https://bbs.archlinux.org/viewtopic.php?id=282151

Any ideas how to "fix" this now?
 

mav@

iXsystems
iXsystems
Joined
Sep 29, 2011
Messages
1,428
As I can see, this check is a new addition of the Linux 6.1 kernel in Cobia. It was not present in 5.15 kernel in Bluefin. I would check outputs of `nvme id-ns /dev/nvmeX -n 1` for all your NVMes, looking for "uuid", "nguid" and "eui64" values, to find out whether the error is valid. If the numbers are not unique, that would be a standard violation, and the only question whether the problem is worse than the fix.
 

pjwelsh

Dabbler
Joined
Oct 4, 2023
Messages
13
Yes, the referenced link indicated a number of vendors/drives have "quirks" associated with them. At least 2 of my Intel/Solidigm are in this boat:
Code:
[~]# nvme id-ns /dev/nvme2 -n 1|grep -E "id|i64"
anagrpid: 0
nvmsetid: 0
endgid  : 0
nguid   : 0100000001000000e4d25c3a83874bf0
eui64   : 0000000000000000
[~]# nvme id-ns /dev/nvme0 -n 1|grep -E "id|i64"
anagrpid: 0
nvmsetid: 0
endgid  : 0
nguid   : 0100000001000000e4d25c3a83874bf0
eui64   : 0000000000000000


Here is the lspci info for the two in question:
Code:
[~]# lspci -nn -d ::0108|grep 0a54
03:00.0 Non-Volatile memory controller [0108]: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller] [8086:0a54]
42:00.0 Non-Volatile memory controller [0108]: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller] [8086:0a54]


However, it looks like the "quirk" for NVME_QUIRK_IGNORE_DEV_SUBNQN was added ~2 years ago an should have likely been included with this Cobia kernel build:

So it seems like there could be a possible omission in the current (LTS?) kernel 6.1?

I'll be honest in saying I don't really understand the different quirk options. NVME_QUIRK_BOGUS_NID vs NVME_QUIRK_IGNORE_DEV_SUBNQN for example.
 

pjwelsh

Dabbler
Joined
Oct 4, 2023
Messages
13
What is the equivalent command for TrueNAS Core to the "nvme id-ns /dev/nvme0 -n 1" command? I want to see if the production system has this issue. I found the "nvmecontrol" command, but that doesn't seem to give the nguid info I need.

Sadly, it may be the NVME_QUIRK_BOGUS_NID quirk line that is needing to be added the pci.h file. However, that line addition does not seem to be in the official tree. The addition only seems to mentioned for promox and ddwrt (or similar) kernel rebuilds. :(

Not sure how to proceed at this point... I'm going to see what it will take to submit for the main kernel tree to add "NVME_QUIRK_BOGUS_NID" to pci.h
 

pjwelsh

Dabbler
Joined
Oct 4, 2023
Messages
13
Upstream kernel change request @ https://bugzilla.kernel.org/show_bug.cgi?id=217981
First time doing a kernel request so not sure how it will go from here. With any luck, the patch will make it into the 6.1 kernel tree for you guys to build against... Or you could put the one line into your kernel tree and help me test???

diff -Nuar drivers/nvme/host/pci.c tmp/drivers/nvme/host/pci.c
@@ -3329,6 +3329,7 @@
{ PCI_VDEVICE(INTEL, 0x0a54), /* Intel P4500/P4600 */
.driver_data = NVME_QUIRK_STRIPE_SIZE |
NVME_QUIRK_DEALLOCATE_ZEROES |
+ NVME_QUIRK_BOGUS_NID, |
NVME_QUIRK_IGNORE_DEV_SUBNQN, },
{ PCI_VDEVICE(INTEL, 0x0a55), /* Dell Express Flash P4600 */
.driver_data = NVME_QUIRK_STRIPE_SIZE |
 

mav@

iXsystems
iXsystems
Joined
Sep 29, 2011
Messages
1,428
Thanks for reporting the issue upstream. We'll happily merge the patch if that be decided a way to go by the upstream. Would be good to complain to Intel/Solidigm also in case they can fix it on firmware level.
 

mav@

iXsystems
iXsystems
Joined
Sep 29, 2011
Messages
1,428
BTW `nvmecontrol identify` on Core does report the IDs when called on specific namespace, not controller:
Globally Unique Identifier: 36455730527086120025384500000003
IEEE EUI64: 0000000000000000
 

pjwelsh

Dabbler
Joined
Oct 4, 2023
Messages
13
For those of you that have made it this far, here are a couple of updates:
1. I've not yet had any response from the kernel guys :(
2. I've not yet posted to the Intel/Solidigm forum as I had a registration issue I just got worked out
3. I have a basic one-liner script to report if there is any duplicate Globally Unique Identifiers on nvme drives for an older TrueNAS Core system. Some research has indicated this is actually a fairly common manufacturer mistake:

test "$(cd /dev/ ; for i in $(ls nvme*ns1) ; do nvmecontrol ns identify $i 2>&1|grep Global ; done | awk '/^Global/ {print }' |sort -u |wc -l)" = "$(cd /dev/ ;
for i in $(ls nvme*ns1) ; do nvmecontrol ns identify $i 2>&1 |grep Global ; done | awk '/^Global/ {print }' |wc -l)" || echo "Danger Duplicate GIDs found"

However, I've not yet had the ability to 100% tell that the BSD kernel is not already "fixing" these ID issues...

TrueNAS peeps: since this seems like a common issue, is there any chance you may add a basic pre-update checker to look-for-and-warn of this impending doom? I got lucky as I only had two drives in this category. It would be super uncool if I had this happen to a server for of them without warning...
 

mav@

iXsystems
iXsystems
Joined
Sep 29, 2011
Messages
1,428
FreeBSD kernel does not enforce the uniqueness, passing data through as-is. TrueNAS WebUI itself complains when it sees non-unique IDs, but allows to override it. Though if new Linux kernel blocks it, TrueNAS can't do much about it afterwards.
 

pjwelsh

Dabbler
Joined
Oct 4, 2023
Messages
13
@mav@ I would agree that post-upgrade is too late. However, my intent was to strongly suggest a pre-Cobia upgrade warning/check. Especially if TrueNAS devs know there is a significant kernel change from both the TrueNAS Core and from TrueNAS Scale 12.x kernel versions that now implements a GUID uniqueness requirement. That may be a simple enough of a check/warning.
 

pjwelsh

Dabbler
Joined
Oct 4, 2023
Messages
13
@truenas Peeps @mav@ : it looks like you may need to update the kernel you are using for the 6.1.50 series to more recent version or apply the patch noted from July to the 6.1.40:
https://www.spinics.net/lists/stable/msg668521.html
This patch will reportadly:
Relax our check for them for so that it doesn't reject the probe on
single-ported PCIe devices, but prints a big warning instead.

However the odd thing is that the kernel.org reports that patch should have been in the 6.1.40 kernel:
https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.1.40 (search for commit ac522fc6c3165fd0daa2f8da7e07d5f800586daa)
But is not in the current 6.1.50 in use now?
 

mav@

iXsystems
iXsystems
Joined
Sep 29, 2011
Messages
1,428
@pjwelsh As I see, TrueNAS kernel does include the mentioned patch. But as I see the patch does not fix it for dual-ported drives, and I suspect your drive as enterprise one may be dual-ported. There is a comment:
If the device also claims to be multi-path capable back off here now and refuse the probe the second device as this is a recipe for data corruption.
 

pjwelsh

Dabbler
Joined
Oct 4, 2023
Messages
13
That multi-path part is what the NVME kernel peep has also indicated. They are going through the path to a patch to the kernel tree now:
In this case, the drive may be dual ported capable, but is not dual port attached.

@mav@ will I need to follow up when they actually have the patch in place or will it get picked-up automagically?
 

mav@

iXsystems
iXsystems
Joined
Sep 29, 2011
Messages
1,428
@pjwelsh We are updating Linux kernels from time to time, so when it get into 6.1 LTS we'll pick it up. If you want it sooner, please create us a ticket to merge it when it lands at least in Linux master. Thanks for handling it.
 

pjwelsh

Dabbler
Joined
Oct 4, 2023
Messages
13
The 6.1 kernel merge to address this issue has officially happened now.
Do I create a new ticket in this ticket system or is there a separate ticket system?
 

pjwelsh

Dabbler
Joined
Oct 4, 2023
Messages
13
I may have been just a bit premature in saying it is fully done... Last notice was that the change was a "review patch". That could indicate more time before it's "official". Sorry.
I don't have access to the that Jira link so I'm not sure what it may show.

[PATCH 6.1 160/196] nvme-pci: add BOGUS_NID for Intel 0a54 device​

6.1-stable review patch. If anyone has any objections, please let me know.
...
commit 5c3f4066462a5f6cac04d3dd81c9f551fabbc6c7 upstream.

These ones claim cmic and nmic capable, so need special consideration to ignore
their duplicate identifiers.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217981
...
---
drivers/nvme/host/pci.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3439,7 +3439,8 @@ static const struct pci_device_id nvme_i
{ PCI_VDEVICE(INTEL, 0x0a54), /* Intel P4500/P4600 */
.driver_data = NVME_QUIRK_STRIPE_SIZE |
NVME_QUIRK_DEALLOCATE_ZEROES |
- NVME_QUIRK_IGNORE_DEV_SUBNQN, },
+ NVME_QUIRK_IGNORE_DEV_SUBNQN |
+ NVME_QUIRK_BOGUS_NID, },
{ PCI_VDEVICE(INTEL, 0x0a55), /* Dell Express Flash P4600 */
.driver_data = NVME_QUIRK_STRIPE_SIZE |
NVME_QUIRK_DEALLOCATE_ZEROES, },


...
 

pjwelsh

Dabbler
Joined
Oct 4, 2023
Messages
13
@mav@ are you in a position to know which TrueNAS kernel version this patch will be applied to?
It does not seem to be part of current production kernel: 6.1.55-production+truenas
 
Top