PCIe errors (AER / BadDLLP error) 21.08?

drjustice

Dabbler
Joined
Apr 2, 2021
Messages
23
I think I have the same motherboard as what's in a TrueNAS Mini XL -- a Supermicro A2SDi-8C-HLN4F -- with 64 GB ram and it's been quite solid when I was running TrueNAS Core 12.x
I installed Scale recently but I am having the issues as per the attached image.
I have 8x Ironwolf 4GB drives, 2 SATA SSD drives, 1 NVMe Samsung 950. The rest is just vanilla.
Does anyone know what this is and how I can fix this?
 

Attachments

  • thumbnail_IMG_1936.jpg
    thumbnail_IMG_1936.jpg
    250.7 KB · Views: 263

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Is that a Mellanox card in there? Maybe try swapping it for something else. The 10 GIg Networking Primer has an extensive discussion of cards known to work well.
 

drjustice

Dabbler
Joined
Apr 2, 2021
Messages
23
Oh my. Yes, yes there's a mellanox card in it, and in fact, I had completely forgotten about it because it's not in use right now.. How did you identify it?

edit: So I went to the 10 gig networking primer, and what's really strange is that most (if not all) the information seems to be about TrueNAS core, and it seemed to work fine with Core, but now I'm on Scale which is a Linux kernel.

The 10 gig primer has gazillions of pages, what are the main problems with the mellanox cards? I thought they were pretty standard, well known, and reliable.
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Because "mlx4_core" suggests Mellanox to me, and that appears to be the device involved, according to the consolespew.

The information about TrueNAS Core is generally backportable to Linux as well. Linux may sometimes have better support for crappy hardware devices, but no amount of magic makes crappy hardware uncrappy. This can be crappy-by-design hardware such as many Realtek or Marvell chipsets made for the Windows PC marketplace, but it can also easily extend to other gear. For example, "broken" gear at data centers is often fed to companies that specialize in recycling server systems, and by "specialize", I mean many of them just get servers in, parts them up, maybe blow them off with some air, look up the part number, ESD bag them, and list them on eBay (note the conspicuous lack of the word "testing"). Therefore one of the things I suggest when advising people on resellers for used gear is to look for large quantities; this is an indicator that they've received a rackful (or more) of identical servers, which more strongly correlates with a likelihood that the gear is good but just being updated/refreshed/cloud-ified/etc. By way of comparison, "quantity 1" sales tend to be more dodgy, and you really need to inspect for damage and test heavily, because you may be getting someone else's reject.

The main problem with the Mellanox cards tends to be that the "affordable" ones like the older ConnectX-2 and ConnectX-3 are not good ones, and that they tend to run warm IIRC, though that is a common failing across most 10G cards. Additionally, their driver support under FreeBSD is not really as strong as the recommended Chelsio or Intel cards, but it isn't the catastrophe that it used to be. However, they are Mellanox-authored, which is a really good general sign.
 

drjustice

Dabbler
Joined
Apr 2, 2021
Messages
23
Ok thanks for your input. Not surprised for FreeBSD support, but since Scale is Linux based I would have expected that Mellanox X3 (what I have) would be well supported in General due to them being, well, 10g cards, which is most often used for servers and for servers and is basically what Linux is mostly used for.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Ok thanks for your input. Not surprised for FreeBSD support, but since Scale is Linux based I would have expected that Mellanox X3 (what I have) would be well supported in General due to them being, well, 10g cards, which is most often used for servers and for servers and is basically what Linux is mostly used for.

As is FreeBSD, and both the FreeBSD and Linux drivers are authored by Mellanox, so really I'd expect them to be quite similar in quality/support/etc.

However, you appear to be conveniently ignoring the thing that brought you to post this in the first place, which is that there's some evidence that something isn't working quite right. This could easily be a bad card, which is what I was hinting at, or an incompatibility with your choice of mainboard, which certainly isn't unheard-of. Mellanox sells primarily to systems integrators, and in such arrangements, it is typically up to the OEM to verify compatibility and possibly generate their own custom firmware, which may actually be less compatible with random aftermarket mainboards. Vendors such as Intel that sell large numbers of retail cards often have fewer rough edges on their BIOS firmware, better compatibility with a wider range of things, simply because there are a lot more reports back from users of random platforms during the product lifecycle. Companies such as Mellanox are less likely to spend significant development effort to fix such incompatibilities when they come to light after end-of-sale, and we've seen stuff like this time and time again over the years.

So I don't really know what the issue is here, but I hope you can take away the point that you might have misunderstood what you've bought and what its level of compatibility with your platform might be. In my opinion, it works out like this:

1) It could be a defective card -- or a defective mainboard. Sorta feel like "ehh" on this one, that's a professional's opinion, but is the most easily tested thru hardware swaps.

2) It could be overheating, which is totally checkable and totally correctable.

3) It could be a driver/BIOS issue, which is unlikely to be resolved if this is an end-of-sale product.

The great thing about getting a network card at a virtual garage sale like eBay is that you can get a great deal for a cheap price. The less-great thing about it is that you may not have a lot of recourse if it doesn't work out.
 

drjustice

Dabbler
Joined
Apr 2, 2021
Messages
23
I wasn’t ignoring the obvious — something definitely seems to be going on with the mellanox. I was just puzzled as to why it would appear to behave incorrectly and you’ve given me a couple of good reasons, namely that the product is EOL and unlikely to get bugfixes, that it’s likely to be certified for different hw platforms than what I have, and that it might overheat, which is possible in the cramped area that I’ve got.

All good. I’ll disconnect the card as I didn’t really need it except for incredibly large transfers.
 

drjustice

Dabbler
Joined
Apr 2, 2021
Messages
23
It appears the mellanox driver is 4.0 when Mellanox is in fact at 4.9 which is for LTS support of ConnectX3 cards.


What's really puzzling to me is that there doesn't seem to be support yet for Debian 11, however TrueNas scale (based on Debian 11) appears to be able to use a built-in driver after all; and the vermagic seems to be 5.10.xxx, when the latest 5.x version over at the Mellanox site appears to be 5.4.xxx

1633452742870.png


Can it be built from source, or does the vermagic indicate that IxSystems did something to it already?

--- edit: it also appears I'm on a much older firmware (v2.33, which is like 2016/2017)

--- edit 2: version 5.10.xx appears to be the linux kernel version, not the mellanox version.
 
Last edited:
Top