DS4246 disks missing in TrueNAS Core but not in TrueNAS Scale

klu

Dabbler
Joined
Nov 19, 2021
Messages
16
Hi All,

I've been running FreeNAS/TrueNAS for years. So far, I could always find help in other threads. This time it seems to be too specific, I have little experience with HBA and SAS Expander. I hope the title is not misleading, but it is my latest revelation.

First off, I am setting up a new System for testing purposes and I am Stuck.

ASRock Rack ROMED6U-2L2T​
AMD EPYC 7232P 8-Core Processor​
2x Micron 64 GB DDR4-3200 reg ECC​
SAS9305-16i 9305-16i (sas3224 - 16.00.12.00 - IT-Mode)​
2x NetApp DS4246 (2xIOM6) on QSFP(SFF-8436) to SFF-8644​
various SATA drives, some on Interposer (only got 24 Interposer so far)​

The hardware and setup seems to work fine so far, I know I am using an 16i for external connection but with a little adapter card its working fine. When connection just one of the DS4246 to the HBA everything seems fine ass 24 drives are recognized, hooking up the second one is when the trouble starts. All the drives are seen by the HBA in BIOS as well as the IOM6 they are hooked up to.

So, I did a fresh install of TrueNAS CORE 12.0-U6.1 and to my surprise there are a lot of drives missing. Most of the time it just recognizes one complete shelve, but when I try to mix drives or when I do not populate all the bays the behavior starts to get random. The best I could achieve is 24 out of 48 disks, most of the time TrueNAS only sees 21 drives.

At a first thought I tried different cabling, chaining the shelves vs direct hook up and so on. Here is what I tried so far
  • Hook up the shelves chained and both IOMs of the first
  • Hook up the shelves directly by using only one IOM each
  • Interposers on the drives for SAS connection
  • No interposers at all (only first IOM recognizes the SATA drives but that's normal behavior as far as other forum threads told me)
  • Tried different ports on the controller
  • Tried different cables ect.
There is no error massage I can see, the drives are just not showing up. For trial an error I than tired another operation system and to my surprise all drives showed up. I than tried TrueNAS-SCALE-22.02-RC.1-1 since it is debian based and not FreeBSD and all the drives showed up. So I did all my Tests again (chaining, interposer, cables, ...) and every time the HBA was seeing all the drives in BIOS Core did not and Scale did.

So I figured it must be a driver Issue or something BSD related. I tried some tunables for mpr spinup time and what else I could find but nothing.

Since there is no newer driver I could find, and the newest I could find (v20) is much older than the used (v23) I am not sure how to proceed.

I really hope that someone can point me in the right direction. Since it is my first post please forgive me if I missed to attach some necessary Information.
 

klu

Dabbler
Joined
Nov 19, 2021
Messages
16
Ok so far, I've made little progress. I figured out that the firmware on the controller was up to date but not the BIOS & UEFI BSD Version. I updated both but no change in behavior
Adapter Selected is a Avago SAS: SAS3224(A1)

Controller Number : 0
Controller : SAS3224(A1)
PCI Address : 00:43:00:00
SAS Address : 500062b-2-02b5-e700
NVDATA Version (Default) : 10.00.00.05
NVDATA Version (Persistent) : 10.00.00.05
Firmware Product ID : 0x2228 (IT)
Firmware Version : 16.00.12.00
NVDATA Vendor : LSI
NVDATA Product ID : SAS9305-16i
BIOS Version : 08.37.02.00
UEFI BSD Version : 18.00.03.00
FCODE Version : N/A
Board Name : SAS9305-16i
Board Assembly : 03-25703-02003
Board Tracer Number : SP71503600
I also tried a clean FreeBSD13 and TrueNAS Core 12.1 (master) install. Since both behaive the same way as TrueNAS12 Core, I have no clue but would guess the problem is the driver or some tunable. Just to be sure I ordered some more interposers to check if this does the trick, but this will take a couple days.

I will try to set the mpr to debug but I am not sure if I could correctly interpretate the results since there are no error message, that just leaves informational output.

Any advice on what could be the problem or how to get some more information would be appreciated. I would rather not order and run one sas controller per disk shelve.
 

klu

Dabbler
Joined
Nov 19, 2021
Messages
16
Today the new interposers arrived, but they did not help either. It seems to make no difference – for this problem at least - if I use interposers on some, all on none of the disks at all.

In the meantime, I also tried to run the system on VMWare ESXi and install TrueNAS Core as a VM with the HBA as Passthrough. But this did not solve anything either.

I also borrowed a SAS-Expander (Adaptec) but it seems I cannot run the IOMs chained to this or maybe even any other expander.

As a next and probably one of the last steps I can think of, I will borrow another controller and try to run the setup on this. Also, I did not yet get around to setting the mpr driver to debug and check all the informational output. Guess I will have something to do for the weekend.
 

Barney318

Cadet
Joined
Oct 29, 2021
Messages
6
just a thought
Have you set the shelf ID on the DS4246
I am only using 1 shelf and I had to set the ID to 2 for it to work on a IBM x3650m3
 

klu

Dabbler
Joined
Nov 19, 2021
Messages
16
Thank you for the suggestions.

I got the shelves and they were both preset to ID-08. I then set them to ID-01 and ID-02. I tried setting them to multiple other IDs just now to be sure I set them to 04 and 12 as well as 05 and 25, but not much changed. They are still recognized by the controller and Linux but not by TrueNAS oder FreeBSD for that matter.

One thing that changed, it seems that setting the ID, it changed the order the shelves/drives are recogniized. One of the shelves is not fully stacked with drives, TrueNAS can see the shelve with the higher ID first. If this is the one not fully stacked with drives, I can see the drives from this one and some drives from the other shelve. So it does see both shelves but just stops after 24.

Knowing this I went through the log again and this time I found something that could be related. I did not see these messages earlier but booting this order of drives gives me this
Nov 27 08:51:08 truenas Syncing disks...
Nov 27 08:51:08 truenas coroutine raised StopIteration
Nov 27 08:51:08 truenas Traceback (most recent call last):
Nov 27 08:51:08 truenas File "/usr/local/lib/middlewared_truenas/plugins/enclosure.py", line 151, in _build_slot_for_disks_dict
Nov 27 08:51:08 truenas slots = next(filter(lambda x: x["name"] == "Array Device Slot", enc["elements"]))["elements"]
Nov 27 08:51:08 truenas StopIteration
Nov 27 08:51:08 truenas
Nov 27 08:51:08 truenas The above exception was the direct cause of the following exception:
Nov 27 08:51:08 truenas
Nov 27 08:51:08 truenas Traceback (most recent call last):
Nov 27 08:51:08 truenas File "/usr/local/lib/python3.9/site-packages/middlewared/job.py", line 367, in run
Nov 27 08:51:08 truenas await self.future
Nov 27 08:51:08 truenas File "/usr/local/lib/python3.9/site-packages/middlewared/job.py", line 403, in __run_body
Nov 27 08:51:08 truenas rv = await self.method(*([self] + args))
Nov 27 08:51:08 truenas File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 975, in nf
Nov 27 08:51:08 truenas return await f(*args, **kwargs)
Nov 27 08:51:08 truenas File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/disk_/sync.py", line 191, in sync_all
Nov 27 08:51:08 truenas await self.middleware.call('enclosure.sync_disks')
Nov 27 08:51:08 truenas File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1256, in call
Nov 27 08:51:08 truenas return await self._call(
Nov 27 08:51:08 truenas File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1213, in _call
Nov 27 08:51:08 truenas return await methodobj(*prepared_call.args)
Nov 27 08:51:08 truenas File "/usr/local/lib/middlewared_truenas/plugins/enclosure.py", line 220, in sync_disks
Nov 27 08:51:08 truenas curr_slot_info = await self._build_slot_for_disks_dict()
Nov 27 08:51:08 truenas RuntimeError: coroutine raised StopIteration
I do not know if this is related, this error only shows in TrueNAS and not in FreeBSD 12/13, but I will check the python files and hope something there will make sense.
 

Barney318

Cadet
Joined
Oct 29, 2021
Messages
6
I am only using one shelf and truenas appears to allocate the the devices (da0-18 in my case) all over the shelf
The only way I found to see what position in the shelf is what da number is to get a list of the drive serial, numbers then in the web interface look at storage then disks and you can see the serial numbers
You can also add a description and I put the position in here but the description is not one of the default columns so you have to click on the columns box top right and enable it
I have no idea how it works with more than one shelf
I also had problems running two cables from the shelf back to the server so I am only using one port from the server and one controller card in the ds4246 I may loop it to the other card later but it appears to work ok for now
 

klu

Dabbler
Joined
Nov 19, 2021
Messages
16
Today I finally could borrow another controller and test my setup with it. So, I temporarily got an SAS9211-8i
Adapter Selected is a LSI SAS: SAS2008(B2)

Controller Number : 0
Controller : SAS2008(B2)
PCI Address : 00:42:00:00
SAS Address : 500605b-0-04e9-ebeb
NVDATA Version (Default) : 14.01.00.08
NVDATA Version (Persistent) : 14.01.00.08
Firmware Product ID : 0x2213 (IT)
Firmware Version : 20.00.07.00
NVDATA Vendor : LSI
NVDATA Product ID : SAS9211-8i
BIOS Version : 07.39.02.00
UEFI BSD Version : 07.27.01.01
FCODE Version : N/A
Board Name : SAS9211-8i
Board Assembly : N/A
Board Tracer Number : N/A
and now I am totaly confused. I hooked up my Setup and only swaped my SAS9305-16i for the SAS9211-8i. What can I say, everything was working right away. o_O All the drives are seen by TrueNAS, als the drives are accessible and working nicely.

So I guess I will either try to trade SAS9305-16i for a way older, slower, cheaper SAS9211-8i or I will just have to start flashing all the firmware I can find on to my SAS9305-16i and hope one of them will work.

Anyone here that had something similar in any way?
 

blanchet

Guru
Joined
Apr 17, 2018
Messages
515
Maybe an incompatibility between SAS3 HBA and SAS2 expander?
 

klu

Dabbler
Joined
Nov 19, 2021
Messages
16
I will look into it, thank you.

So far I tried Firmware 16.00.12.00, 16.00.11.00 and 16.00.01.00 on my 9305-16i but nothing changed. The controller works on Linux and Windows with all drives but FreeBSD or TrueNAS for that matter I am stuck with 24 drives.
 

klu

Dabbler
Joined
Nov 19, 2021
Messages
16
I am pretty much convinced now, that the problem is between the FreeBSD mpr driver, mulltiple SAS3 HBA chips and the DS4246 IOM6 units.

Today I got two more controller to test my setup
  • SAS 9300-8e (SAS3008 chip)
  • SAS 9400-8e (SAS3408 chip)
and I got the same results as with the
  • SAS9305-16i (sas3224 chip)
It works everywhere but in FreeBSD. I tried Windows, Ubuntu and TureNAS Scale, on all of these I could get all 24+10 (one DS4246 is not fully stocked) drives to work. In FreeBSD/TrueNAS Core I can only see the first 24 drives. The most interesting fact I can think of is, that I can get 10+14 drives to work. So the system can access both ds4246 but no more than 24 drives total.

Since I tried multiple Firmwares, BIOS, Controller and whatnot. I am giving up for now. I put the mpr driver into debug, but cannot see anything that would help me. Maybe someone else can make heads and tail out of it, I attached the output.

For now I will put the old SAS9211-8i to use and probably wait for TrueNAS Scale to get ready for production use and than switch over if it seems as stable as TrueNAS core. Not the ideal solution with the way older and slower SAS9211 but better than keeping the system on hold for another month or two.

If anyone has any idea what else I could try, please reply or give me an PM. I would be more than glad to provide more debug output on multiple setups (daisychained, all 4 IOMs, just 2 IOMS, single shelf, both, and so on) I some thinks it would help to figure out how to solve this.
 

Attachments

  • mpr_output_on_boot.txt
    841.4 KB · Views: 138

Ilyakaz

Cadet
Joined
Dec 28, 2021
Messages
1
Did you fix it or get any more suggestions? I have a similar problem. 4xDS4486 with 192 drives total and NetApp X2065A-R6 111-00341+F1 Quad-Port QSFP PCIe Card. I had it find 144 drives, possibly my fault for impatience and initially only powering 3 of 4 psus so not all lit up immediately then it threw a scsi error (I think because I set power to minimal and spin down after 30 mins and didn’t do a - n standby,0 - they were chucking out about 2-3kw of heat idle!) at the moment it can see only about 86.

Like you I fiddled with cabling variants. Last iteration is 2 ports going into top shelf chained to 3rd, 2 ports going into 2nd chained to 4th.

Powered up top shelf and it picked up most pretty quickly, 2nd not so good, went downhill as I powered up 3rd and 4th. Obviously have to boot with drives out or hangs on waiting for cam.

Have 2x hp storage works 24 drives on another adaptor which picks up perfectly.

I’ve renumbered drive shelves as 0,1,2,3 (had duplicate numbers 0,1 when it found 144) I’ve googled away and your post is the closest I have come to a clue.

Some degraded multipath errors also on a load of drives. And like you I have no clue which drive number is which. Tried pulling drives and seeing what errors thrown.

Tried camcontrol rescan and also inquiry of supposedly problematic drives.

My next move was going to be to pull all drives out(!) reboot then insert one by one.

I could try installing truenas scale.

Machine is an HP380G7 with 2x5650 (24 virtual cores) and 108GB RAM. I’ve put nextcloud in a jail. Doing an htop processor not sweating when it hangs. I am not really familiar with BSD, have Ubuntu on other machines, I was also thinking of just trying manual configure on Ubuntu using zfs if this doesn’t work. Like the idea of dedicated is with gui but it does crash/hang a lot! Am running kubernetes/Ubuntu/Python mainly on other machines.

Have got an infiniband 40Gb card which it seems to load driver for (and infiniband switch and cards in other machines - need to get that data out of there!) but just put in a 10Gb adapter also as that may be much easier to configure. I couldn’t even get link aggregation to work on 2 of the 4 1G adaptors.

Any help appreciated (on the netapp issue of the thread primarily but also if anyone reading has got infiniband working by chance!) by pm or here.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
I am pretty much convinced now, that the problem is between the FreeBSD mpr driver, mulltiple SAS3 HBA chips and the DS4246 IOM6 units.

Today I got two more controller to test my setup
  • SAS 9300-8e (SAS3008 chip)
  • SAS 9400-8e (SAS3408 chip)
and I got the same results as with the
  • SAS9305-16i (sas3224 chip)
It works everywhere but in FreeBSD. I tried Windows, Ubuntu and TureNAS Scale, on all of these I could get all 24+10 (one DS4246 is not fully stocked) drives to work. In FreeBSD/TrueNAS Core I can only see the first 24 drives. The most interesting fact I can think of is, that I can get 10+14 drives to work. So the system can access both ds4246 but no more than 24 drives total.

Since I tried multiple Firmwares, BIOS, Controller and whatnot. I am giving up for now. I put the mpr driver into debug, but cannot see anything that would help me. Maybe someone else can make heads and tail out of it, I attached the output.

For now I will put the old SAS9211-8i to use and probably wait for TrueNAS Scale to get ready for production use and than switch over if it seems as stable as TrueNAS core. Not the ideal solution with the way older and slower SAS9211 but better than keeping the system on hold for another month or two.

If anyone has any idea what else I could try, please reply or give me an PM. I would be more than glad to provide more debug output on multiple setups (daisychained, all 4 IOMs, just 2 IOMS, single shelf, both, and so on) I some thinks it would help to figure out how to solve this.

This should be a workable setup, and you've clearly put in significant effort. This seems ripe for a Jira ticket to iXsystems. It's pretty much outside the realm of what anyone on the forums, even me, is likely to have available for experimental purposes.
 

klu

Dabbler
Joined
Nov 19, 2021
Messages
16
I would really like to open a Jira ticket to iXsystems. But since I cannot find any real error messages that seem helpful, I did not do it yet.

But thanks for the advice, I will open a ticket as soon as I can get around to it.

@Ilyakaz
I have not made much progress since. The only thing I could figure out - like stated in my last post - it seems to be an issue revolving arround the SAS-Controller and the mpr driver of TrueNAS Core/FreeBSD.

I could get it to work for now by using one SAS3 controller card per shelve or using one SAS2 controller and chaining the shelves. It's been running on both configurations for some weeks now. Right now I am using one "LSI SAS: SAS2008(B2)" because it seems overkill to run one SAS3 controller per shelve.
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Error messages are not a requirement. They may ask you to submit a debug tarball, which can be done from the GUI under System->Support.
 

homer27081990

Patron
Joined
Aug 9, 2022
Messages
321
Today I finally could borrow another controller and test my setup with it. So, I temporarily got an SAS9211-8i

and now I am totaly confused. I hooked up my Setup and only swaped my SAS9305-16i for the SAS9211-8i. What can I say, everything was working right away. o_O All the drives are seen by TrueNAS, als the drives are accessible and working nicely.

So I guess I will either try to trade SAS9305-16i for a way older, slower, cheaper SAS9211-8i or I will just have to start flashing all the firmware I can find on to my SAS9305-16i and hope one of them will work.

Anyone here that had something similar in any way?
Just a thought, but, do the 16i and 8i parts of the names refer to pcie lanes? If they do, maybe you need to use different topology as one of the cards (the 16i?) cannot get, for some reason, enough lanes? Maybe switch pcie port (or riser)?
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,600
@homer27081990, no, the 8i & 16i in reference to LSI SAS cards, (and other branded cards), is in reference to the number of SAS ports and that they are internal "i".

LSI SAS cards tend be be 8 PCIe lanes. But, like all PCIe devices should automatically work in 4, (or fewer), lane electrical, 8 / 16 lane physical. Of course, lower lane count or earlier PCIe releases, (v1 or v2), will affect performance.

I used to own a 4e4i card, because I had 2 SAS / SATA drive bays internal, but also wanted some external ports. (I had an external eSATA DVD-ROM drive...)
 
Last edited:
Top