All Drives In Desk Shelf go offline after Upgrade 11.2 -> 11.3 -> 12.0

wooley-x64

Dabbler
Joined
Dec 6, 2018
Messages
22
Having an issue with all disks in one shelf going offline.
I had the same issue occur last year when I upgraded from 11.2 up to 11.3. At the time I rolled back to 11.2 and the issue ceased to occur.

Presently, just upgraded from 11.2 again up to 11.3 and then 12.0 U3.1. The issue is back again.
I am leaning heavy on hardware failure. However, it just seems strange that the issue started directly after the upgrades in both instances.

Summary of problem:
All of the drives in one disk shelf go offline and never recover unless I reboot the server.


Summary of Hardware and configuration:
Dell R710 ESX 6.7.x
TrueNAS is a guest on this system
32GB Dedicated RAM
4 CPUs
Pass-Thru on LSI 9201-16e (IT MODE)
Disk Shelves:
DS-4486 (With Dell SAS Module)
DS-2246 (With NetApp SAS Module)

The issue occurs on the disks contained in the DS-4486. This device has dual PSU and is connected to UPS. The time frame of these events do not directly correlate to scheduled tasks and do not occur at a specific time.

I am looking for guidance on what type of logs I should be looking at to pin point the cause and potential solution to this issue.

Thanks


Update: 6/15/21
Drives in the other shelf report as online. However, they are also inaccessible.
The issue reoccurs every 5 days.
 
Last edited:

wooley-x64

Dabbler
Joined
Dec 6, 2018
Messages
22
Bump.

In the interim I will return to 11.2 from snapshot. If required, I may just spin up a new VM and install 12.0 Fresh and migrate the pools over.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
That sounds like it could be a dodgy hardware compatibility issue. Typically, it is a good idea to jettison the hypervisor and run on bare metal, this helps to clarify whether it is an actual hardware issue or maybe something with the house-of-cards involved in virtualization. This is pretty easy to do because you just need to boot FreeNAS from a thumb drive and see how it goes for a week or two. There can be various issues with MSI/MSIX stuff that only pops up once in a blue moon, which is why I am so insistent that people should thoroughly test their systems in an offline role before deploying them for production...
 

wooley-x64

Dabbler
Joined
Dec 6, 2018
Messages
22
Just to clarify, this is potentially an issue between the newer version of the underlying guest OS (FreeBSD), my configurations, and the hardware it rides on?

This is not a prod environment, its a home server/lab. However, its not easy to bring down the entire host (lose access to all the VMs) to perform this testing. (bare metal) An R710 is a bit overkill for a NAS on its own. (at least in my use case)

I could potentially boot up a new VM and pass thru the HBA card to that with a fresh install of latest TrueNAS and see if the issue reoccurs.

I am back on 11.2 presently. If it was the hardware at fault. This issue would continue to occur, yes? (regardless of version)

Your guidance is appreciated,
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
It depends. We don't know what's wrong. In general, it could work and probably ought to work, but Supermicro didn't figure out reliable passthru until around Sandy Bridge, and HP and Dell are their own things.

You need to remember that FreeBSD plus ZFS is a very complicated house of cards, working on Intel-architecture ("PC" heritage) hardware, with lots of ins and outs and caveats and strange bugs that existed on one platform one place at one time, and quite frankly it is sometimes amazing that it works at all.

Then you cut out that foundation and instead place another "platform" underneath it, which is an equally complicated house of cards, and then you expect to be able to pass a hardware device from one to the other, and it needs to work *perfectly*, because ZFS places incredibly crushing I/O loads on the controller.

Quite frankly, it's amazing that this is possible at all, and it's important to remember that you are actually asking quite a bit out of things. In general, the description you gave sounded like the flakiness people experienced on Westmere and Nehalem systems by Supermicro, some of which shook out over time with updated BIOS, and some of which didn't. And ....

I'm an idiot. I was confusing this thread with something else. I just looked up what an R710 was, which I should have known, since we get R510's thru here now and then.

You've got something which is probably a Westmere or Nehalem system. First off, Intel's support for VT-d was very "new" back when your server was built, and probably did NOT work (well, at all) at the time. Dell did upgrade the BIOS and got it mostly working, but I do emphasize "mostly". This is basically the same bad experience people have had with Supermicro and HP pre-Sandy Bridge systems.

You might be able to get it to work, you might not. You will find it very finicky to versions of BIOS, hypervisor, OS, number of CPU cores, etc. These factors will affect its overall reliability; do not expect "pass/fail" type results. Well, obviously "fail" is "fail."
 

wooley-x64

Dabbler
Joined
Dec 6, 2018
Messages
22
Yes, they are Westmere E5620s

I appreciate all of the input and knowledge...I will just stick with 11.2 until tech refresh time comes around.

Thanks
 

wooley-x64

Dabbler
Joined
Dec 6, 2018
Messages
22
Just an update:
I pushed together a desktop and I migrated the FreeNAS box to its own physical hardware.

Motherboard: Gigabyte UD4H (z87 haswell board)
CPU: Intel Core i7 4770k
Memory: Kingston HyperX Fury 1866 DDR3 CL10 2x8GB (2 of these for a total of 32GB)
HBA: LSI 9201-16e (IT MODE)

This is far from ideal. However, I am tired of the drives going offline, so I will take the virtualization out of the equation. If this runs stable for the next 30 days I will look into getting some server grade hardware to toss into this desktop and leave FreeNAS in its own little hardware platform.

If the disks go offline again...I guess then we can really troubleshoot some logs.
 

wooley-x64

Dabbler
Joined
Dec 6, 2018
Messages
22
Update:
The drives went offline again this morning. 7/6/21 Around 7AM EST.

Debug logging was enabled and I have logs dump into Splunk.

This was when the drives start to fall off:


Code:
 Jul  6 07:04:31 helydokk GEOM_MIRROR: Device swap1: provider da19p1 disconnected.
Jul  6 07:04:31 helydokk (da1:mps0:0:18:0): Periph destroyed
Jul  6 07:04:31 helydokk da10: <SEAGATE ST3400755SS NS25> s/n 3RJ1Q6JC detached
Jul  6 07:04:31 helydokk da10 at mps0 bus 0 scbus0 target 27 lun 0
Jul  6 07:04:31 helydokk s/n PAGU4JWT             detached
Jul  6 07:04:31 helydokk 0): da12: <HITACHI HUS724040ALE64SM 4321>Periph destroyed
Jul  6 07:04:31 helydokk (da4:mps0:0:21:da12 at mps0 bus 0 scbus0 target 31 lun 0
Jul  6 07:04:31 helydokk da7: <HGST HUS724040ALE64SM 4321> s/n PK1334PBHPTLSP detached
Jul  6 07:04:31 helydokk da7 at mps0 bus 0 scbus0 target 24 lun 0
Jul  6 07:04:31 helydokk da1: <HGST HUS724040ALE64SM 4321> s/n PK2334PBHUNENT detached
Jul  6 07:04:31 helydokk da1 at mps0 bus 0 scbus0 target 18 lun 0
Jul  6 07:04:31 helydokk da19: <Hitachi HUS724040ALE64SA 4321> s/n PK1331PAJ7AZ9S detached
Jul  6 07:04:31 helydokk da19 at mps0 bus 0 scbus0 target 42 lun 0
Jul  6 07:04:31 helydokk da9: <SEAGATE ST3400755SS NS25> s/n 3RJ1QCR0 detached
Jul  6 07:04:31 helydokk da9 at mps0 bus 0 scbus0 target 26 lun 0
Jul  6 07:04:31 helydokk GEOM_MIRROR: Device swap1: provider da20p1 disconnected.
Jul  6 07:04:31 helydokk da20: <Hitachi HUS724040ALE64SM 4321> s/n PK2331PAK10XBT detached
Jul  6 07:04:31 helydokk da20 at mps0 bus 0 scbus0 target 43 lun 0
Jul  6 07:04:31 helydokk da11: <WDC WD30EFRX-68EUZSM 4321> s/n WD-WMC4N0F8P6Z3 detached
Jul  6 07:04:31 helydokk da11 at mps0 bus 0 scbus0 target 29 lun 0
Jul  6 07:04:31 helydokk da2: <HGST HUS724040ALE64SM 4321> s/n PK1334PBHRJKLP detached
Jul  6 07:04:31 helydokk da2 at mps0 bus 0 scbus0 target 19 lun 0
Jul  6 07:04:31 helydokk da5: <HGST HUS724040ALE64SM 4321> s/n PK2334PBHUMNXT detached
Jul  6 07:04:31 helydokk da5 at mps0 bus 0 scbus0 target 22 lun 0
Jul  6 07:04:31 helydokk da3: <HGST HUS724040ALE64SM 4321> s/n PK2334PBHTNLTT detached
Jul  6 07:04:31 helydokk da3 at mps0 bus 0 scbus0 target 20 lun 0
Jul  6 07:04:31 helydokk da4: <HGST HUS724040ALE64SM 4321> s/n PK2334PBHTJ5YT detached
Jul  6 07:04:31 helydokk da4 at mps0 bus 0 scbus0 target 21 lun 0


I have attached the entire log for before and after these events above.

Scheduled Tasks:
Smart Short tests: Every Month = 05,12,19,26 (3am)
Smart Long tests: Every Month = 08,22 (4am)
Scrubs: Every Month = 01,15 (4am)

Thanks
 

Attachments

  • FreeNAS-11.2_Debug_Logs_Offline_Disks.txt
    143.6 KB · Views: 240
Last edited:

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Update:
The drives went offline again this morning. 7/6/21 Around 7AM EST.

Debug logging was enabled and I have logs dump into Splunk.

This was when the drives start to fall off:


Code:
 Jul  6 07:04:31 helydokk GEOM_MIRROR: Device swap1: provider da19p1 disconnected.
Jul  6 07:04:31 helydokk (da1:mps0:0:18:0): Periph destroyed
Jul  6 07:04:31 helydokk da10: <SEAGATE ST3400755SS NS25> s/n 3RJ1Q6JC detached
Jul  6 07:04:31 helydokk da10 at mps0 bus 0 scbus0 target 27 lun 0
Jul  6 07:04:31 helydokk s/n PAGU4JWT             detached
Jul  6 07:04:31 helydokk 0): da12: <HITACHI HUS724040ALE64SM 4321>Periph destroyed
Jul  6 07:04:31 helydokk (da4:mps0:0:21:da12 at mps0 bus 0 scbus0 target 31 lun 0
Jul  6 07:04:31 helydokk da7: <HGST HUS724040ALE64SM 4321> s/n PK1334PBHPTLSP detached
Jul  6 07:04:31 helydokk da7 at mps0 bus 0 scbus0 target 24 lun 0
Jul  6 07:04:31 helydokk da1: <HGST HUS724040ALE64SM 4321> s/n PK2334PBHUNENT detached
Jul  6 07:04:31 helydokk da1 at mps0 bus 0 scbus0 target 18 lun 0
Jul  6 07:04:31 helydokk da19: <Hitachi HUS724040ALE64SA 4321> s/n PK1331PAJ7AZ9S detached
Jul  6 07:04:31 helydokk da19 at mps0 bus 0 scbus0 target 42 lun 0
Jul  6 07:04:31 helydokk da9: <SEAGATE ST3400755SS NS25> s/n 3RJ1QCR0 detached
Jul  6 07:04:31 helydokk da9 at mps0 bus 0 scbus0 target 26 lun 0
Jul  6 07:04:31 helydokk GEOM_MIRROR: Device swap1: provider da20p1 disconnected.
Jul  6 07:04:31 helydokk da20: <Hitachi HUS724040ALE64SM 4321> s/n PK2331PAK10XBT detached
Jul  6 07:04:31 helydokk da20 at mps0 bus 0 scbus0 target 43 lun 0
Jul  6 07:04:31 helydokk da11: <WDC WD30EFRX-68EUZSM 4321> s/n WD-WMC4N0F8P6Z3 detached
Jul  6 07:04:31 helydokk da11 at mps0 bus 0 scbus0 target 29 lun 0
Jul  6 07:04:31 helydokk da2: <HGST HUS724040ALE64SM 4321> s/n PK1334PBHRJKLP detached
Jul  6 07:04:31 helydokk da2 at mps0 bus 0 scbus0 target 19 lun 0
Jul  6 07:04:31 helydokk da5: <HGST HUS724040ALE64SM 4321> s/n PK2334PBHUMNXT detached
Jul  6 07:04:31 helydokk da5 at mps0 bus 0 scbus0 target 22 lun 0
Jul  6 07:04:31 helydokk da3: <HGST HUS724040ALE64SM 4321> s/n PK2334PBHTNLTT detached
Jul  6 07:04:31 helydokk da3 at mps0 bus 0 scbus0 target 20 lun 0
Jul  6 07:04:31 helydokk da4: <HGST HUS724040ALE64SM 4321> s/n PK2334PBHTJ5YT detached
Jul  6 07:04:31 helydokk da4 at mps0 bus 0 scbus0 target 21 lun 0


I have attached the entire log for before and after these events above.

Scheduled Tasks:
Smart Short tests: Every Month = 05,12,19,26 (3am)
Smart Long tests: Every Month = 08,22 (4am)
Scrubs: Every Month = 01,15 (4am)

Thanks
How are you cooling this system? It looks like /dev/da16 (a Samsung SSD) is running at 71C -- that's hot!
Code:
Jul  6 07:05:20 helydokk smartd[72814]: Device: /dev/da16 [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 70 to 71

Makes me wonder how hot your HBA is running. Perhaps it's failing due to over-heating?
 

wooley-x64

Dabbler
Joined
Dec 6, 2018
Messages
22
Quick Answer: No, the HBA temp should not correlate to the drives. Two different chassis.

All of the 2.5 Drives:
NetAPP DS-2246

All of the 3.5s
NetAPP DS-4486
These sit in a closed server cabinet. (Back open) It has 3x140mm deep fans running up top.

Quick Reads:
3.5s:
root@helydokk[~]# smartctl -a /dev/da1 | grep "Drive Temp" <---HGST 4TB Drive
Current Drive Temperature: 47 C

root@helydokk[~]# smartctl -a /dev/da11 | grep "Drive Temp" <---WD Red 3TB Drive
Current Drive Temperature: 36 C

2.5:
190 Airflow_Temperature_Cel 0x0032 068 035 000 Old_age Always - 32 <--- Smasung evo 850

root@helydokk[~]# smartctl -a /dev/da15 | grep "Temp" <--- Seagate 2.5 SAS 300GB
Temperature Warning: Disabled or Not Supported
Current Drive Temperature: 35 C
Drive Trip Temperature: 50 C

I find it a bit perplexing that the platter drives in the same disk shelf are running at half the temp of the SSDs.

The FreeNAS system itself is housed in a Desktop tower to the side of the server cabinet.

Lack of my Hypervisor limits me now to see what the readings are. I can go down there and see just how hot the HBA (using a digital temp sensor, hand held)

I will respond back with my findings.

HBA direct IR reading on the heatsink is 65* C
I could not directly find literature (at a glance) for the LSI 9201-16e.
I found articles claiming anywhere from 55* to 103* C
so, the card is either running hot or quite cool. -_-

BTW the Ambient temperature in the house is 81* F (27C)

I took the door off the front of the rack: (readings taken 20 minutes+ after)

new reads:
root@helydokk[~]# smartctl -a /dev/da1 | grep "Drive Temp" <---HGST 4TB Drive
Current Drive Temperature: 46 C

root@helydokk[~]# smartctl -a /dev/da11 | grep "Drive Temp" <---WD Red 3TB Drive
Current Drive Temperature: 36 C

root@helydokk[~]# smartctl -a /dev/da16 | grep "Temp"
190 Airflow_Temperature_Cel 0x0032 071 035 000 Old_age Always - 29


Basically no change on the either of the disk shelves.
 
Last edited:

wooley-x64

Dabbler
Joined
Dec 6, 2018
Messages
22
Now that I have re-aligned myself with sanity.

The drives are not running hot.

Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0032   071   035   000    Old_age   Always       -       29


The Raw value is the actual temperature of the drive. It is currently 29* C

The Value and worst columns are calculated by subtracting the raw value from 100.
The worst temp this drive has recorded is 65* C
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Now that I have re-aligned myself with sanity.

The drives are not running hot.

Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0032   071   035   000    Old_age   Always       -       29


The Raw value is the actual temperature of the drive. It is currently 29* C

The Value and worst columns are calculated by subtracting the raw value from 100.
The worst temp this drive has recorded is 65* C
Makes sense. I should have snapped to the fact that smartd doesn't show the raw values.

So perhaps we can eliminate heat as the immediate cause of your problems.

Still, 65C is pretty hot. And I saw that one of your HGST hard drives was running at 47C; also a little warm, though not out of spec.
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Having an issue with all disks in one shelf going offline.
I had the same issue occur last year when I upgraded from 11.2 up to 11.3. At the time I rolled back to 11.2 and the issue ceased to occur.

Presently, just upgraded from 11.2 again up to 11.3 and then 12.0 U3.1. The issue is back again.
I am leaning heavy on hardware failure. However, it just seems strange that the issue started directly after the upgrades in both instances.

Summary of problem:
All of the drives in one disk shelf go offline and never recover unless I reboot the server.


Summary of Hardware and configuration:
Dell R710 ESX 6.7.x
TrueNAS is a guest on this system
32GB Dedicated RAM
4 CPUs
Pass-Thru on LSI 9201-16e (IT MODE)
Disk Shelves:
DS-4486 (With Dell SAS Module)
DS-2246 (With NetApp SAS Module)

The issue occurs on the disks contained in the DS-4486. This device has dual PSU and is connected to UPS. The time frame of these events do not directly correlate to scheduled tasks and do not occur at a specific time.

I am looking for guidance on what type of logs I should be looking at to pin point the cause and potential solution to this issue.

Thanks


Update: 6/15/21
Drives in the other shelf report as online. However, they are also inaccessible.
The issue reoccurs every 5 days.
What firmware are you running on the LSI 9201-16e?
 

wooley-x64

Dabbler
Joined
Dec 6, 2018
Messages
22
Code:
LSI Corporation SAS2 Flash Utility
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2008-2013 LSI Corporation. All rights reserved

        Adapter Selected is a LSI SAS: SAS2116_1(B1)

Num   Ctlr            FW Ver        NVDATA        x86-BIOS         PCI Addr
----------------------------------------------------------------------------

0  SAS2116_1(B1)   20.00.07.00    14.01.00.07    07.39.02.00     00:01:00:00

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Code:
LSI Corporation SAS2 Flash Utility
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2008-2013 LSI Corporation. All rights reserved

        Adapter Selected is a LSI SAS: SAS2116_1(B1)

Num   Ctlr            FW Ver        NVDATA        x86-BIOS         PCI Addr
----------------------------------------------------------------------------

0  SAS2116_1(B1)   20.00.07.00    14.01.00.07    07.39.02.00     00:01:00:00

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.
Perfect. Phase 20.00.07.00 is the 'golden' version for that card.

So we've established that you're experiencing this problem:
  • While running FreeNAS/TrueNAS both on-the-metal and virtualized on ESXi -- currently on-the-metal
  • While running any of FreeNAS 11.2, 11.3, and TrueNAS 12.0 -- currently FreeNAS 11.2
  • Temperature doesn't seem to be a factor, though the system does run a little warmer than the optimum
Is it still just the DS-4486 shelf that's dropping off? If so, have you tried swapping cables?
 

wooley-x64

Dabbler
Joined
Dec 6, 2018
Messages
22
It would appear that way. I no longer have historical logs for these occurrences due to the migration.
When I check the FreeNAS UI during one of these events. I will see All of the pools that reside in the DS-4486 as unavailable (red X)

I have swapped the cable out in previous troubleshooting on my own. Replaced with a new SFF-8088

I cannot test the cable on the DS-2246 considering that it uses the NetAPP cable which is not a straight SF-8088.
DS-2246 cable: QSFP(SFF-8436) to MiniSAS(SFF-8088)
DS-4486 cable: MiniSAS(SFF-8088) Male to Mini SAS 26 (SFF-8088) Male

When I started with this deployment, I only had the DS-4486
This shelf was stood up with the Dell Compellent HB Module.

I had seen references online stating you cannot use the NetAPP module and pass the disks on from this shelf to an HBA.
When I obtained the DS-2246 I bought a second Dell Compellent HB Module with the intent to install on that shelf as well.
In the hopes to keep everything similar. Cables and vendors.
Problem was with the Dell Module I had purchased the fans spun up 100% on the shelf and never settled down.

I started to think, maybe people are incorrect online and or they mean you cannot use the NetAPP module to pass disks over to the HBA if you have NetAPP formatted Disks. (Which I do not have)
I took a few test SSDs and SAS drives, placed them in the shelf. Connected the shelf with the NetAPP module to the HBA after purchasing a QSFP(SFF-8436) to MiniSAS(SFF-8088) cable.
Bingo.
Drives appeared no issue in FreeNAS, formatted them created some pools and ran benchmarks without issue.

Placed some existing 2.5s I had already configured in FreeNAS and they were visible and their pools were green, data was accessible.

What does this all come down to?
Is the issue the two different disk shelf modules connected to the same HBA and or the Dell module connected to the DS-4486. (Oh, and BTW, I also swapped that Module out with the one I had intended on running in the DS-2246 and that worked fine)

Should I:
Purchase another QSFP(SFF-8436) to MiniSAS(SFF-8088) cable.
Bring everything down
Install the Original NetAPP module on the DS-4486
Attach the DS-4486 to the HBA with the new cable
Bring everything back up

IDK much about storage. Or at least not much from a real world enterprise point of view. I would think, so long as its passing along JBOD to FreeNAS, it should not matter if there is a mismatch between Modules/Cables/Drives.
I am not completely ruling it out though.

Is there any credence to this?

Also, if I swap the module on the shelf, is there any potential for data loss? Should I empty the shelf before the swap. Power it up with some test drives and then slowly add back in some existing test pools to ensure its compatible between the Dell Module and the NetApp module. (I am thinking it should not matter, but I'd rather be on the safe side)

I am just trying to eliminate the cause of this issue. I understand the virtue of 90% of the time it's a physical problem. (from experience)
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
It would appear that way. I no longer have historical logs for these occurrences due to the migration.
When I check the FreeNAS UI during one of these events. I will see All of the pools that reside in the DS-4486 as unavailable (red X)

I have swapped the cable out in previous troubleshooting on my own. Replaced with a new SFF-8088

I cannot test the cable on the DS-2246 considering that it uses the NetAPP cable which is not a straight SF-8088.
DS-2246 cable: QSFP(SFF-8436) to MiniSAS(SFF-8088)
DS-4486 cable: MiniSAS(SFF-8088) Male to Mini SAS 26 (SFF-8088) Male

When I started with this deployment, I only had the DS-4486
This shelf was stood up with the Dell Compellent HB Module.

I had seen references online stating you cannot use the NetAPP module and pass the disks on from this shelf to an HBA.
When I obtained the DS-2246 I bought a second Dell Compellent HB Module with the intent to install on that shelf as well.
In the hopes to keep everything similar. Cables and vendors.
Problem was with the Dell Module I had purchased the fans spun up 100% on the shelf and never settled down.

I started to think, maybe people are incorrect online and or they mean you cannot use the NetAPP module to pass disks over to the HBA if you have NetAPP formatted Disks. (Which I do not have)
I took a few test SSDs and SAS drives, placed them in the shelf. Connected the shelf with the NetAPP module to the HBA after purchasing a QSFP(SFF-8436) to MiniSAS(SFF-8088) cable.
Bingo.
Drives appeared no issue in FreeNAS, formatted them created some pools and ran benchmarks without issue.

Placed some existing 2.5s I had already configured in FreeNAS and they were visible and their pools were green, data was accessible.

What does this all come down to?
Is the issue the two different disk shelf modules connected to the same HBA and or the Dell module connected to the DS-4486. (Oh, and BTW, I also swapped that Module out with the one I had intended on running in the DS-2246 and that worked fine)

Should I:
Purchase another QSFP(SFF-8436) to MiniSAS(SFF-8088) cable.
Bring everything down
Install the Original NetAPP module on the DS-4486
Attach the DS-4486 to the HBA with the new cable
Bring everything back up

IDK much about storage. Or at least not much from a real world enterprise point of view. I would think, so long as its passing along JBOD to FreeNAS, it should not matter if there is a mismatch between Modules/Cables/Drives.
I am not completely ruling it out though.

Is there any credence to this?

Also, if I swap the module on the shelf, is there any potential for data loss? Should I empty the shelf before the swap. Power it up with some test drives and then slowly add back in some existing test pools to ensure its compatible between the Dell Module and the NetApp module. (I am thinking it should not matter, but I'd rather be on the safe side)

I am just trying to eliminate the cause of this issue. I understand the virtue of 90% of the time it's a physical problem. (from experience)
Frankly, I'm not sure how I would proceed if I were in your shoes.

The problem does seem to follow the HBA + disk shelves, so that's where to look.

I would definitely pull the drives whenever you make major hardware changes. And you do have a backup, right? :smile:
 

wooley-x64

Dabbler
Joined
Dec 6, 2018
Messages
22
Status Update:
The issue was still presenting itself roughly 15-30 days out from a fresh restart.

The following change was made to isolate cause.

The small disk shelf has been removed from the equation.
Now only the 3.5inch DS-4486 Disk shelf is attached.
I have moved the SSDs from the other disk shelf, they are now connected directly in the chassis of the NAS.

Will update if anything changes and if that resolved the issue.
 
Top