SOLVED Drives disappearing as the system uptime increases

Halcy0n

Cadet
Joined
Apr 7, 2021
Messages
9
Hello TrueNAS community.
i am having difficulty getting my truenas core install up and running.
my hope is to use this for NFS shares
i am able to install to the boot drive without issue.
when i run camcontrol devlist it shows all drives connected to the backplane internal to the server
but only some that are connected through the 2 DS14MK2 disk shelves i have attached through a qlogic FC HBA
the drives that do and do not appear are completely random
further, as soon as the web gui comes up you can see all 28 drives in the disk shelves
the longer the system is up the more drives disappear until the gui only shows 1 drive
the internal drives continue to show in camcontrol but not those on the disk shelves

the hardware i am using is all second hand and i thought i had a bad motherboard
so i switched from the DL180G7 i had to a 380G6 as the filer. same issue
swapped HBA, same issue. swapped SFP modules, same issue. swapped fiber, same issue. swapped drives same issue.

when i do a camcontrol rescan all *some* of the drives come back but not all, and then they drop off again.
i have also checked firmware and it appears to be correct. i have tried formatting the drives but as they do not show in camcontrol this is difficult
i tried formatting the internal drives that do show in cam control but not gui. the drives still did not reappear in gui.
the only thing left i can think of is some type of power control is parking the drives when they are not in use and causing them to drop off. but i can find no evidence of others having a similar issue on the forums. i am researching this route currently

i have gone through numerous posts about "drives disappearing" with no resolution to my own problem.

is there something obvious i am missing? If anyone has any advice or help to offer me it would be greatly appreciated.
please let me know if there is any command out put you would like me to post.

please see "TrueNas-CORE" in my signature for exact HW information.
 

Halcy0n

Cadet
Joined
Apr 7, 2021
Messages
9
Update: forgot to include this in the opening post;
the closest issue i can find to what i am experiencing is at the following post:

however i am not familiar enough with FreeBSD or TrueNas to be able to trouble shoot this without some assistance.
as such i cannot confirm this is what is happening, only suspect based off the description of the way his system behaves.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
TrueNAS Core's support for Fibre Channel hasn't been well tested. Only the Enterprise editions have Fibre Channel support.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Only the Enterprise editions have Fibre Channel support.

And that is as a FC target.

You have two issues that will break you.

The problem with using a FC disk shelf is that it has a trickle of bandwidth between the host and the shelf. ZFS will pound I/O at full speed to all the disks in the pool during scrubs and resilvers. You have 14 drives that can pound probably 100MBytes/sec, which is 800Mbits/sec, and you have 14 drives, which is about 12Gbps. But your FC is probably 2Gbps. This is going to be unhappy suckage.

The other problem is that you have a Qlogic controller. While it is an HBA, it isn't a LSI HBA and the reliability of the Qlogic FC HBA's isn't really that great. This article, especially point #4, applies to you:


We usually offer this in response to people with RAID controllers, but this is valid for non-LSI HBA's as well.
 

Halcy0n

Cadet
Joined
Apr 7, 2021
Messages
9
Thank you! I didnt think I would get an answer so quickly or from two moderators at that! it is greatly appreciated.
I knew there was likely to be some performance issues due to the 2gb link but I didnt realize it could cause the drives to just disappear like that. luckily I have an LSI SAS3801EL-S and an external sas disk shelf I can use for testing. I was hoping to use the old netapp shelves to get the full 28 TB raw storage but the need for a stable environment far outweighs what I need for capacity (assuming all parts function; less work too)
I will post my results here just in case someone else runs into this issue like I did and has trouble finding the answer.
how would I mark this closed assuming my testing is successful?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I don't know that the 2Gbps link will cause drives to disappear. It may cause the system to be very poorly responsive, and in extreme cases, that can cause I/O to fail and drives can sometimes offline or do other funky things. You'd be well advised to stress test with something like solnet-array-test-v2 and see just how bad it is.

I do know that the 2Gbps link will cause very sluggish performance under load.

Please note that the LSI 3Gbps SAS controllers have a 32 bit LBA limit that effectively limits them to 2.2TB drives. If you have a plain jane SAS disk shelf with a SAS expander, you may want to see if you can pick a LSI 6Gbps HBA off eBay.
 

Halcy0n

Cadet
Joined
Apr 7, 2021
Messages
9
Thank you for that. In my case 2TB is the maximum drive capacity I have. (didnt realize earlier.) I definitely want to get a 6gbs SAS card or even a 12gbs for my final setup.
that's a little ways off as I learn TrueNas, ZFS and FreeBSD.
it looks like the SAS hba and drive shelf (also 3gbs, repurposed EMC DAE from a VNX 5400) resolved my issue and the drives are now showing up. I do need to do a LLF though to set them to 512bps.
I am familiar with the SG_format command but it doesnt look like CORE has this. (previously was using SCALE)
would the equivalent be the "camcontrol format" command?
Code:
root@truenas[~]#
root@truenas[~]# camcontrol devlist
<HP RAID 0 OK>                     at scbus0 target 0 lun 0 (pass0,da0)
<HP RAID 0 OK>                     at scbus0 target 1 lun 0 (pass1,da1)
<HP RAID 0 OK>                     at scbus0 target 2 lun 0 (pass2,da2)
<HP RAID 0 OK>                     at scbus0 target 3 lun 0 (pass3,da3)
<HP RAID 0 OK>                     at scbus0 target 4 lun 0 (pass4,da4)
<HP RAID 0 OK>                     at scbus0 target 5 lun 0 (pass5,da5)
<HP RAID 0 OK>                     at scbus0 target 6 lun 0 (pass6,da6)
<SEAGATE STE60005 CLAR600 ES0E>    at scbus2 target 1 lun 0 (pass8,da8)
<SEAGATE ST2000NMCLAR2000 PS11>    at scbus2 target 2 lun 0 (pass9,da9)
<SEAGATE ST2000NMCLAR2000 PS0C>    at scbus2 target 3 lun 0 (pass10,da10)
<SEAGATE ST2000NMCLAR2000 PS11>    at scbus2 target 4 lun 0 (pass11,da11)
<SEAGATE ST2000NMCLAR2000 PS11>    at scbus2 target 5 lun 0 (pass12,da12)
<SEAGATE ST2000NMCLAR2000 PS11>    at scbus2 target 6 lun 0 (pass13,da13)
<SEAGATE ST2000NMCLAR2000 PS11>    at scbus2 target 7 lun 0 (pass14,da14)
<SEAGATE ST2000NMCLAR2000 PS11>    at scbus2 target 8 lun 0 (pass15,da15)
<SEAGATE ST2000NMCLAR2000 PS11>    at scbus2 target 9 lun 0 (pass16,da16)
<SEAGATE ST2000NMCLAR2000 PS11>    at scbus2 target 10 lun 0 (pass17,da17)
<SEAGATE ST2000NMCLAR2000 PS11>    at scbus2 target 11 lun 0 (pass18,da18)
 

Halcy0n

Cadet
Joined
Apr 7, 2021
Messages
9
well.. it seems I was doing something incorrect as I can now use sg_format.
at this point I think its safe to say the overall issue I was experiencing is resolved.
thank you so much for the help!
 
Top