Pool disappeared and disks no longer show up

zoull

Cadet
Joined
Jan 18, 2024
Messages
4
Motherboard: SuperMicro X9DRi-LN4F+
CPU: 2x Intel Xeon E5-2650 v2 @ 2.60GHz
RAM: ~196GB
Drives:
960GB Toshiba TR150 SSD
2x 256GB Samsung NVMe
12x 6TB HGST HUS726060AL5210 SAS HDD
SAS Controllers:
SAS2004 SAS9211-4i IT mode
SAS2008 SAS9211-8i IT mode
TrueNAS SCALE 23.10.1 hosted in Proxmox with passthrough of the SAS Controllers

Extra information: the 12 SAS drives were configured in a RaidZ3 with one of the 256GB Samsung NVMe as a cache drive

This system started with TrueNAS CORE and went through several updates ending with 13-U6. Not too long ago I updated the system to TrueNAS SCALE, starting with 22 and updating to 23.01.1. I recently did an update to 23.10.1 as well, but the system was working up until the other night. I was working on deploying some services to some other VMs in Proxmox and ran into an issue. I took the services down because I had implemented HTTPS redirection and that broke an internal endpoint that couldn't be reached over HTTPS. I believed that taking the services down would remove the HTTPS redirection, but it did not. I have a VM on the same Proxmox node as the TrueNAS VM that supplies DHCP, DNS and a Reverse Proxy. So, I thought maybe something was going on with that VM and I tried restarting it. That didn't work, so I tried restarting the entire Proxmox node. That didn't work either, so I went a different route and tried creating a file to use in the mean time in place of the endpoint that I could no longer reach.

Now we get into the part where I noticed an issue with TrueNAS. I got things working with the file, but one of the tasks in the process that I was working on is to mount shares from TrueNAS. When the shares tried to get mounted, I was getting errors along the lines of the network location didn't exist. I was very confused because I haven't had issues with my TrueNAS shares in years, so I started investigating what was going on and ended up in TrueNAS to find that it said that my pool was offline. I restarted TrueNAS, but that didn't help, so I started digging deeper. I restarted the entire server and noticed that it said that the SAS configuration had changed and reconfiguration was recommended. I went into the configuration utility and sure enough one of the controllers didn't have the boot order specified. I set that and got back into TrueNAS, no difference.

At this point I really started digging into things like, making sure the passthrough was setup properly in Proxmox. Passthrough does seem to be setup properly. I read some stuff about these SAS Controllers needing to be in IT mode and then I realized that they already are and TrueNAS can see the SAS Controllers. I started trying to look into the disks and noticed that they show up in the BIOS, they show up in the SAS configuration utility, but they do not show up in TrueNAS. I also tried looking into the pool further in the CLI and noticed that while I can see the pool in the Dashboard of the TrueNAS GUI, I cannot see the pool in the CLI, not in status, not in import and it even appears like the files in /mnt do not exist.

When I first started troubleshooting, I initially thought that maybe the drives had failed, although it seems crazy that all 12 SAS drives in the pool would have all failed at the same time and before I restarted the whole Proxmox node everything was working fine. I'm still going to check each drive individually, just in case, but considering that I can see them in the BIOS and in the SAS configuration utility, I'm thinking that it's not the drives having failed.

I've searched and searched and found several things that seemed like they might be similar enough to my issue, but nothing has gotten me any closer to a solution. I've tried everything that I can think of, but I'm at loss at this point as to why I can see the drives in the BIOS and the SAS configuration utility, but TrueNAS just refuses to see them. Or why TrueNAS won't even see the pool in the CLI.

I have attached photos of anything that I thought could potentially be helpful.
 

Attachments

  • VideoCapture_20240118-185409.jpg
    VideoCapture_20240118-185409.jpg
    677.7 KB · Views: 34
  • 20240118_185811.jpg
    20240118_185811.jpg
    413 KB · Views: 35
  • 20240118_185640.jpg
    20240118_185640.jpg
    502 KB · Views: 35
  • 20240118_185612.jpg
    20240118_185612.jpg
    304.2 KB · Views: 31
  • 20240118_185549.jpg
    20240118_185549.jpg
    421.8 KB · Views: 31
  • 20240118_185502.jpg
    20240118_185502.jpg
    483.5 KB · Views: 36
  • 20240118_185020.jpg
    20240118_185020.jpg
    300.1 KB · Views: 35
  • 20240118_185004.jpg
    20240118_185004.jpg
    266.1 KB · Views: 32
  • 20240118_184951.jpg
    20240118_184951.jpg
    250.6 KB · Views: 27
  • 20240118_184826.jpg
    20240118_184826.jpg
    397.8 KB · Views: 30
  • 20240118_184548.jpg
    20240118_184548.jpg
    344.4 KB · Views: 23
  • 20240118_184526.jpg
    20240118_184526.jpg
    475.9 KB · Views: 30
  • 20240118_184420.jpg
    20240118_184420.jpg
    465.5 KB · Views: 24
  • 20240118_183712.jpg
    20240118_183712.jpg
    459.3 KB · Views: 24
  • 20240118_183659.jpg
    20240118_183659.jpg
    367 KB · Views: 26
  • 20240118_183650.jpg
    20240118_183650.jpg
    444.8 KB · Views: 27
  • 20240118_183252.jpg
    20240118_183252.jpg
    552.2 KB · Views: 30
  • 20240119_122318.jpg
    20240119_122318.jpg
    451.2 KB · Views: 29
  • 20240119_122247.jpg
    20240119_122247.jpg
    288.5 KB · Views: 37

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
You mention that from the Unix SHELL that zpool importdoes not show up. That is very odd. Are you sure you are using the Unix / Linux SHELL to run the command?

The only advice I can give is to try without any virtualization. Something along these lines;
  • Make a new TrueNAS boot device and boot the server physically
  • Or see if the pool is visible from Proxmox
  • Or make a ZFS aware Linux boot device and boot to that
Without seeing the pool using zpool import, your options are very limited.

My knowledge of Proxmox is limited, as is running TrueNAS as a VM... perhaps someone else can offer up better suggestions.
 

Whattteva

Wizard
Joined
Mar 5, 2013
Messages
1,824
Extra information: the 12 SAS drives were configured in a RaidZ3 with one of the 256GB Samsung NVMe as a cache drive


This system started with TrueNAS CORE and went through several updates ending with 13-U6. Not too long ago I updated the system to TrueNAS SCALE, starting with 22 and updating to 23.01.1. I recently did an update to 23.10.1 as well, but the system was working up until the other night. I was working on deploying some services to some other VMs in Proxmox and ran into an issue. I took the services down because I had implemented HTTPS redirection and that broke an internal endpoint that couldn't be reached over HTTPS. I believed that taking the services down would remove the HTTPS redirection, but it did not. I have a VM on the same Proxmox node as the TrueNAS VM that supplies DHCP, DNS and a Reverse Proxy. So, I thought maybe something was going on with that VM and I tried restarting it. That didn't work, so I tried restarting the entire Proxmox node. That didn't work either, so I went a different route and tried creating a file to use in the mean time in place of the endpoint that I could no longer reach.
I hope your VM's are actually hosted on its own separate storage instead of on the TrueNAS storage, cause that would cause a lot of issues with the VM's when your TrueNAS stops functioning properly. I only host non-essential testing VM's on the TrueNAS storage. Essential services like DHCP, DNS, etc. are hosted on its own independent storage that doesn't have any dependency on a VM successfully booting.

At this point I really started digging into things like, making sure the passthrough was setup properly in Proxmox. Passthrough does seem to be setup properly. I read some stuff about these SAS Controllers needing to be in IT mode and then I realized that they already are and TrueNAS can see the SAS Controllers. I started trying to look into the disks and noticed that they show up in the BIOS, they show up in the SAS configuration utility, but they do not show up in TrueNAS. I also tried looking into the pool further in the CLI and noticed that while I can see the pool in the Dashboard of the TrueNAS GUI, I cannot see the pool in the CLI, not in status, not in import and it even appears like the files in /mnt do not exist.
What firmware are you running. I'm not entirely sure on this, but I've read that some earlier versions of the firmware may have issues.

When I first started troubleshooting, I initially thought that maybe the drives had failed, although it seems crazy that all 12 SAS drives in the pool would have all failed at the same time and before I restarted the whole Proxmox node everything was working fine. I'm still going to check each drive individually, just in case, but considering that I can see them in the BIOS and in the SAS configuration utility, I'm thinking that it's not the drives having failed.

I've searched and searched and found several things that seemed like they might be similar enough to my issue, but nothing has gotten me any closer to a solution. I've tried everything that I can think of, but I'm at loss at this point as to why I can see the drives in the BIOS and the SAS configuration utility, but TrueNAS just refuses to see them. Or why TrueNAS won't even see the pool in the CLI.

I have attached photos of anything that I thought could potentially be helpful.
When you boot TrueNAS VM, does it load the OPROM of your SAS controller? Do you see it in the booting sequence? If you don't see it, it could be an indication that the passthrough isn't setup properly if you didn't disable the OPROM.

I second @Arwen suggestion to go baremetal so you can rule out Proxmox passthrough shenanigans.
 

zoull

Cadet
Joined
Jan 18, 2024
Messages
4
You mention that from the Unix SHELL that zpool import does not show up. That is very odd. Are you sure you are using the Unix / Linux SHELL to run the command?
In TrueNAS, I went to System Settings > Shell and typed zpool import there. The second image shows the output of zpool import.

Yesterday I got a SAS to SATA adapter and plugged all of the drives into one of my Windows machines and they all showed up in Disk Management. Of course Windows asked to format the drives, but otherwise they showed up fine.

I hope your VM's are actually hosted on its own separate storage instead of on the TrueNAS storage, cause that would cause a lot of issues with the VM's when your TrueNAS stops functioning properly. I only host non-essential testing VM's on the TrueNAS storage. Essential services like DHCP, DNS, etc. are hosted on its own independent storage that doesn't have any dependency on a VM successfully booting.
Proxmox and the VMs are hosted on their own disk and TrueNAS and back when it was CORE, its jails, are hosted on their own disk. In the sixth image, in Proxmox, on the left you can see a shared drive under local and local-lvm, TrueNAS is stored there and the rest of the VMs are stored in local-lvm. So, they are indeed hosted completely separately of each other.

What firmware are you running. I'm not entirely sure on this, but I've read that some earlier versions of the firmware may have issues.
SAS2004 SAS9211-4i is on 20.00.07.00-IT
SAS2008 SAS9211-8i is on 20.00.06.00-IT
The 9th picture shows this.

Which I've read that 20.00.07.00-IT is the firmware that's recommended, but also, I think the thing to keep in mind is that this setup worked for years until the other day. Something that I might add here, is that the CMOS battery might be dead, because when I unplugged the machine to take the pictures of the inside, when I turned the machine back in the BIOS reset its settings.

When you boot TrueNAS VM, does it load the OPROM of your SAS controller? Do you see it in the booting sequence? If you don't see it, it could be an indication that the passthrough isn't setup properly if you didn't disable the OPROM.
If I switch back to TrueNAS CORE one of the controllers shows up in the boot sequence because it gets stuck trying to load the drivers or something. As for TrueNAS SCALE, I'm not entirely sure because it all flies by too fast. BUT, in the third image I run lspci in TrueNAS and you can see the controllers from in TrueNAS. So, that leads me to believe that the pass through is working properly.

Something that I plan on trying later today, is the take a random drive that I have lying around, wipe it, stick it on the backbone and see if it shows up. If it does, I might try taking some backups of all of the drives, just in case, and store them in the cloud. And then wipe the drives and see if I can't just start the pool over. I don't really want to, but I'm starting the think that I might just have to take the L on this one.
 

Whattteva

Wizard
Joined
Mar 5, 2013
Messages
1,824
If I switch back to TrueNAS CORE one of the controllers shows up in the boot sequence because it gets stuck trying to load the drivers or something. As for TrueNAS SCALE, I'm not entirely sure because it all flies by too fast. BUT, in the third image I run lspci in TrueNAS and you can see the controllers from in TrueNAS. So, that leads me to believe that the pass through is working properly.
That's strange. Regardless of what OS you're booting, the OPROM should show up in the boot sequence because those things load BEFORE your OS loads. The fact that it shows on CORE, but not SCALE leads me to believe that the VM's have different passthrough setup (assuming they're different VMs).
 

zoull

Cadet
Joined
Jan 18, 2024
Messages
4
That's strange. Regardless of what OS you're booting, the OPROM should show up in the boot sequence because those things load BEFORE your OS loads. The fact that it shows on CORE, but not SCALE leads me to believe that the VM's have different passthrough setup (assuming they're different VMs).
They are not different VMs, just different boot options because of TrueNAS updates that have happened in that VM. And I didn't say that the OPROM doesn't show up in TrueNAS SCALE, just that I'm not sure because I haven't caught it because of how fast things fly by. But it does show up in the list of devices in TrueNAS SCALE. So, same pass through, and the pass through seems to be fine, since it does show up in the list of devices in TrueNAS SCALE and it at least tries to load it in TrueNAS CORE.

Here soon, I'm going to try a different clean drive to see if it shows up and that maybe something just happened when I did that restart. Because other than that, the only thing that I can think of is a BIOS setting, since I did see a message saying that the settings were reset after unplugging the server to take pictures of it.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
It's 2024. Do yourself a favor and use UEFI boot, disable all the CSM stuff, and use only UEFI OpROMs. Those dedicated real mode setup applications and their stupid timeouts are just cancer, and any relevant functionality can be obtained integrated in the system firmware setup menu using the UEFI OpROM.

With that rant out of the way...

Something that I might add here, is that the CMOS battery might be dead, because when I unplugged the machine to take the pictures of the inside, when I turned the machine back in the BIOS reset its settings
Definitely fix that first. Expect a lot of dodgy behavior until you do, anywhere from misconfigured memory controllers to mystery boot failures.
 

zoull

Cadet
Joined
Jan 18, 2024
Messages
4
It's 2024. Do yourself a favor and use UEFI boot, disable all the CSM stuff, and use only UEFI OpROMs. Those dedicated real mode setup applications and their stupid timeouts are just cancer, and any relevant functionality can be obtained integrated in the system firmware setup menu using the UEFI OpROM.
I believe that everything was using UEFI, but I did also see something that the SAS configuration utility and the firmware for the LSI HBAs that I'm using will only load in BIOS mode and it seemed like that might be true because I tried setting UEFI only in the BIOS and it didn't seem to load the controllers and they didn't load again until I switched back to legacy mode.

But an update. I found a drive laying around, so I pulled everything out of the server, installed the drive I found and installed TrueNAS onto it, effectively installing TrueNAS directly on the bare metal and it was able to see all of my drives, it was able to see that they were a part of the pool and I ran zpool import and it showed the pool available to be imported. So now I'm thinking that maybe it is either a pass through issue or that maybe it's an issue with the VMs BIOS, related to the BIOS stuff that I talked about earlier.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
But an update. I found a drive laying around, so I pulled everything out of the server, installed the drive I found and installed TrueNAS onto it, effectively installing TrueNAS directly on the bare metal and it was able to see all of my drives, it was able to see that they were a part of the pool and I ran zpool import and it showed the pool available to be imported. So now I'm thinking that maybe it is either a pass through issue or that maybe it's an issue with the VMs BIOS, related to the BIOS stuff that I talked about earlier.
Excellent. I agree that it may be related to a BIOS setting that was lost from the dead battery.

By the way, I did review each and every one of your screen captures. The second is from zpool status, not zpool import, thus my request. One of the first steps in a lost pool ZFS trouble shooting session is the output of zpool import. (This paragraph is mostly for others reading this thread looking to trouble shoot their own missing ZFS pool...)
 
Top