Pool gone after reboot

Wafflest1ck

Dabbler
Joined
Jan 12, 2022
Messages
10
Hello everyone. I have a Dell R720XD with a Perc Mini H710 mini flashed to IT mode and an LSI SAS9207-8e also in IT mode. Both cards running on P20 firmware.
The LSI SAS9207-8e is connected to a netapp ds4243 JBOD. Made a 36 drive pool and all is working fine however, when it comes to rebooting the R720XD, is when the fun starts… So almost every single time, after a reboot, for some reason truenas is not able to properly mount the drives in the JBOD. I get flooded with alerts that most of the drives are unhealthy, and are therefore not mounted. To rectify the issue, I have to restart the server an additional 5-20 times until the pool miraculously mounts properly again which is a real headache...

I have so far only seen 1 thread with someone else having the same issue on core https://www.truenas.com/community/t...boot-shutdown-can-get-it-back-but-help.80304/

So far I have tried going back and forth from the bluefin rc1 release and the stable release to no avail. I have also reflashed both cards in case of a bad flash but the issue persists. Appreciate the help/insight!


Update: Tested the disk shelf on a brand new server build (E-2356G Xeon, Asus P12-RE motherboard, 64GB ECC UDIMM, & SAS9217-4i4e) and the issue still persists even on the new hardware. However. I took core for a spin and the issue is not present there. Rebooted the server multiple times and the pool was still there every time. So this does show some indication of a software problem. Might cut my losses here and return the disk shelf in favor of a bigger chassis to directly attach all my drives since core does not cover all my needs unfortunately.
 
Last edited:

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
This looks like a timing issue. Instead of rebooting, log in via SSH and run zpool import. If your pool is listed and says it can be imported, try importing it.

Also, please supply more hardware configuration.

Like, does the NetApp DS4243 JBOD use a SAS Expander?
I would guess so since you don't appear to have enough disk ports.
 

Daisuke

Contributor
Joined
Jun 23, 2011
Messages
1,041
NetApp are tricky, I have the same server, same cards (H710 mini, H810 external) and DS4246. Make sure the cards have the correct firmware and are disabled in BIOS, to save on boot time:

BAA0DC10-EA79-4231-86ED-950AAA9F7037.png


If you need to reboot the server, first shutdown the server, then shutdown the DS. Yes, you need to shutdown everything, in this specific order. Next, power up the DS, wait until fans slowdown, then power up the server. This should have your issues solved.

Note: If DS fans do not slowdown after 30 seconds or so, you have a power configuration issue and you need to address it. In my case, I had to swap the position of the two power supplies.

Also, when you say a 36 disks pool, I presume you have a 12disks raidz2 pool on R720xd and 2 additional raidz2 vdevs on DS, extending the pool, like I do.
 
Last edited:

Wafflest1ck

Dabbler
Joined
Jan 12, 2022
Messages
10
This looks like a timing issue. Instead of rebooting, log in via SSH and run zpool import. If your pool is listed and says it can be imported, try importing it.

Also, please supply more hardware configuration.

Like, does the NetApp DS4243 JBOD use a SAS Expander?
I would guess so since you don't appear to have enough disk ports.
yes it's equipped with sas expanders. thank you for the ssh tip
 

Wafflest1ck

Dabbler
Joined
Jan 12, 2022
Messages
10
NetApp are tricky, I have the same server, same cards (H710 mini, H810 external) and DS4246. Make sure the cards have the correct firmware and are disabled in BIOS, to save on boot time:

View attachment 60672

If you need to reboot the server, first shutdown the server, then shutdown the DS. Yes, you need to shutdown everything, in this specific order. Next, power up the DS, wait until fans slowdown, then power up the server. This should have your issues solved.

Note: If DS fans do not slowdown after 30 seconds or so, you have a power configuration issue and you need to address it. In my case, I had to swap the position of the two power supplies.

Also, when you say a 36 disks pool, I presume you have a 12disks raidz2 pool on R720xd and 2 additional raidz2 vdevs on DS, extending the pool, like I do.
firmware checks out on both cards and I have boot disabled on the external card but uefi boot flashed on the perc mini so I can boot from my boot drives in the back of the server. Thanks alot for the power sequence suggestion this seems very promising. Will def give that a shot when able. As far as my pool config its 6 vdevs of 6 drives each in raidz2

04ab3e0cebe7a4a634b7a010fbf89df2272d4a57_2_512x550.jpeg
 

Daisuke

Contributor
Joined
Jun 23, 2011
Messages
1,041
I can boot from my boot drives in the back of the server.
You should definitely use these 2 SSD drives for a dedicated software pool. I use as boot drive an SSD connected to internal USB, you don’t need redundancy as is super easy to replace. You do need speed for your ix-applications dataset, you will see great performance gains once you run everything on dual SSDs.

Check My TrueNAS Scale Build for above mentioned details, it should help you. Is quite a lot of info there but it will be useful for you, because we have the same setup.

If you don’t have data on your pool, you can safely go to a 12disks raidz2 vdev format. Technically the max recommended number for raidz2 is 11 disks but I’ve been doing this for years with zero issues. Only once I had another disk failing, while resilvering another. I buy disks in small quantities, from different dates/batches.

Are your disks CMR? Quite important.
 
Last edited:

Wafflest1ck

Dabbler
Joined
Jan 12, 2022
Messages
10
You should definitely use these 2 SSD drives for a dedicated software pool. I use as boot drive an SSD connected to internal USB, you don’t need redundancy as is super easy to replace. You do need speed for your ix-applications dataset, you will see great performance gains once you run everything on dual SSDs.

Check My TrueNAS Scale Build for above mentioned details, it should help you. Is quite a lot of info there but it will be useful for you, because we have the same setup.

If you don’t have data on your pool, you can safely go to a 12disks raidz2 vdev format. Technically the max recommended number for raidz2 is 11 disks but I’ve been doing this for years with zero issues. Only once I had another disk failing, while resilvering another. I buy disks in small quantities, from different dates/batches.

Are your disks CMR? Quite important.
I have 2 mirrored nvme drives with nvme to pcie adapters for my apps. As far as wether my drives are cmr that is a good question will need to check. All I know is that in the r720 I have a mix of iron wolf pros, exos, 1 wd gold, and other seagate sas drives and my disk shelf is all hgst sas drives. Thanks for the build info. Very awesome and helpful bumping into someone with a similar setup but with more experience. For your 12 raidz2 are you still able to saturate 10G?
 

Wafflest1ck

Dabbler
Joined
Jan 12, 2022
Messages
10
NetApp are tricky, I have the same server, same cards (H710 mini, H810 external) and DS4246. Make sure the cards have the correct firmware and are disabled in BIOS, to save on boot time:

View attachment 60672

If you need to reboot the server, first shutdown the server, then shutdown the DS. Yes, you need to shutdown everything, in this specific order. Next, power up the DS, wait until fans slowdown, then power up the server. This should have your issues solved.

Note: If DS fans do not slowdown after 30 seconds or so, you have a power configuration issue and you need to address it. In my case, I had to swap the position of the two power supplies.

Also, when you say a 36 disks pool, I presume you have a 12disks raidz2 pool on R720xd and 2 additional raidz2 vdevs on DS, extending the pool, like I do.
so I tried this startup sequence a few times exactly as you suggested and unfortunately my the issue persists... zpool import will not work either. If i power cycle the DS after truenas is loaded up, I also noticed not all drives are picked up from the DS so my only 'solution' is to keep power cycling the server until everything works again
 

Daisuke

Contributor
Joined
Jun 23, 2011
Messages
1,041
I know for a fact that the correct behaviour to reboot an appliance like TrueNAS with NetApp attached to it is by performing the steps I mentioned above. Honestly, I never tried to reboot just the server.

Can you keep only 6 disks (one vdev) into DS and see if all disks are seen properly? Are you using interposers? In some cases, using them can actually be detrimental. From what I read, using interposers can mess with the diskid and SMART values.
 

Wafflest1ck

Dabbler
Joined
Jan 12, 2022
Messages
10
I know for a fact that the correct behaviour to reboot an appliance like TrueNAS with NetApp attached to it is by performing the steps I mentioned above. Honestly, I never tried to reboot just the server.

Can you keep only 6 disks (one vdev) into DS and see if all disks are seen properly?
will give it a shot when back home. Also, just making sure, I am only supposed to have 1 single minisas to qsfp to the shelf right? does not support splitting the bandwidth with a second cable?
 

Daisuke

Contributor
Joined
Jun 23, 2011
Messages
1,041
Yes, your cable should be connected into the square marked port, not the round one. A second cable can be used for redundancy but the speed will not increase, I don’t know about bandwidth split. I only use one cable.

You can use the round marked port to connect another DS, into square marked port.

I forgot to ask, do you have another cable by any chance? I had to buy 3 cables, until I found a good working one. All 3 had identical specs, the first two were from eBay and none worked. I ended buying this cable, it was the only one that worked. With eBay cables, the disks were not showing at all into TrueNAS, to create a vdev.

I know is stressful, I spent literally one month to get everything working properly.
 
Last edited:

Wafflest1ck

Dabbler
Joined
Jan 12, 2022
Messages
10
Yes, your cable should be connected into the square marked port, not the round one. A second cable can be used for redundancy but the speed will not increase, I don’t know about bandwidth split. I only use one cable.

You can use the round marked port to connect another DS, into square marked port.

I forgot to ask, do you have another cable by any chance? I had to buy 3 cables, until I found a good working one. All 3 had identical specs, the first two were from eBay and none worked. I ended buying this cable, it was the only one that worked. With eBay cables, the disks were not showing at all into TrueNAS, to create a vdev.

I know is stressful, I spent literally one month to get everything working properly.

got it thanks. Stressful yes very but many thanks for your responses and insight on this issue :smile: I do have 2 of these https://amz.run/6Chs
 

Daisuke

Contributor
Joined
Jun 23, 2011
Messages
1,041
I was sure the mentioned shutdown sequence will fix the issue. I don’t know what to say. Can you post the output of dmesg and cat /var/log/messages to make sure the kernel sees the enclosure but the OS reports issues with the disks?

Is very simple, disks can either be or not be seen. It does not makes sense to me multiple reboots will eventually make the the disks visible. Anyone else can share their thoughts? Pinging also @HoneyBadger.
 

Wafflest1ck

Dabbler
Joined
Jan 12, 2022
Messages
10
just got back. Here you go.
 

Attachments

  • dmesg.txt
    259.4 KB · Views: 113
  • cat.txt
    8.8 MB · Views: 170
Top