hot swap failed disk impossible with default swap behavior (9.3)

Status
Not open for further replies.

kajer

Dabbler
Joined
Dec 22, 2014
Messages
10
Scenario:
Disk fails due to the random earth magnetic fluctuations and the sun spot cycle. Pull disk, smash with hammer, and replace with another similar spec'd device. FreeNAS will NOT acknowledge the insertion of /dev/da6 (for example).

TL;DR == SWAP is holding the device /dev/da6 open. As long as swap is holding that device, the OS will not allow that device to be re-created. Only when swap is turned off or disabled, will a failed disk replace as expected.

Who thought this was a good idea to put swap partitions on data disks? Unless swap is disabled, there is no way to truly hot-swap failed disks.

As always, rebooting is for quitters.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
It was with 9.2.1.x, but I have successfully hot-swapped a failed disk following the manual's instructions. There may well be a bug, or it could just be a problem with your installation, but it's assuredly not universally the case that "nless swap is disabled, there is no way to truly hot-swap failed disks."
 

kajer

Dabbler
Joined
Dec 22, 2014
Messages
10
It was with 9.2.1.x, but I have successfully hot-swapped a failed disk following the manual's instructions. There may well be a bug, or it could just be a problem with your installation, but it's assuredly not universally the case that "nless swap is disabled, there is no way to truly hot-swap failed disks."

When I was testing a vanilla install, the swapinfo command showed the device as swap, even though the swapoff -a was issued after the disk was pulled.

The swapinfo command even showed the active swap as non-english characters meaning something got corrupted, and that particular swap instance could not be disabled or turned off.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Yeah, I'm betting it's not a swap issue. I'm betting your hardware and the drivers for your hardware isn't compatible with hotswap. We've had this discussion a few times, and hotswap doesn't work for everyone. In fact, for quite a few unfortunate souls it has cause problems with their zpool because they plugged in a new drive and the power fluctuation tripped off multiple other drives in the pool. Whoops!

Rebooting may be for quitters, but it's also for people that don't have evidence that hotswap is broken for their hardware.

For the record, if your hardware *does* support hotswap, depending on the hardware the new disks might come up as da6, or it might come up as a new device (say, da10). So assuming it must come up at da6 is the fools errand and shows a lack of experience with hotswap. You'd also be surprised at how many RAID controllers do NOT support hotswap in FreeBSD. Gee... isn't this your thread? You are using a RAID controller. Color me shocked.

At some point you're going to realize that we've recommended against RAID controllers for probably a dozen reasons. But its the reasons that you probably don't know that you want to run from RAID if you go with FreeNAS.
 

kajer

Dabbler
Joined
Dec 22, 2014
Messages
10
Yeah, I'm betting it's not a swap issue.
I disable swap and hotswap is no longer an issue. How do you think SWAP not in the spotlight here?

I'm betting your hardware and the drivers for your hardware isn't compatible with hotswap. We've had this discussion a few times, and hotswap doesn't work for everyone. In fact, for quite a few unfortunate souls it has cause problems with their zpool because they plugged in a new drive and the power fluctuation tripped off multiple other drives in the pool. Whoops!
One of the first things I tested on the box in question. The OS and HBA support drive removal and reinsertion flawlessly. Only when SWAP gets involved does the system freak out when a drive is pulled while a SWAP partition is in use.

Rebooting may be for quitters, but it's also for people that don't have evidence that hotswap is broken for their hardware.
Again, first thing tested through and through.

For the record, if your hardware *does* support hotswap, depending on the hardware the new disks might come up as da6, or it might come up as a new device (say, da10). So assuming it must come up at da6 is the fools errand and shows a lack of experience with hotswap. You'd also be surprised at how many RAID controllers do NOT support hotswap in FreeBSD. Gee... isn't this your thread? You are using a RAID controller. Color me shocked.
First, this thread is not about the 2950, or the PERC5i or 6i. This thread was about the OS creating SWAP partitions on disks that will fail, and the swap system not being able to handle the partition removal when active. Hotswap has been a trivial issue since SATA and SAS took market share. Hot Swap of a IDE disk, okay, maybe that may be the issue. Also, with "swapoff -a" issued, the system can handle the insertion and removal of /dev/da6 over and over and over. Create, destroy, repeat. I would not be posting this thread blaming swap if I was having /dev/da99 show up after 90+ cycles of drive insertion.

At some point you're going to realize that we've recommended against RAID controllers for probably a dozen reasons. But its the reasons that you probably don't know that you want to run from RAID if you go with FreeNAS.
I did not mention a RAID controller here. The other thread about the 2950 / PERC was another thread. As it happens, the system I was testing this issue on, was a completely different system... A simple HBA that presents 16 disks to the OS. /dev/da[0-15]
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Did you actually follow the manual and offline the drive before touching it?
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Well, be fair--if you're trying to simulate a catastrophic disk failure, the disk won't be polite enough to offline itself before it dies.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Well, be fair--if you're trying to simulate a catastrophic disk failure, the disk won't be polite enough to offline itself before it dies.

Still, if it wasn't automatically offlined, it has to be offlined before the replacement is resilvered. Bad choice of words on my part.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
I decided to try a bit of a test:
  • Created a new RAIDZ1 pool with three new disks
  • Started copying several hundred GB of random data to the new pool
  • After around 100 GB had been copied, pulled one of the disks (just flipped the lever and pulled it out of the chassis--didn't touch anything in the GUI first)
  • After about 30 seconds, the system noticed the missing disk, sent a warning email, and started flashing the red light in the web GUI
  • At that time, went to storage -> select volume -> volume status, and noticed that the disk I removed was marked as REMOVED. When I clicked on it, I didn't have a button to Offline, just Replace
  • Swapped in a different new disk into the bay where the removed disk had been
  • Clicked the Replace button, selected the new disk from the drop-down, OK
  • The new disk was successfully added and began resilvering while the copy continued. No errors thrown on the web GUI or the console.
The hardware in this case is in my .sig. The new disks were attached to an LSI 9211-8, and this was done on 9.3-STABLE.
 

kajer

Dabbler
Joined
Dec 22, 2014
Messages
10
I was able to complete the same steps up to the replace button in the gui. It was empty.

I'll post the hba model in a day or two... time for xmas down time and all
 
Status
Not open for further replies.
Top