Recover a pool. No pressure.

mjws00 · Sep 9, 2014

Ok gurus,

I got a little too crazy with my pool, and I'd like to learn a little more about recovery.

Basically I was playing with adding and removing slogs. The last iteration I was testing was using a ram drive. I know a little kooky, but not that uncommon among some of the zfs wizards in other places.

Obviously my device is not persistent, but I never had trouble remounting a pool missing a slog until now. I've tried the more aggressive import commands, but with no change. I'll do some more research, but I'm sure I'm not the only one that would benefit by the recovery procedures. Successful or not.

The box was rebooted gracefully. All drives are intact, and there is no corruption or missing data on the mirrored pairs. The slog is the device missing. I did try and import on a clean/reset freeNas as well. Haven't tried a live bsd cd or anything. This is a test pool, so no big deal.

I appreciate any wisdom given, hopefully it benefits many. Here is the result of my last attempt which shows the structure and error:

Code:

[root@freenas] ~# zpool import -fFX
   pool: Mir56
     id: 14502085242706162656
  state: UNAVAIL
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
        devices and try again.
   see: http://illumos.org/msg/ZFS-8000-6X
config:

        Mir56                                           UNAVAIL  missing device
          mirror-0                                      ONLINE
            gptid/8680b45e-1e14-11e4-8513-000c29d3a5ee  ONLINE
            gptid/872ff4a0-1e14-11e4-8513-000c29d3a5ee  ONLINE
          mirror-1                                      ONLINE
            gptid/9c9cef83-1e19-11e4-8513-000c29d3a5ee  ONLINE
            gptid/9d719343-1e19-11e4-8513-000c29d3a5ee  ONLINE
          mirror-2                                      ONLINE
            gptid/c14c4884-1e20-11e4-9d5b-001e674d144d  ONLINE
            gptid/c22073df-1e20-11e4-9d5b-001e674d144d  ONLINE

        Additional devices are known to be part of this pool, though their
        exact configuration cannot be determined.

No rush or pressure. I'll fight it for a while as a recovery exercise. :)

cyberjock · Sep 9, 2014

For starters, this statement has me totally baffled...

mjws00 said:
Basically I was playing with adding and removing slogs. The last iteration I was testing was using a ram drive. I know a little kooky, but not that uncommon among some of the zfs wizards in other places.

Are you talking about a ramdisk created out of system RAM or one of those devices that has a SATA interface but uses RAM for its storage medium? When I first read that statement I was thinking system RAM and if that's the case I just couldn't help but think "if ZFS wizards are saying that's not 'uncommon' they are NOT ZFS wizards in the slightest" and that would be 1000% true.

For reference, when I mention memory devices in the remainder of this post I'm referring to devices you create from system RAM from the CLI.

Your SLOG is nothing more than what (should) be a persistent copy of the data that is in the RAM that needs to be committed to the pool. I've seen quite a few people on this forum make memory devices from the CLI and then mirror their to an SSD as well as a memory device and call that "extra fast and reliable" compared to just the SSD. The reality is that they are kind of silly as it won't be extra fast nor extra reliable. All they are actually doing is taking what is in RAM and making another copy in RAM and then another copy on the SSD. For example, they could also set sync=disabled and then see "extra fast and dangerous" because a loss of power or kernel panic would cause a loss of the only copy of the data that exists.

Anyway, on to recovery....

If an slog is missing and you are looking to mount a pool the proper way to do that is to add the -m import command. In your case something like:

# zpool import -m -R /mnt Mir56

The -f, -F, and -X are not going to solve the problem (at least, that's not the proper way to handle the situation).

Optionally you can also use the -n parameter to "simulate" the import without actually importing the pool. Using it with -m will say something like "X seconds of transactions will be lost" if I remember correctly. At that point you can either restore the slog (obviously not possible for the purposes of your experimenting) or you can remove the -n parameter and the pool will import without the slog.

jgreco · Sep 9, 2014

I read it as an md device, which of course wouldn't survive the reboot, so yeah, what you suggest is the generally correct strategy cyberjock.

To the original poster: no ZFS wizard who cared about pool integrity would do this. It is fine for experimentation or quantification but also generally pointless. You get similar but safer effects from setting sync=disabled.

mjws00 · Sep 9, 2014

You read it exactly right, jgreco. This was an md device and experimentation and quantification was my intent. I'd like to get my esxi box playing as close to write back mode as I can. I've seen the results of sync=disabled for nfs, but haven't seen the same result through iscsi. I'd like to see hardware limits, or try and emulate a poor mans fusion-io card. Comstar can initiate a ram disk for use as an iscsi device. Nexenta can hit that target directly. We can add persistence and mirrors via additional nodes. I could see no reason why BSD/ZFS couldn't do the same. Even with all the unecessary virtualization layers you can still beat a fast ssd slog in latency and throughput. Can't even imagine what a pooled RAM based 10GB device might be able to do for IOPS.

The source was someone on the ZFS lecture circuit doing this for demos. Originally considered crazy, but then looked at in terms of what are the REAL issues. It was then taken further with RAM based extents being both scalable and persistent. I found a bunch of communities with some pretty serious minds looking into the idea.

So yep we are mapping RAM inside of RAM, cyber... but with intent. DRAM is cheap and plentiful in the small quantities needed for slog. A safe, fast way to use it would be awesome, imho.

Thanks for the heads up on the commands. I was hoping someone with more experience would have the right procedure at their fingertips. No luck yet.

jgreco · Sep 9, 2014

Mostly the people who need that sort of thing have been either heading towards the Fusion-io stuff, or Stec ZeusRAM, or - if you're a cheap bastard like me - abusing a RAID controller's BBU write cache for production low-latency SLOG use. Cyberjock is mostly oriented towards people who are actually deploying stuff rather than just playing with it, so please understand he's got the best of intentions in trying to keep data safe.

cyberjock · Sep 9, 2014

mjws00 said:
So yep we are mapping RAM inside of RAM, cyber... but with intent. DRAM is cheap and plentiful in the small quantities needed for slog. A safe, fast way to use it would be awesome, imho.

Yeah, system RAM as it currently is not only isn't safe but can't be made safe since it's non-volatile.

Now there is talk about next-generation RAM that has fast SLC flash RAM on the RAM DIMM and on a loss of power it dumps itself to SLC. Not sure when/if it is coming to market nor when/if it will be supported on FreeBSD nor when/if it will be something that is supported in ZFS.

I do have several Acard ANS-9010 devices. They use DDR2 RAM and provide a SATA interface and "appear" to look like a small hard drive(but a fast one). It has a battery of its own and on a loss of power it backs up to a CF card you buy and install. It worked very well on benchmarks.

I really have no clue how you call something a "RAM based extent that is persistent" as RAM itself is not persistent. Got a link?

jgreco · Sep 9, 2014

It's a'comin' http://www.sandisk.com/enterprise/ulltradimm-ssd/

etc

mjws00 · Sep 9, 2014

Yep. Those make me drool. But I only need an 8GB one for $500 not a 2TB one for $20,000. I'd gladly pay the price per GB for a fusion-io or zeus... just give me a small one. There are other nvramm style options coming online as well, but hard to say if we ever get momentum necessary to mainstream for smb.

Cyber, I was a little envious of your Acard devices the second I heard you had them. The ramdisks are persistent because they are mirrored across seperate systems. So we have a local ram disk. We have a mirror(s) for when we want to reboot. In addition we could dump to nonvolatile storage on demand if the system were designed with that in mind. The setup is point and click with a Windows Server using the free comstar iscsi software, and a Nexenta box. We can beat controller cache speeds and reliability (theoretically). I'll find you some links, I found the theory interesting as I am a performance guy.

The problem I keep running into is FreeNAS doesn't even come close to hitting the speeds of even a mediocre hardware based disk system. It is a no brainer to throw ssds at an enterprise controller (BBU and cache) and crush what we see in terms of speed and IOPS here on modest hardware. 500MBps drives are affordable. A stack of 20 of them is delicious. No dual E5 and 196GB RAM necessary. This approach is one where we optimize for speed first... then we increase reliability to suit the storage problem at hand.

I do have stable configs. I'm paranoid as fsck with data. In fact no where near trusting FreeNAS for clients; TrueNAS maybe. But I do love to learn and test and enjoy a discussion beyond .... yet another. Can I run this on a pentium with 1GB RAM? It seems slow.

Can't deal with vanilla noob issues all the time, sometimes ya gotta have a little fun ;)

cyberjock · Sep 9, 2014

mjws00 said:
Cyber, I was a little envious of your Acard devices the second I heard you had them. The ramdisks are persistent because they are mirrored across seperate systems. So we have a local ram disk. We have a mirror(s) for when we want to reboot. In addition we could dump to nonvolatile storage on demand if the system were designed with that in mind.

RAMdisks are persistent because they are mirrored across separate systems. What!? You've really really confused me now. It sounds like nonsensical talk to me, so at least one of us is very confused.

Mirroring RAM with a memory device and physical drive buys you nothing. As I said above the *whole* reason for going with physical (non-volatile) storage is the fact that on a loss of power it stores what it has. There is *nothing* gained from mirroring an slog with system RAM. In fact, you could argue there is a performance penalty over simply using your slog drive (likely an SSD).

And saying that you have a mirror for when you want to reboot.... that's all fine and dandy but the slog is useless when you do a clean reboot. ZFS commits any slog data to the pool when the system shuts down, so there is no "risk" of losing data on a reboot (or shutdown). The real problems are when you have a kernel panic, sudden loss of power, etc. And those can't be predictive enough to commit everything in the memory device to the pool. So I'm really thinking you are more confused than I am or leaving out some very crucial information.

mjws00 said:
Can't deal with vanilla noob issues all the time, sometimes ya gotta have a little fun ;)

I totally agree. For those that are competent and intelligent enough to do the tests and understand what does or doesn't work when you "deliberately do the wrong thing" can be an excellent learning experience (and fun if you are into that kind of thing... I am).

jgreco · Sep 9, 2014

cyberjock said:
RAMdisks are persistent because they are mirrored across separate systems. What!? You've really really confused me now. It sounds like nonsensical talk to me, so at least one of us is very confused.

Think he's maybe talking about mirroring with another system, i.e. another head. This can be done with HAST, I believe. The problem here is that there's a terminology issue. "Persistent" typically encompasses non-volatile, but in the described case it is merely being redundant/replicated. It is not persistent. It therefore fails for the classical reasons.

Or, for the practical reason. There will come a day when you send off orders to your DC Dumb Hands and they manage to cause a cascading failure that takes down both nodes at the same time. That is the time when you absolutely must be able to guarantee that the data is actually somewhere persistent, not just somewhere that's persistent-in-your-creative-imagination.

cyberjock · Sep 9, 2014

Sure, I'll buy that HAST adds a little more protection. But again, if your datacenter loses power, both heads will go down. At that point it doesn't matter how many mirrors you have because you've failed the "non-volatile" mirrors "need". If some kind of malicious network packet can somehow cause a kernel panic that would be another nasty position to be in. Not that it's likely at all, but these theoretical cases seem to become reality to often and then you're left with the "f*ck you" card when your server boots back up. ;)

mjws00 · Sep 9, 2014

Exactly jgreco. Didn't realize my language was confusing. The data persists via mirrors across multiple heads. That multi node crash issue is a problem discussed, but at some point you either trust your data center or you don't. Full blown crash of the entire DC is gonna be ugly any way you slice it. 2 zfs transactions in an slog may be the least of your worries ;). At that level, we can likely throw in the fusion-io, netapp, nvramm, whatever. There is no reason why your heads couldn't live in separate datacenters. I have no idea how you'd get the latency to a level to make that practical but bandwith gets more plentiful everyday.

ZFS is supposed to survive removal of the slog, and it does. I'm sure I screwed something up in my implementation. Or the read only nature of the FreeNAS system screwed me. My BSD wizardry is lacking. I realize I'd be better off with a full OS and ZFS for hacking around... but I like it here.

That little Asrock server you mentioned in that iscsi thread got me a little moist. So I grabbed one of these this morning. Came with a nice little 9650SE-24M8 as well. I will have to add the bbu myself. So I'll have new things to break.

jgreco · Sep 9, 2014

You never trust your data center. It *will* fail at some point.

What you're talking about isn't data persistence, it's merely data replication and redundancy. And replicating ZIL transactions at any sort of distance is going to add latency, an issue that plentiful bandwidth won't solve. I suppose you could do two different filers in two different facilities at the same campus, but overall that's getting off into the weeds. I expect some sort of NVDIMM solution is going to be more practical because it actually provides the persistence quality that replicating the data is trying to implement in a hacky way.

mjws00 · Sep 9, 2014

I'll play with your abuse the controller cache scenario some. I question whether there will ever be demand enough to make an nvdimm solution viable for SMB's in the near term. In addition I'd like to make something fly in a lab now, safety be damned ;). Just give me a small PCIe card. Slots for 2-4 ecc dimms, and a dump to sd or cf on powerloss. Or build it into the chipset Intel and let me use onboard RAM. Unfortunately the prior versions of that ala Acard and Gigabyte failed miserably.

You get spoiled when everything is local and on fast ssd's or hardware pools. Everything SAN based seems dog slow, even after you throw a decent chunk of change at it. Thankfully 10Gbe is getting there on a per card basis.

Some of the links that got me thinking:
http://www.c0t0d0s0.org/archives/4906-Separated-ZIL-on-ramdisk..html
http://www.ntpro.nl/blog/archives/1663-Hey-Drummonds-forget-SSD-RAM-is-the-future.html
http://forums.freenas.org/index.php?threads/ramdisk-as-zil.9424/
http://forums.freenas.org/index.php?threads/revisit-zil-on-ramdisk.12979/
http://www.nexentastor.org/boards/1/topics/8160
https://blogs.oracle.com/relling/entry/a_short_ramdisk_zfs_anecdote
https://blogs.oracle.com/ds/entry/using_zfs_to_your_advantage

I enjoy this stuff. Including wrecking and fixing it in pursuit of something better. So thanks for the comments... hopefully the community as a whole benefits.

cyberjock · Sep 9, 2014

So I read those links. The whole "memory device on another machine with ISCSI" seems like just marginally better than doing sync=disabled (which really is no different than doing an slog with /dev/null as your storage device).

It's a fun (and rather bizarre) experiment to run, but something that really shouldn't be considered in any scenario where data is actually important. To be blunt, it really sounds like something *someone* did somewhere that was clueless about the slog and other people did it because they had lots of free time and "just wanted to see what would happen". To be honest, I'd get more value out of sync=disabled experimenting than doing a slog on another machine's memory device.

This is like saying "I wonder how a vehicle would drive with square wheels!" and yes, I know the Mythbuster's did try that.

mjws00 · Sep 9, 2014

The benefit I see is with the comstar solution. You can effectively force write back mode to esxi with just a checkbox. All of a sudden cripple mode with zfs and esxi and the need for 128GB RAM and pray you hit the ARC goes away. :) In a Lab. Just seems silly to me to have an extra layer of windows/abstraction/(possible virtualization) in the way.

NFS with sync=disabled seems to be the fastest scariest ride. Maybe you've seen wicked iscsi results, and I only see the failures on here? But even tweaked high RAM e5's with 10Gbe seem pretty lackluster. Poor drivers, weak iscsi initiators... just generally poor performance. I want a success story I can duplicate, where is it?

Guess I got a pool to fix, and some hardware to play with. Have a wonderful evening all.

mjws00 · Sep 9, 2014

So I did manage to restore the pool.

Initially zfs status pretty much just puked and threw no pools, and state: 'UNAVAIL' one or more devices missing.

*I really should have tried to find a way to get a list of the missing devices... but didn't.

So I figured I'd give recreating the device a shot. 'mdconfig -a -t malloc -s 1024M -u 8'
Now a 'zpool import Mir56' threw: "The devices below are missing use -m ..."
'zpool import -m Mir56' got me a pool.
'zpool remove Mir56 /dev/md8' cleaned it up.
'zpool export' got me exported for a GUI re-import (I wiped my freeNAS config)

Clear sailing. Thanks for the tips folks.

cyberjock · Sep 10, 2014

mjws00 said:
NFS with sync=disabled seems to be the fastest scariest ride. Maybe you've seen wicked iscsi results, and I only see the failures on here? But even tweaked high RAM e5's with 10Gbe seem pretty lackluster. Poor drivers, weak iscsi initiators... just generally poor performance. I want a success story I can duplicate, where is it?

Success stories can't be 'easily duplicated'. Mostly because as time goes on the writes to the pool won't be the same, the size and disk performance won't match, the VMs themselves won't use the same kind of workload, etc. What might start out with what "looks" like an equivalent won't end up being equivalent. This is why I just shake my head and ignore threads where people start demanding a "working case" to go off of. Not to mention the fact that people typically build a ZFS system with 4 or 5 mirrors and then try to run 10+ VMs. Normally you'd run 1 system per disk and roughly 5 vdevs = 5 disk, which means 5 VMs. But you wanted to run 10+VMs, so you MUST pick up efficiency, at a large scale, somewhere.

You're almost always going to bottleneck at the iops level. That's the biggest problem with VMs. The *only* ways to do that on a massive scale is ARC, L2ARC and ZIL. Yes, adding more mirrored vdevs helps, but when you are talking only 100 iops per vdev you add that's not very much. Adding ARC, L2ARC and a ZIL can take a pool with just 600-800 iops and make it 5k+. If I told you to make a 100 drive pool with 50 mirrors OR go with 96GB of RAM, 512GB L2ARC and a ZIL I think I know which one you'd choose.

At the end of the day you have physical limitations to overcome. By far the most cost effective way to handle that is ARC, L2ARC and ZIL. PERIOD. That's why we don't even talk about doing 100 disk pools. There is no 'close second' with 100 disks in a pool. The cost of the chassis, controllers, cabling, and electricity would quickly make that a very cost ineffective solution. Just 100 disks at $50 each (which I think is pretty low) means $5000... just in disks. I guarantee you could build a whole new server with the required RAM, L2ARC and ZIL for less. I know, I've helped people do it!

The real problem is that you don't have options. Everyone seems to think there's secret options to get more power out of their zpool. There are no secrets. The answers are plain as day. People just don't want to swallow a pill and do what is supposed to be done. Frankly, if you (not you specifically mjws00) aren't happy with the pill, feel free to go back to hardware RAID and a "regular" file server with 8GB of RAM.

cyberjock · Sep 10, 2014

mjws00 said:
NFS with sync=disabled seems to be the fastest scariest ride. Maybe you've seen wicked iscsi results, and I only see the failures on here? But even tweaked high RAM e5's with 10Gbe seem pretty lackluster. Poor drivers, weak iscsi initiators... just generally poor performance. I want a success story I can duplicate, where is it?

Success stories can't be 'easily duplicated'. Mostly because as time goes on the writes to the pool won't be the same, the size and disk performance won't match, the VMs themselves won't use the same kind of workload, etc. What might start out with what "looks" like an equivalent won't end up being equivalent. This is why I just shake my head and ignore threads where people start demanding a "working case" to go off of. Not to mention the fact that people typically build a ZFS system with 4 or 5 mirrors and then try to run 10+ VMs. Normally you'd run 1 system per disk and roughly 5 vdevs = 5 disk, which means 5 VMs. But you wanted to run 10+VMs, so you MUST pick up efficiency, at a large scale, somewhere.

You're almost always going to bottleneck at the iops level. That's the biggest problem with VMs. The *only* ways to do that on a massive scale is ARC, L2ARC and ZIL. Yes, adding more mirrored vdevs helps, but when you are talking only 100 iops per vdev you add that's not very much. Adding ARC, L2ARC and a ZIL can take a pool with just 600-800 iops and make it 5k+. If I told you to make a 100 drive pool with 50 mirrors OR go with 96GB of RAM, 512GB L2ARC and a ZIL I think I know which one you'd choose.

At the end of the day you have physical limitations to overcome. By far the most cost effective way to handle that is ARC, L2ARC and ZIL. PERIOD. That's why we don't even talk about doing 100 disk pools. There is no 'close second' with 100 disks in a pool. The cost of the chassis, controllers, cabling, and electricity would quickly make that a very cost ineffective solution. Just 100 disks at $50 each (which I think is pretty low) means $5000... just in disks. I guarantee you could build a whole new server with the required RAM, L2ARC and ZIL for less. I know, I've helped people do it!

The real problem is that you don't have options. Everyone seems to think there's secret options to get more power out of their zpool. There are no secrets. The answers are plain as day. People just don't want to swallow a pill and do what is supposed to be done. Frankly, if you (not you specifically mjws00) aren't happy with the pill, feel free to go back to hardware RAID and a "regular" file server with 8GB of RAM.

jamiejunk · Jul 11, 2015

Updating this thread. I lost a ZIL SSD. The usual stuff wasn't fixing the problem and I couldn't import my pool. These switches did the trick and imported the pool without the ZIL(slog) being available:

zpool import -m -F -f -R /mnt tank1

Important Announcement for the TrueNAS Community.

Recover a pool. No pressure.

Guru

Inactive Account

Resident Grinch

Guru

Resident Grinch

Inactive Account

Resident Grinch

Guru

Inactive Account

Resident Grinch

Inactive Account

Guru

Resident Grinch

Guru

Inactive Account

Guru

Guru

Inactive Account

Inactive Account

Contributor

Similar threads