zfs hanging on attempted drive replacement

Status
Not open for further replies.

toadman

Guru
Joined
Jun 4, 2013
Messages
619
I haven't seen this one before...but it's a problem. I'm running 9.2.0-RELEASE-x64.

I've got a pool with 4 mirrored(x2) vdevs. I had a drive start to throw sector errors so I decided to replace it. After taking the drive offline via the gui, I'm now stuck. The following things cause zfs to hang (command doesn't complete, pool not available, but the OS is available - e.g. can SSH into the server):
- "replace" via the GUI hangs zfs
- "zpool replace" via command line hangs zfs
- "zpool attach" via command line hangs zfs
- "zpool detach" via command line hangs zfs

Something with the identifying on the drive went haywire. Here is the zpool status output:

pool: pool0
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: scrub repaired 0 in 6h18m with 0 errors on Thu Jan 2 13:18:09 2014
config:

NAME STATE READ WRITE CKSUM
pool0 DEGRADED 0 0 0
mirror-0 ONLINE 0 0 0
gptid/6b1a2268-cd59-11e2-a450-6805ca0df8de ONLINE 0 0 0
gptid/c0f72699-38fd-11e2-b89f-6805ca0dea40 ONLINE 0 0 0
mirror-1 DEGRADED 0 0 0
12540302857970932507 OFFLINE 0 0 0 was /dev/gptid/82b5c85e-cd59-11e2-a450-6805ca0df8de
gptid/a26d10ed-debc-11e2-82df-0015172951de ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
gptid/d50f089e-cd59-11e2-a450-6805ca0df8de ONLINE 0 0 0
gptid/0213231d-5064-11e2-8d3e-6805ca0dea40 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
gptid/071cd876-cd5a-11e2-a450-6805ca0df8de ONLINE 0 0 0
gptid/4f9bc54e-bce3-11e2-8a6d-20cf3032cfe3 ONLINE 0 0 0

errors: No known data errors

Note the "12540302857970932507" and the "was /dev/gptid/82b5c85e-cd59-11e2-a450-6805ca0df8de" in mirror-1.

I can't reference either on the command line to get a replacement (or detach) to happen. If I reference "gptid/a26d10ed-debc-11e2-82df-0015172951de" which should allow me to do an attach with another drive (to at least get that mirror back to 2 live drives) it doesn't work (hangs).

The pool operates in a degraded state if I don't attempt any changes.

Ideas anyone?
 

jyb3

Dabbler
Joined
Jan 12, 2014
Messages
18
I wonder if it will even be re-reviewed since two days ago the status of 3925 was set to "3rd party to resolve"? Not sure how the review process works.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Not likely. If you read that whole ticket Jordan Hubbard says something like "sounds like a hardware issue or at least something we can't fix". So I'd say unless there's evidence the problem is in FreeNAS' court then I wouldn't expect anything to change.

Notworthy things to mention in my opinion from that ticket was:

1. Not using hardware that's even remotely recommended. Not only is consumer motherboard manufacturers like Asus not recommended he went with AMD, which is well known to cause all sorts of problems that aren't fixable because AMD support for FreeBSD is a total unmitigated disaster.
2. On top of it, he mentions that he had used an L2ARC. Ok, whatever. The limit on that motherboard is a claimed 16GB of RAM. You shouldn't even be considering an L2ARC until you hit 64GB of RAM.
3. You should be using ECC RAM, which he may or may not have been using. More than likely not considering his choice of motherboard and CPU.

Because FreeNAS is making FreeBSD available to the masses on the "easy", FreeNAS has some problems. People know Windows and may even know Linux, but neither of them are "like" FreeBSD. Trying to make any comparision only shows your ignorance. The 3 biggest problems with FreeNAS development is:

1. The people that won't follow the recommendations for appropriate hardware and get upset when stuff doesn't work. Then *someone* has to figure out if this is hardware or software. If the software had a horrible bug there would be tons of people complaining. There isn't. So the *likely* cause is that its poor hardware choices. Hardware that is good for Windows and Linux is NOT necessarily the same hardware for FreeBSD. At all.
2. People don't have a clue what they are doing with ZFS and do the "Windows Thing"(tm) and throw more hardware at whatever problem they perceive they are having. Guess what, the OP was clearly out of his mind putting an L2ARC on a system with anything < or = 32GB of RAM. PERIOD. As a casual observer that makes me wonder what other mistakes he's made that we aren't even aware of, or that he doesn't even realize he's made in the past that are rearing their ugly head. Because I know from experience you don't make 1 stupid mistake and realize it was a mistake. You'll make lots of them as you throw everything at the wall to see what sticks. Only need to spend 2 weeks on the forum to see the 20+ year Windows veterans do this regularly and don't even know that what they did was incredibly stupid or undoable.
3. People decided that the CLI was the way to go and ignored the GUI, despite the fact that the GUI is the proper way to handle many situations and problems. PERIOD. I do keep harping on people to make pools from the GUI and not from the CLI just to avoid scenarios just like we saw in IRC last night with someone. They thought the CLI was the right way, and he couldn't do disk replacements because FreeNAS' assumptions were broken.

So when I see that ticket written by someone that didn't use appropriate hardware, then solved the problem by doing more inappropriate things with hardware, then potentially made mistakes in software use I personally don't spend much time helping him. Note that I don't work for iX and so I can and do make that call when I read tickets and posts. In short.. do things right and I'll try to help you in all the ways I have available. I've gone to great lengths and spent hours on Teamviewer and Skype for free. On the flipside if you do things wrong I won't feel too bad for you. I won't be spending much time trying to help you, assuming I even acknowledge you have a problem that isn't a PEBKAC.

Now when someone shows up that used the appropriate hardware, followed the commonly reverberated recommendations for using FreeNAS, and didn't do stupid or crazy things with their setup, then I'll want to do a Teamviewer session and try to figure it out for myself. Until then, don't expect me to work too terribly hard on your problem that I didn't create and I don't want to troubleshoot or solve because you didn't want to do the appropriate homework.

As has been said many times on the forums.. "Do it right or do it twice". And you know, there must be a lot of masochists here because so many people do it twice. I really wonder sometimes how people in IT can say such stupid things here and yet I'm looking for a job in IT and I could probably do more than 1/2 of these people's jobs better than them. Granted, I have other circumstances that make getting a job in IT difficult. But holy smokes, the number of dumb people in IT is astounding! I've sometimes wondered if I've never ended up with a job in IT because the universe is trying to save my sanity.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
At this point, because I do not have recommended hardware, I can assume anything that doesn't work is normal in my environment.

Possibly. There's no certainty. Surely there is plenty of hardware out there that will work just fine with FreeNAS without being on our recommended list of hardware. But, if you have a problem you are probably going to be on your own to troubleshoot it.

The way I describe problems is that either you jump on the boat and do what everyone else is doing or you'll end up stuck on the island alone. If you are a professional swimmer(aka FreeBSD Professional) you might be very secure being stuck on the island. If you aren't, you should be getting on that damn boat! :p
 

toadman

Guru
Joined
Jun 4, 2013
Messages
619
Possibly. There's no certainty. Surely there is plenty of hardware out there that will work just fine with FreeNAS without being on our recommended list of hardware. But, if you have a problem you are probably going to be on your own to troubleshoot it.

The way I describe problems is that either you jump on the boat and do what everyone else is doing or you'll end up stuck on the island alone. If you are a professional swimmer(aka FreeBSD Professional) you might be very secure being stuck on the island. If you aren't, you should be getting on that damn boat! :p

Cyberjock, I don't understand your approach. While I agree with the island analogy, it doesn't seem to apply in this case.

In post #7 above you're refering to ticket 3925 that is related to this thread. Yes, Jordan suggests a hardware problem. Perhaps it was (is). If not, I suspect it is more likely a ZFS, and therefore FreeBSD issue. But then either way it's likely not going to be addressed by the FreeNAS team. On this we can agree.

Then you continue...

Notworthy things to mention in my opinion from that ticket (what ticket?) was:

1. Not using hardware that's even remotely recommended. Not only is consumer motherboard manufacturers like Asus not recommended he went with AMD, which is well known to cause all sorts of problems that aren't fixable because AMD support for FreeBSD is a total unmitigated disaster.
2. On top of it, he mentions that he had used an L2ARC. Ok, whatever. The limit on that motherboard is a claimed 16GB of RAM. You shouldn't even be considering an L2ARC until you hit 64GB of RAM.
3. You should be using ECC RAM, which he may or may not have been using. More than likely not considering his choice of motherboard and CPU.​
#1, sure, it's probably HW. (Unlikely.)
#2, who said they were using an L2ARC? If you're referring to the ASUS motherboard, that's me. No L2ARC. So this is just wrong. Statement not helpful.
#3, me again. Yes, I'm using ECC. Again, not helpful. I submit guessing when you can politely ask is not helpful. Just *politely* ask. Most people that post are willing to help. (Unless you assume they are idiots and make dismissive comments, several of which are actually wrong.)

Next...

"People know Windows and may even know Linux, but neither of them are "like" FreeBSD. Trying to make any comparision only shows your ignorance."​
True, but who made such a comparison? If no one, why even waste bits on making this statement.

And then...

1. The people that won't follow the recommendations for appropriate hardware and get upset when stuff doesn't work. Then *someone* has to figure out if this is hardware or software. If the software had a horrible bug there would be tons of people complaining. There isn't. So the *likely* cause is that its poor hardware choices. Hardware that is good for Windows and Linux is NOT necessarily the same hardware for FreeBSD. At all.
2. People don't have a clue what they are doing with ZFS and do the "Windows Thing"(tm) and throw more hardware at whatever problem they perceive they are having. Guess what, the OP was clearly out of his mind putting an L2ARC on a system with anything < or = 32GB of RAM. PERIOD. As a casual observer that makes me wonder what other mistakes he's made that we aren't even aware of, or that he doesn't even realize he's made in the past that are rearing their ugly head. Because I know from experience you don't make 1 stupid mistake and realize it was a mistake. You'll make lots of them as you throw everything at the wall to see what sticks. Only need to spend 2 weeks on the forum to see the 20+ year Windows veterans do this regularly and don't even know that what they did was incredibly stupid or undoable.
3. People decided that the CLI was the way to go and ignored the GUI, despite the fact that the GUI is the proper way to handle many situations and problems. PERIOD. I do keep harping on people to make pools from the GUI and not from the CLI just to avoid scenarios just like we saw in IRC last night with someone. They thought the CLI was the right way, and he couldn't do disk replacements because FreeNAS' assumptions were broken.​
1. I'm sure people do get upset. But again, I don't think that applies to either this thread or the 3925 thread. Seems like another case of painting with a broad brush. Also, what does it have to do with this thread??
2. Where is this mysterious OP that was out of his mind in putting an L2ARC on a system? (Maybe you're referring to 3656. But then that would be something not referenced in this thread, and therefore not applicable.)
3. The "people" are not the ones on this thread as far as I can tell. At least in my case I can assure you the GUI is the primary method of interaction. Only when issues are seen is the CLI used for debug purposes.

And finally...

So when I see that ticket written by someone that didn't use appropriate hardware, then solved the problem by doing more inappropriate things with hardware, then potentially made mistakes in software use I personally don't spend much time helping him. Note that I don't work for iX and so I can and do make that call when I read tickets and posts. In short.. do things right and I'll try to help you in all the ways I have available. I've gone to great lengths and spent hours on Teamviewer and Skype for free. On the flipside if you do things wrong I won't feel too bad for you. I won't be spending much time trying to help you, assuming I even acknowledge you have a problem that isn't a PEBKAC.

Now when someone shows up that used the appropriate hardware, followed the commonly reverberated recommendations for using FreeNAS, and didn't do stupid or crazy things with their setup, then I'll want to do a Teamviewer session and try to figure it out for myself. Until then, don't expect me to work too terribly hard on your problem that I didn't create and I don't want to troubleshoot or solve because you didn't want to do the appropriate homework.

As has been said many times on the forums.. "Do it right or do it twice". And you know, there must be a lot of masochists here because so many people do it twice. I really wonder sometimes how people in IT can say such stupid things here and yet I'm looking for a job in IT and I could probably do more than 1/2 of these people's jobs better than them. Granted, I have other circumstances that make getting a job in IT difficult. But holy smokes, the number of dumb people in IT is astounding! I've sometimes wondered if I've never ended up with a job in IT because the universe is trying to save my sanity.​
You are certainly entitled to help or not help whomever you wish. But you're basically ranting here about something that doesn't apply to this thread. Look, in MY case, I posted something on the forum. No one helped. Ok, no problem. I solved it myself. Someone suggested I post a freenas ticket. Rather than create a new one, I posted my info on an existing ticket that seemed to match my problem. Maybe that info could help, maybe not.

I for one appreciate the work the FreeNAS people do. It's a great product. I try to help when I can. I'm not using recommended hardware, I'm using what I have. So in the end I don't expect help. And it's a free product after all.

What I will suggest is: (1) if you are going to help, try to be nice when doing so. Calling people idiots, even when they are idiots, doesn't help. I'm sure you see/read a lot of things that make you slap your head and conclude, "yet another idiot." (I do so myself.) But don't let it get the best of you... it will continue to be the case and drive you nuts if you let it. (2) Assume less, ask more.

BTW, reading 3656 I would place my bet that this issue discussed in this thread is in the ZFS code involving zvols.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I won't even bother quoting and explaining my side, because its not worth my time.

But dismissing a hardware issue like you did because its AMD is complete crap. AMD does make good products, but their products don't seem to work with FreeNAS/FreeBSD. Don't like that VERY obvious truth, don't buy it. We've seen everything from random reboots to unexplainable errors from AMD. Again, if you do ANY searching of the forum there's plenty of them. Don't like it, too darn bad. That's the reality of it.

And yes, a guy named Chris mentioned the L2ARC. I don't try to separate out by name most of the time, and I don't really care. But you have to sit down and figure out the reality from all of the BS. That's customer service. Search for the phrase "L2ARC" in that bug report and you'll see who had one. Not sure why you are even spewing crap like "who said they were using an L2ARC?" when you could search the page and have an answer instantly! He admitted in his post he used an L2ARC!

Most of my comments were generalities for why Jordan may have made the decision that it was a hardware problem. They didn't directly apply to you, that ticket, or this thread.

Anyway. I'm done here. I was going to write more, but its really not worth my effort at this point. Good luck to you!
 

toadman

Guru
Joined
Jun 4, 2013
Messages
619
I won't even bother quoting and explaining my side, because its not worth my time.

But dismissing a hardware issue like you did because its AMD is complete crap. AMD does make good products, but their products don't seem to work with FreeNAS/FreeBSD. Don't like that VERY obvious truth, don't buy it. We've seen everything from random reboots to unexplainable errors from AMD. Again, if you do ANY searching of the forum there's plenty of them. Don't like it, too darn bad. That's the reality of it.

And yes, a guy named Chris mentioned the L2ARC. I don't try to separate out by name most of the time, and I don't really care. But you have to sit down and figure out the reality from all of the BS. That's customer service. Search for the phrase "L2ARC" in that bug report and you'll see who had one. Not sure why you are even spewing crap like "who said they were using an L2ARC?" when you could search the page and have an answer instantly! He admitted in his post he used an L2ARC!

Most of my comments were generalities for why Jordan may have made the decision that it was a hardware problem. They didn't directly apply to you, that ticket, or this thread.

Anyway. I'm done here. I was going to write more, but its really not worth my effort at this point. Good luck to you!

Yikes! I'm not dismissing hardware, i'm just not dismissing zfs code. BTW, FreeNAS (and FreeBSD) support AMD, just not the post Opteron hardware. So I placed a bet. I could be wrong.

Um, you plainly said the "OP" said something about an l2arc. You also said that OP was "out of his mind." That didn't happen in this thread, nor the bug 3925 which is referenced in this thread. So what thread are you referring to? You don't say. I can't search a thread you don't point me toward. If it's bug 3656 you were the OP. Sheesh. In that thread a guy named Chris referred to an l2arc, but that was a reference to you, not him. If there is some other, as of yet unmentioned thread you're reading, but not informing us of, let me know and I'll read it. I'm giving you actual references to threads, why are you not doing the same?

"Most of my comments were generalities for why Jordan may have made the decision that it was a hardware problem. They didn't directly apply to you, that ticket, or this thread."​
Entirely my point! Why would you make general, non-specific statements on my thread that didn't apply to me, the ticket I appended, or this thread? Especially for someone that has stated they are short on time. It doesn't add up. Seriously.

If you don't have relevant information related to the thread you are posting on, please do not post. It's disruptive to people trying to actually follow the thread and help. I point you to forum rule #6 actually...

6. Flames, shouting (all-caps), sarcasm, bullying, profanity, defamatory, hateful, personal attacks, negative or arrogant posts/comments about others are prohibited.

You are not billed for the help provided here, but it costs a lot of time and effort to the volunteers providing it. Be polite and patient.

Thanks.
 

jyb3

Dabbler
Joined
Jan 12, 2014
Messages
18
Toadman, how did you resolve your problem? I am pretty much to the point of deleting and recreating my pool. But I hate giving up on it.
 

toadman

Guru
Joined
Jun 4, 2013
Messages
619
Yes, I destroyed my pool and restored from backup. I had 24 hours to make a call, and didn't have info to try and save the original pool. If I had the benefit of information gained over this past week I probably would have tried to copy out the contents of my zvols, remove them, then replace the failed drive.

I'm curious, do you have a zvol in your pool?

Note: In my restored pool I am no longer using zvols for iscsi. I am using file extents.
 

jyb3

Dabbler
Joined
Jan 12, 2014
Messages
18
I do not have zvol in my pool. My setup is very basic, and currently I have one pool that is shared up via NFS and CIFS. But this pool has been through many upgrades in hardware, upgrading ZFS (v15 to v28) as well as FreeNAS versions. I even just changed architectures from 32-bit to 64-bit when I moved to 9.2. I was under the assumption that my problem is in the history that this pool has been through. I have tried to replicate the hung ZPOOL command in VMware, but my virtual test machine works just fine.

I would imagine if I deleted this pool, and built it again, I could put all the data back on it, and then preform the same upgrade and it would work fine. But I am trying to avoid that. I have located a different brand of 1TB drive to try and REPLACE the offline drive in the even that it is somehow an issue with the other 1TB drive I am trying to swap out.
 

toadman

Guru
Joined
Jun 4, 2013
Messages
619
Yes, I have not been able to replicate the issue in a virtual environment either.

Have you tried to roll back to 9.1.1 on your physical system to see if the issue persists? Otherwise I would just restore from backup and move forward. I doubt this one will get root caused anytime soon.
 

jyb3

Dabbler
Joined
Jan 12, 2014
Messages
18
So I wrote out the FreeNAS-9.1.1-RELEASE-x64 (a752d35) image to anther SD card, booted it up, imported the pool, imported my older 8.3.1 config (importing the 9.2 failed with a bad database error) and simulated my OFFLINE and REPLACE.

Partial success because zpool did not hang, it actually spit back an error:

Code:
[root@freenas1] ~# zpool replace tank0 17462580476412354147 ada0p1
cannot replace 17462580476412354147 with ada0p1: devices have different sector alignment


Which is interesting because they are all 512 byte sectors (even the new drive).

Trying to impliment the fix in bug 3486 (https://bugs.freenas.org/issues/3486) by setting sysctl vfs.zfs.vdev.larger_ashift_minimal from 1 -> 0 did not fix my problem either. I still have the "devices have different sector alignment" error.

I have put the old 500GB drive back in and it is resilver'ing now. I had to keep the ashit_minimal at 0 for the old drive to be reintroduced to the pool.

I will try again tomorrow, just to make sure I cover all my steps. Then perhaps I will move on to the beta version as it does appear to be the particular version of FreeNAS I use.
 

toadman

Guru
Joined
Jun 4, 2013
Messages
619
Nice job!

How did you get the partition on ada0? Did you manually create it? Asking because you didn't use a gptid in the zpool replace command.

What ashift does your pool have? 9 or 12?
 

jyb3

Dabbler
Joined
Jan 12, 2014
Messages
18
Yes, I manually created it. ashift on my pool is 9. I am still 3 hours away from resilver being done, but I will try a the all GUI method next in 9.1.1, leaving vfs.zfs.vdev.larger_ashift_minimal=0 and destroying any partitioning I did on the new drive for my next test.

Either way it goes, I assume this disproves the "3rd party resolution" for bug 3925.
 
Status
Not open for further replies.
Top