Boot stuck at "Beginning ZFS volume imports"

Status
Not open for further replies.

Dotty

Contributor
Joined
Dec 10, 2016
Messages
125
This morning I ran a smart long test on a HDD with suspected issues.
The smart test returned a read error at some sector, but that was it.

Few hours later the FreeNAS server rebooted, by itself, and now Im stuck at the screenshot below.
It has been there for almost an hour.
The system is a Asrock Rack E3C226D2I with 16GB ECC RAM, booting from USB.
CPU is a Xeon E3-1246 V3.
Two volumes, one is RAIDZ2 on four WD RED HDDs (encrypted) and another is mirror two SSD (non encrypted)
Im running 9.10.1-U4

I cannot acces via shell (it is not pinging yet), the image below is taken from the IPMI KVM functionality on the motherboard, that is working fine. I dont have physical access to the box, it is few hundred miles away from me.
Any advise?
I dont know if I should wait longer or do something else.


upload_2017-5-13_16-11-47.png
 
Last edited:

Dotty

Contributor
Joined
Dec 10, 2016
Messages
125
Update: After almost two hours, the boot completed, everything seemed fine.
I decided to reboot, and see if it happened again, but the box rebooted just fine, as quick as usual.
I do have a syslog collector receiving logs from the box, but I dont see anything unusual, just the regular SNMP GET that I do on port 161 from a monitoring server and a Automatic Snapshot earlier today, as expected.
Im still puzzled about this, why FreeNAs took so long importing the volume?
What would be the proper strategy to troubleshoot this if it happens again?
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
The smart test returned a read error at some sector, but that was it.
That isn't good at all. You should follow this up with the output of the SMART test results after completing an Extended SMART test. My signature below links to how to troubleshoot your hard drive. Remember to use CODE tags for the output so it retains the proper format which makes it so much easier to read.

After you have done this you should also run a scrub if you have no errors.
 

Dotty

Contributor
Joined
Dec 10, 2016
Messages
125
That isn't good at all. You should follow this up with the output of the SMART test results after completing an Extended SMART test. My signature below links to how to troubleshoot your hard drive. Remember to use CODE tags for the output so it retains the proper format which makes it so much easier to read.

After you have done this you should also run a scrub if you have no errors.
Agree,, I was going to go deeper into this,, but the sudden reboot caught me off guard, then the issue of booting up the box was more pressing than the one of finding whats going on with the HDD.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
I could only speculate as to why your box rebooted in the first place but it's just piss poor speculation. You can reasonably expect that you have a failing hard drive. These could be related but who knows. After you fix your drive issue, if your computer reboots again out of the blue, run the basic tests (Memtest86 and a CPU stress test). You could have a power supply issue, RAM, CPU, Motherboard. You get the picture. The two test will stress the system and if you have a failure, something is rotten then.
 

Dotty

Contributor
Joined
Dec 10, 2016
Messages
125
I could only speculate as to why your box rebooted in the first place but it's just piss poor speculation. You can reasonably expect that you have a failing hard drive. These could be related but who knows. After you fix your drive issue, if your computer reboots again out of the blue, run the basic tests (Memtest86 and a CPU stress test). You could have a power supply issue, RAM, CPU, Motherboard. You get the picture. The two test will stress the system and if you have a failure, something is rotten then.
I know I have a failing HDD
I know Im supposed to stress test each box to make sure components are not faulty ( for the record I have more than 10 of these, identical to one another, plus a number of other boxes, with different hardware, all of them tested individually before put into production, and also periodically, and this is the only box of that model series with this issue in over a year).

While the box was stuck I came here and asked a specific question: "Should I wait or should I do something else?"
After the box booted, I came again, provided an update and asked two more questions, (obviously the previous question was not valid anymore since the box was already booted)

I will enumerate my questions to see if I can get some precise answers:
Question 1- Why FreeNAs took so long importing the volume? (Obviously, since I did not provide logs or more visual clues, Im not expecting anybody to give me an exact answer, but I was assuming some expert could point me to where to look for logs, debug, etc, or maybe tell me that "we have no way to analyze "post-mortem" what happened on FreeNAS during boot, therefore you will never know why certain process took long to load). In that last case, just as an observation, even MS Windows has an Event Viewer that tells you the delay between services, the timeouts, and many times the root cause.
Question 2- What would be the proper strategy to troubleshoot this if it happens again? (the process seemed to be stuck on boot, and maybe there is a way to see what is happening behind the scenes, to have an idea of what the box is doing,, maybe ZFS is rearranging corrupted data, maybe is having more read errors and trying to deal with it, who knows)

The provided answers where "you have a failed HDD", and "Once you replaced it, do some stress test to make sure nothing is failing"
The first answer was obvious, I can read my own SMART output, and the second one, also obvious, is good advise, thanks, but not what I need now.

Again, I know I have a failing HDD, but then again, the answers to my questions "1" and "2" are important on my case and future similar cases.
Those answers might help the overall troubleshooting process if a box get stuck somewhere, in the future.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
Question 1- Why FreeNAs took so long importing the volume? (Obviously, since I did not provide logs or more visual clues, Im not expecting anybody to give me an exact answer, but I was assuming some expert could point me to where to look for logs, debug, etc, or maybe tell me that "we have no way to analyze "post-mortem" what happened on FreeNAS during boot, therefore you will never know why certain process took long to load). In that last case, just as an observation, even MS Windows has an Event Viewer that tells you the delay between services, the timeouts, and many times the root cause.
I suspect it was due to the failing hard drive and maybe the system was waiting for the drive to respond properly or it was doing a lot of reads in troubled sectotrs. If you type "dmesg" you can look at the log data to see if there were issues. I'm not certain they would have popped up. Also look in "/var/log/" for other log files.

Question 2- What would be the proper strategy to troubleshoot this if it happens again? (the process seemed to be stuck on boot, and maybe there is a way to see what is happening behind the scenes, to have an idea of what the box is doing,, maybe ZFS is rearranging corrupted data, maybe is having more read errors and trying to deal with it, who knows)
Again, I suspect this all has to do with the hard drive failure. The boot process was not stuck, it was just slow, very slow. If it were stuck it would never have finally booted. I already provided you good advice, if this happens again after you have fixed the drive failure then you need to test your hardware again. Just becasue it passed all the testing at a previous time doesn't mean something couldn't have failed since then.

The first answer was obvious, I can read my own SMART output, and the second one, also obvious, is good advise, thanks, but not what I need now.
The advice is sound and I feel this is what you need to do. If you find some messages in the "dmesg" output that look questionable, please post them for analysis.

And your reply seems to be a bit angry, it certainly wasn't my intention to offend you.
 

Dotty

Contributor
Joined
Dec 10, 2016
Messages
125
I suspect it was due to the failing hard drive and maybe the system was waiting for the drive to respond properly or it was doing a lot of reads in troubled sectotrs. If you type "dmesg" you can look at the log data to see if there were issues. I'm not certain they would have popped up. Also look in "/var/log/" for other log files.


Again, I suspect this all has to do with the hard drive failure. The boot process was not stuck, it was just slow, very slow. If it were stuck it would never have finally booted. I already provided you good advice, if this happens again after you have fixed the drive failure then you need to test your hardware again. Just becasue it passed all the testing at a previous time doesn't mean something couldn't have failed since then.


The advice is sound and I feel this is what you need to do. If you find some messages in the "dmesg" output that look questionable, please post them for analysis.

And your reply seems to be a bit angry, it certainly wasn't my intention to offend you.
Now we are getting somewhere.
:smile:
Yes, I get a bit angry !, nothing personal but Im tired of asking "how much is 1+1?" and I get answers like "if you add two numbers maybe you get another number".,, gezz! the answer is "2" or "I dont know exactly but go to this URL and buy this calculator which is that one I use and is good'

Back to the FreeNAS box, yes, the HDD is for sure bad, and will be replaced, and your advise on where to look for clues on the slow boot is good.
Certainly was not stuck, seems that "slow" is a better description, but when I opened the thread I didnt know if it was one thing or the other, because there is no visibility on what is happening behind the scenes,no progress indicator, nothing.
Is there any way to access other "terminal" debug sessions while on boot? sort of Ubuntu Server, that if you press F11 (if Im not mistaken) takes you to another screen where you see other stuff happening), VMware ESXI has also something similar.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
Is there any way to access other "terminal" debug sessions while on boot? sort of Ubuntu Server, that if you press F11 (if Im not mistaken) takes you to another screen where you see other stuff happening), VMware ESXI has also something similar.
Not that I'm aware of. The "dmesg" file contains the same data that is on the console screen during the boot process and there after.

I understand the frustration but please understand that I assume nothing other than the person I'm talking to doesn't have a clue what is going on. Don't take offence to it, it actually helps me ensure that I ask questions to ensure we are properly communicating. And generally ask people to explain a problem to me as if I was someone who had no clue what was going on. Making assumptions really hurts the troubleshooting process. At the same time I treat people on the other end as if they know nothing (which is more true today with FreeNAS than it was 4 years ago) which is why you will often see me provide answers with numbered steps. I just have no idea how knowledgable they are, even if they tell me they know everything, odds are they don't if they are asking for help. I certainly don't know everything, heck I know only the basics. there are a lot of advanced features that I don't use so I can't answer questions on those topics.

Sorry, long winded. Hope you figure it out but please replace the hard drive after looking at those logs, I really think that is the initial problem at hand.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Also check the IPMI log for any indications of the cause of the reboot.

Also, long waits on pool imports are characteristic of an async delete that could not finish - these have to be finished before the pool can be imported and can take a while. Could be a bunch of large snapshots that were deleted in a fragmented pool.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
If there was a large delete happening that could prolong the import since it will wait until that delete is done. It could also be a disk is really slow and dying.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
If there was a large delete happening that could prolong the import since it will wait until that delete is done. It could also be a disk is really slow and dying.
A combination of both actually sounds credible. A single marginal drive turned my 3-hour scrubs into 25-hour scrubs.
 

Dotty

Contributor
Joined
Dec 10, 2016
Messages
125
Also check the IPMI log for any indications of the cause of the reboot.

Also, long waits on pool imports are characteristic of an async delete that could not finish - these have to be finished before the pool can be imported and can take a while. Could be a bunch of large snapshots that were deleted in a fragmented pool.
I do have automatic snapshots, with 2 week retention in most cases,, so there are some daily deletes, how do you track those async deletes?,, I would like to get to the bottom of this.
(I will check the IPMI logs,, but on this board they are less than ideal,, I dont know what Asrock rack was thinking, this boards dont have many of the basic functions a server board should have,, for what I use them is fine, but I wish they would be better)
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194

James Snell

Explorer
Joined
Jul 25, 2013
Messages
50
I too am experiencing this. I had received some warning emails from my NAS saying one of the drives in my mirror had SMART errors, but then when I went to check the status of the drives they looked completely fine. So I wasn't sure if there was really a problem. I've now had this running like this for, as of this moment, about 13hrs. I sure hope this thing recovers on its own before a whole lot longer. :\

Screen Shot 2018-01-16 at 12.43.58 PM.jpg
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Are you on 11.1? If so, rollback.
 

James Snell

Explorer
Joined
Jul 25, 2013
Messages
50
Are you on 11.1? If so, rollback.

Oh wow, is 11.1 not so awesome? I can just do a fresh 11.0 install, as I'd rather get this OS volume mirror working anyway. Someone in a Facebook thread commented that I may need to change my BIOS settings (AHCI) to have a RAID install succeed. I believe the system was on 11.0-U4 when its USB Flash drive started to fail. I believe I did try to upgrade it to 11.1, but when I've seen boot volume failures on USB, seems that upgrades cease to work. So, there's a fine chance it was on 11.0-U4.

On 11.1, I actually tried installing to two brand new SATA SSDs in a mirror twice and both times died on with this:
Screen Shot 2018-01-16 at 2.30.40 PM.jpg


It would just retry and then eventually comment that it's out of retries. I eventually just installed to a single SSD and it went perfectly.

So, circling back to my response question - is 11.1 not so awesome?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Someone in a Facebook thread commented that I may need to change my BIOS settings (AHCI) to have a RAID install succeed.
Sounds like they have no idea what they're talking about. You never want FakeRAID mode. It's not supported, much less for booting.

Choose both devices in the installer and that's it.

is 11.1 not so awesome?
Not awesome.
 

James Snell

Explorer
Joined
Jul 25, 2013
Messages
50
Sounds like they have no idea what they're talking about. You never want FakeRAID mode. It's not supported, much less for booting.

Choose both devices in the installer and that's it.

I've selected both devices in the past when installing to multiple USB drives and indeed that did work. I've since developed a hatred of running FreeNAS on any number of USB devices. Last night I made two mirrored OS install attempts on two brand new SATA-attached SSDs. Both failed and led to the screenshot I attached in my previous post, seemingly with AHCI related complaints. Installing to one of the SSDs worked perfectly fine.


---------------------------------------------
Dunning-Kruger Effect
Sounds like they have no idea what they're talking about.
Is it me, or are there quite a lot of FreeNAS community members that love to authoritatively dispense what they think are best-practices, but are in fact more like "I did this once briefly. Since it mostly worked, so it's safe for prod" type situations?


---------------------------------------------
11.1 vs not 11.1
Regarding the not so awesome status of 11.1, I'm hopeful then that doing a fresh install of 11.0-Ux (Where x >= 4) will get me back up and running. And hopefully it'll let me use dual SSDs along the way, that'd be excellent. I'll report back with how it goes, it'll be quite a few hours. Thanks for the help.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Is it me, or are there quite a lot of FreeNAS community members that love to authoritatively dispense what they think are best-practices, but are in fact more like "I did this once briefly. Since it mostly worked, so it's safe for prod" type situations?
The forums are pretty good when it comes to disseminating best practices. I can't speak for any Facebook groups, but the internet is full of disinformation.

I've selected both devices in the past when installing to multiple USB drives and indeed that did work. I've since developed a hatred of running FreeNAS on any number of USB devices. Last night I made two mirrored OS install attempts on two brand new SATA-attached SSDs. Both failed and led to the screenshot I attached in my previous post, seemingly with AHCI related complaints. Installing to one of the SSDs worked perfectly fine.
Most likely, the other SSD has some sort of fault or the interface had some problem. The error would not be caused by the fact that a ZFS mirror was involved.
 
Status
Not open for further replies.
Top