n00b in need of help, FreeNAS won't boot!

Status
Not open for further replies.

sully

Explorer
Joined
Aug 23, 2012
Messages
60
Hello All,

Yesterday, I restarted my FreeNAS box (running 8.0.4 RELEASE) via SSH (shutdown -r now) as I've done many times before. However, this time, the box did not come back online. I went downstairs to the box, fired up the monitor and noticed the following...

First, the boot was stalled at..

"GEOM: da0s1: geometry does not match label (16h,63s !=255h,63s)."

photo.jpg

moments later, (roughly 10minutes later) the boot continued and then stopped at..

"Mounting local file systems."

and here it stops, nothing more...

What I've tried so far, without success.

- Thinking that something may of flipped in my BIOS, I restored Optimized defaults. No Success
- I loaded up 8.0.4 onto a spare USB stick, rebooted. This time the boot did stall at the "GEOM: da0s1..." error but then continued to mount. Once booted, I log into the GUI and restored my config from previous system, the system rebooted and it then again stalled out at "Mounting local file systems..."

I'm by no means a expert or even a novice, I know enough to get it done, somewhat. Any assistance on troubleshooting this issue would be greatly appreciated.

Specs:
12TB RAIDZ2 (ZFS) RAID6
8GB RAM
FreeNAS 8.0.4 Release

Thanks in advance,
sully
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The following would be helpful:

Is your da0 a hard drive or the flash drive? (a view of the screenshot higher up would be helpful) The GEOM warning may be harmless in any case.

What happens if you try a verbose boot? You are likely to get better details, or a more precise "sticking-point".
 

sully

Explorer
Joined
Aug 23, 2012
Messages
60
Hi jgreco

Unfortunately, I'm not sure, the boot lines quickly fly-by so I wasn't able to see what was what. I will try a verbose boot tonight.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Right. Once they've flown by, hit scroll lock (or is it print screen) and scroll up and look, if it's off the screen.
 

sully

Explorer
Joined
Aug 23, 2012
Messages
60
Thank you Sir, I'll give it a whirl tonight. Thanks again for your help so far.
 

sully

Explorer
Joined
Aug 23, 2012
Messages
60
Ok, I've taken some additional photos...

jgreco - Is your da0 a hard drive or the flash drive?

It appears that da0 is something else as all drives are listed (at least I think) as ada0-5, 6 drives total, maybe the flash drive? However, I don't think the issue is da0s1 and really the following error.

photo%2520copy.JPG


I wasn't able to see this error before as I wasn't verbose logging. This error will continue for 10minutes or so.

photo%2520copy%25202.JPG


I'm assuming something is wrong with ada5, however, I'm not sure how to physically identify this drive. Secondly, why would FreeNAS refuse to boot if one of the drives failed, or am I missing something, which is totally possible.

And when I went to reboot, I got this error...

photo%2520copy%25203.JPG


Something wrong with core 3 of the CPU??

Thanks in advance,
sully
 

andoy31

Explorer
Joined
Apr 29, 2012
Messages
65
Normally fatal trap 12 occurs when there's some problem with the memory-- try to look into that. Check if you've modified std memory timings in the bios or memory is a being faulty.

For ada5- try to check all physical connections first. To check, think you can try camcontrol -devlist (but this is only after you've booted in freenas.

Da0s1 is normal as this is the partition in your usb.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Don't bother with the memory. I haven't actually seen this before, but intuition strongly suggests that "cambio" is probably something like the CAM block I/O subsystem. It's extremely unlikely that your system is having disk problems AND then also experienced memory failures while in the kernel CAM subsystem. It's more likely that a bug in the CAM layers is being exposed by disk hardware failures, so chasing memory faults is probably going to be a red herring.

Find and fix the disk problem first.
 

sully

Explorer
Joined
Aug 23, 2012
Messages
60
Normally fatal trap 12 occurs when there's some problem with the memory-- try to look into that. Check if you've modified std memory timings in the bios or memory is a being faulty.

For ada5- try to check all physical connections first. To check, think you can try camcontrol -devlist (but this is only after you've booted in freenas.

Da0s1 is normal as this is the partition in your usb.

Thank you andoy31

I checked all physical connections and they appear to be fine. I loaded up a fresh FreeNAS image (w/o configurations) onto another USB stick and restarted, this time, after about 10 minutes, I was able to get to the FreeNAS screen where I was able to access the Shell.

I ran the "camcontrol devlist" and everything appeared to be correct...

photo%2520copy%25203.JPG


No convinced, I rescanned each bus...

photo%2520copy.JPG


2 out of 6 came back without an error, the 4 with errors (even though they report successful, which is unintuitive). Putting the pieces together, I have 2 SATA controllers, a set of 2 and a set of 4, I've circled the set of 4 in RED.

photo%2520copy%25202.JPG


So I'm wondering if that controller is shot? Again, this issue appeared out of nowhere after a restart, nothing was changed in the BIOS.

Thanks again for the help, if the either of your are in the Boston area, I owe you beers.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Could be something with the controller. However, I suggest you might also want to double-check your power, just to make sure you've not incorrectly identified the problem. Disconnect all the drive power leads. Take the two leads from drives that "work" and hook them to two drives that don't. Check in FreeNAS. I'm guessing this is unlikely, but better to be certain than to chase a phantom problem.

You can then, of course, do a similar test with the SATA data connections.

If that all points to the board as a problem, go in and drill down into all possibly related BIOS settings. Also, is that a video card? If so, is it needed? If it is, try pulling and reseating it. If it's not needed, and the board has onboard video, suggest using it.
 

sully

Explorer
Joined
Aug 23, 2012
Messages
60
Thank you jgreco -- I will try your suggestions tonight. Unfortunately, the mobo (Gigabyte GA-MA790FX-DS5) does not have onboard video, hence the card. Last night, I did downgrade the BIOS, thinking that it may have something to do with the latest firmware which GIGABYTE has listed has beta (though it's been beta for 2 yrs) and the system has been running fine for months prior. The downgrade didn't do anything, the same errors persisted.

I did order new SATA cables, in hopes of have a consistent set of cables, vs the mix&match I have now.

Thanks again for your help and I'll report back.
 

andoy31

Explorer
Joined
Apr 29, 2012
Messages
65
Just checking are you using some 4pin molex to sata power adapter? Might be good to look into those as well.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Just checking are you using some 4pin molex to sata power adapter? Might be good to look into those as well.

He's not likely to have four of those failing simultaneously. However, if he's got some of those on a "christmas tree" of Molex power Y's, yeah, definitely check for faults. I'm also less-than-optimistic that replacing SATA cables will help - one of the big plusses of SATA is the independence of each SATA bus. Those of us who spent years swearing at dozen-drive arrays on SCSI are still haunted by the nightmares, hahaha. But all of that should be isolated through some swapping experiments, which I hope I already suggested.
 

sully

Explorer
Joined
Aug 23, 2012
Messages
60
As I mentioned prior, I've replaced all of my SATA cables. I didn't think they would solve the problem but more so gave me a bit more confidence of the integrity of each line. That said, I've gone ahead and follow jgreco's suggestion and powered on each drive 1 by 1. All of them booted without error except one. I narrowed down the specific drive using the camcontrol identify and located the drive via serial.

That said, I unplugged the power from the problematic drive, and started the server with the original FreeNAS USB. The system mounted without a problem, except for the fact that the volume was degraded due to the missing drive. However, no ada/ahci boot errors.

So, I'm assuming the drive is the problem. Is it normal for a FreeNAS system not the mount if 1 drive goes bad?

I ran a SMART test on the problematic drive prior to removing it from the system, (smartctl -t short /dev/ada0), and the log resulted in a read failure.

IMG_0419.JPG


a few pages worth of these errors, I couldn't find a error legend to explain the code, though...

IMG_0416.JPG


I removed the drive from the system, plugged it into my laptop via SATA/USB adapter, and ran WD Data LifeGuard Diagnostics application and the Quick Test was a PASS.

IMG_0420.JPG


So, my questions are as follows...

1. What is the standard procedure for RMA'ing a drive to Western Digital, the drives were purchased new less then 2 months ago. Do I need to provide some sort of FAILURE code?
2. I'm running a RAIDZ2 (RAID6) so I'm ok right now, however, when I get the new drive, how do I add it back to the raid and rebuild, is there a how-to somewhere for this process?
3. Any other suggestions for getting a conclusive fail on this drive?
4. I was thinking I may trying zero'ing out the drive and adding it back to the raid, thoughts?

Many thanks in advance!!

sully
 

Stephens

Patron
Joined
Jun 19, 2012
Messages
496
I bought 6 Seagate 3TB drives and did nothing other than put them into a system with no other drives, boot Ultimate Boot CD from USB, and run SeaTools on them (long test). One failed. I copied the log to the flash drive, moved it to my desktop, and printed it out along with the log. I took a photo of the screen that showed the error code. Basically, I wanted to provide them with everything I had showing the drive was bad. Then I RMA'd. In my case I decided to RMA to NewEgg for a refund (within 30-day period) rather than the manufacturer because everyone in the industry seems to think it's totally OK to exchange a relatively new drive with a refurb that has gone through God knows what. Even NewEgg wouldn't guarantee me a new drive if I RMA'd for replacement, so I selected RMA for refund. If I'd been outside my 30-day window, I might have had to accept a refurb, which you may have to do. Just make sure you run a long test on it right away when you get it.

Quick Test is helpful, but long test is where it's at. No way I'd try to "make do" with a bad drive that's under warranty.

There are several threads here on the forum about how to replace a failed drive, and it's covered in the documentation.
 

sully

Explorer
Joined
Aug 23, 2012
Messages
60
I bought 6 Seagate 3TB drives and did nothing other than put them into a system with no other drives, boot Ultimate Boot CD from USB, and run SeaTools on them (long test). One failed. I copied the log to the flash drive, moved it to my desktop, and printed it out along with the log. I took a photo of the screen that showed the error code. Basically, I wanted to provide them with everything I had showing the drive was bad. Then I RMA'd. In my case I decided to RMA to NewEgg for a refund (within 30-day period) rather than the manufacturer because everyone in the industry seems to think it's totally OK to exchange a relatively new drive with a refurb that has gone through God knows what. Even NewEgg wouldn't guarantee me a new drive if I RMA'd for replacement, so I selected RMA for refund. If I'd been outside my 30-day window, I might have had to accept a refurb, which you may have to do. Just make sure you run a long test on it right away when you get it.

Quick Test is helpful, but long test is where it's at. No way I'd try to "make do" with a bad drive that's under warranty.

There are several threads here on the forum about how to replace a failed drive, and it's covered in the documentation.

Thanks for the heads up Stephens. I ordered them through Amazon directly. I just checked and was able to process a new replacement via RMA. It's just a bit odd that the drive fails the SMART test via smartctl in FreeNAS but passed the Quick Test via WD Data LifeGuard. I have the drive running a Extended/Long test now.

Thanks for your help.
 

sully

Explorer
Joined
Aug 23, 2012
Messages
60
The drive has been replaced, resilvered, old drive detached, zpool scrubbed, zpool has been cleared. However, it appears that the GUI is not reflected an updated description of the new drive.

Screen Shot 2012-09-26 at 12.07.50 PM.jpg

Any suggestions on how to get the GUI to reflect what seen in the zpool status? I've restarted the server about 3 times now, once after the resilvering, once after the scrub, and once after the zpool clear.

Screen Shot 2012-09-26 at 12.08.47 PM.jpg

When I went to schedule a SMART test, the drive is also shown as "replacing" as does the GUI.

Screen Shot 2012-09-26 at 12.10.42 PM.jpg

Thoughts?

Thanks in advance.

-Sully
 

sully

Explorer
Joined
Aug 23, 2012
Messages
60
Disregard. I figured it out.

I exported the pool and then auto-imported. I think this may of been a database error as the pool was intacts but the labeling was off.

Thanks again for all your help.
 
Status
Not open for further replies.
Top