FreeNAS 18 months, working like a champ, suddenly all drives have disappeared? :(

Status
Not open for further replies.

Leave

Cadet
Joined
Nov 28, 2016
Messages
9
Hi Guys,

I look after a big FreeNAS box at work, it's a SuperMicro motherboard X10DRI-T with a Xeon CPU and 32GB RAM, running 18 x 1TB drives in a SuperMicro chasis.
It's been running like champ for over 18 months, but this morning it suddenly locked up, I had to power cycle it to get back into the SSH/web interfaces.
Now the box is back up and running but only the internal USB drive which holds the OS is present.
I've powered it down, ejected/reinserted all the drives and disconnected/reconnected the big cables from the backplane, but FreeNAS/BSD still sees no drives.
Help. I'm out of my depth (the box was built by my predecessor) and people are starting to notice... :(

Cheers,
L.
 
Last edited:

darkwarrior

Patron
Joined
Mar 29, 2015
Messages
336
Hi there,

Let's try the basic things : ;)
- are you seeing the drives from within the "BIOS" of your HBAs ?
- are you seeing FreeNAS trying to import the pool ?
- any other messages getting displayed ?
 

Leave

Cadet
Joined
Nov 28, 2016
Messages
9
Perhaps I should add that by gone, I mean that in the web interface under Storage/Volumes/ I get -

Name Used Available Compression Compression Ratio Status Readonly
Tank 0 (Errror) Error getting available space - - Unknown

And under /View Disks

Name Serial Disk Size Description Transfer Mod e HDD Standby Advanced Power Management Acoustic Level Enable S.M.A.R.T. S.M.A.R.T. extra options
No entry has been found

From SSH -

[root@freenas] ~# zpool status
pool: freenas-boot
state: ONLINE
scan: scrub repaired 0 in 0h5m with 0 errors on Sun Nov 13 03:50:46 2016
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
gptid/5dd02fbc-d21b-11e4-97d7-002590f9d2c8 ONLINE 0 0 0

errors: No known data errors
 

Leave

Cadet
Joined
Nov 28, 2016
Messages
9
Hi HoneyBadger!
I was just tinkering with the LSI controller BIOS, it shows up during the boot, and I can CTRL-C into it, but the only place I find anything about drives is "SAS Topology", which then says "No devices to display". Is the correct? Does this mean the LSI card thinks there are no drives attached?
If so, what do you think has failed?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
SAS topology is the right place to look; but if it's showing no devices at all, it's the drive backplane or the cables leading to it. Since you reseated the cables already, it is likely that the backplane has failed or become unpowered somehow.

Try reseating the cables again, maybe try attaching only the first one from the HBA to the backplane, just in case the two components suddenly decided to have a fit about multipath mode. Check for indicator LEDs on the backplane to see if it has any fault indicators or loose power cables; if there's nothing obvious (or all lights are dead and you know it's powered) then you'll likely need to order a matching replacement.

Make note of the model number, since there's a lot of different backplanes (expander vs direct-attach, SAS2 vs SAS1) and order a replacement. Once the replacement is installed you shouldn't need to reconfigure anything either on the LSI card or within FreeNAS.

Edit: If this unit is 18 months old, it may still be warrantied. Check for that first before you spend money; but having spare parts for production units is a good idea as well. Or investigate TrueNAS for a commercially supported ZFS solution (incl. active/active heads)
 

Leave

Cadet
Joined
Nov 28, 2016
Messages
9
Oh, that's not good...
The drive lights on the bays light up, so there's some power there, but I'll check the backplane in the morning.
I've got on to the supplier, hopefully a two year warranty is in place, but I suspect it was only twelve months.
Inn the meantime, I've ordered cables, as it's a cheap first move while I sort out more expensive parts.
What do you suggest to keep on the shelf? A spare LSI HBA card, a backplane, and cables? Or better yet all three? :/
Do these things go wrong often?
 
Last edited:

Leave

Cadet
Joined
Nov 28, 2016
Messages
9
I've just had another thought...

I'd guess that it's more likely to be the backplane as the controller appears to be working.
However, I forgot to mention that my case also has another 12 drive bays in the back, so there's presumably another backplane for those rear 12 drives, some of which are populated, BUT they have disappeared too.

So does this point the finger back at the controller?
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
does this point the finger back at the controller?
Could be, could be the PSU, could be something else. You just need to explain to your boss that the only way forward is step by step, changing one thing at a time, eliminating possibilities one by one.
 

Leave

Cadet
Joined
Nov 28, 2016
Messages
9
Hi again,
After looking at the hardware in more detail, it became apparent that the rear backplane was chained through the front backplane to the LSI card. After an hour out removing bits and trying all the permutations of 'planes/cables, I came to the conclusion that it's the front backplane that's died, as the rear one works as it should when attached directly to the card.
New backplane ordered (£500 - ouch!), delivery next working day.

More news tomorrow. :)
 

Leave

Cadet
Joined
Nov 28, 2016
Messages
9
And finally... The new front backplane arrived, after a lot of screws and cables later, the old one was replaced and the new one works like like a charm!
FreeNAS carried on from where it was before the hardware failure.

So in summary, if all your hard drives have disappeared, but you can still see the HBA, it's probably your backplane that's died. If you have another backplane in the server, plug that directly into your HBA to confirm.

Happy days. [It has brought up another issue with the firmware in the HBA, but that's another thread.]
Cheers, for the comments and support!
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Glad you got this sorted out in the end, despite the cost.

Document the process, do a nice root-cause-analysis ending in "hardware failure" and present it to your boss as a case for having spare parts available.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
A case for having an identical backup system I guess.

In the past I've used a simple sff-8087 to Sata breakout cable to verify if an HBA is alive. And the nearest HD I could grab.

In my case it was the entire batch of 4 sff-8087 cables was faulty!
 

Leave

Cadet
Joined
Nov 28, 2016
Messages
9
Glad you got this sorted out in the end, despite the cost.

Document the process, do a nice root-cause-analysis ending in "hardware failure" and present it to your boss as a case for having spare parts available.

Along those lines, I don't suppose anyone on here knows anyone who might be a able to "repair" the old backplane? Longshot I know, but I thought I'd ask.
 

wblock

Documentation Engineer
Joined
Nov 14, 2014
Messages
1,506
A case for having an identical backup system I guess.
Agreed. If an important system has parts you can't obtain locally in an emergency, spares on hand are critically important. It probably does not need to be the whole system, but figure anything you can't get at a local computer store. And those spares should be tested. Ideally, tested on a repeating basis, to avoid discovering something has failed in storage when you really need it right now.
 
Status
Not open for further replies.
Top