Recently upgraded to 9.2.1.8 and now restarting a bunch

Status
Not open for further replies.

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Well, it is a fair concern, but SATA drives are designed with support for hot-plugging (same as hot-swapping).
Mating connectors are designed to allow GND to connect first when plugging the device and GND to break last when unplugging it.
Also, Power Management Unit which is part of the drive is designed to monitor all the voltage rails are present before turning the drive on. There is also inrush current limiting through a Soft-Start process.
The old 4 pin MOLEX connector doesn't support hot-swapping because the construction of the pins cannot guarantee GND pins will connect first and disconnect last.
On servers, there maybe an extra layer of protection between the drive and the system to prevent collapse of the power supply, but is intended I would think to prevent hard failure of the drive that could short the voltage rails taking the entire system down.

I am not even sure a PSU would shut down because it would sense a surge condition (such condition would have to be a short) or because your PSU is undersized or overloaded, at least .
In my opinion, PSU are not designed to shut down if the inrush current is within the envelope the PSU is designed for.
 

zambanini

Patron
Joined
Sep 11, 2013
Messages
479
had enough sata drives which died after "hot swapping"

apollo, please do not post fairytails
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Well, it is a fair concern, but SATA drives are designed with support for hot-plugging (same as hot-swapping).
Mating connectors are designed to allow GND to connect first when plugging the device and GND to break last when unplugging it.
Also, Power Management Unit which is part of the drive is designed to monitor all the voltage rails are present before turning the drive on. There is also inrush current limiting through a Soft-Start process.
The old 4 pin MOLEX connector doesn't support hot-swapping because the construction of the pins cannot guarantee GND pins will connect first and disconnect last.
On servers, there maybe an extra layer of protection between the drive and the system to prevent collapse of the power supply, but is intended I would think to prevent hard failure of the drive that could short the voltage rails taking the entire system down.

I am not even sure a PSU would shut down because it would sense a surge condition (such condition would have to be a short) or because your PSU is undersized or overloaded, at least .
In my opinion, PSU are not designed to shut down if the inrush current is within the envelope the PSU is designed for.

That's the problem. SATA drives all accept that. But SATA power cables generally don't - the few millimeters on the drive side aren't enough, you need the host end to have the same indents. Backplanes also have filter circuitry to help isolate the host side.

I'm not saying it'll never work, I'm saying it's very likely it won't work.
 

electricd7

Explorer
Joined
Jul 16, 2012
Messages
81
Ok here is what I have done over the past 24 hours. I re-installed FreeNAS to a new USB and this time did not re-import my pool or restore my settings from database backup. It has been running solid for the past 24 hours. So that leads me to believe it is something with my storage/pool/controller (although I am still seeing SMART alerts so I know the controller is still working as are the disks, even though the pool is not referenced.) Should I just try to re-import my pool and hope it was a config issue of some sort?
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Ok here is what I have done over the past 24 hours. I re-installed FreeNAS to a new USB and this time did not re-import my pool or restore my settings from database backup. It has been running solid for the past 24 hours. So that leads me to believe it is something with my storage/pool/controller (although I am still seeing SMART alerts so I know the controller is still working as are the disks, even though the pool is not referenced.) Should I just try to re-import my pool and hope it was a config issue of some sort?

When drive are present but not imported nor added as a pool, they will not cause the issue.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
Sigh. I will lock this thread if you guys won't behave.

Yes, SATA has support for what we all call hotswap (and hotplugging). The question is whether the remainder of your hardware does. If you aren't using a backplane, you almost certainly do NOT have the appropriate hardware, and therefore do NOT truely support hotswap and hot plugging. Keep in mind that when stuff is said like this (from the wiki page above):

The Serial ATA Spec includes logic for SATA device hotplugging. Devices and motherboards that meet the interoperability specification are capable of hot plugging.

The expectation is that YOU know what "devices" and "motherboards" support the spec. This also means that the drivers for the SATA controller you are connecting to is supported. ;)

So this is much more complicated than people think, which is why I go with the old trusty answer of 'just shut the box down and swap disks'. Why do I go with that versus doing it hot? Because 99.9% of people that want to say it is supported have never actually tested it with their hardware, so they can't even vouch for if it will work. If they get it wrong the box will go down (and there's a chance the pool will go down for good too). Is the risk worth the benefit? Nope.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
I won't go any further on the subject, but I do hot-swap my drives most of the time.
The hot-swapping idea was to experiment with communication fault and see if Freenas would detect the drive reconnecting, as I thought might be an issue causing the system to reboot.
The other issue I think could be the culprit is Timeout between Freenas and the HBA controller. I have seen this first hand when using my RR4320 RAID card.
Last I would suggest to only use the motherboard SATA ports assuming they are regular non RAID SATA ports, but then you may have the number of available connectivity issue with a 8 drive.
So if the OP could experiment with a spare drive we might get somewhere.

By the way this is Xmas time and I do believe in Fairy tales. Who else would say Santa doesn't exist?
 

electricd7

Explorer
Joined
Jul 16, 2012
Messages
81
Can we get back on-track? Apollo, I appreciate your help, but please don't side-track my discussion into a SATA discussion. I really need to get this resolved. I have the pool re-imported at this point and am pretty sure the driver and firmware are correct on the controller.
 

electricd7

Explorer
Joined
Jul 16, 2012
Messages
81
I should add that I just got a new matching disk FedEx today that i had planned on using to replace da3 which is the one that seems to be getting constant SMART alerts on in the console.
 

electricd7

Explorer
Joined
Jul 16, 2012
Messages
81
Definitely back to junk uptime once my pool is added back. So it HAS to be related to my controller and/or disks. I need some steps to narrow this down. The controller again is an IBM M1015 flashed to SAS2008 IT mode (16.00.01 fw), and the disks are all Hitachi 2TB disks arranged in a RAIDZ2 setup (8 disks total.) I do have one new disk which I just received today.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Can we get back on-track? Apollo, I appreciate your help, but please don't side-track my discussion into a SATA discussion. I really need to get this resolved. I have the pool re-imported at this point and am pretty sure the driver and firmware are correct on the controller.
I did stay on track and was pointing to you a way to see if the problem is related to your IBM M1015 timeout. It is the 3rd line of my previous post.
Also, as you are not providing your system specs as signature I have to look further into the thread and extract the relevant information. Which by the way is the number of available SATA port on your motherboard. (see previous post line 4).
If you had enough SATA ports on your motherboard then I would connect your old array there tacking the IBM M1015 out of the loop. Which is right in track with isolating root cause of the failure.

And I added the comment about the spare.

So to make this official and clear beyond any doubt do as I would.

1: Don't connect your old array.
2: instead, just connect you new drive and create a zvol.
3: add a dataset or a few, try to replicate some snapshot task, shares maybe jails too that is currently set on your old array, relocate system dataset storage to point to new zvol.
4: when you get a close enough system, let it seat and see how it behaves.

At this point based on the general feeling of this thread toward hot-swapping, I would leave this one aside, which in my book would provide a better understanding of system interaction in the event of a SATA link failure. Which is what you might experience with a failing drive. My point was to see if the card would trigger a reconnect.

As I said, I suspect the IBM M1015 to be the culprit in this occasion, until we know more.

Another solution would be to power system OFF and disconnect only the drive that is at fault and start the system again. Will the system reboot then?

My concern in the event the root cause is the BM M1015 is that I don't know what will happen to your pool if your were to replace the bad drive with the new one and start resilvering.

Would someone comment on this?

Beyond that I have no more recommendation at this point.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Definitely back to junk uptime once my pool is added back. So it HAS to be related to my controller and/or disks. I need some steps to narrow this down. The controller again is an IBM M1015 flashed to SAS2008 IT mode (16.00.01 fw), and the disks are all Hitachi 2TB disks arranged in a RAIDZ2 setup (8 disks total.) I do have one new disk which I just received today.
Isn't it exactly what I said below?
When drive are present but not imported nor added as a pool, they will not cause the issue.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
As a side note, I would like to explore any changes that might have arrived in the few days you first experienced your reboots?
I take this one is going to be extreemelly low probability but is not unheard of.
Have you connected more devices to the AC line? such as Xmas lights and all the junk that suck up juice.
What I am trying to see is whether your AC line would droop causing the system to reboot.
If you don't know what droop means you can google it, but be careful don't trust anything that is on Wikipedia. (just kidding I love Wikipedia).
Could it be related to your home heating or cooling unit turning on and off time? Drooping will appear as you perceve being a flickering of the lights, not low enough to turn it off, but if it does, this means you are pooling too much current somewhere.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Another one, you may want to check is the following:

- Do you see any damage to the motherboard or the IBM M1015 boards?
- What you have to look for are popped electrolythic capacitors (bright silver top of a cap is designed as a venting safety device. It will brake following the least pressure resistance of the K shaped indentation, you should look for brown colored cotton protruding through the opening if any ), burned out transistors or traces of carbon.
- Usually electrolythic capacitors ages more rapidly during turn ON and OFF events as the energy going through them is the strongest.

http://en.wikipedia.org/wiki/Capacitor_plague

And this is not FairyTails. I have personally seen some.

Someone might say, this is not possible otherwise you board will never work, but I dare challenge this statement.
 

electricd7

Explorer
Joined
Jul 16, 2012
Messages
81
Apollo-

Sorry, I didn't mean to tick you off. I do appreciate your input and probably could have done a better job scouring through it. I really want to get this resolved. My motherboard (which is now in my signature) does provide enough ports to hook up all 6 of my disks, I would just need to get some SATA cables and figure out the BIOS to do such. I have also unhooked the DA3 drive and powered the machine back on and it still reboots. I looked for swollen caps on both the HBA and motherboard and all look normal. At this point, I will try hooking up my single HDD to the system and creating a new zpool on it and doing some testing using the M1015 and let you know. If it continues to reboot, are you saying it would be possible to change the connections from the 1015 to my mobo and not lose my pool? I guess I just figured that if I changed the "address" of how the disks are connected that the system wouldnt know how to put the pool back together.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
I would wait you evaluate the new drive before doing anything else.

If M1015 is still causing reboots, take it out.
Make sure the SATA controller on motherboard has proper firmware if any, as it doesn't show the type of RAID processor.

Repeat the test with the single drive on the motherboard SATA and see what happens.

Back to your question, you can reconnect the drives in any SATA ports. As far I can tell, each drive as a particular gptid and the entire RAIDZx contains the list of all the gptid drives that are part of the array. So Freenas knows what connect with what.
 

electricd7

Explorer
Joined
Jul 16, 2012
Messages
81
Ok I have disconnected my 6 disks, pulled the HBA card, plugged in the new disk on the onboard SATA0 port. I created at new ZFS stripe using only this one disk and created a dataset. I changed the system dataset to point to this new disk. I am now installing a jail onto it and will see what happens. If this does work, I have also ordered 6 SATA cables to be here Saturday so I can get my original pool back up on the onboard ports, but I still have the da3 disk in the "replacing" state, so I will need to find a way to wipe this new disk out and put it back to new status so I can replace the failing da3 with this one. If this works, should I even bother replacing my M1015 card or just run them off the onboard SATA ports?
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
You know, personally, if your pool is small enough that you *CAN* run them off of onboard SATA ports, why wouldn't you? Definitely, I'd do that. That's the ideal situation.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Can you run your single drive only with M1015 for now? if the M1015 is the faulty part then it should reboot the system and this will indicate the 6 disk array is not the root cause of the reboots,even though they have some errors.
If the system reboots then take M1015 out and try the same one disk with the motherboard SATA. Then let's hope it will not reboot.
If M1015 is shown to cause the reboots, and if the SATA motherboard doesn't show signs of reboot, then I would disconnect the single disk, install all the other disk onto the motherboard SATA, even the old da3. Run silvering or scrub and Freenas will only repair blocks with wrong parity leaving most of the disk unaffected. I think this is the fastest way to resilver for now as it shouldn't stress the array too much, I think. The reason I think this is the better solution is that even without da3 you had a reboot which means da3 was not directly the source of the crash, maybe another drive, but I doubt.
Best to keep the single drive to replace the really defective drive if any.
 
Status
Not open for further replies.
Top