Recently upgraded to 9.2.1.8 and now restarting a bunch

Apollo · Dec 30, 2014

Well, it is a fair concern, but SATA drives are designed with support for hot-plugging (same as hot-swapping).
Mating connectors are designed to allow GND to connect first when plugging the device and GND to break last when unplugging it.
Also, Power Management Unit which is part of the drive is designed to monitor all the voltage rails are present before turning the drive on. There is also inrush current limiting through a Soft-Start process.
The old 4 pin MOLEX connector doesn't support hot-swapping because the construction of the pins cannot guarantee GND pins will connect first and disconnect last.
On servers, there maybe an extra layer of protection between the drive and the system to prevent collapse of the power supply, but is intended I would think to prevent hard failure of the drive that could short the voltage rails taking the entire system down.

I am not even sure a PSU would shut down because it would sense a surge condition (such condition would have to be a short) or because your PSU is undersized or overloaded, at least .
In my opinion, PSU are not designed to shut down if the inrush current is within the envelope the PSU is designed for.

zambanini · Dec 31, 2014

had enough sata drives which died after "hot swapping"

apollo, please do not post fairytails

Ericloewe · Dec 31, 2014

Apollo said:
Well, it is a fair concern, but SATA drives are designed with support for hot-plugging (same as hot-swapping).
Mating connectors are designed to allow GND to connect first when plugging the device and GND to break last when unplugging it.
Also, Power Management Unit which is part of the drive is designed to monitor all the voltage rails are present before turning the drive on. There is also inrush current limiting through a Soft-Start process.
The old 4 pin MOLEX connector doesn't support hot-swapping because the construction of the pins cannot guarantee GND pins will connect first and disconnect last.
On servers, there maybe an extra layer of protection between the drive and the system to prevent collapse of the power supply, but is intended I would think to prevent hard failure of the drive that could short the voltage rails taking the entire system down.

I am not even sure a PSU would shut down because it would sense a surge condition (such condition would have to be a short) or because your PSU is undersized or overloaded, at least .
In my opinion, PSU are not designed to shut down if the inrush current is within the envelope the PSU is designed for.

That's the problem. SATA drives all accept that. But SATA power cables generally don't - the few millimeters on the drive side aren't enough, you need the host end to have the same indents. Backplanes also have filter circuitry to help isolate the host side.

I'm not saying it'll never work, I'm saying it's very likely it won't work.

electricd7 · Dec 31, 2014

Ok here is what I have done over the past 24 hours. I re-installed FreeNAS to a new USB and this time did not re-import my pool or restore my settings from database backup. It has been running solid for the past 24 hours. So that leads me to believe it is something with my storage/pool/controller (although I am still seeing SMART alerts so I know the controller is still working as are the disks, even though the pool is not referenced.) Should I just try to re-import my pool and hope it was a config issue of some sort?

Apollo · Dec 31, 2014

zambanini said:
had enough sata drives which died after "hot swapping"

apollo, please do not post fairytails

Unfortunately, SATA specification is not available for free, but you can get some information here:

http://en.wikipedia.org/wiki/Serial_ATA
http://wdc.custhelp.com/app/answers/detail/a_id/941/~/hot-swap-or-hot-plug-wd-sata-drives

Apollo · Dec 31, 2014

electricd7 said:
Ok here is what I have done over the past 24 hours. I re-installed FreeNAS to a new USB and this time did not re-import my pool or restore my settings from database backup. It has been running solid for the past 24 hours. So that leads me to believe it is something with my storage/pool/controller (although I am still seeing SMART alerts so I know the controller is still working as are the disks, even though the pool is not referenced.) Should I just try to re-import my pool and hope it was a config issue of some sort?

When drive are present but not imported nor added as a pool, they will not cause the issue.

cyberjock · Dec 31, 2014

Sigh. I will lock this thread if you guys won't behave.

Yes, SATA has support for what we all call hotswap (and hotplugging). The question is whether the remainder of your hardware does. If you aren't using a backplane, you almost certainly do NOT have the appropriate hardware, and therefore do NOT truely support hotswap and hot plugging. Keep in mind that when stuff is said like this (from the wiki page above):

The Serial ATA Spec includes logic for SATA device hotplugging. Devices and motherboards that meet the interoperability specification are capable of hot plugging.

The expectation is that YOU know what "devices" and "motherboards" support the spec. This also means that the drivers for the SATA controller you are connecting to is supported. ;)

So this is much more complicated than people think, which is why I go with the old trusty answer of 'just shut the box down and swap disks'. Why do I go with that versus doing it hot? Because 99.9% of people that want to say it is supported have never actually tested it with their hardware, so they can't even vouch for if it will work. If they get it wrong the box will go down (and there's a chance the pool will go down for good too). Is the risk worth the benefit? Nope.

Apollo · Dec 31, 2014

I won't go any further on the subject, but I do hot-swap my drives most of the time.
The hot-swapping idea was to experiment with communication fault and see if Freenas would detect the drive reconnecting, as I thought might be an issue causing the system to reboot.
The other issue I think could be the culprit is Timeout between Freenas and the HBA controller. I have seen this first hand when using my RR4320 RAID card.
Last I would suggest to only use the motherboard SATA ports assuming they are regular non RAID SATA ports, but then you may have the number of available connectivity issue with a 8 drive.
So if the OP could experiment with a spare drive we might get somewhere.

By the way this is Xmas time and I do believe in Fairy tales. Who else would say Santa doesn't exist?

electricd7 · Dec 31, 2014

Can we get back on-track? Apollo, I appreciate your help, but please don't side-track my discussion into a SATA discussion. I really need to get this resolved. I have the pool re-imported at this point and am pretty sure the driver and firmware are correct on the controller.

electricd7 · Dec 31, 2014

I should add that I just got a new matching disk FedEx today that i had planned on using to replace da3 which is the one that seems to be getting constant SMART alerts on in the console.

electricd7 · Dec 31, 2014

Definitely back to junk uptime once my pool is added back. So it HAS to be related to my controller and/or disks. I need some steps to narrow this down. The controller again is an IBM M1015 flashed to SAS2008 IT mode (16.00.01 fw), and the disks are all Hitachi 2TB disks arranged in a RAIDZ2 setup (8 disks total.) I do have one new disk which I just received today.

Apollo · Dec 31, 2014

electricd7 said:
Can we get back on-track? Apollo, I appreciate your help, but please don't side-track my discussion into a SATA discussion. I really need to get this resolved. I have the pool re-imported at this point and am pretty sure the driver and firmware are correct on the controller.

I did stay on track and was pointing to you a way to see if the problem is related to your IBM M1015 timeout. It is the 3rd line of my previous post.
Also, as you are not providing your system specs as signature I have to look further into the thread and extract the relevant information. Which by the way is the number of available SATA port on your motherboard. (see previous post line 4).
If you had enough SATA ports on your motherboard then I would connect your old array there tacking the IBM M1015 out of the loop. Which is right in track with isolating root cause of the failure.

And I added the comment about the spare.

So to make this official and clear beyond any doubt do as I would.

1: Don't connect your old array.
2: instead, just connect you new drive and create a zvol.
3: add a dataset or a few, try to replicate some snapshot task, shares maybe jails too that is currently set on your old array, relocate system dataset storage to point to new zvol.
4: when you get a close enough system, let it seat and see how it behaves.

At this point based on the general feeling of this thread toward hot-swapping, I would leave this one aside, which in my book would provide a better understanding of system interaction in the event of a SATA link failure. Which is what you might experience with a failing drive. My point was to see if the card would trigger a reconnect.

As I said, I suspect the IBM M1015 to be the culprit in this occasion, until we know more.

Another solution would be to power system OFF and disconnect only the drive that is at fault and start the system again. Will the system reboot then?

My concern in the event the root cause is the BM M1015 is that I don't know what will happen to your pool if your were to replace the bad drive with the new one and start resilvering.

Would someone comment on this?

Beyond that I have no more recommendation at this point.

Apollo · Dec 31, 2014

electricd7 said:
Definitely back to junk uptime once my pool is added back. So it HAS to be related to my controller and/or disks. I need some steps to narrow this down. The controller again is an IBM M1015 flashed to SAS2008 IT mode (16.00.01 fw), and the disks are all Hitachi 2TB disks arranged in a RAIDZ2 setup (8 disks total.) I do have one new disk which I just received today.

Isn't it exactly what I said below?

Apollo said:
When drive are present but not imported nor added as a pool, they will not cause the issue.

Apollo · Dec 31, 2014

As a side note, I would like to explore any changes that might have arrived in the few days you first experienced your reboots?
I take this one is going to be extreemelly low probability but is not unheard of.
Have you connected more devices to the AC line? such as Xmas lights and all the junk that suck up juice.
What I am trying to see is whether your AC line would droop causing the system to reboot.
If you don't know what droop means you can google it, but be careful don't trust anything that is on Wikipedia. (just kidding I love Wikipedia).
Could it be related to your home heating or cooling unit turning on and off time? Drooping will appear as you perceve being a flickering of the lights, not low enough to turn it off, but if it does, this means you are pooling too much current somewhere.

Apollo · Dec 31, 2014

Another one, you may want to check is the following:

- Do you see any damage to the motherboard or the IBM M1015 boards?
- What you have to look for are popped electrolythic capacitors (bright silver top of a cap is designed as a venting safety device. It will brake following the least pressure resistance of the K shaped indentation, you should look for brown colored cotton protruding through the opening if any ), burned out transistors or traces of carbon.
- Usually electrolythic capacitors ages more rapidly during turn ON and OFF events as the energy going through them is the strongest.

http://en.wikipedia.org/wiki/Capacitor_plague

And this is not FairyTails. I have personally seen some.

Someone might say, this is not possible otherwise you board will never work, but I dare challenge this statement.

electricd7 · Jan 1, 2015

Apollo-

Sorry, I didn't mean to tick you off. I do appreciate your input and probably could have done a better job scouring through it. I really want to get this resolved. My motherboard (which is now in my signature) does provide enough ports to hook up all 6 of my disks, I would just need to get some SATA cables and figure out the BIOS to do such. I have also unhooked the DA3 drive and powered the machine back on and it still reboots. I looked for swollen caps on both the HBA and motherboard and all look normal. At this point, I will try hooking up my single HDD to the system and creating a new zpool on it and doing some testing using the M1015 and let you know. If it continues to reboot, are you saying it would be possible to change the connections from the 1015 to my mobo and not lose my pool? I guess I just figured that if I changed the "address" of how the disks are connected that the system wouldnt know how to put the pool back together.

Apollo · Jan 1, 2015

I would wait you evaluate the new drive before doing anything else.

If M1015 is still causing reboots, take it out.
Make sure the SATA controller on motherboard has proper firmware if any, as it doesn't show the type of RAID processor.

Repeat the test with the single drive on the motherboard SATA and see what happens.

Back to your question, you can reconnect the drives in any SATA ports. As far I can tell, each drive as a particular gptid and the entire RAIDZx contains the list of all the gptid drives that are part of the array. So Freenas knows what connect with what.

electricd7 · Jan 1, 2015

Ok I have disconnected my 6 disks, pulled the HBA card, plugged in the new disk on the onboard SATA0 port. I created at new ZFS stripe using only this one disk and created a dataset. I changed the system dataset to point to this new disk. I am now installing a jail onto it and will see what happens. If this does work, I have also ordered 6 SATA cables to be here Saturday so I can get my original pool back up on the onboard ports, but I still have the da3 disk in the "replacing" state, so I will need to find a way to wipe this new disk out and put it back to new status so I can replace the failing da3 with this one. If this works, should I even bother replacing my M1015 card or just run them off the onboard SATA ports?

DrKK · Jan 1, 2015

You know, personally, if your pool is small enough that you *CAN* run them off of onboard SATA ports, why wouldn't you? Definitely, I'd do that. That's the ideal situation.

Apollo · Jan 1, 2015

Can you run your single drive only with M1015 for now? if the M1015 is the faulty part then it should reboot the system and this will indicate the 6 disk array is not the root cause of the reboots,even though they have some errors.
If the system reboots then take M1015 out and try the same one disk with the motherboard SATA. Then let's hope it will not reboot.
If M1015 is shown to cause the reboots, and if the SATA motherboard doesn't show signs of reboot, then I would disconnect the single disk, install all the other disk onto the motherboard SATA, even the old da3. Run silvering or scrub and Freenas will only repair blocks with wrong parity leaving most of the disk unaffected. I think this is the fastest way to resilver for now as it shouldn't stress the array too much, I think. The reason I think this is the better solution is that even without da3 you had a reboot which means da3 was not directly the source of the crash, maybe another drive, but I doubt.
Best to keep the single drive to replace the really defective drive if any.

Important Announcement for the TrueNAS Community.

Recently upgraded to 9.2.1.8 and now restarting a bunch

Wizard

Patron

Server Wrangler

Explorer

Wizard

Wizard

Inactive Account

Wizard

Explorer

Explorer

Explorer

Wizard

Wizard

Wizard

Wizard

Explorer

Wizard

Explorer

FreeNAS Generalissimo

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Recently upgraded to 9.2.1.8 and now restarting a bunch"

Similar threads