Recently upgraded to 9.2.1.8 and now restarting a bunch

Status
Not open for further replies.

electricd7

Explorer
Joined
Jul 16, 2012
Messages
81
Found it and performed the upgrade from 14 to 16. For those looking, the file was found at ftp://ftp.supermicro.com/Driver/SAS/LSI/2008/IR_IT/Firmware/IT/PH16.0.1-IT.zip. I extracted those files to a USB stick, booted into UEFI, and just ran 'SMC2008T.nsh'. This performed the entire upgrade. I did need to know the last 9 digits of my card's serial number. No more warning in the web interface. Here is to hoping this cleans up my random reboots! Thanks for all the help. I will repost my status after I see if its stable.
 

electricd7

Explorer
Joined
Jul 16, 2012
Messages
81
Well it was a good thought....still restarting at random times. I am at a loss. I am using ECC memory, stress test of CPU is good, firmware is correct on HBA, cleaned out all fans and filters. Just not sure where to look next. PSU?
 

electricd7

Explorer
Joined
Jul 16, 2012
Messages
81
..nm Thought I was on to something...same problem.
 

electricd7

Explorer
Joined
Jul 16, 2012
Messages
81
OK so I just was sitting next to the machine and witnessed a reboot. It turned off, then I heard the fans spin up for a second, then off, then on, then off, etc until eventually they spun up and it posted. Definitely a hardware problem, but how do I know what to check based on that?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
OK so I just was sitting next to the machine and witnessed a reboot. It turned off, then I heard the fans spin up for a second, then off, then on, then off, etc until eventually they spun up and it posted. Definitely a hardware problem, but how do I know what to check based on that?
I'd start with the PSU. Bad power can do nasty things. From there, it's harder to evaluate single elements...
 

electricd7

Explorer
Joined
Jul 16, 2012
Messages
81
I just booted to bios to see what fans were running at (all were normal.) I also looked at my SMBIOS event log and found lots of single-bit ECC memory errors on DIMM0 channel A. The last one was logged on 12/15/14, though, so I am not sure that is culprit. If I understand correctly, single-bit memory errors are corrected in ECC and is the primary purpose for ECC memory. I guess I will start with the PSU first and then the DIMM next I suppose....
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Bad power could also cause RAM errors.
 

electricd7

Explorer
Joined
Jul 16, 2012
Messages
81
Well I ordered a new Thermaltake power supply which will be here Wednesday, but I ran the machine on passmark's tool which maxes out CPU and memory (from bootable CD-rom) and it has now been running almost 24 hours without a hiccup. The only time I see the reboots appear to be when running FreeNAS. Passmark does not test the HDD as it doesn't have a driver for the M1015 HBA. I don't know what to think or where to look next. Looking for any advice. I have no problem replacing hardware, but my last test leads me to be believe its not hardware (or at least if it is it must be disk or HBA?) What do you think? How can I test the HDDs in the system without running FreeNAS? It was solid at version 9.1.x and started acting up after I went to 9.2.x and still on 9.3. I unfortunately updated my ZFS pool (I know!), so going back to 9.1 is not an option. I would think it would still be stable at 9.2.x or 9.3 if it were stable for years on 9.1.x?
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Have you run SMART on all your drives?
It seems to me the possible cause is related to HDD disconnection being dropped by the system.
Can you check if removing a drive and then attaching it back is detected by the system? Can you do that on a spare disc that has a volume?
I had similar issue on a RR4320 RAID card and a failing drive. I removed the RAID card and while the drive would still fail, it would not crash the system.
 

electricd7

Explorer
Joined
Jul 16, 2012
Messages
81
Im game for trying. How do I force a SMART test and view results? Also, how would I know which disk could be pulled? I have a RAIDZ2 setup using 6 disks.
 

electricd7

Explorer
Joined
Jul 16, 2012
Messages
81
I found how to get smart to run from command line. I am getting an error Device /dev/da3 [SAT], 1 offline uncorrectable sectors. I didn't have SMART turned on under services. Now what do I do? should I disconnect da3 and see if reboots stop (obviously replacing it as well soon, but with RAIDz2, I should be still protected but degraded, right?)
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Post the output of smartctl -a /dev/adawhatever for all drives in pastebin and link to it hee, please.
 

electricd7

Explorer
Joined
Jul 16, 2012
Messages
81

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Well, all drives but da4 have logged errors. I recommend you look into replacing them (starting with da0, which seems to have a bad sector).

that said, these drives aren't failing hard enough to justify reboots, so there's something else thing on.
 

electricd7

Explorer
Joined
Jul 16, 2012
Messages
81
The console showing "Device /dev/da3 [SAT], 1 offline uncorrectable sectors" is not concerning? The disk doesn't show as offline or anything in the GUI which is weird to me.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
The console showing "Device /dev/da3 [SAT], 1 offline uncorrectable sectors" is not concerning? The disk doesn't show as offline or anything in the GUI which is weird to me.

That's a bad sector, too. da0 and da3 are the most urgent for replacement, but it's hard to say exactly what's going on.
 

electricd7

Explorer
Joined
Jul 16, 2012
Messages
81
Any reason to beleive that 9.2 or 9.3 would be less stable on my build? I believe I have only used only parts which are compatible and suggested in my build. Just weird that I didn't see issues like this until upgrading to 9.2.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Any reason to beleive that 9.2 or 9.3 would be less stable on my build? I believe I have only used only parts which are compatible and suggested in my build. Just weird that I didn't see issues like this until upgrading to 9.2.

No, but have you tried a brand new install of 9.3 on a new USB drive? You never know if the OS is corrupted (well, you do starting with 9.3).
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Now what do I do? should I disconnect da3 and see if reboots stop (obviously replacing it as well soon, but with RAIDz2, I should be still protected but degraded, right?)

Well, I wouldn't go that route unless you already have a backup of you current volume or you don't care about loosing your data.
Running in degraded mode is fine for a short while but there a risk, meaning that you can loose another disk and you will not have any redundancy, you lose another one and your pool becomes unavailable until reboot or permanently.
The risk is that all you drives are showing some sort of failures, the one I have concern is the one showing a high Raw_Read_Error_Rate count.
Surprisingly, Reallocated_Sector_Ct are all zeros but one.

Do you have a a spare drive you can do some experiment with for which you have no concern about data loss?
I am curious whether your HBA card is not letting your drive reconnect. I also suspect a timeout between Freenas and the HBA card could cause the reboots.
What I would like you to do, is power your system off, unplug all your HDD that are part of the RAID-Z2 volume.
Connect you spare drive, turn the system back on and create a volume as Stripe with that spare drive.
Create a fault by disconnecting SATA connection from drive or unplug power to the drive, whichever is easier. Be careful not to shake the drive around or you could damage it when it spins.
If you are running ssh or shell, or simply look at the display, you will see some error messages.
Plug the drive back, this is what is called hot plugging, and what I want you to see is whether the drive is being recognized by the system or not.
Do this a few times making sure you can also reconnect the volume within Freenas and report back your findings.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Well, I wouldn't go that route unless you already have a backup of you current volume or you don't care about loosing your data.
Running in degraded mode is fine for a short while but there a risk, meaning that you can loose another disk and you will not have any redundancy, you lose another one and your pool becomes unavailable until reboot or permanently.
The risk is that all you drives are showing some sort of failures, the one I have concern is the one showing a high Raw_Read_Error_Rate count.
Surprisingly, Reallocated_Sector_Ct are all zeros but one.

Do you have a a spare drive you can do some experiment with for which you have no concern about data loss?
I am curious whether your HBA card is not letting your drive reconnect. I also suspect a timeout between Freenas and the HBA card could cause the reboots.
What I would like you to do, is power your system off, unplug all your HDD that are part of the RAID-Z2 volume.
Connect you spare drive, turn the system back on and create a volume as Stripe with that spare drive.
Create a fault by disconnecting SATA connection from drive or unplug power to the drive, whichever is easier. Be careful not to shake the drive around or you could damage it when it spins.
If you are running ssh or shell, or simply look at the display, you will see some error messages.
Plug the drive back, this is what is called hot plugging, and what I want you to see is whether the drive is being recognized by the system or not.
Do this a few times making sure you can also reconnect the volume within Freenas and report back your findings.

DO NOT PULL A DRIVE'S POWER CABLE. MORE IMPORTANTLY, DO NOT REINSERT IT WITH THE SERVER RUNNING.

SATA cables are not generally designed for hotplugging. Only SATA backplanes are. Hotplugging SATA power can lead to the system crashing because the PSU sensed a surge condition.
 
Status
Not open for further replies.
Top