reboots e. Ivery 3-4 days

Status
Not open for further replies.

estrabd

Dabbler
Joined
Nov 21, 2013
Messages
38
I am running 9.3 and my server reboots regularly every 3-4 days, and I have no indication as to why.

It's a Dell PowerEdge R900 with 5x2TB SATA drives set up in RAIDZ2 It has 16 cores and 32 GB of RAM.

I need some advice on figuring out why. I have looked in /data/crash and there is nothing. Logs don't seem to indicate any issues leading up to the spontaneous reboot.

I've been looking for tunables to set to make it more of a debug environment, but alas I have found no good advice.

My plan is to test the memory thoroughly, but beyond that I am stumped.

Any help would be appreciated. Thanks!
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
You really didn't say much by saying your server is an R900. There's an infinite number of possible hardware configurations... BUT...

http://www.dell.com/us/dfb/p/poweredge-r900/pd

Notice that FreeBSD isn't on the list. I'd bet that your hardware isn't entirely compatible with FreeBSD. Generally, spontaneous reboots are due to hardware failure or hardware incompatibility. So right now everything points to your hardware as the cause.

RAM test is pointless as the box uses ECC RAM. Unless you're getting RAM errors from your server that's not your problem. In fact, doing memory tests with ECC RAM is basically pointless. :P

I'd look at getting more appropriate hardware though (and by appropriate I mean hardware that is more compatible with FreeBSD). One of the biggest problems with servers from Dell, HP, and the like are that they rarely officially support FreeBSD, so their compatibility is often very questionable.
 

estrabd

Dabbler
Joined
Nov 21, 2013
Messages
38
You really didn't say much by saying your server is an R900. There's an infinite number of possible hardware configurations... BUT...

http://www.dell.com/us/dfb/p/poweredge-r900/pd

Notice that FreeBSD isn't on the list. I'd bet that your hardware isn't entirely compatible with FreeBSD. Generally, spontaneous reboots are due to hardware failure or hardware incompatibility. So right now everything points to your hardware as the cause.

RAM test is pointless as the box uses ECC RAM. Unless you're getting RAM errors from your server that's not your problem. In fact, doing memory tests with ECC RAM is basically pointless. :p

I'd look at getting more appropriate hardware though (and by appropriate I mean hardware that is more compatible with FreeBSD). One of the biggest problems with servers from Dell, HP, and the like are that they rarely officially support FreeBSD, so their compatibility is often very questionable.

Thank you, and fair enough.

I had been running FreeBSD 9.x on here while burning her in, and I had no indication of hardware issues. It was also not loaded with disks at the time, so the use scenario is vastly different.

If it was a hardware problem, can you please provide me with some pointers to information on the sorts of logs I need to be looking at or ways to debug this? As I said, I've looked at the various knobs for attempting to catch this information, but I'm at a stand still about what to do next.

Update, I am using a Dell 6/IR SAS RAID Controller Card JW063 because the other controller was forcing me to use the hardware RAID configuration.

Thank you!
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I don't have any specific recommendations because there's SO many places to go looking. /var/log/messages is the best place to look. But hardware problems often simply reboot the box, so no log entry would be necessary. You'll likely have to boot in a kernel verbose mode and look for things out of place.

I wouldn't recommend that card though. I'm not sure if the driver for that SAS controller is particularly stable or not, but that is one thing that is 'out of place' compared to everyone else from my observation.
 

estrabd

Dabbler
Joined
Nov 21, 2013
Messages
38
Thank you. I had replaced the Perc RAID card with a non-RAID card because I thought I it would interfere with ZFS, but I found out I could just set up each disk as a RAID0 - so I put back in the original controller, set up each disk as a hardware RAID0 and am now playing a waiting game. Thanks for the pointers.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Thank you. I had replaced the Perc RAID card with a non-RAID card because I thought I it would interfere with ZFS, but I found out I could just set up each disk as a RAID0 - so I put back in the original controller, set up each disk as a hardware RAID0 and am now playing a waiting game. Thanks for the pointers.
No, you DO NOT WANT TO DO THAT!
 

estrabd

Dabbler
Joined
Nov 21, 2013
Messages
38
Just in case I miscommunicated, each of the 5 drives is set up as a RAID0, it is not striped across all 5 drives. This was the lowest option I had via the controller's configuration tool if I wanted the disks to be seen. Is there a preferred RAID level to use in this case? Does it even matter if each virtual disk corresponds to a single physical disk?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Just in case I miscommunicated, each of the 5 drives is set up as a RAID0, it is not striped across all 5 drives. This was the lowest option I had via the controller's configuration tool if I wanted the disks to be seen. Is there a preferred RAID level to use in this case? Does it even matter if each virtual disk corresponds to a single physical disk?
Exactly, that's exactly what you must avoid, if you value your data. Hell, it might even be worse than a single RAID volume with ZFS on top, if the RAID controller doesn't suck (big if there).

The only safe option is presenting the drives without any sort of trickery. This means SAS HBAs or SATA AHCI controllers.
 

estrabd

Dabbler
Joined
Nov 21, 2013
Messages
38
Okay, thank you. I'll keep working it out. This is a NAS that will eventually replace a UFS based FreeNAS 9.2 set up I have; so I have time to change this up.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Yep.. read the warning I put in the hardware recommendations thread...

What not to do...

So I've told you what stuff you should get. Here's stuff you should NOT get:

Note: this list is compiled by people that have used hardware and lost all of their data as a result. If you can't learn from someone else's screwups you probably shouldn't use FreeNAS at all.
  • Highpoint controllers (they'll work until one day when you suddenly have no data)
  • Adaptec controllers (they'll work until one day when you suddenly have no data)
  • Dell PERC RAID controllers like the 5i (if you have to do RAID0 of individual disks you have failed to follow the "do not use hardware RAID" rule... that is NOT a JBOD. End of discussion.)
  • Try to use less than 8GB of RAM (If you plan to use Plex you should have at least 12GB of RAM)
  • Hardware that has a FSB (front-side bus). The FSB will be a performance killer for ZFS
  • Hardware that is older than the Intel Sandy Bridge (older stuff burns LOTS of watts)
  • Anything that doesn't use DDR3 RAM (DDR2 is just too expensive to try to upgrade later)
 

estrabd

Dabbler
Joined
Nov 21, 2013
Messages
38
Well, I moved over to using a M1015 (need to flash it,but it's working). But I still get the reboot behavior and am going to start investigating other means, such as IRQ conflicts like what I saw in some other threads on the PE2950 having a similar issue.

I can't find any artifacts related to the reboots other than knowing that it happens about every 3 days - regardless of use. I've pounded the machine for 3 days straight and left it idle for 3 days straight, so I think it's going to be some sort of scheduled activity that is happening or accumulating issues until the 3rd day.

Here's the output of `dmesg`, not sure if that is going to be helpful - https://gist.github.com/estrabd/b772f9f4f56f4a23f9e2

Any other places I should look for diagnostic info?

Thank you,
Brett
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Umm... I see a bunch of bizarre errors.. in particular....

link_elf_obj: symbol _sx_assert undefined

linker_load_file: Unsupported file type

link_elf_obj: symbol _sx_assert undefined

linker_load_file: Unsupported file type

link_elf_obj: symbol _mtx_assert undefined

linker_load_file: Unsupported file type

KLD dtraceall.ko: depends on fasttrap - not available or version mismatch

linker_load_file: Unsupported file type

link_elf_obj: symbol _mtx_assert undefined

linker_load_file: Unsupported file type

link_elf_obj: symbol _mtx_assert undefined

linker_load_file: Unsupported file type

That tends to scream that the boot device is fubarred or you've done some serious customizing of the OS to the point that the OS can't even boot properly. In either case I'd get a new boot device, install freeNAS and import your config file.

if you do a scrub of the boot device or verify the OS files I'd be surprised if you didn't have a crapload of errors.

Edit: and you are using mfi0 devices. /smite you
 

estrabd

Dabbler
Joined
Nov 21, 2013
Messages
38
I haven't customized it, but it's been through multiple updates and several rounds of controller card bingo - my next move was actually to reinitialize the USB nubbins.

Brett
 

estrabd

Dabbler
Joined
Nov 21, 2013
Messages
38
Just for completeness, I wanted to cautiously report success. It seems that doing the following may have stabilized my machine:
  1. updated controller to the M1015
  2. tweaked IRQ settings in R900's BIOS to minimize conflicts
  3. flashed M1015 with latest, discovered the latest firmware was not supported in FreeBSD
  4. updated the ko that FreeBSD uses (rather than reflash the M1015 to downgrade)
  5. recreated the boot USBs
It's been running for 5 days now, all the while "stressing" by rsync's some rather large data sets with a historically more stable FreeNAS machine (32 bit Dell PE 1750, FreeNAS 9.2 managing a Dell PowerVault 220s). So far so good. I figure a few more day of good results and I'll be satisfied that the instability has been licked.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
And some iXsystems dev would have a heart attack after reading that. It's almost entertaining that you thought that updating the driver for a custom OS was more stable than going to an older revision of the firmware. Ok, it's not "almost entertaining". It's hilarious!

Well, good luck to you. Be sure that when (not if) you ahve problems with your zpool someday you mention that you did this. It will save many of us more experienced people a lot of time and effort trying to help you debug a problem that is self-inflicted.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I don't drink..

and yes it's offered... by a random noob posting to a ticket. Doesn't make it even remotely recommended advice.

Good luck to you though.
 

simpleton

Cadet
Joined
Mar 8, 2016
Messages
3
Just for completeness, I wanted to cautiously report success. It seems that doing the following may have stabilized my machine:
  1. updated controller to the M1015
  2. tweaked IRQ settings in R900's BIOS to minimize conflicts
  3. flashed M1015 with latest, discovered the latest firmware was not supported in FreeBSD
  4. updated the ko that FreeBSD uses (rather than reflash the M1015 to downgrade)
  5. recreated the boot USBs
It's been running for 5 days now, all the while "stressing" by rsync's some rather large data sets with a historically more stable FreeNAS machine (32 bit Dell PE 1750, FreeNAS 9.2 managing a Dell PowerVault 220s). So far so good. I figure a few more day of good results and I'll be satisfied that the instability has been licked.

I wish I had run across this a few weeks ago.

I too have an R900 that suffers from random reboots. I've played the "card bingo" as you called it. I even went as far as to replace the 64g of ram I have to see if that was the issue.

The difference in our setups is I am using an IBM FC shelf with 16 drives instead of the on-board raid controller.

Can you expand on exactly what you did as far as IRQ settings in the BIOS?


Thanks
 

toadman

Guru
Joined
Jun 4, 2013
Messages
619
In fact, doing memory tests with ECC RAM is basically pointless. :p

Generally yes. The only case(s) that you might discover something is a bad module and/or board with some weird multiple stuck-at conditions I suppose. (i.e. an error not detectable by the code.) But in that condition the system probably wouldn't get very far in the first place.
 
Status
Not open for further replies.
Top