Rebooting on large file transfers

Status
Not open for further replies.

jimay3677

Dabbler
Joined
Jun 20, 2011
Messages
12
Hi everyone, I was running Nexentastor as an NFS datastor for 3 ESXi hosts but it kept crashing with a machine exception error randomly (during high network load) so I moved everything across to a temporary FreeNAS server while I installed FreeNAS on the original server.

Everything went well, new server is built but now I am having the temporary FreeNAS reboot without warning every time I try to move things back (ssh'd into new server, mounted NFS share and using cp command) or when there's high network load (last nights backup)

Both the servers are the following
TYAN S2892 Motherboard with 8GB ECC, dual dual core opteron
SATA DOM
Supermicro AOC-SAT2 PXI-X Controller Card (8x 500GB SATA drive)
generic JMB363 PCI-E controller card (60Gb SSD)

both are setup with a mirrored stipped arrays.
I'm not sure if this is hardware related but the servers were running fine for over three years with server 2003.

I'm at a bit of a stuffed stage as I can't backup to tape due to the reboots, I may end up moving everything back to the ESXi local datastores but this could take a week due to ESXi being so slow at transferring.

Any help would be appreciated.
Perhaps ZFS is not yet ready for production environments..
 

jimay3677

Dabbler
Joined
Jun 20, 2011
Messages
12
In order to stabilise the system and get the system backed up I'm ditching ZFS based systems for now and going to Openfiler.
Hopefully I can move everything off the FreeNAS server to Openfiler..
 
Joined
May 27, 2011
Messages
566
zfs is stable, your system is not. before you blame your issues on zfs, check your hardware.

i push and pull 2 gb/s through my dual nic LAGG at the same time. no stability issues (after i fixed my heat issue, ie hardware issue)
 

jimay3677

Dabbler
Joined
Jun 20, 2011
Messages
12
zfs is stable, your system is not. before you blame your issues on zfs, check your hardware.

i push and pull 2 gb/s through my dual nic LAGG at the same time. no stability issues (after i fixed my heat issue, ie hardware issue)

It's obviously to do with hardware incompatabilities.
I've tried 4 of these identical servers, all having the same issue.
I've started moving everything back to the local ESXi datastores.
 

jimay3677

Dabbler
Joined
Jun 20, 2011
Messages
12
Hmm, CPU two is running quite warm in both servers at idle (both in a 22c room).

dev.cpu.0.temperature: 23.0C
dev.cpu.1.temperature: 22.0C
dev.cpu.2.temperature: 37.0C
dev.cpu.3.temperature: 37.0C

CPU two sits towards the back of the motherboard/case so obviously cops a lot of heat from everything else.
Will go in once everyone leaves tonight and pull cpu two out. If that doesn't help then I'll start the week long process of moving everything back to ESXi at 4MB/s
 
Joined
May 27, 2011
Messages
566
my heat issue was my controller card, not the cpu. i have a 4U rackmount with very good airflow... everywhere except for where the expansion bays are. try placing a fan so it blows across the chips on the card. it could be that it's overheating and dropping your system, and that it's never been stressed this much when you were running server03. the operating temp for that card is 0 to 55 C that's pretty low in my book. if you have a pocket of low airflow, you'll hit that real fast as the card is passively cooled and has no heat sinks. looking at the motherboard in question, the 2 PCIX ports are at the very bottom. the controller card is half hight so if you're in a 4U case, you could easily mount an extra fan.

truth be told, mine fan's held in place with zip ties. haven't had time to do a good job on it yet but it's functional.
 

jimay3677

Dabbler
Joined
Jun 20, 2011
Messages
12
OK, there does seem to be a bit of a heat pocket on that side of the case, I've switched the SATA card and the network card over, so hopefully the NIC won't get as hot, it's a HP NC7170.

I may invest in a PCI slot cooler.
 

globus999

Contributor
Joined
Jun 9, 2011
Messages
105
Maybe, or, you may have been bitten by the infamous "kmem_map too small" error. This is an error that used to show up in BSD around 2007 (I think v 7) and it is due to lack of zfs kernel tuning. The kernel panics and reboots withing 15 seconds. It leaves no trace (if you are not in front of the console where it happens you would not know the origin). These were the *exact* conditions under which such error shows up: large transfers, however, it is more noticeable if you are transferring a large number of small files. Now, since FN8 does not have an ZFS algo tuning, you have to do the tuning by yourself (good luck here). There are some parameters that may help you, google +zfs +"kmem_map too small" to get a list of forum posts and their resolutions. The interesting bit is that even in 2007 did not happen with all the machines (some were working OK), but it did happen with quite a lot of them. If this is happening with yours, there may be the boxes or these boxes may simply not be compatible with zfs unless tuned.
 
Joined
May 27, 2011
Messages
566
OK, there does seem to be a bit of a heat pocket on that side of the case, I've switched the SATA card and the network card over, so hopefully the NIC won't get as hot, it's a HP NC7170.

I may invest in a PCI slot cooler.

can you place the system on a bench, open it up and bring in one of those large box fans? rule heat in or out.
 

jimay3677

Dabbler
Joined
Jun 20, 2011
Messages
12
OK, been running all night and hasn't crashed since moving the card.
Now I just need to work out why only six disks are listed under zpool status.
 
Joined
May 27, 2011
Messages
566
glad to hear it's better, i wouldn't stop on the heat issue yet though. i would get one of those expansion bay coolers you may be fine now but a higher load could cause it to overheat again.
 

jimay3677

Dabbler
Joined
Jun 20, 2011
Messages
12
Will do, overnight is the highest that the load gets due to the backups, the backups were the main thing that always killed it.
 

jimay3677

Dabbler
Joined
Jun 20, 2011
Messages
12
OK, I changed settings in loader.conf to get around the kmem map errors to no avail and have gone back to Nexentastor which has been stable now for 4 days even with large transfers, I beleive the Nexentastor crashes were caused by the volume-check bug.

FreeNAS continued to reboot at random and I couldn't see anything in any logs, it would reboot after 1 second of a large transfer or 1 minute..
Once rebooted I could do the large transfer for 8 hours without error, it seems that when the system is idle for a long period it can't handle the suddent increase in resource usage and panics..

Another reason for going back to Nexenta was that FreeNAS had issues with transferring many small files, FreeNAS transfered at 50MB/min Nexentastor transfers at up to 1GB/min to the same virtual server, same VMDK etc.
 
Joined
May 27, 2011
Messages
566
Glad you found something that worked for you. I was under the impression it was working once you got more cooling on your server.

were you using the x86 or the x64 version of FreeNAS?
 

jimay3677

Dabbler
Joined
Jun 20, 2011
Messages
12
Glad you found something that worked for you. I was under the impression it was working once you got more cooling on your server.

were you using the x86 or the x64 version of FreeNAS?

I thought the cooling had helped too but obviously not.
Nexentastor is still up but I'm paranoid about the system hanging.

I'm not sure what it was but something in the server doesn't like FreeNAS which is a shame..

I was running the x64 version.
 
Joined
May 27, 2011
Messages
566
maybe it helped but not enough. relocating the card may give you better airflow but not good enough airflow.
 
Status
Not open for further replies.
Top