jgreco
Resident Grinch
- Joined
- May 29, 2011
- Messages
- 18,680
I'm curious what kind of insane conditions "high heat load" means. Broadcom's spec sheet shows a maximum operating environment temperature of 55*C (during which, at best with perfect heat transfer, the card's chip is 55*C), and in this application the RAID capabilities of the cards aren't even being used, so I'm not sure how we get to a spot where these cards are overheating.
This really is an honest question, because the proffered wisdom conflicts with my existing understanding of processor and chip temperature limits.
No, it really is a good honest question, and on re-reading this thread it might be hard to try to piece together the bits, so let me throw some stuff out there and then I expect you to ask further questions if you have any that I can answer.
I'm not going to limit myself here to any particular controller. In addition to the HBA controllers used in FreeNAS for HDD's, I've worked with a variety of LSI 1068, 2008, 2108, 2208, 2308, 3108, etc., ROC chipsets over the years, and it is simultaneously not worth my effort to recall the exact specifics for every statement I make while also possibly misleading to assume that there isn't general relevance across the product line.
The LSI cards are based on "ROC" chipsets which are "RAID-on-a-Chip". You have a driver on your host OS that communicates with the card, which has a little CPU with one or two cores on it, which then in turn talks to the drives over SAS. In the case of an IT-mode HBA, it basically just "passes data" back and forth to the drives. Most of the HBA's are sold in "IR" mode, which is similar to IT mode, except that it also includes basic RAID functionality such as RAID1, and "IR" mode is a little slower than "IT" mode due to the extra code that's executed by the card. Every time you talk to the HBA, though, you're causing the HBA's CPU to work, which in turn burns a little more heat.
Now you may recall during the 1990's, an era where we began the decade with CPU's that didn't even carry a heatsink (386DX/40) and ended with CPU's that had massive ones along with fans (Pentium III), as you increase the speed and complexity, more watts burn. LSI's RoC cards are basically little CPU's, and in order to hit their performance targets, they are relatively fast CPU's. They've been pressured by the speed increase of PCIe 2 -> PCIe 3, the SAS speed increase from 3Gbps to 6Gbps, and the introduction of faster HDD's and also SSD's, all of which present massive challenges to their little RoC CPU's. So they began increasing clock speeds and also went to two cores. At the same time, they were benefiting from the normal die shrink advances, so power consumption didn't jump disastrously, but in general, the current cards burn around ~10-15 watts. You're trying to get rid of that from a fairly weedy heatsink.
I suspect that @Redcoat might have a better technical explanation, but what it comes down to is that the efficiency of a heatsink is tied up with the environmental temperature, and so if you have a lower environmental temperature, there's a higher delta between the two temperatures, and a smaller amount of cold air can have a similar efficiency to a large amount of warm air. It becomes increasingly difficult to get the transfer as air gets closer to the temperature of the thing being cooled.
In addition, in your typical LSI RAID controller (not HBA), there are components such as the battery or supercapacitor that have a tighter environmental operating range, so especially on some of the RAID controllers, the max environmental is pretty low, like 45'C (113'F). In a server, often the air inlets through a set of hot-running 10K RPM HDD's, so these really need to be run in a data center with some fairly good air conditioning.
The RAID chip itself often runs much hotter, as reported by its onboard sensor, and you'll see lots of people freaking out because theirs is reporting 70'C-85'C, but this isn't really outside the realm of what seems to be reported in a lot of servers. A 9270-8i in one of our 2U's here reports at 61'C, which seems fine. The chips are fairly hardy, and seem to be able to run at high temperatures. However, many of the LSI's are known to become problematic once they're reporting as 90+'C, despite the spec sheet for some of them reporting a 115'C max temp. At those temperatures, they appear to start corrupting data. RAID controllers report problems with the member drives (probably related to that corruption), HBA's start spewing incoherent bits, appearing to read and write sectors with some corrupt bits.
So the trick here is to keep good airflow over them. This is easier in a server, where you can arrange plenum spaces for the air and create shrouds to guide air if needed, so that multiple fans are always cooling the server. This is trickier in a PC, where the air is not being actively forced through the fins of the heatsink. @wblock suggested putting a 40MM fan on top of it, and that will work very well for the lifetime of the 40MM fan.
But there are a bunch of problems with that.
First, the 40MM fan you're likely to be using is something like a 10MM-thich fan, because those are common, and easily screwed to the heatsink. However, these are typically designed just for air movement, not designed for high static pressure, such as forcing air through the fins of a heat sink (significant resistance), so you are setting up a situation that may shorten the fan's life.
Second, running fans in unusually warm environments requires fans that are designed for that. When warmed, material properties can change somewhat, and metal expands, and the precision construction of a fan means that if you are constantly running it up against a heatsink that is 150'F/65'C, you can significantly shorten its lifespan. The average consumer-grade fan typically sold to PC enthusiasts is very different in quality from the industrial grade stuff like the high quality Sanyo Denkis.
Third, fans are rated for a given number of operational hours. These numbers aren't always publicized. But just for giggles I tracked down some numbers for high quality Sanyo Denki ("San Ace") Long Life 40L fans. It's a top of the line, 26MM thick, and they are rated for 100,000 hours - 11 years - when run at 60'C with free air flow (no backpressure). This is really some of the best stuff out there that you can get. Their definition of that lifetime is still only that 90% of the fans should get to that age under those conditions. Your average cheap grade fan is more like a 30,000 hour affair, and that's not when stressed, and failure rates are significantly higher.
Fourth, if you need a fan mounted on the heatsink, then this implies your HBA is already in a location in the chassis with poor airflow. So what happens when your fan sitting on top of the heatsink dies, is that it acts as a huge insulator (because airflow is blocked) in an area that already had heat troubles. This implies that the operating temperature of the fan is already higher than the system's average temperature. And when the fan fails, it will mainly act to help cook the chip.
Fifth, 40MM fans seem to be more of an engineering challenge because they need to spin faster than a larger (60MM, 80MM) fan for the same volume of air. I *suspect* a 60MM fan would be a better choice. See below.
I would say that a warm-ish temperature without a fan is preferable to a cooler temperature with a fan that could eventually result in a very hot temperature for the chip that causes corrupt data to be written into your pool.
The normal solution to dealing with physical unreliability is redundancy - the obvious example is that everyone here is hopefully familiar with what RAID is. This gentleman placed a fan over the top of the card to cause air to blow down the card, but without interfering with the system's normal ability to blow across in the normal direction, should the fan fail. That's pretty good. I would consider using two 60MM fans, as the cost of fans is pretty low, and the value of pool data is usually pretty high.
But this doesn't actually deal with the root problem here. The root is that the heatsink is too small, so that you are trying to dissipate a lot of heat from insufficient surface area. Now I'm kinda dyin' to know what the actual distance between the clips is on those Dell boards. I have a small pile of brand new fantastic Alpha Novatech U60C's here that were meant to dissipate 29W TDP Pentium III 1GHz CPU's in a 1U platform, I have a bad feeling they're just a little too big (2.375"sq).