Cooling for a Dell PERC H310

jgreco · Feb 28, 2018

Nick2253 said:
I'm curious what kind of insane conditions "high heat load" means. Broadcom's spec sheet shows a maximum operating environment temperature of 55*C (during which, at best with perfect heat transfer, the card's chip is 55*C), and in this application the RAID capabilities of the cards aren't even being used, so I'm not sure how we get to a spot where these cards are overheating.

This really is an honest question, because the proffered wisdom conflicts with my existing understanding of processor and chip temperature limits.

No, it really is a good honest question, and on re-reading this thread it might be hard to try to piece together the bits, so let me throw some stuff out there and then I expect you to ask further questions if you have any that I can answer.

I'm not going to limit myself here to any particular controller. In addition to the HBA controllers used in FreeNAS for HDD's, I've worked with a variety of LSI 1068, 2008, 2108, 2208, 2308, 3108, etc., ROC chipsets over the years, and it is simultaneously not worth my effort to recall the exact specifics for every statement I make while also possibly misleading to assume that there isn't general relevance across the product line.

The LSI cards are based on "ROC" chipsets which are "RAID-on-a-Chip". You have a driver on your host OS that communicates with the card, which has a little CPU with one or two cores on it, which then in turn talks to the drives over SAS. In the case of an IT-mode HBA, it basically just "passes data" back and forth to the drives. Most of the HBA's are sold in "IR" mode, which is similar to IT mode, except that it also includes basic RAID functionality such as RAID1, and "IR" mode is a little slower than "IT" mode due to the extra code that's executed by the card. Every time you talk to the HBA, though, you're causing the HBA's CPU to work, which in turn burns a little more heat.

Now you may recall during the 1990's, an era where we began the decade with CPU's that didn't even carry a heatsink (386DX/40) and ended with CPU's that had massive ones along with fans (Pentium III), as you increase the speed and complexity, more watts burn. LSI's RoC cards are basically little CPU's, and in order to hit their performance targets, they are relatively fast CPU's. They've been pressured by the speed increase of PCIe 2 -> PCIe 3, the SAS speed increase from 3Gbps to 6Gbps, and the introduction of faster HDD's and also SSD's, all of which present massive challenges to their little RoC CPU's. So they began increasing clock speeds and also went to two cores. At the same time, they were benefiting from the normal die shrink advances, so power consumption didn't jump disastrously, but in general, the current cards burn around ~10-15 watts. You're trying to get rid of that from a fairly weedy heatsink.

I suspect that @Redcoat might have a better technical explanation, but what it comes down to is that the efficiency of a heatsink is tied up with the environmental temperature, and so if you have a lower environmental temperature, there's a higher delta between the two temperatures, and a smaller amount of cold air can have a similar efficiency to a large amount of warm air. It becomes increasingly difficult to get the transfer as air gets closer to the temperature of the thing being cooled.

In addition, in your typical LSI RAID controller (not HBA), there are components such as the battery or supercapacitor that have a tighter environmental operating range, so especially on some of the RAID controllers, the max environmental is pretty low, like 45'C (113'F). In a server, often the air inlets through a set of hot-running 10K RPM HDD's, so these really need to be run in a data center with some fairly good air conditioning.

The RAID chip itself often runs much hotter, as reported by its onboard sensor, and you'll see lots of people freaking out because theirs is reporting 70'C-85'C, but this isn't really outside the realm of what seems to be reported in a lot of servers. A 9270-8i in one of our 2U's here reports at 61'C, which seems fine. The chips are fairly hardy, and seem to be able to run at high temperatures. However, many of the LSI's are known to become problematic once they're reporting as 90+'C, despite the spec sheet for some of them reporting a 115'C max temp. At those temperatures, they appear to start corrupting data. RAID controllers report problems with the member drives (probably related to that corruption), HBA's start spewing incoherent bits, appearing to read and write sectors with some corrupt bits.

So the trick here is to keep good airflow over them. This is easier in a server, where you can arrange plenum spaces for the air and create shrouds to guide air if needed, so that multiple fans are always cooling the server. This is trickier in a PC, where the air is not being actively forced through the fins of the heatsink. @wblock suggested putting a 40MM fan on top of it, and that will work very well for the lifetime of the 40MM fan.

But there are a bunch of problems with that.

First, the 40MM fan you're likely to be using is something like a 10MM-thich fan, because those are common, and easily screwed to the heatsink. However, these are typically designed just for air movement, not designed for high static pressure, such as forcing air through the fins of a heat sink (significant resistance), so you are setting up a situation that may shorten the fan's life.

Second, running fans in unusually warm environments requires fans that are designed for that. When warmed, material properties can change somewhat, and metal expands, and the precision construction of a fan means that if you are constantly running it up against a heatsink that is 150'F/65'C, you can significantly shorten its lifespan. The average consumer-grade fan typically sold to PC enthusiasts is very different in quality from the industrial grade stuff like the high quality Sanyo Denkis.

Third, fans are rated for a given number of operational hours. These numbers aren't always publicized. But just for giggles I tracked down some numbers for high quality Sanyo Denki ("San Ace") Long Life 40L fans. It's a top of the line, 26MM thick, and they are rated for 100,000 hours - 11 years - when run at 60'C with free air flow (no backpressure). This is really some of the best stuff out there that you can get. Their definition of that lifetime is still only that 90% of the fans should get to that age under those conditions. Your average cheap grade fan is more like a 30,000 hour affair, and that's not when stressed, and failure rates are significantly higher.

Fourth, if you need a fan mounted on the heatsink, then this implies your HBA is already in a location in the chassis with poor airflow. So what happens when your fan sitting on top of the heatsink dies, is that it acts as a huge insulator (because airflow is blocked) in an area that already had heat troubles. This implies that the operating temperature of the fan is already higher than the system's average temperature. And when the fan fails, it will mainly act to help cook the chip.

Fifth, 40MM fans seem to be more of an engineering challenge because they need to spin faster than a larger (60MM, 80MM) fan for the same volume of air. I *suspect* a 60MM fan would be a better choice. See below.

I would say that a warm-ish temperature without a fan is preferable to a cooler temperature with a fan that could eventually result in a very hot temperature for the chip that causes corrupt data to be written into your pool.

The normal solution to dealing with physical unreliability is redundancy - the obvious example is that everyone here is hopefully familiar with what RAID is. This gentleman placed a fan over the top of the card to cause air to blow down the card, but without interfering with the system's normal ability to blow across in the normal direction, should the fan fail. That's pretty good. I would consider using two 60MM fans, as the cost of fans is pretty low, and the value of pool data is usually pretty high.

But this doesn't actually deal with the root problem here. The root is that the heatsink is too small, so that you are trying to dissipate a lot of heat from insufficient surface area. Now I'm kinda dyin' to know what the actual distance between the clips is on those Dell boards. I have a small pile of brand new fantastic Alpha Novatech U60C's here that were meant to dissipate 29W TDP Pentium III 1GHz CPU's in a 1U platform, I have a bad feeling they're just a little too big (2.375"sq).

Evertb1 · Feb 28, 2018

jgreco said:
But this doesn't actually deal with the root problem here. The root is that the heatsink is too small, so that you are trying to dissipate a lot of heat from insufficient surface area. Now I'm kinda dyin' to know what the actual distance between the clips is on those Dell boards. I have a small pile of brand new fantastic Alpha Novatech U60C's here that were meant to dissipate 29W TDP Pentium III 1GHz CPU's in a 1U platform, I have a bad feeling they're just a little too big (2.375"sq).

I ordered the card and just received it. Yes, the heatsink is a marginal thing. And if I look at the way the heatsink is mounted, with the spring-steel bracket and diagonally placed clips, I think it won't be easy to replace it with a better one. I am thinking about cutting an old heatsink (used for one of those hot headed AMD CPU's) to size. But the spring-steel mounting bracket will be a problem. As for the distance between the clips: see the picture I included.

Ericloewe · Feb 28, 2018

There's a fairly good solution to the 40 mm fan problem: Monitor it.

Any serious contender is going to be available in 3-pin versions and probably 4-pin (though PWM control might be more hindrance than help here). This is withing reach of mere mortals who don't want to order San Aces from Mouser and then crimp the connector to the fan wires, in the form of the Noctua NF-A4x20-FLX. I'm confident that Sanyo Denki still has the superior product (they certainly have the superior datasheet), but Noctua is pretty good and this model is probably the most appropriate in the 40 mm range. Static pressure is decent, comparable to the 120 mm fans they recommend for radiator usage (in other words, well below "suck air through a full 4U's front panel" but well above "blow air onto LSI's crappy little extruded aluminum heatsink").

When inevitable failure happens, fan monitoring via IPMI would ramp up all other fans to maximum and send out alerts, if configured to do so.

As an aside, I just noticed that the SFF-8087 connectors block much direct airflow to the heatsink. That's nasty, and a point in favor the the other style of card, with the connectors next to the back of the server.

jgreco · Mar 1, 2018

Ericloewe said:
There's a fairly good solution to the 40 mm fan problem: Monitor it.

I'm not sure I can agree with that. The problem that I object to is that the failure mode of the HBA in overheat is data corruption, so basically you're just shortening the window in which pool damage can happen. This seems to me like the people who don't wear seat belts because the air bags should save them.

When inevitable failure happens, fan monitoring via IPMI would ramp up all other fans to maximum and send out alerts, if configured to do so.

Make sure to set the alert message to "LSI controller may be shooting shrapnel into your pool" for maximum correctness.

As an aside, I just noticed that the SFF-8087 connectors block much direct airflow to the heatsink. That's nasty, and a point in favor the the other style of card, with the connectors next to the back of the server.

No, there's sufficient space between the top of the connectors and the adjacent card. In a server, this is non-problematic, and even in a PC case, you'd probably need to be blowing air straight in for it to make a huge difference.

jgreco · Mar 1, 2018

Evertb1 said:
I ordered the card and just received it. Yes, the heatsink is a marginal thing. And if I look at the way the heatsink is mounted, with the spring-steel bracket and diagonally placed clips, I think it won't be easy to replace it with a better one. I am thinking about cutting an old heatsink (used for one of those hot headed AMD CPU's) to size. But the spring-steel mounting bracket will be a problem. As for the distance between the clips: see the picture I included. View attachment 23157

Ah, there, that's useful. So, taking that and measuring just the heatsink, that looks like it is probably a standard 45MM square heatsink, and it's oriented for horizontal forced airflow. That should be highly replaceable - just pick something that allows for the spring to be re-used. But it isn't clear what the best option would be. Alpha's N45-25B is designed for passive cooling and would probably be the sort of thing you'd want to look towards in a PC chassis. While a PC isn't actually totally convection cooling, you have a situation with less-defined airflow. The original purpose of fans in PC's was simply to avoid the box becoming a convection oven, but as thermal requirements have increased, it's more of a mixed bag. Anyways my napkin calcs and general instinct suggest that the N45 falls short of what is needed by a fair bit. I guess the full crazy solution would be the CS45-50B, which seems like it should be fine, but it would be blocking the next few PCIe slots, plus you have the sheer weight, which is a factor because of the retention mechanism, which will be holding the thing on its side in a desktop, and upside down in a midtower. That might actually be fine but they are things to consider. There is probably an option somewhere in the middle that would work, but I'm not invested enough in this to do the work to figure out which one. I've basically listed all the factors I think I'd be thinking about. This is only a ~10-15 watt application. It can't be that hard.

Anyways, this seems like we've burned through most of the ins and outs. I happened to notice that it appears Alpha will sell to end-users directly at a fairly reasonable cost. It looks like US based folks can probably get any of their replacement products for an application like this for less than $20 shipped.

https://www.alphanovatech.com/en/c_cs45e.html

https://www.alphanovatech.com/en/c_n45e.html

None of this is an endorsement or a recommendation. I've sourced heatsinks from Alpha before and they have a good selection of products, plus significant (but pricey) customization capabilities. But either do all the calculations and know what you're buying ... um ... here this looks adequate:

https://www.electronics-cooling.com/1995/06/how-to-select-a-heat-sink/

Evertb1 · Mar 2, 2018

@jgreco you went above and beyond with al the information you provided. Thanks for that. Who would have thought that one innocent question would lead to all this :).
About that heatsink: it's not exactly 45 mm square but close enough. And there is some room to play with. Re-using that spring is a bit iffy for 2 reasons. The groove in the heatsink is dented a bit to keep the spring in it's place. So I won't be able to free it up without doing some damage to the heatsink. And the bottom of the heathsink is very thin. So, almost any other heatsink with a thicker bottom will put some more stress on the spring and thus on those clips. Besides of that I have a hard time locating an alternative heathsink. Those Alpha's are looking good but I can't find them (yet).

In the meantime I have re-aranged my case a little and removed one of the HDU cages. The card is now in the middle of fresh breeze coming directly from an 140 mm Noctua fan pulling fresh air from outside the front of the case. With nothing between the fan and the card. That should do for now.

jgreco · Mar 2, 2018

Well it took me like two hours to write that reply. I sometimes sit here while waiting for my boring work to grind away to completion. Some of my posts sound strange because I am writing sentences one every ten or twenty minutes, and I don't always remember exactly where I was going. For me, this is often just a fun distraction or way to pass the time waiting for stuff to finish. It's great when my rambling is actually helpful!

Redcoat · Mar 2, 2018

jgreco said:
Well it took me like two hours to write that reply. I sometimes sit here while waiting for my boring work to grind away to completion. Some of my posts sound strange because I am writing sentences one every ten or twenty minutes, and I don't always remember exactly where I was going. For me, this is often just a fun distraction or way to pass the time waiting for stuff to finish. It's great when my rambling is actually helpful!

Well, I went down with the flu just about when you prompted me to make another contribution (I suspect that @Redcoat might have a better technical explanation) - when I got out of bed I was able to enjoy everything you put together, especially the "electronics-cooling" article, a simple version of which I had been dreaming about writing.

Keep rambling, please.

Evertb1 · Mar 3, 2018

@Redcoat get well soon. To me having the flu is always a great excuse to take some of my favorite whisky. It's for medical purposes honey really;)

In the mean time I think I have no worries anymore about that hotheaded H310. After opening up my case to get a healthy airflow directly to it I saw idle temps of 46-50°C (ROC) and 35-37°C (IOC). Not freezing but not to bad. And this was with the two intake fans and 1 outtake fans on a medium setting (7 volts). They are connected to a controller in the case that gives them 5, 7 or 12 volts. If I put them on 12 volts the temps are going down another 4-7 degrees.

But as I still had an 80 mm Noctua fan lying around doing nothing useful, I decided to jury rig it in a PCI slot. I left some room between the H310 and the fan. If the Noctua fails there will be still some airflow from the intake fan. If the intake fan fails the Noctua will keep the temps reasonable.

With the Noctua running (with a low noise adapter) I am looking at temps of 33 - 35°C (ROC) and 30°C (IOC). Not bad at all. With the Noctua disconnected I again see temps of 46-50 (ROC) and 35-37 (IOC). If I disconnect my upper intake fan the Noctua keeps the ROC at around 40 and the IOC around 33. So yes, a fan close to the H310 is more efficient in cooling the H310 then the intake fan is. Now I need to see how it is holding up on a full workload. But it looks promising.

Redcoat · Mar 3, 2018

Evertb1 said:
@Redcoat get well soon. To me having the flu is always a great excuse to take some of my favorite whisky. It's for medical purposes honey really;)

Thank you for the good wishes. To be honest, I don't usually look for an excuse - whisky, either Irish or Bourbon, is my tipple, and here's my "talisman" for my Cleveland Clinic visits these days:

I had a glass in hand when I opened your post.

Back to "business" - looks like you've got the H310 issue well in hand.

wblock · Mar 4, 2018

Evertb1 said:
With the Noctua running (with a low noise adapter) I am looking at temps of 33 - 35°C (ROC) and 30°C (IOC). Not bad at all. With the Noctua disconnected I again see temps of 46-50 (ROC) and 35-37 (IOC).

Thank you for this. Actual numbers are much more useful than speculation.

Black Ninja · Mar 11, 2018

This was a fascinating discussion. I think the raid cards are the hardest thing to cool down even in a sever chassis , and I am glad that people are addressing the issue cause failing raid card is very bad.

I wish there was no need to add active fan on the card (for all the reasons jgreco mention above), but if you have to , make sure you connect your raid card fan to the motherboard headers , where rpm can be monitored. And as @Ericloewe point out (at least with supermicro) when the fan stops , the MB will make all the fans kick to full speed mode. This way you won't burn your card, in fact I think it will cool it better with the cost of being super loud for a while. But you'll know you have a fan to replace and as soon as you do it will be quiet again.

P.S. I notice in 1U chassis raid cards seems to be cooler that 2U, despite that 2U have bigger fans with bigger fins.

Grinchy · Mar 21, 2018

Thank you for all the Information in this Thread!

But thinking about this, wouldn't it be better for a small Home User to use (Consumer) Marvell Cards instead of LSI? I'm sure the LSI are much faster than normal consumer stuff, but they also use much more energy and generate much more heat than consumer controllers?

The h310 needs about ~10W, (better) Marvel Cards need about ~2W-3W. Sure, you got less Ports, but for a Home User that's not a big Problem in the most cases. Also there shouldn't be a big Problem with heat, cause this cards are made for use in normal PC Cases.

Sorry if this is a stupid question, but after reading this, I'm really concerned about using LSI and get Data Corruption after a Fan Fails and the card runs hot ...

Also, even an Z2 Raid wouldn't be able to detect this kind of Corruption, cause it's caused by the Controller itself. For ZFS the Corrupted Data would be the "correct" Data.

Ericloewe · Mar 21, 2018

Grinchy said:
Also, even an Z2 Raid wouldn't be able to detect this kind of Corruption, cause it's caused by the Controller itself. For ZFS the Corrupted Data would be the "correct" Data.

Not true at all. That's a problem lesser filesystems like btrfs might have, but ZFS will detect any corruption in data unless there's a hash collision. That is very unlikely with the default hash and nearly impossible with skein, sha256 or sha512. Of course, that doesn't mean your data is magically safe, it could still be corrupted beyond ZFS' ability to recover it, but it will be detected as such.

danb35 · Mar 21, 2018

Ericloewe said:
That's a problem lesser filesystems like btrfs might have, but ZFS will detect any corruption in data unless there's a hash collision.

...or unless it gets corrupted on first write, in which case it won't know any better.

Ericloewe · Mar 21, 2018

danb35 said:
...or unless it gets corrupted on first write, in which case it won't know any better.

That's the thing, unless the HBA is recalculating the checksums, ZFS is going to read that back and go "nope". Again, the data may be lost, but ZFS will be aware of it.

danb35 · Mar 21, 2018

I'm not sure that "lost" vs "corrupted" data is an improvement. But how is ZFS aware of it immediately? ZFS sends data to write. Malicious HBA messes with the data, which gets written in messed-up form. Does ZFS reread the data immediately to confirm a good write? Sure, it will read it at some point (during a scrub if nothing else), and find the problem then, but if it doesn't immediately check after writing (which would seem to really slow things down), I don't see how ZFS avoids putting bad data on the disk.

Ericloewe · Mar 21, 2018

danb35 said:
But how is ZFS aware of it immediately?

Ah, not immediately, but as soon as it's read. Nothing can keep bad data from reaching the media, but ZFS does keep it from getting out.

Evertb1 · Mar 21, 2018

Grinchy said:
Sorry if this is a stupid question, but after reading this, I'm really concerned about using LSI and get Data Corruption after a Fan Fails and the card runs hot ...

I was a bit concerned but I am not anymore. I have an 140 mm intake fan directly blowing fresh air to the HBA. And I have an extra 80 mm fan mounted on a PCI slot blowing on the HBA. Each of the fans -and I have tested this- keeps the HBA cool enough to stay well within the limits. Of course if both fans fail at the same time all bets are off (and I will be very disappointed in Noctua). But there is no such thing as absolute security.

Black Ninja · Mar 21, 2018

This post really had me thinking about some airflow improvement I could do. Improving by channeling the airflow instead of adding extra fan. I am so pleased with the results I feel I should share it the community.

The chassis is 1U WIO with LSI 9271-4i. The idle temperature before was 62 C, and after the modification dropped to 46 C. This is 16 C difference just by adding my own air shroud !!! I can "torture" my raid card anyway I can and I am sure it won't go above 55 at any given load.

The air shroud for the CPUand RAM section (first 4 fans) is the original part, it comes with it. The one I added was the one that channels the air from last fan to raid card. I hope the pictures will show it well.

Important Announcement for the TrueNAS Community.

Cooling for a Dell PERC H310

Resident Grinch

Guru

Server Wrangler

Resident Grinch

Resident Grinch

Guru

Resident Grinch

MVP

Guru

MVP

Documentation Engineer

Guru

Explorer

Server Wrangler

Hall of Famer

Server Wrangler

Hall of Famer

Server Wrangler

Guru

Guru

Attachments

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Cooling for a Dell PERC H310"

Similar threads