Script to control fan speed in response to hard drive temperatures

Joined
Dec 2, 2015
Messages
730
"man cat" explains that "cat -e" will show a "$" at the end of every line. I'm puzzled by them at the start of the lines though. That is not what I see when I use "cat -e" to view the script on my machine. I wonder if some Windows carriage returns got in there somehow when the script was put on FreeNAS.
 

tmsmith

Dabbler
Joined
Apr 28, 2014
Messages
44
"man cat" explains that "cat -e" will show a "$" at the end of every line. I'm puzzled by them at the start of the lines though. That is not what I see when I use "cat -e" to view the script on my machine. I wonder if some Windows carriage returns got in there somehow when the script was put on FreeNAS.

Just tried to create it in Putty in the CLI and getting the same error message...
 
Joined
Dec 2, 2015
Messages
730
I'm baffled. I was wondering about possible DOS line ending issues, but I did a test on my machine, and those show up as "^M$" in the output of "cat -e". So, that isn't it.

What happens if you go to the directory that contains the script in Putty, and do "perl fancontrol.sh" (without the quotes)?
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
I'm puzzled by them at the start of the lines though.

They are not added by cat, they are just the first char of the variables :)

Everything seems ok with this script but I don't do perl so I'm can't confirm to 100 %.
 

tmsmith

Dabbler
Joined
Apr 28, 2014
Messages
44
What happens if you go to the directory that contains the script in Putty, and do "perl fancontrol.sh" (without the quotes)?

I didn't get any errors when running that. It appears it set everything to 600rpm without checking the actually hard drive temperature. I believe it changed the Fan Mode from Full to Optimal Speed. Doesn't appear to actually do anything if I run it several times.
 
Joined
Dec 2, 2015
Messages
730
What do you get from "which perl"?

What is the output of "smartctl -A /dev/da0"?
 
Joined
Dec 2, 2015
Messages
730
I didn't get any errors when running that. It appears it set everything to 600rpm without checking the actually hard drive temperature. I believe it changed the Fan Mode from Full to Optimal Speed. Doesn't appear to actually do anything if I run it several times.
After studying your original reported error some more, and doing a bit of experimenting on my machine, it looks like that perl script was perhaps being run by sh, rather than perl.

Try changing the first line of the script to:
Code:
#!/usr/bin/env perl


That should let it find perl, rather than worrying about whether you have the correct path to perl.

You should also edit the $min_fan_speed value in your script. You showed that it was set to 1200, but you later reported seeing a fan speed of 600. Set that value to the fan speed at the lowest setting.

For neatness, I would also change the file extension on that script to ".pl" rather than ".sh". You would also need to edit the cron job in the same way.

"sh" is traditionally used to indicate shell scripts, and "pl" would indicate perl scripts. This shouldn't matter, if the shebang line at the start of the script is correct, but it could lead to confusion if anyone looks at the file name and assumes it is an sh script.
 

tmsmith

Dabbler
Joined
Apr 28, 2014
Messages
44
Sorry, I didn't have time over the weekend to touch the server but I will give it a shot this afternoon. Thanks Kevin!
 

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,211
PID Fan Control Script with Detailed Logs

EDIT - These scripts and others are now posted and updated in the Resource section
https://forums.freenas.org/index.php?resources/fan-scripts-for-supermicro-boards-using-pid-logic.24/
This post still has a lot of details not posted there, however.

****************************************

Since being inspired by the ideas and facts presented in this thread, especially the tricks that Kevin discovered to control fans, I’ve spent way too much time working on this. My motivation is that the PWM fan control logic in my motherboard is lousy. Much of the time, setting the mode to Standard is inadequate and HeavyIO is overkill. Either way, the temperatures fluctuate a lot, so I don’t see switching between them in a script as an ideal solution.

Instead, adjusting fan duty cycle to maintain a steady temperature seems to be the way to go. There are a couple of problems though:
  1. BMC won’t give up control (EDIT: @PigLover found the solution to this: leaving fan mode on Full. That is incorporated into the script now). It’s hard to develop control logic that works when the BMC independently decides to change duty cycle. This may happen when you set a duty cycle near the upper/lower end, >80% or <30%. Limiting your own settings between those figures doesn’t help much, since the logic also doesn’t work right if you hit a control limit. I’ve taken two approaches to dealing with this:
    • When setting the duty cycle ≥ 50%, first set the mode to HeavyIO, and when setting <50%, first set mode to Standard. This has nothing to do with regulating temperature, it is just less likely the BMC will feel the need to make adjustments in those ranges.
    • Control the temperature so well that fans rarely, if ever, will get near the extremes of duty cycle.
  2. Temperature readings are way too coarse for fine control. Using the temperature of the hottest drive (Tmax) is a good strategy in principle, but you only have 1 C resolution. Let’s say your setpoint (SP) is 36 C. You can successfully keep Tmax in the range 35-37, but you will never do better than that, and it will be very hard to keep it from oscillating. The solution I’ve found is using the mean temperature of all drives (Tmean). Even with slight temperature change, chances are that one of the drives will cross the degree threshold and its temperature reading will change. Using Tmean (with at least 1 decimal place) gives you much more and earlier information about temperature trends in your drives.
First, a few basics in case there are any fan newbies. There are two fan settings you can control: mode and duty cycle. You can read those plus RPM.
  1. Mode. There are 3-5 modes, depending on the board. Mine has Standard, HeavyIO, and Full. Full is full speed all the time (unless you change it with an ipmitool raw command). Standard and HeavyIO have some degree of speed control based on board temperatures. HeavyIO is a higher speed at a given temperature. The details don’t seem to be well documented, but that doesn't matter, since it works very poorly anyway.
  1. Duty cycle and RPM. Duty cycle is a percentage of full power applied to the fans. It is correlated with RPM, actual fan speed, but you can’t set RPM directly. And when you read RPM it is rounded to the nearest 100.
I read whatever I could understand on PID control. Turns out, as BiduleOhm said, it doesn’t have to be as complicated as they make it out to be. Here’s how it works. Based on temperature error from your setpoint, you calculate three corrections (P, I, and D) to adjust the duty cycle. Each has a tuning constant: Kp, Ki, and Kd. Note that T (time between control cycles) is sometimes included in the calculation of I and D. It makes it a bit more complicated, but I think then your tuning constants don’t break if you change T. However, I’ve never changed T to see what happens.

First you calculate the current error as ERRc = Tmean – SP. Most sources say to calculate error the other way, SP-Tmean, but it makes more sense to have error be positive when temps are too high and you need to increase fan speed.
  • P is for proportional. This is a correction proportional to the current error. Just multiply by the tuning constant. So the formula for P is simply Kp * ERRc.
  • I is for integral. This is a correction for cumulative error. So every cycle, you add the current error (ERRc) * T to cumulative error (ERR), and multiply by a tuning constant. So ERR = ERRc*T + ERR, then I = Ki * ERR. My understanding is this helps correct offset; where the temperature stays a bit below or above SP.
  • D is for derivative. That is change in current error with respect to time, or basically the slope of the error line. In practice it’s even simpler, you just need to subtract ERRc from the previous error (ERRp), divide by T and multiply by another tuning constant. So D = Kd * ((ERRc – ERRp) / T). This really does two things. When you start a scrub or the sun hits your NAS, you get a large positive error. The bigger the increase in error, the bigger D is. In this case, D and P are additive, aggressively increasing duty cycle. But then when the temps are cooling fast, coming back down to SP, P is still positive and D is negative, so D counters P and puts the brakes on the fans, reducing overshoot and subsequent oscillation.
Most of what I’ve read says, in the great majority of cases, D is not needed, and you can just use PI. So I worked with PI for a long time, trying all kinds of tuning, and always had oscillation. By then I was using Tmean instead of Tmax, and I put the derivative term in and it was MUCH more stable. On the other hand, I have not seen any sign of offset, and I does more harm than good, so I ended up setting Ki to 0.

Here are a few graphs showing some preliminary trials. First using Tmax as the process variable, SP is 36. Kp is too high.
Kp only with Tmax.png


Tried adding the I term. Many experiments I won't bore you with. This one using Tmean, still bad oscillation. This shows cumulative error too. No improvement
Kp Ki Tmean.png


And the script and tuning I ended up with, showing Tmax, Tmean, and duty cycle.. As a test, this starts with a large error. See how Tmean comes down to SP without overshooting. Then there are minor corrections through the day, and you can see the fans increase as the sun comes in in the afternoon.
KpKd.png


Here is what the log from spinpid.sh looks like (sorry alignment is off here; it's correct in the log).
Code:
Saturday, Jan 14  Fan %  Interim CPU
  da0  da1  da2  da3  ada0 ada1 ada2 Tmax Tmean  ERRc  P  I  D CPU Driver  RPM MODE  Curr/New Adjustments
18:01:22  *34  *34  *35  *34  *31  *32  *32  ^35  33.14 -0.43  -1.71  0.00  0.00  53 Drives  600 Full  38/36
18:07:31  *34  *34  *35  *34  *31  *32  *32  ^35  33.14 -0.43  -1.71  0.00  0.00  53 Drives  600 Full  36/34
18:13:41  *34  *34  *35  *34  *31  *32  *32  ^35  33.14 -0.43  -1.71  0.00  0.00  54 Drives  600 Full  34/32
18:19:52  *35  *34  *35  *35  *31  *32  *32  ^35  33.43 -0.14  -0.57  0.00  1.90  55 Drives  500 Full  32/33
18:26:03  *35  *35  *36  *35  *31  *32  *32  ^36  33.71  0.14  0.58  0.00  1.90  55 Drives  500 Full  33/35  38 44 50 56
18:32:11  *35  *35  *36  *35  *31  *33  *32  ^36  33.86  0.29  1.15  0.00  0.95  60 Drives  800 Full  56/58
18:38:21  *34  *34  *35  *34  *30  *32  *32  ^35  33.00 -0.57  -2.28  0.00  -5.71  52 Drives  900 Full  58/50
18:44:30  *33  *33  *34  *33  *30  *32  *32  ^34  32.43 -1.14  -4.57  0.00  -3.81  49 Drives  800 Full  50/42
18:50:40  *33  *33  *34  *33  *30  *32  *32  ^34  32.43 -1.14  -4.57  0.00  0.00  50 Drives  700 Full  42/37
18:56:50  *33  *33  *34  *33  *30  *32  *32  ^34  32.43 -1.14  -4.57  0.00  0.00  51 Drives  600 Full  37/32
19:03:01  *34  *33  *34  *34  *30  *32  *32  ^34  32.71 -0.86  -3.42  0.00  1.90  53 Drives  500 Full  32/30
19:09:11  *34  *34  *35  *34  *30  *32  *32  ^35  33.00 -0.57  -2.28  0.00  1.90  55 Drives  500 Full  30/30  32
19:15:21  *34  *34  *35  *34  *30  *32  *32  ^35  33.00 -0.57  -2.28  0.00  0.00  56 CPU  500 Full  30/32   

This is MUCH better than the control built into the boards. Tmean normally stays within 0.3 C of SP unless there is a disturbance, then within 0.5 C. It is damn near perfect. I guess SuperMicro doesn’t do something like this because (a) they’re more interested in protecting the board than the drives, and the board is less sensitive to temperature variation, (b) accessing drive temperatures depends on the OS, and (c) it requires tuning.

Tuning
PID tuning advice on the internet generally does not work well for controlling drive temperatures in my experience.
  1. First run the script spincheck.sh (logs detailed data only, no control) and get familiar with your temperature and fan variations without any intervention.
  2. Now in the settings of spinpid.sh, choose a setpoint that is an actual observed Tmean, given the number of drives you have. It should be the Tmean associated with the Tmax that you want.
  3. Set Ki=0 and leave it there. You probably will never need it.
  4. Start with Kp low. Use a value that results in a rounded correction=1 when error is the lowest value you observe other than 0 (i.e., when ERRc is minimal, Kp ~= 1 / ERRc). However, if you have few drives and thus coarser temperature monitoring, you may need a larger Kp. I would not go below 4.
  5. Set Kd at about Kp*10
  6. Get Tmean within ~0.3 degree of SP before starting script. At this stage you don’t want to test a large error, you want an equilibrated system.
  7. Start script and run for a few hours or so. If Tmean oscillates (best to graph it), you probably need to reduce Kd. If no oscillation but response is too slow, raise Kd.
  8. Stop script and get Tmean at least 1 C off SP. Restart. If there is overshoot and it goes through some cycles, you may need to reduce Kd.
  9. If you have problems, examine P and D in the log and see which is messing you up. Most likely Kd needs tuning. You can try raising Kp, though too high and changes become too aggressive and you get overshoot and oscillation. You can even try using Ki. If you use Ki, make it small, ~ 0.1 or less.
Scripts

There are two bash scripts attached (change extension to .sh after you download) (EDIT: scripts updated 2017-01-15). spincheck.sh logs data only, it does not control anything. spinpid.sh logs and controls. Both scripts log:
  • disk status (spinning or standby)
  • disk temperature (Celsius) if spinning
  • max and mean disk temperature
  • fan rpm (spincheck does FAN1-4 and FANA) and mode
  • current fan duty cycle (spincheck gives zones 0 and 1; spinpid gives new duty cycle also)
  • optional diagnostic variables
The scripts include disks on motherboard as well as on an HBA. They get a list of devices from camcontrol devlist. I edit that to delete my SanDisk flash drives from the list; you may have to change that for your system. Suggestions in the script.

The mode code is primarily based on my board. If you have different modes you may need to make some minor tweaks. I have no idea if any of this is applicable to non-Supermicro boards.

As usual, you are responsible for anything you do on your system. This works for me, but for all I know it could make your box catch fire or cause a zombie apocalypse. I suggest you monitor closely at first.

After testing, if you decide to use it, you can run it as a post-init script (Tasks in the GUI) so it starts automatically after booting. In this case, to avoid ‘windup’ (a large error when starting with possibly cold drives), you may want to add a ‘sleep 1200’ before the main loop. Then it will wait 20 minutes for the drives to warm up before doing anything.
 
Last edited:
Joined
Dec 2, 2015
Messages
730
Thanks for sharing! I'm travelling for the rest of May, but I plan to experiment with this once I'm back home.
 

AlainS

Cadet
Joined
Oct 20, 2015
Messages
3
Hi,

Seems awesome and I tried it but I get this error:
upload_2016-5-26_23-19-43.png


I'm not familiar with the actual scripting (yet) so I have no idea how to fix this...
 

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,211
I'm on my phone so can't investigate now. But try without sh in front. This is a bash script and I don't know what sh does.
You will then have to specify the directory the script is in or type ./spincheck.sh if you're in the directory.
 
Last edited:

AlainS

Cadet
Joined
Oct 20, 2015
Messages
3
Now I get this output:
upload_2016-5-26_23-41-1.png

I am not using a Supermicro motherboard, but an AsrockRack e3c224d4i 14s, which might be the reason I'm getting these errors.
 

AlainS

Cadet
Joined
Oct 20, 2015
Messages
3
I've solved part of the errors I was getting by removing my Kingston flashdrives from the reported devlist. This is the result I'm getting right now:
 

Attachments

  • upload_2016-5-26_23-49-56.png
    upload_2016-5-26_23-49-56.png
    48.7 KB · Views: 515

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,211
I've solved part of the errors I was getting by removing my Kingston flashdrives from the reported devlist. This is the result I'm getting right now:
Please copy the output and paste it in code tags. You're getting drive temps now but the raw commands apparently don't work on your board. May need some research.

I looked up your board and can't find information that it even has PWM fan control, except for the CPU fan. It may not be able to take ipmitool raw commands. You might want to ask AsRockRack about that.
 
Last edited:

PigLover

Dabbler
Joined
May 29, 2016
Messages
11
This is a GREAT thread. Lots of good info that wasn't generally available about SM fan speed control methods.
For my first post on FreeNAS forums I thought I'd address this comment:
BMC won’t give up control.
After playing with this for a while, it appears that if you set the fan speed control to "full" before setting the duty cycle the BMC stops trying to fiddle with the fans and your settings will hold persistently.

For example I've currently added this to the script I run at system boot:

Code:
#set fan mode to "full"
ipmitool raw 0x30 0x45 0x01
#set fans in "system" zone to 37.5%
ipmitool raw 0x30 0x70 0x66 0x01 0x00 0x24
#set fans in "peripheral" zone to 25%
ipmitool raw 0x30 0x70 0x66 0x01 0x01 0x16


Its less flexible than the PIM script posted above but for me (for now) it works great.
 
Last edited:

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,211
After playing with this for a while, it appears that if you set the fan speed control to "full" before setting the duty cycle the BMC stops trying to fiddle with the fans and your settings will hold persistently.
Strange but true. Good find. I've also noticed in my scripts that if I set mode, then immediately set duty cycle, the second command doesn't work. So I let it sleep a second in between with 'sleep 1'

I didn't realize the number before duty cycle is a fan zone. How do you know where/what your zones are? I suspect my board has only one zone.
 

PigLover

Dabbler
Joined
May 29, 2016
Messages
11
Fans named with numbers are in the System/CPU zone (e.g., FAN1, FAN2, etc.). This is zone 0.

Fans named with letters are in the peripheral zone (e.g , FANA, FANB, etc). This is zone 1.

Sent from my SM-G925V using Tapatalk
 
Last edited:

leonroy

Explorer
Joined
Jun 15, 2012
Messages
77
Been trying to change the fan modes on an X9 series motherboard but the settings don't seem to stick. For example the below command to set fan speed to full doesn't change the fan speed:

ipmitool raw 0x30 0x45 0x1 0x1

Even resetting the BMC afterwards doesn't change anything.

The only place which seems to have an effect on fan speed is the BIOS setting which offers Normal, Optimal and Full speeds.

Any ideas?
 

PigLover

Dabbler
Joined
May 29, 2016
Messages
11
What motherboard exactly and what BMC rev?

Sent from my SM-G925V using Tapatalk
 
Top