Fan Scripts for Supermicro Boards Using PID Logic

Fan Scripts for Supermicro Boards Using PID Logic 2020-08-20, previous one was missing a file

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,210
The error message:

Code:
 ./spincheck.sh: line 118: let: Tsum += : syntax error: operand expected (error token is "+= ")
*0   ./spincheck.sh: line 118: let: Tsum += : syntax error: operand expected (error token is "+= ")
That's an odd thing for it to choke on.

Please run (with sudo if needed) smartctl -a "/dev/da11" and post results. I really just need the temperature line from the attributes table. It was getting temps for all the drives till that one. My guess is it is a different brand that outputs smartctl data differently from WD, Seagate, Toshiba, Hitachi.

Also, see item #11 in the Overview section. It's possible da11 and some other remaining devices are not spinning drives and need to be deleted from the output of camcontrol devlist .
 
Last edited:

PnoT

Dabbler
Joined
Apr 12, 2017
Messages
41
That's an odd thing for it to choke on.

Please run (with sudo if needed) smartctl -a "/dev/da11" and post results. I really just need the temperature line from the attributes table. It was getting temps for all the drives till that one. My guess is it is a different brand that outputs smartctl data differently from WD, Seagate, Toshiba, Hitachi.

Also, see item #11 in the Overview section. It's possible da11 and some other remaining devices are not spinning drives and need to be deleted from the output of camcontrol devlist .

Thank you for the response and it was entirely my fault for not adjusting the line to exclude the 2 SSDs in my system. I got caught up in trying to get things running without going back over the install instructions carefully.
 
Joined
May 10, 2017
Messages
838
Many thanks for the script, it works very well for me but it uses the CPU for about 25% constantly, is this normal or do I have something misconfigured?

Hardware used is in the signature.
 

Attachments

  • spinpid.png
    spinpid.png
    59.5 KB · Views: 434

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,210
Hopefully someone will respond who knows more about CPU usage. All I can say is the script should be asleep most of the time, waiting for the next cycle. I don't know if that figure you attached integrates all the ups and downs. What's your CPU cycle interval?
 

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,210
2 seconds is not unreasonable for CPU_T. I have a wimpy CPU and have it set at 15. I cannot detect CPU usage by spinpid.sh. But maybe spinpid2.sh is different in that respect

Try the commands htop . The processes I know are related to spinpid.sh are the script itself (twice for some reason) and tee. None of them register above 0.0%.

Code:
[jim@Tabernacle ~]$ htop


  1  [#*														  1.0%]   5  [															0.0%]
  2  [#*														  1.1%]   6  [															0.0%]
  3  [															0.0%]   7  [															0.0%]
  4  [															0.0%]   8  [															0.0%]
  Mem[|||||||*******************************************   1.50G/16.0G]   Tasks: 60, 0 thr; 2 running
  Swp[|													17.3M/14.0G]   Load average: 0.19 0.23 0.24
																		  Uptime: 19 days, 08:06:22

  PID USER	  PRI  NI  VIRT   RES S CPU% MEM%   TIME+  Command																					
3488 root	   20   0  246M 33132 S  0.0  0.2  4h34:27 /usr/local/sbin/collectd
3411 root	   20   0  362M  152M S  0.0  0.9  3:35.99 /usr/local/bin/python -R /usr/local/www/freenasUI/manage.py runfcgi method=threaded host=127
	0 root	  -16   0	 0 13856 S  0.0  0.1  3h31:34 kernel
3388 root	   52   0  241M 57384 S  0.0  0.3  1h39:08 python: alertd
9979 root	   25   0 85848  7288 S  0.0  0.0  0:00.03 sshd: jim [priv]
85750 root	   20   0 17852  2524 S  0.0  0.0  0:00.12 bash
85740 root	   20   0 72140  4340 S  0.0  0.0  0:00.02 sudo jexec 2 bash
7113 root	   20   0 12324  1900 S  0.0  0.0  0:07.44 tee -i /mnt/Ark/Jim/spinpid.log
7112 root	   52   0 19892  2648 S  0.0  0.0  0:00.00 /usr/local/bin/bash /mnt/Ark/Jim/bin/spinpid.sh											
7111 root	   21   0 19892  2788 S  0.0  0.0  6:50.93 /usr/local/bin/bash /mnt/Ark/Jim/bin/spinpid.sh
7110 root	   52   0 17064  2380 S  0.0  0.0  0:00.00 sh /etc/rc autoboot
7103 root	   52   0 17064  2380 S  0.0  0.0  0:00.00 sh /etc/rc autoboot
7078 root	   43   0 20740  2288 S  0.0  0.0  0:15.91 /usr/sbin/cron -s
7068 root	   20   0 51712  2952 S  0.0  0.0  0:00.11 /sbin/zfsd
6625 root	   20   0  141M 35912 S  0.0  0.2  3:31.11 /usr/pbi/transmission-amd64/bin/python2.7 /usr/pbi/transmission-amd64/control.py start 192.1
6378 root	   20   0 16624  1128 S  0.0  0.0  0:04.99 /usr/sbin/cron -s
6350 root	   20   0 21704  2108 S  0.0  0.0  3h07:23 /usr/local/sbin/openvpn --cd /usr/local/etc/openvpn --daemon openvpn --config /usr/local/etc
6304 root	   20   0 14528  1136 S  0.0  0.0  0:15.68 /usr/sbin/syslogd -s
5043 root	   20   0 16624   740 S  0.0  0.0  0:05.09 /usr/sbin/cron -s
4986 root	   20   0 14528   756 S  0.0  0.0  0:12.25 /usr/sbin/syslogd -s
3578 root	   52   0  101M 12400 S  0.0  0.1  0:00.17 /usr/local/bin/python /usr/local/libexec/nas/register_mdns.py
3577 root	   52   0 14460  1992 S  0.0  0.0  0:00.00 daemon: /usr/local/libexec/nas/register_mdns.py[3578]
3367 root	   22   0 41860  3780 S  0.0  0.0  0:01.04 /usr/local/sbin/cnid_metad -d -F /usr/local/etc/afp.conf
3366 root	   22   0  101M  7132 S  0.0  0.0  0:03.91 /usr/local/sbin/afpd -d -F /usr/local/etc/afp.conf
3345 root	   20   0 46208  3748 S  0.0  0.0  1:26.89 /usr/local/sbin/netatalk
3332 root	   52   0 50656  3648 S  0.0  0.0  0:00.00 nginx: master process /usr/local/sbin/nginx
3114 root	   22   0 17064  3296 S  0.0  0.0  0:09.36 /bin/sh /usr/local/sbin/pbid
3113 root	   52   0 14460  1992 S  0.0  0.0  0:00.00 daemon: /usr/local/sbin/pbid[3114]
3089 root	   20   0 30732  3112 S  0.0  0.0  0:04.43 /usr/local/sbin/smartd -i 1800 -c /usr/local/etc/smartd.conf -p /var/run/smartd.pid
2672 root	   20   0 61028  5036 S  0.0  0.0  0:02.06 /usr/local/sbin/sshd
2307 root	   20   0 14468  2008 S  0.0  0.0 10:37.64 /usr/sbin/powerd
2304 root	   20   0 30260 18104 S  0.0  0.1  3:20.92 /usr/sbin/ntpd -g -c /etc/ntp.conf -p /var/run/ntpd.pid -f /var/db/ntpd.drift
2237 root	   52   0 20704  3992 S  0.0  0.0  0:04.02 /usr/sbin/rpc.lockd
2234 root	   20   0  276M  4044 S  0.0  0.0  0:03.32 /usr/sbin/rpc.statd
2231 root	   20   0 14476  2024 S  0.0  0.0  6:39.51 nfsd: server
2230 root	   20   0 22720  4024 S  0.0  0.0  0:00.02 nfsd: master
2224 root	   20   0 22744  4224 S  0.0  0.0  0:00.04 /usr/sbin/mountd -rSn /etc/exports /etc/zfs/exports
F1Help  F2Setup F3SearchF4FilterF5Tree  F6SortByF7Nice -F8Nice +F9Kill  F10Quit										  
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,358
That was it, it was set too low at 2 secs, now at 20 secs I still see some usage but it's much lower, 2 or 3%.

I don't know about your CPUs but all mine will over heat if the fan controller is only checking every 20 seconds.

My E5-1650v4 can go from cool to 65C in 2 seconds. Any longer it would hit 100C and shutdown/throttle.

@Glorious1 how are you reading CPU temp?
 

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,210
My CPU is very low powered and doesn't heat up so fast.

You mean how is the script reading CPU temp?
Code:
   CPU_TEMP=$($IPMITOOL sdr | grep "CPU Temp" | grep -Eo '[0-9]{2,5}')
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,358
My CPU is very low powered and doesn't heat up so fast.

You mean how is the script reading CPU temp?
Code:
   CPU_TEMP=$($IPMITOOL sdr | grep "CPU Temp" | grep -Eo '[0-9]{2,5}')

Possible reading SDR is relatively expensive. In my script I have two options, one is sysctl which reads the CPU temp from the kernel and the other is IPMI based which uses

my $cpu_temp = `$ipmitool sensor get \"CPU Temp\" | awk '/Sensor Reading/{print \$4}'`;

Because that gets just the value it needs it's probably more CPU efficient. I certainly found this with the sysctl method. Getting sysctl to focus on just the kernel temps meant it didn't have to work so hard to output, and grep didn't have to work so hard to filter.

And yes, low powered CPUs do heat up slower, but still 20s is probably a bit long. The way to find out is to run mprime and see how quick the CPU heats up from cold to uncomfortably hot ;)
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,175
For some reference, an i3-4330 with the stock cooler takes very little time to heat up insanely (two seconds sounds right). YMMV with the cooler - in particular, a water cooling loop will take much longer to heat up due to heat transfer due to water that is being pumped around, even if the fans haven't spun up yet. I'd be interested in seeing what happens in something like a Noctua NH-D15, with that huge thermal mass... With an IR camera, preferably.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,358
My 1650v4 is using a Noctua NH U9DX i4

Not quite as huge, but still, 2 seconds. Of course, I tuned the idle speed so that it could run slower, based on the thermal mass ;)

If idle speed was faster then it will take longer to heat up

I use a higher idle speed on my XeonD, so it takes much longer to heat up. (Because i hate the sound of its fan running flat out, I wanted to maximize the time it had before it needed 100%)

Would like to see a thermal image anyway :)
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,175
For fun, I just tried prime95 on my workstation. Xeon E5-1650 v3 with Corsair H100i, two Noctua NF-F12 iPPC3000 PWM fans.

The CPU immediately jumps from the mid 30s to high 50s, within two sample periods (roughly two seconds, most of the change happening in the first period). The fans spin up to 1600RPM and the temperatures stay there. It's certainly less scary than the screeching stock cooler that can't keep the i3-4330 beneath 70 degrees at full tilt.

(The fan control is from Corsair's software, not these scripts)
 

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,210
Thanks Stux. I'm sure there is room for greater efficiency in the scripts and I will look into the code you suggested. But I changed the setting on my setup to check CPU every second, and CPU usage is still is below the measurable threshold with htop.

So either the reporting graphic that Jonnie Black posted is incorrect, (I would still like to see the htop output JB), or the 2-zone script has something weird going on that's not going on in the 1-zone script. I can't
test it myself unfortunately.

In any case, clearly JB should not leave the interval at 15-20 seconds; 1-2 sec may be good.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,358
The CPU immediately jumps from the mid 30s to high 50s, within two sample periods (roughly two seconds, most of the change happening in the first period). The fans spin up to 1600RPM and the temperatures stay there. It's certainly less scary than the screeching stock cooler that can't keep the i3-4330 beneath 70 degrees at full tilt.

Very similar to my results. Which is why I use a 2 second loop and not 1s
 
Joined
May 10, 2017
Messages
838
So either the reporting graphic that Jonnie Black posted is incorrect, (I would still like to see the htop output JB), or the 2-zone script has something weird going on that's not going on in the 1-zone script. I can't
test it myself unfortunately.

I can't see a single process using the CPU on htop, it looks like it's the "system", but it only starts when the scrip is running and it stops when I stop the script, see the video below, I changed the CPU cycle back to 2 secs, for the first 10 seconds or so the server idling (I have no plugins/jails running on this server), at the 10 sec. mark I start the script and CPU usage is visible but there's no process using it, at around the 50 sec. mark I stop the script and CPU usage returns to normal.

https://www.dropbox.com/s/mdig9l3ftlzn1jx/htop.avi?dl=0
 
Joined
May 10, 2017
Messages
838
In my script I have two options, one is sysctl which reads the CPU temp from the kernel and the other is IPMI based which uses

@Stux , just tried your script and don't see the high CPU usage.

@Glorious1 I can still try to debug if you want but if it's just happening to me maybe not worth it.
 

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,210

lmannyr

Contributor
Joined
Oct 11, 2015
Messages
198
Just tested the new script on the download link. It wants to reset BMC frequently. No sure why...

Code:
CPU zone 1; Peripheral zone 0
CPU fans min/max duty cycle: 30/100
PER fans min/max duty cycle: 30/100
CPU fans - measured RPMs at 30 0x1.13p+10nd 100 1900uty cycle: /
PER fans - measured RPMs at 30 0x1.c2p+9nd 100 2400uty cycle: /
Drive temperature setpoint (C): 35
Kp=4, Ki=0, Kd=40
Drive check interval (main cycle; minutes): 3
CPU check interval (seconds): 2
CPU reference temperature (C): 38
CPU scalar: 6

Key to drive status symbols:  * spinning;  _ standby;  ? unknown							  Version 2017-04-10

Thursday, Oct 19																				   CPU		 New_Fan%  New_RPM_____________________
		  da0  da1  da2  da3  da4  da5  da6  da7  ada0 ada1 Tmax Tmean   ERRc	  P	 I	  D TEMP MODE	CPU PER   FANA  FAN1  FAN2  FAN3  FAN4
20:54:59  *33  *35  *32  *33  *33  *32  *31  *33  *35  *33  ^35  33.00  -2.00  -8.00  0.00 -26.67   50 Full	 50  30   1400   900  1000   ---  1000
20:58:40  *33  *35  *32  *34  *34  *32  *31  *34  *36  *34  ^36  33.50  -1.50  -6.00  0.00   6.67   51 Full	100  31   1400  1000  1000   ---  1000Sent cold reset command to MC

DUTY_CPU=100; RPM_CPU=1400 -- I reset the BMC because RPMs were too high or low for DUTY_CPU
spinpid2.sh: line 292: printf: Unable: invalid number
spinpid2.sh: line 292: printf: Unable: invalid number
spinpid2.sh: line 292: printf: Unable: invalid number

21:03:22  *33  *35  *32  *34  *34  *32  *31  *34  *36  *34  ^36  33.50  -1.50  -6.00  0.00   0.00   48 Full	 90  30   1800  1000  1000   ---  1000
21:07:05  *33  *35  *32  *34  *34  *32  *31  *34  *36  *34  ^36  33.50  -1.50  -6.00  0.00   0.00   47 Full	 84  30   1800   900   900   ---  1000
21:10:46  *34  *36  *33  *35  *35  *32  *32  *34  *36  *34  ^36  34.10  -0.90  -3.60  0.00   8.00   49 Full	 96  34   2000  1100  1100   ---  1200
21:14:28  *34  *35  *33  *34  *34  *32  *31  *34  *36  *34  ^36  33.70  -1.30  -5.20  0.00  -5.33   47 Full	 84  30   1700  1000  1000   ---  1000

21:18:09  *33  *35  *33  *34  *34  *32  *31 
*34  *36  *34  ^36  33.60  -1.40  -5.60  0.00  -1.33   45 Full	 72  30   1600  2400  2400   ---  2700Sent cold reset command to MC

DUTY_PER=30; RPM_PER=2400 -- I reset the BMC because RPMs were too high or low for DUTY_PER
spinpid2.sh: line 292: printf: Unable: invalid number
spinpid2.sh: line 292: printf: Unable: invalid number

21:22:50  *33  *35  *32  *33  *33  *32  *31  *34  *35  *33  ^35  33.10  -1.90  -7.60  0.00  -6.67   48 Full	 90  30   1800  2400  2400   ---  2700Sent cold reset command to MC

DUTY_PER=30; RPM_PER=2400 -- I reset the BMC because RPMs were too high or low for DUTY_PER



Code:
Script to determine fan RPMs associated with varying duty cycles.
Duty will be varied from 100% down in steps of 10.
After setting duty, we wait 5 seconds for equilibration
before reading the fans.  Some points:

1. Set the log path in the script before beginning.
2. There should be no fan control scripts running.
3. At very high and especially low fan speeds, the BMC may take
   over and change the duty.  You can:
	 - ignore it,
	 - adjust fan thresholds before testing, or
	 - edit the script so it starts and/or ends at different values
	   (see comment before while loop for instructions).

Thursday, Oct 19, 20:42:47
								 ___Duty%___  Curr_RPM____________________
						MODE	 Zone0 Zone1  FANA  FAN1  FAN2  FAN3  FAN4
Conditions before test  Full		 6	 6  1700  2400  2400   ---  2700

Are you ready to begin? (y/n)

Thursday, Oct 19, 20:42:52
								 ___Duty%___  Curr_RPM____________________
						MODE	 Zone0 Zone1  FANA  FAN1  FAN2  FAN3  FAN4
Duty cycle 100%		 Full	   100   100  1900  2400  2400   ---  2700
Duty cycle 90%		  Full		90	90  1800  2200  2200   ---  2500
Duty cycle 80%		  Full		80	80  1700  2000  2100   ---  2300
Duty cycle 70%		  Full		70	70  1600  1800  1800   ---  2000
Duty cycle 60%		  Full		60	60  1400  1600  1600   ---  1800
Duty cycle 50%		  Full		50	50  1300  1400  1400   ---  1500
Duty cycle 40%		  Full		40	40  1200  1200  1200   ---  1300
Duty cycle 30%		  Full		30	30  1100   900   900   ---  1000
Duty cycle 20%		  Full		20	20  1000   700   700   ---   700




Threasholds:
Code:
CPU Temp		 | 47.000	 | degrees C  | ok	| 0.000	 | 0.000	 | 0.000	 | 95.000	| 100.000   | 100.000			
System Temp	  | 41.000	 | degrees C  | ok	| -9.000	| -7.000	| -5.000	| 80.000	| 85.000	| 90.000			 
Peripheral Temp  | 43.000	 | degrees C  | ok	| -9.000	| -7.000	| -5.000	| 80.000	| 85.000	| 90.000			 
PCH Temp		 | 49.000	 | degrees C  | ok	| -11.000   | -8.000	| -5.000	| 90.000	| 95.000	| 100.000			
VRM Temp		 | 39.000	 | degrees C  | ok	| -9.000	| -7.000	| -5.000	| 95.000	| 100.000   | 105.000			
DIMMA1 Temp	  | 34.000	 | degrees C  | ok	| 1.000	 | 2.000	 | 4.000	 | 80.000	| 85.000	| 90.000			 
DIMMA2 Temp	  | 35.000	 | degrees C  | ok	| 1.000	 | 2.000	 | 4.000	 | 80.000	| 85.000	| 90.000			 
DIMMB1 Temp	  | 35.000	 | degrees C  | ok	| 1.000	 | 2.000	 | 4.000	 | 80.000	| 85.000	| 90.000			 
DIMMB2 Temp	  | 35.000	 | degrees C  | ok	| 1.000	 | 2.000	 | 4.000	 | 80.000	| 85.000	| 90.000			 
FAN1			 | 1300.000   | RPM		| ok	| 300.000   | 400.000   | 500.000   | 3600.000  | 3700.000  | 3800.000		   
FAN2			 | 1300.000   | RPM		| ok	| 300.000   | 400.000   | 500.000   | 3600.000  | 3700.000  | 3800.000		   
FAN3			 | na		 |			| na	| na		| na		| na		| na		| na		| na				 
FAN4			 | 1300.000   | RPM		| ok	| 300.000   | 400.000   | 500.000   | 3600.000  | 3700.000  | 3800.000		   
FANA			 | 2000.000   | RPM		| nc	| 300.000   | 400.000   | 500.000   | 2000.000  | 2100.000  | 2200.000	 
 
Last edited:

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,210
Thanks very much lmannyr. This is what comes from trying to make a script for a system you don't have :(. The error in the preliminary printout of settings is an easy fix: at lines 305 and 306, under '# Print settings at beginning of log', the 30% and 100% have to be changed to 30%% and 100%%.

The BMC reset will take more time to figure out. But you've given me all the right data to work with. Thanks

(Actually, if you still have it, could you send me the CPU log from that same run?)

(Also please let me know what those settings are that were messed up in the printout:
Code:
		30%	 100%
PER	   ?	   2400	FANs 1-4
CPU	   ?	   1900	 FANA
 
Last edited:

lmannyr

Contributor
Joined
Oct 11, 2015
Messages
198
Here are the settings. For some reason, the cpu log did not record temps during that time. I'll run the script again and see if it will print the temps during the error.




Code:
#################  FAN SETTINGS ################

# Supermicro says:
# Zone 0 - CPU/System fans, headers with number (e.g., FAN1, FAN2, etc.)
# Zone 1 - Peripheral fans, headers with letter (e.g., FANA, FANB, etc.)
# Some want the reverse (i.e, drive cooling fans on headers FAN1-4 and 
# CPU fan on FANA), so that's the default.  But you can switch to SM way.
ZONE_CPU=1
ZONE_PER=0

# Set min and max duty cycle to avoid stalling or zombie apocalypse
DUTY_PER_MIN=20
DUTY_PER_MAX=100
DUTY_CPU_MIN=20
DUTY_CPU_MAX=100

# Your measured fan RPMs at 30% duty cycle and 100% duty cycle
# RPM_CPU is for FANA if ZONE_CPU=1 or FAN1 if ZONE_CPU=0
# RPM_PER is for the other fan.
RPM_CPU_30=1100   # Your system
RPM_CPU_MAX=2000
RPM_PER_30=900
RPM_PER_MAX=2500
# RPM_CPU_30=500   # My system
# RPM_CPU_MAX=1400
# RPM_PER_30=500
# RPM_PER_MAX=1400

 
Top