Fan Scripts for Supermicro Boards Using PID Logic

Fan Scripts for Supermicro Boards Using PID Logic 2020-08-20, previous one was missing a file

Seani

Dabbler
Joined
Oct 18, 2016
Messages
41
I wish someone would alter the base script for Scale and include that in the resource. I've used the script in Core on my Supermicro X10SRH-CLN4F for many years and it works really well. Now I want to switch over to Scale and the lack of an adjusted version scares me a bit. I really don't want to have to tinker with the script while 16 fans run amok at 3.000rpm in my server.
 
Joined
Jan 27, 2020
Messages
577

mrcc

Cadet
Joined
Mar 25, 2023
Messages
3
I wish someone would alter the base script for Scale and include that in the resource. I've used the script in Core on my Supermicro X10SRH-CLN4F for many years and it works really well. Now I want to switch over to Scale and the lack of an adjusted version scares me a bit. I really don't want to have to tinker with the script while 16 fans run amok at 3.000rpm in my server.

Hi,
I found this script before joining truenas community, and edited to suit my needs (running TN SCALE), basically bypassing camcontrol and using lsblk to get dev list (only gets disk from WDigital and Seagate, but easy to mod line 379), and a changing a function reflecting that change.

Bear in mind that i'm a noob, so got it working but there might be much better solutions.

Not tried with HBA cards or SAS drives, just with built in SM X12.... mboard.
My board has some issues on a fan pin, so had to change reference fan to FAN5.

I dont know how to upload changes to github or whatever, and dont think my change are good enough for that, but if somebody needs it here it is:

Code:
#!/usr/bin/bash

# spinpid2.sh for dual fan zones.
VERSION="2020-08-20"
# Run as superuser. See notes at end.

##############################################
#
#  Settings sourced from spinpd2.config
#  in same directory as the script
#
##############################################

DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
source "$DIR/spinpid2.config"

##############################################
# function get_disk_name
# Get disk name from current LINE of DEVLIST
##############################################
# The awk statement works by taking $LINE as input,
# setting '(' as a _F_ield separator and taking the second field it separates
# (ie after the separator), passing that to another awk that uses
# ',' as a separator, and taking the first field (ie before the separator).
# In other words, everything between '(' and ',' is kept.

# camcontrol output for disks on HBA seems to change every version,
# so need 2 options to get ada/da disk name.
function get_disk_name {
#   if [[ $LINE == *",p"* ]] ; then     # for ([a]da#,pass#)
#      DEVID=$(echo "$LINE" | awk -F '(' '{print $2}' | awk -F ',' '{print$1}')
#   else                                # for (pass#,[a]da#)
      DEVID=$(echo "$LINE")
#   fi                                  # commented out, bypassing camcontrol
}

############################################################
# function print_header
# Called when script starts and each quarter day
############################################################
function print_header {
   DATE=$(date +"%A, %b %d")
   let "SPACES = DEVCOUNT * 5 + 42"  # 5 spaces per drive
   printf "\n%-*s %3s %16s %29s \n" $SPACES "$DATE" "CPU" "New_Fan%" "New_RPM__________________________________"
   echo -n "          "
   while read -r LINE ; do
      get_disk_name
      printf "%-5s" "$DEVID"
   done <<< "$DEVLIST"             # while statement works on DEVLIST
   printf "%4s %5s %6s %6s %6s %3s %-7s %s %-4s %5s %5s %5s %5s %5s %5s %5s" "Tmax" "Tmean" "ERRc" "P" "D" "TEMP" "MODE" "CPU" "PER" "FANA" "FANB"  "FAN1" "FAN2" "FAN3" "FAN4" "FAN5"
}

#################################################
# function read_fan_data
#################################################
function read_fan_data {

   # If set by user, read duty cycles, convert to decimal.  Otherwise,
   # the script will assume the duty cycles are what was last set.
    if [ $HOW_DUTY == 1 ] ; then
        DUTY_CPU=$($IPMITOOL raw 0x30 0x70 0x66 0 $ZONE_CPU) # in hex with leading space
        DUTY_CPU=$((0x$(echo $DUTY_CPU)))  # strip leading space and decimalize
        DUTY_PER=$($IPMITOOL raw 0x30 0x70 0x66 0 $ZONE_PER)
        DUTY_PER=$((0x$(echo $DUTY_PER)))
    fi

   # Read fan mode, convert to decimal, get text equivalent.
   MODE=$($IPMITOOL raw 0x30 0x45 0) # in hex with leading space
   MODE=$((0x$(echo $MODE)))  # strip leading space and decimalize
   # Text for mode
   case $MODE in
      0) MODEt="Standard" ;;
      1) MODEt="Full" ;;
      2) MODEt="Optimal" ;;
      4) MODEt="HeavyIO" ;;
   esac

   # Get reported fan speed in RPM from sensor data repository.
   # Takes the pertinent FAN line, then 3 to 5 consecutive digits
   SDR=$($IPMITOOL sdr)
   FAN1=$(echo "$SDR" | grep "FAN1" | grep -Eo '[0-9]{3,5}')
   FAN2=$(echo "$SDR" | grep "FAN2" | grep -Eo '[0-9]{3,5}')
   FAN3=$(echo "$SDR" | grep "FAN3" | grep -Eo '[0-9]{3,5}')
   FAN4=$(echo "$SDR" | grep "FAN4" | grep -Eo '[0-9]{3,5}')
   FAN5=$(echo "$SDR" | grep "FAN5" | grep -Eo '[0-9]{3,5}')
   FANA=$(echo "$SDR" | grep "FANA" | grep -Eo '[0-9]{3,5}')
   FANB=$(echo "$SDR" | grep "FANB" | grep -Eo '[0-9]{3,5}')

}

##############################################
# function CPU_check_adjust
# Get CPU temp.  Calculate a new DUTY_CPU.
# Send to function adjust_fans.
##############################################
function CPU_check_adjust {
   #   Another IPMITOOL method of checking CPU temp:
   #   CPU_TEMP=$($IPMITOOL sdr | grep "CPU Temp" | grep -Eo '[0-9]{2,5}')
   if [[ $CPU_TEMP_SYSCTL == 1 ]]; then
       # Find hottest CPU core
       MAX_CORE_TEMP=0
       for CORE in $(seq 0 $CORES)
       do
           CORE_TEMP="$(sysctl -n dev.cpu.${CORE}.temperature | awk -F '.' '{print$1}')"
           if [[ $CORE_TEMP -gt $MAX_CORE_TEMP ]]; then MAX_CORE_TEMP=$CORE_TEMP; fi
       done
       CPU_TEMP=$MAX_CORE_TEMP
   else
       CPU_TEMP=$($IPMITOOL sensor get "CPU Temp" | awk '/Sensor Reading/ {print $4}')
   fi

   DUTY_CPU_LAST=$DUTY_CPU

   # This will break if settings have non-integers
   let DUTY_CPU="$(( (CPU_TEMP - CPU_REF) * CPU_SCALE + DUTY_CPU_MIN ))"

   # Don't allow duty cycle outside min-max
   if [[ $DUTY_CPU -gt $DUTY_CPU_MAX ]]; then DUTY_CPU=$DUTY_CPU_MAX; fi
   if [[ $DUTY_CPU -lt $DUTY_CPU_MIN ]]; then DUTY_CPU=$DUTY_CPU_MIN; fi

   adjust_fans $ZONE_CPU $DUTY_CPU $DUTY_CPU_LAST

   # Use this short CPU cycle to also allow PER fans to come down
   # if PD < 0 and drives are at least 1 C below setpoint
   # (e.g, after high demand or if 100% at startup).
   # With multiple CPU cycles and no new drive temps, this will
   # drive fans to DUTY_PER_MIN, but that's ok if drives are that cool.
   # However, this is experimental.
    if [[ PD -lt 0 && (( $(bc <<< "scale=2; $Tmean < ($SP-1)") == 1 )) ]]; then
        DUTY_PER_LAST=$DUTY_PER
        DUTY_PER=$(( DUTY_PER + PD ))
        # Don't allow duty cycle below min
        if [[ $DUTY_PER -lt $DUTY_PER_MIN ]]; then DUTY_PER=$DUTY_PER_MIN; fi
        # pass to the function adjust_fans
        adjust_fans $ZONE_PER $DUTY_PER $DUTY_PER_LAST
    fi

    sleep $CPU_T

    if [ $CPU_LOG_YES == 1 ] ; then
        print_interim_CPU | tee -a $CPU_LOG >/dev/null
    fi

    # This will call user-defined function if it exists (see config).
    declare -f -F Post_CPU_check_adjust >/dev/null && Post_CPU_check_adjust
}

##############################################
# function DRIVES_check_adjust
# Print time on new log line.
# Go through each drive, getting and printing
# status and temp.  Calculate max and mean
# temp, then calculate PID and new duty.
# Call adjust_fans.
##############################################
function DRIVES_check_adjust {
   Tmax=0; Tsum=0  # initialize drive temps for new loop through drives
   i=0             # initialize count of spinning drives
   while read -r LINE ; do
      get_disk_name
      /usr/sbin/smartctl -a -n standby "/dev/$DEVID" > /var/tempfile
      RETURN=$?  # have to preserve return value or it changes
      BIT0=$(( RETURN & 1 ))
      BIT1=$(( RETURN & 2 ))
      if [ $BIT0 -eq 0 ]; then
         if [ $BIT1 -eq 0 ]; then
            STATUS="*"  # spinning
         else  # drive found but no response, probably standby
            STATUS="_"
         fi
      else   # smartctl returns 1 (00000001) for missing drive
         STATUS="?"
      fi

      TEMP=""
      # Update temperatures each drive; spinners only
      if [ "$STATUS" == "*" ] ; then
         # Taking 10th space-delimited field for most SATA:
         if grep -Fq "Temperature_Celsius" /var/tempfile ; then
             TEMP=$( cat /var/tempfile | grep "Temperature_Celsius" | awk '{print $10}')
         # Else assume SAS, their output is:
         #     Transport protocol: SAS (SPL-3) . . .
         #     Current Drive Temperature: 45 C
         else
             TEMP=$( cat /var/tempfile | grep "Drive Temperature" | awk '{print $4}')
         fi
         let "Tsum += $TEMP"
         if [[ $TEMP > $Tmax ]]; then Tmax=$TEMP; fi;
         let "i += 1"
      fi
      printf "%s%-2d  " "$STATUS" "$TEMP"
   done <<< "$DEVLIST"

   DUTY_PER_LAST=$DUTY_PER

   # if no disks are spinning
    if [ $i -eq 0 ]; then
        Tmean=""; Tmax=""; P=""; D=""; ERRc=""
        DUTY_PER=$DUTY_PER_MIN
    else
    # summarize, calculate PD and print Tmax and Tmean
        # Need ERRc value if all drives had been spun down last time
        if [[ $ERRc == "" ]]; then ERRc=0; fi

        Tmean=$(bc <<< "scale=2; $Tsum / $i" )
        ERRp=$ERRc        # save previous error before calculating current
        ERRc=$(bc <<< "scale=2; ($Tmean - $SP) / 1" )
        P=$(bc <<< "scale=3; ($Kp * $ERRc) / 1" )
        D=$(bc <<< "scale=4; $Kd * ($ERRc - $ERRp) / $DRIVE_T" )
        PD=$(bc <<< "$P + $D" )  # add corrections

        # round for printing
        Tmean=$(printf %0.2f "$Tmean")
        ERRc=$(printf %0.2f "$ERRc")
        P=$(printf %0.2f "$P")
        D=$(printf %0.2f "$D")
        PD=$(printf %0.f "$PD")  # must be integer for duty

        let "DUTY_PER = $DUTY_PER_LAST + $PD"

        # Don't allow duty cycle outside min-max
        if [[ $DUTY_PER -gt $DUTY_PER_MAX ]]; then DUTY_PER=$DUTY_PER_MAX; fi
        if [[ $DUTY_PER -lt $DUTY_PER_MIN ]]; then DUTY_PER=$DUTY_PER_MIN; fi
    fi

   # DIAGNOSTIC variables - uncomment for troubleshooting:
   # printf "\n DUTY_PER=%s, DUTY_PER_LAST=%s, DUTY=%s, Tmean=%s, ERRp=%s \n" "${DUTY_PER:---}" "${DUTY_PER_LAST:---}" "${DUTY:---}" "${Tmean:---}" $ERRp

   # pass to the function adjust_fans
   adjust_fans $ZONE_PER $DUTY_PER $DUTY_PER_LAST

   # DIAGNOSTIC variables - uncomment for troubleshooting:
   # printf "\n DUTY_PER=%s, DUTY_PER_LAST=%s, DUTY=%s, Tmean=%s, ERRp=%s \n" "${DUTY_PER:---}" "${DUTY_PER_LAST:---}" "${DUTY:---}" "${Tmean:---}" $ERRp

   # print current Tmax, Tmean
   printf "^%-3s %5s" "${Tmax:---}" "${Tmean:----}"

    # This will call user-defined function if it exists (see config).
    declare -f -F Post_DRIVES_check_adjust >/dev/null && Post_DRIVES_check_adjust
}

##############################################
# function adjust_fans
# Zone, new duty, and last duty are passed as parameters
##############################################
function adjust_fans {
   # parameters passed to this function
   ZONE=$1
   DUTY=$2
   DUTY_LAST=$3

   # Change if different from last duty, or the first time.
   if [[ $DUTY -ne $DUTY_LAST ]] || [[ FIRST_TIME -eq 1 ]]; then
      # Set new duty cycle. "echo -n ``" prevents newline generated in log
      echo -n "$($IPMITOOL raw 0x30 0x70 0x66 1 "$ZONE" "$DUTY")"
   fi
   FIRST_TIME=0
}

##############################################
# function print_interim_CPU
# Sent to a separate file by the call
# in CPU_check_adjust{}
##############################################
function print_interim_CPU {
   RPM=$($IPMITOOL sdr | grep  "$RPM_CPU" | grep -Eo '[0-9]{2,5}')
   # print time on each line
   TIME=$(date "+%H:%M:%S"); echo -n "$TIME  "
   printf "%7s %5d %5d \n" "${RPM:----}" "$CPU_TEMP" "$DUTY"
}

##############################################
# function mismatch_test
# Tests for mismatch
# between fan duty and fan RPMs
##############################################

function mismatch_test {
    MISMATCH=0; MISMATCH_CPU=0; MISMATCH_PER=0

    # ${!RPM_*} gets updated value of the variable RPM_* points to
    if [[ (DUTY_CPU -ge 95 && ${!RPM_CPU} -lt RPM_CPU_MAX) || (DUTY_CPU -lt 25 && ${!RPM_CPU} -gt RPM_CPU_30) ]] ; then
        MISMATCH=1; MISMATCH_CPU=1
        printf "\n%s\n" "Mismatch between CPU Duty and RPMs -- DUTY_CPU=$DUTY_CPU; RPM_CPU=${!RPM_CPU}"
    fi
    if [[ (DUTY_PER -ge 95 && ${!RPM_PER} -lt RPM_PER_MAX) || (DUTY_PER -lt 25 && ${!RPM_PER} -gt RPM_PER_30) ]] ; then
        MISMATCH=1; MISMATCH_PER=1
        printf "\n%s\n" "Mismatch between PER Duty and RPMs -- DUTY_PER=$DUTY_PER; RPM_PER=${!RPM_PER}"
    fi
}

##############################################
# function force_set_fans
# Used each cycle if a mismatch is detected and
# after BMC reset
##############################################
function force_set_fans {
    if [ $MISMATCH_CPU == 1 ]; then
        FIRST_TIME=1  # forces adjust_fans to do it
        adjust_fans $ZONE_CPU $DUTY_CPU $DUTY_CPU_LAST
        echo "Attempting to fix CPU mismatch  "
        sleep 5
    fi
    if [ $MISMATCH_PER == 1 ]; then
        FIRST_TIME=1
        adjust_fans $ZONE_PER $DUTY_PER $DUTY_PER_LAST
        echo "Attempting to fix PER mismatch  "
        sleep 5
    fi
}

##############################################
# function reset_bmc
# Triggered after 2 attempts to fix mismatch
# between fan duty and fan RPMs
##############################################

function reset_bmc {
    TIME=$(date "+%H:%M:%S"); echo -n "$TIME  "
    echo -n "Resetting BMC after second attempt failed to fix mismatch -- "
    $IPMITOOL bmc reset cold
    sleep 120
    read_fan_data
}

#####################################################
# SETUP
# All this happens only at the beginning
# Initializing values, list of drives, print header
#####################################################
# Print settings at beginning of log
printf "\n****** SETTINGS ******\n"
printf "CPU zone %s; Peripheral zone %s\n" $ZONE_CPU $ZONE_PER
printf "CPU fans min/max duty cycle: %s/%s\n" $DUTY_CPU_MIN $DUTY_CPU_MAX
printf "PER fans min/max duty cycle: %s/%s\n" $DUTY_PER_MIN $DUTY_PER_MAX
printf "CPU fans - measured RPMs at 30%% and 100%% duty cycle: %s/%s\n" $RPM_CPU_30 $RPM_CPU_MAX
printf "PER fans - measured RPMs at 30%% and 100%% duty cycle: %s/%s\n" $RPM_PER_30 $RPM_PER_MAX
printf "Drive temperature setpoint (C): %s\n" $SP
printf "Kp=%s, Kd=%s\n" $Kp $Kd
printf "Drive check interval (main cycle; minutes): %s\n" $DRIVE_T
printf "CPU check interval (seconds): %s\n" $CPU_T
printf "CPU reference temperature (C): %s\n" $CPU_REF
printf "CPU scalar: %s\n" $CPU_SCALE

if [ $HOW_DUTY == 1 ] ; then
    printf "Reading fan duty from board \n"
else
    printf "Assuming fan duty as set \n" ; fi

# Check if CPU Temp is available via sysctl (will likely fail in a VM)
CPU_TEMP_SYSCTL=$(($(sysctl -a | grep dev.cpu.0.temperature | wc -l) > 0))
if [[ $CPU_TEMP_SYSCTL == 1 ]]; then
    printf "Getting CPU temperatures via sysctl \n"
    # Get number of CPU cores to check for temperature
    # -1 because numbering starts at 0
    CORES=$(($(sysctl -n hw.ncpu)-1))
else
    printf "Getting CPU temperature via ipmitool (sysctl not available) \n"
fi

CPU_LOOPS=$( bc <<< "$DRIVE_T * 60 / $CPU_T" )  # Number of whole CPU loops per drive loop
I=0; ERRc=0  # Initialize errors to 0
FIRST_TIME=1

# Alter RPM thresholds to allow some slop
RPM_CPU_30=$(echo "scale=0; 1.2 * $RPM_CPU_30 / 1" | bc)
RPM_CPU_MAX=$(echo "scale=0; 0.8 * $RPM_CPU_MAX / 1" | bc)
RPM_PER_30=$(echo "scale=0; 1.2 * $RPM_PER_30 / 1" | bc)
RPM_PER_MAX=$(echo "scale=0; 0.8 * $RPM_PER_MAX / 1" | bc)

# Get list of drives
# DEVLIST1=$(/sbin/camcontrol devlist) # not working in debian, no camcontrol available
DEVLIST1=$(/usr/bin/lsblk -d -o KNAME,TYPE,SIZE,MODEL,SERIAL,HCTL)
# Remove lines with non-spinning devices; edit as needed
# You could use another strategy, e.g., find something in the camcontrol devlist
# output that is unique to the drives you want, for instance only WDC drives:
# if [[ $LINE != *"WDC"* ]] . . .
# DEVLIST="$(echo "$DEVLIST1"|sed '/KINGSTON/d;/ADATA/d;/SanDisk/d;/OCZ/d;/LSI/d;/EXP/d;/INTEL/d;/TDKMedia/d;/SSD/d;/VMware/d;/Enclosure/d;/Card/d;/Flash/d')"
DEVLIST=$(echo "$DEVLIST1"|grep -E '\b(ST|WDC)+'|awk '{print $1}')
DEVCOUNT=$(echo "$DEVLIST" | wc -l)

# These variables hold the name of the other variables, whose
# value will be obtained by indirect reference.  Don't ask.
if [[ ZONE_PER -eq 0 ]]; then
   RPM_PER=FAN5
   RPM_CPU=FANA
else
   RPM_PER=FANA
   RPM_CPU=FAN5
fi

read_fan_data

# If mode not Full, set it to avoid BMC changing duty cycle
# Need to wait a tick or it may not get next command
# "echo -n" to avoid annoying newline generated in log
if [[ MODE -ne 1 ]]; then
   echo -n "$($IPMITOOL raw 0x30 0x45 1 1)"
   sleep 1
fi

# Need to start fan duty at a reasonable value if fans are
# going fast or we didn't read DUTY_* in read_fan_data
# (second test is TRUE if DUTY_* is unset).
if [[ ${!RPM_PER} -ge RPM_PER_MAX || -z ${DUTY_PER+x} ]]; then
   echo -n "$($IPMITOOL raw 0x30 0x70 0x66 1 $ZONE_PER 50)"
   DUTY_PER=50; sleep 1
fi
if [[ ${!RPM_CPU} -ge RPM_CPU_MAX || -z ${DUTY_CPU+x} ]]; then
   echo -n "$($IPMITOOL raw 0x30 0x70 0x66 1 $ZONE_CPU 50)"
   DUTY_CPU=50; sleep 1
fi

# Before starting, go through the drives to report if
# smartctl return value indicates a problem (>2).
# Use -a so that all return values are available.
while read -r LINE ; do
   get_disk_name
   /usr/sbin/smartctl -a -n standby "/dev/$DEVID" > /var/tempfile
   if [ $? -gt 2 ]; then
      printf "\n"
      printf "*******************************************************\n"
      printf "* WARNING - Drive %-4s has a record of past errors,   *\n" "$DEVID"
      printf "* is currently failing, or is not communicating well. *\n"
      printf "* Use smartctl to examine the condition of this drive *\n"
      printf "* and conduct tests. Status symbol for the drive may  *\n"
      printf "* be incorrect *but probably not*.                    *\n"
      printf "*******************************************************\n"
   fi
done <<< "$DEVLIST"

printf "\n%s %36s %s \n" "Key to drive status symbols:  * spinning;  _ standby;  ? unknown" "Version" $VERSION
print_header

# for first round of printing
CPU_TEMP=$(echo "$SDR" | grep "CPU Temp" | grep -Eo '[0-9]{2,5}')

# Initialize CPU log
if [ $CPU_LOG_YES == 1 ] ; then
    printf "%s \n%s \n%17s %5s %5s \n" "$DATE" "Printed every CPU cycle" $RPM_CPU "Temp" "Duty" | tee $CPU_LOG >/dev/null
fi

###########################################
# Main loop through drives every DRIVE_T minutes
# and CPU every CPU_T seconds
###########################################
while true ; do
   # Print header every quarter day.  awk removes any
   # leading 0 so it is not seen as octal
   HM=$(date +%k%M)
   HM=$( echo $HM | awk '{print $1 + 0}' )
   R=$(( HM % 600 ))  # remainder after dividing by 6 hours
   if (( R < DRIVE_T )); then
      print_header;
   fi

#
# Main stuff
#
    echo                                         # start new line
    TIME=$(date "+%H:%M:%S"); echo -n "$TIME  "  # print time on each line

    DRIVES_check_adjust                          # prints drive data also

    sleep 5  # Let fans equilibrate to duty before reading them
    read_fan_data


   printf "%7s %6s %6.6s %4s %-7s %3d %3d %6s %5s %5s %5s %5s %5s %5s %5s" "${ERRc:----}" "${P:----}" "${D:----}" "$CPU_TEMP" $MODEt $DUTY_CPU $DUTY_PER "${FANA:----}" "${FANB:----}" "${FAN1:----}" "${FAN2:----}" "${FAN3:----}" "${FAN4:----}" "${FAN5:----}"

# Test loop for BMC reset.  Exit loop if no mismatch found between duty and rpm,
# or after 2 attempts to fix lead to bmc reset and a third attempt to fix.
# This should happen after reading fans so CPU loops don't result in false mismatch.

    ATTEMPTS=0  # Number of attempts to fix duties
    mismatch_test

    while true; do

        if [ $MISMATCH == 1 ]; then
            force_set_fans
            let "ATTEMPTS += 1"
            read_fan_data
            mismatch_test
        else
            break   # exit loop
        fi

        if [ ATTEMPTS == 2 ]; then
            if [ MISMATCH == 1 ]; then
                reset_bmc
                force_set_fans
                read_fan_data
                mismatch_test
            else
                break   # exit loop
            fi
        fi

        if [ $ATTEMPTS == 3 ]; then
            break
        fi
    done


    # CPU loop
    i=0
    while [ $i -lt "$CPU_LOOPS" ]; do
        CPU_check_adjust
        let i=i+1
    done
done

# For SuperMicro motherboards with dual fan zones.
# Adjusts fans based on drive and CPU temperatures.
# Includes disks on motherboard and on HBA.
# Mean drive temp is maintained at a setpoint using a PID algorithm.
# CPU temp need not and cannot be maintained at a setpoint,
# so PID is not used; instead fan duty cycle is simply
# increased with temp using reference and scale settings.

# Drives are checked and fans adjusted on a set interval, such as 5 minutes.
# Logging is done at that point.  CPU temps can spike much faster,
# so are checked and logged at a shorter interval, such as 1-15 seconds.
# CPUs with high TDP probably require short intervals.

# Logs:
#   - Disk status (* spinning or _ standby)
#   - Disk temperature (Celsius) if spinning
#   - Max and mean disk temperature
#   - Temperature error and PID variables
#   - CPU temperature
#   - RPM for FANA and FAN1-4 before new duty cycles
#   - Fan mode
#   - New fan duty cycle in each zone
#   - In CPU log:
#        - RPM of the first fan in CPU zone (FANA or FAN1
#        - CPU temperature
#        - new CPU duty cycle

#  Relation between percent duty cycle, hex value of that number,
#  and RPMs for my fans.  RPM will vary among fans, is not
#  precisely related to duty cycle, and does not matter to the script.
#  It is merely reported.
#
#  Percent      Hex         RPM
#  10         A     300
#  20        14     400
#  30        1E     500
#  40        28     600/700
#  50        32     800
#  60        3C     900
#  70        46     1000/1100
#  80        50     1100/1200
#  90        5A     1200/1300
# 100        64     1300

################
# Tuning Advice
################
# PID tuning advice on the internet generally does not work well in this application.
# First run the script spincheck.sh and get familiar with your temperature and fan variations without any intervention.
# Choose a setpoint that is an actual observed Tmean, given the number of drives you have.  It should be the Tmean associated with the Tmax that you want.
# Start with Kp low.  Find the lowest ERRc (which is Tmean - setpoint) in the output other than 0 (don't worry about sign +/-).  Set Kp to 0.5 / ERRc, rounded up to an integer.  My lowest ERRc is 0.14.  0.5 / 0.14 is 3.6, and I find Kp = 4 is adequate.  Higher Kp will give a more aggressive response to error, but the downside may be overshooting the setpoint and oscillation.  Kd offsets that, but raising them both makes things unstable and harder to tune.
# Set Kd at about Kp*10
# Get Tmean within ~0.3 degree of SP before starting script.
# Start script and run for a few hours or so.  If Tmean oscillates (best to graph it), you probably need to reduce Kd.  If no oscillation but response is too slow, raise Kd.
# Stop script and get Tmean at least 1 C off SP.  Restart.  If there is overshoot and it goes through some cycles, you may need to reduce Kd.
# If you have problems, examine P and D in the log and see which is messing you up.

# Uses joeschmuck's smartctl method for drive status (returns 0 if spinning, 2 in standby)
# https://forums.freenas.org/index.php?threads/how-to-find-out-if-a-drive-is-spinning-down-properly.2068/#post-28451
# Other method (camcontrol cmd -a) doesn't work with HBA
 

Sjöhaga

Dabbler
Joined
Apr 17, 2016
Messages
41
Got a great deal on a refurbished 118G-1400B server with a X10SRG-F motherboard in it. Yes I know - not the typical NAS box :). But I wanted something to try out Scale on and I didn't have anything usable so I am very happy with this deal.
This motherboard comes with eight fans and eight fan headers 1-4 and A-D. I am probably not going to need all those fans, but something peaked my interest.

I was testing out the spintest script and not all fans were changing speeds. After changing the script to list all the eight fans I saw that fan C & D stayed on full RPM throughout the test.

Anyone know what raw values might help me here? The fans change when the fan mode is changed so they should be controllable.

regards
1d
 
Joined
Dec 2, 2015
Messages
730
Are Fan Headers A & B behaving the same as C & D? The manual says that fans A - D are I/O fans, whereas Fans 1 to 4 are system and CPU fans. Other Supermicro boards which I have used only describe System and CPU fans - System fans are controlled by one fan bus command, and CPU by another. So, maybe your board has a third fan bus. Try incrementing the fan header number in the raw command to see if you strike gold.
 

Sjöhaga

Dabbler
Joined
Apr 17, 2016
Messages
41
Are Fan Headers A & B behaving the same as C & D? The manual says that fans A - D are I/O fans, whereas Fans 1 to 4 are system and CPU fans. Other Supermicro boards which I have used only describe System and CPU fans - System fans are controlled by one fan bus command, and CPU by another. So, maybe your board has a third fan bus. Try incrementing the fan header number in the raw command to see if you strike gold.
of course you are corrent, $> ipmitool raw 0x30 0x70 0x66 1 2 nn do control fans C & D.
cool, lets see if i can get this script working now :)

and yes this server is not really built for nas services, it is designed to be used with two graphicscards hence all the fans. But it makes for a very nice test rig.
 

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,211
That's great you got control over the other fans. It sounds like you have three fan zones, which I haven't encountered before and the script isn't written for.

Plus, it would have to be changed for Scale; I've never experimented with that but the previous few posts address that.

I have been working on changing the scripts to work in zsh instead of bash, but that's another story. Keep us posted.
 

Slownas

Cadet
Joined
Mar 14, 2021
Messages
2
@sretalla
Thank you for working on the script!
I have successfully run it on Truenas Scale Bluefin. But when I updated to Cobia, it does not work anymore. Seems like Cobia have removed the JSON perl module. Would you have any idea or solution for this?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Seems like Cobia have removed the JSON perl module. Would you have any idea or solution for this?
Sure, it seems you're right and the powers that be have removed that Perl module for some reason (or accident).

Since that imported JSON module is only used once... and I see it's only really used to grab the temperatures from the sensors output... it can certainly be done natively in Perl instead...

I would do it by omitting the -j switch from the sensors command and then parsing the output with regexp... (I also saw an opportunity to radically reduce the code in that section to share the function between SCALE and CORE, so did that too).

If you comment out the JSON declaration on line 539, then replace the get_cpu_temp_sysctl function (line 1467+) with this code:

Code:
sub get_cpu_temp_sysctl {

    #----------------------------------------------------------------------
    # significantly more efficient to filter to dev.cpu than to just grep the whole lot!
    # *cough*
    # significantly more efficient to only spawn one subprocess for sysctl than a pipeline with
    # egrep, awk, sed, kitchensink, garagedooropener, all to do what Perl is good at
    #----------------------------------------------------------------------
    my @core_temps_list;
    my $max_core_temp = 0;
    my @cmd;
    my $pattern = qr/(?|Core \d+:\s+\+(\d+.\d)°C|dev.cpu.\d+.temperature: (\d+.\d)C)/;
    if ($operating_system eq 'freebsd' ) {
      @cmd = ('sysctl', '-a', 'dev.cpu');
        }
    elsif ($operating_system eq 'linux' ) {
      @cmd = ('sensors');
    }
      my @temps = join("\n", run_command(@cmd)) =~ m/$pattern/g;
      foreach (@temps) {
              push(@core_temps_list, $1);
              dprint( 2, "core_temp = $1 C" );
              $max_core_temp = $1 if $1 > $max_core_temp;
      }

      dprint_list( 4, "core_temps_list", @core_temps_list );

      dprint( 1, "CPU Temp: $max_core_temp" );

      # possible that this is 0 if there was a fault reading the core temps
      $last_cpu_temp = $max_core_temp;


    return $max_core_temp;
}


It should work... not extensively tested.
 

Sjöhaga

Dabbler
Joined
Apr 17, 2016
Messages
41
Realized when I got the notification today for new posts that I had forgotten to check back with you guys.

After messing around with all the things needed I actually got to a point where I am quite happy with the scripts, ofc they are never finished, but they work well enough for my needs.

If anyone want to use them please steal everything from my github repo truenas-stuff
The spinpid scripts are all modified to work on debian. In theory they should work on both FreeBSD and Linux with only the first line schebang needing an edit. To tell the truth only very minimal testing has been done on FreeBSD, so ehm, well, I hope it works still :D

Some of the manual config options are removed, it uses uname and which commands to try to find things it needs. As my setup is all SSD I naturally removed the block and instead added a few more search patterns to match some Intel and other SSD:s i have.

They are also modified to display 8 fans and 3 fan zones. Maybe broke somethings in the process too, who knows.

The fan control now plays very nice on my odd supermicro board, I should do some more tweaking to slow the attack rate, but since it works, I guess that'll never happen.

all credit goes to glorious for making this possible
 

sanello

Cadet
Joined
Jun 8, 2023
Messages
4
Hello,
Is there a way to exclude drives from the "Setpoint mean drive temperature" calculation. I have two 2.5 inch drives that run significantly cooler than the 3.5 inch drives.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Is there a way to exclude drives from the "Setpoint mean drive temperature" calculation. I have two 2.5 inch drives that run significantly cooler than the 3.5 inch drives.
Depending on the script you're running, there's usually an option to edit the filtering of the disk list with some keywords to ensure certain types of disks don't get added in the list of disks to check.

In my version, that code looks like this (for now... I'll be changing it a lot soon to make it easier to have users of the script set their own options):
Code:
sub get_hd_list {
    my @vals;
    if ($operating_system eq 'freebsd' ) {
      my @freebsdcmd = ('camcontrol', 'devlist');
      foreach (run_command(@freebsdcmd)) {
          next if (/SSD|Verbatim|Kingston|Elements|Enclosure|Virtual|KINGSTON/);
          if (/\((?:pass\d+,(a?da\d+)|(a?da\d+),pass\d+)\)/) {
              dprint(2, $1);
              push(@vals, $1);
          }
      }
      dprint_list(3, "@vals");
    }
    elsif ($operating_system eq 'linux' ) {
      my @linuxcmd = ('sfdisk', '-l');
      my $joinedcmd = join("\n", run_command(@linuxcmd));
      my @drivechunks = split(/\n{3}/, $joinedcmd);
      foreach (@drivechunks) {
          next if (/SSD|Verbatim|Kingston|Elements|Enclosure|Virtual|KINGSTON|mapper/);
          if (/^Disk\s+\/dev\/(s.+):/) {
              dprint(2, $1);
              push(@vals, $1);
          }
      }
      dprint_list(3, "@vals");
    }
    return @vals;
}


The thing you would do in the case above is edit the line applicable to your OS which begins with next if (which effectively means "skip the drives which have text matching any of this stuff").

You would look at your drives with camcontrol devlist (CORE) or sfdisk -l (SCALE) and pick a word from that first section of your disk that's unique to those disks you want excluded, then add another pipe symbol and that text to the list.
 

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,211
Hello,
Is there a way to exclude drives from the "Setpoint mean drive temperature" calculation. I have two 2.5 inch drives that run significantly cooler than the 3.5 inch drives.
@sretella's method is the most direct. In my script, which is the basis of this thread (see the Download button at the top of the page), that is at line 375, and there is a brief explanation of how to discover the unique characters and add them to that list.

However, there is a much easier alternative. I just discovered that I have neglected to update the scripts posted here with this option and other changes, and will try to do that in the next few days. There is a 'new' configuration setting called NUMKEEP. It is the number of warmest drives you want kept in the mean calculation. So if you set this equal to or less than your number of 3.5-inch drives, cooler drives will be ignored in comparing drive temperatures to the set point.
 

sanello

Cadet
Joined
Jun 8, 2023
Messages
4
@sretella's method is the most direct. In my script, which is the basis of this thread (see the Download button at the top of the page), that is at line 375, and there is a brief explanation of how to discover the unique characters and add them to that list.

However, there is a much easier alternative. I just discovered that I have neglected to update the scripts posted here with this option and other changes, and will try to do that in the next few days. There is a 'new' configuration setting called NUMKEEP. It is the number of warmest drives you want kept in the mean calculation. So if you set this equal to or less than your number of 3.5-inch drives, cooler drives will be ignored in comparing drive temperatures to the set point.
This sounds perfect. Would the log reflect the NUMKEEP config? My target temp for the 5 hottest would be 40 in this instance.

Code:
****** SETTINGS ******
CPU zone 0; Peripheral zone 1
CPU fans min/max duty cycle: 20/100
PER fans min/max duty cycle: 40/100
CPU fans - measured RPMs at 30% and 100% duty cycle: 3700/7200
PER fans - measured RPMs at 30% and 100% duty cycle: 800/2100
Drive temperature setpoint (C): 36.50
Kp=4, Kd=40
Drive check interval (main cycle; minutes): 5
CPU check interval (seconds): 5
CPU reference temperature (C): 40
CPU scalar: 6
Reading fan duty from board
Getting CPU temperatures via sysctl

Key to drive status symbols:  * spinning;  _ standby;  ? unknown                              Version 2020-08-20

Wednesday, Nov 29                                                             CPU         New_Fan%  New_RPM_____________________
          ada0 ada1 ada2 ada3 ada4 ada5 ada6 Tmax Tmean   ERRc      P      D TEMP MODE    CPU PER   FANA  FAN1  FAN2  FAN3  FAN4
20:58:33  *41  *44  *43  *43  *41  *24  *25  ^44  37.28   0.78   3.12   6.24   37 Full     20  62   1500   ---   ---  3200   ---
 

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,211
This sounds perfect. Would the log reflect the NUMKEEP config? My target temp for the 5 hottest would be 40 in this instance.

Code:
****** SETTINGS ******
CPU zone 0; Peripheral zone 1
CPU fans min/max duty cycle: 20/100
PER fans min/max duty cycle: 40/100
CPU fans - measured RPMs at 30% and 100% duty cycle: 3700/7200
PER fans - measured RPMs at 30% and 100% duty cycle: 800/2100
Drive temperature setpoint (C): 36.50
Kp=4, Kd=40
Drive check interval (main cycle; minutes): 5
CPU check interval (seconds): 5
CPU reference temperature (C): 40
CPU scalar: 6
Reading fan duty from board
Getting CPU temperatures via sysctl

Key to drive status symbols:  * spinning;  _ standby;  ? unknown                              Version 2020-08-20

Wednesday, Nov 29                                                             CPU         New_Fan%  New_RPM_____________________
          ada0 ada1 ada2 ada3 ada4 ada5 ada6 Tmax Tmean   ERRc      P      D TEMP MODE    CPU PER   FANA  FAN1  FAN2  FAN3  FAN4
20:58:33  *41  *44  *43  *43  *41  *24  *25  ^44  37.28   0.78   3.12   6.24   37 Full     20  62   1500   ---   ---  3200   ---
The setting shows in the SETTINGS section at the top of the log. Temperature of all the drives would be shown, as in your log, but Tmean would be the mean of only the 5 hottest.

Did you get the private message I sent you? I'm asking you to test the 2-zone version I just added this feature to. I have only a single-zone board. It works fine in the single-zone script.
 

sanello

Cadet
Joined
Jun 8, 2023
Messages
4
The setting shows in the SETTINGS section at the top of the log. Temperature of all the drives would be shown, as in your log, but Tmean would be the mean of only the 5 hottest.

Did you get the private message I sent you? I'm asking you to test the 2-zone version I just added this feature to. I have only a single-zone board. It works fine in the single-zone script.
I have not received anything in my DMs.
 

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,211
I have not received anything in my DMs.
Hmm, I see my post in your profile. Anyway, would you please try it out? I'm attaching it with config file here. You'll have to edit it with settings from your current one.
 

Attachments

  • spinscripts_2023-12-01.zip
    11 KB · Views: 149

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I see my post in your profile
A profile post (actually I find it baffling that posts on an individual's profile are allowed for others) isn't a DM... you would need to use the envelope (conversations) icon in the top right of the forum and select the user to start a DM conversation with.
 

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,211
Thanks sretella. Well posting the script on the forum gives anyone an opportunity to try it and let me know if it works.
 
Top