Light-weight S.M.A.R.T. monitoring from command line with possible solution

puzavi

Cadet
Joined
Jul 29, 2021
Messages
2
Dear ALL,

Just a bit of history.

Just a few days ago I assembled and installed my first DIY NAS. I took leftover mobo with some CPU, bought HBA and harddrives and now it works. I wanted it to be nearly quiet, since it lives its life in the room where I work. As many of us, I run SATA HDDs with HBA in IT mode.

From the very beginning I started to worry about my HDD health status. For my build I selected Fractal Design Define R5 - quite big and good case, however I simply forgot to order good system fans. Because of this I had to startup using two fans available with my case. I installed them in the front of the case to cool down HDDs. So far I have no issues with HDD temperatures, but I still worry about them.

I inspected internal TrueNAS HDD temperature monitoring facilities and found them inappropriate to monitor temperature on the long run. If temperatures become too high, I may get an email, notifying me about cooked/fried HDDs. I made a brave experiment and stopped one of the fans while system was under moderate load. It took just three minutes for HDDs to raise their temperature by 7oC to uncomfortable for me value. In other words, fan malfunction for such system is deadly.

So, I started to look for more appropriate mean to monitor HDD temperatures and react accordingly. My very first attempt was to employ a script on top of mix of smartctl -a, grep and awk. There are tons of such examples in the Internet. Unfortunately, such construction is quite heavy. It takes resources! Even if the boot pool is on SSD, it takes good deal of resources to run smartctl -a for 8 HDDs I have in the system. As a next step I transitioned to smartctl -A - slightly more efficient version, which supposed to query only S.M.A.R.T. log and nothing more. Unfortunately, it is still too heavy. Also, smartctl wakes the drives from IDLE.

I know it is bad for drives to start and stop all the time. That is why I do not allow drives to stop, but allow them to go to IDLE. This way I still get a good savings in terms of power consumption, but do not put too much pressure on mechanical parts.

I spent 3 days looking for light-weight temperature monitor without any success. I wanted something simple, reporting HDDs temperatures, allowing me to react, preferrably, whithin scripted environment. During my searches I found few references to camcontrol. First it looked like a very dangerous black magic, however, it looked capable to send raw SATA commands to drive. As a result, I found a way to send SATA READ SMART LOG command to my drives and get 512 bytes of response back. Since /bin/bash and friends are not capable to process binary data with acceptable overhead, I selected a Perl as my scripting environment. Please, meet a piece of work in progress: disk_temp.pl. It is still a work in progress, but it already contains all necessary moving pieces.

I am not very familiar with Perl, that is why some of the parts possibly could be rewritten in more efficient way. However, it works, at least, in my environment. It generates accepable load on my system and does not interferes with system main activities. This version just prints HDDs temperatures, while in my environment I use the provided subroutines to monitor HDDs and shutdown the system if temperature goes higher than 45C. Of course, this is only acceptable for home users, but this is who I am :smile:

Of course, I do not guaranty the script will work in other environments. It is just several subroutines, which could help to build system with more agile reaction, comparing to what bare TrueNAS may provide.

I will be happy if this some part of this code could make some SOHO admin's life easier :cool:

P.S. I do not process errors here. Error processing makes this code too complicated and is not relevant for demonstration purposes.

P.P.S. Make sure to change the line below to the list of drives appropriate for your system.

my $drives = "da1 da2 da3 da4 da5 da6 da7 da8";

P.P.P.S. Since I know how to communicate with SATL ATA PASS-THROUGH commands, it might be wise to develop a small Perl/Python binary extension to further reduce monitoring overhead to negligible value :rolleyes:

P.P.P.P.S. In similar way it is possible to extract any S.M.A.R.T. sttribute from the S.M.A.R.T. log. Actually, camcontrol is very agile instrument. It is possible to perform virtually everyhting with HDDs using it.

--
With best regards,
Puzavi


Code:
#!/usr/local/bin/perl

# Initial experiments
#camcontrol cmd da1 -v -c "A1 08 0E D0 01 00 4F C2 00 B0 00 00" -i 512 "i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1"
#camcontrol cmd da2 -v -c "A1 08 0E D0 01 00 4F C2 00 B0 00 00" -i 512 "i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1"
#camcontrol cmd da3 -v -c "A1 08 0E D0 01 00 4F C2 00 B0 00 00" -i 512 "i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1"
#camcontrol cmd da4 -v -c "A1 08 0E D0 01 00 4F C2 00 B0 00 00" -i 512 "i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1"
#camcontrol cmd da5 -v -c "A1 08 0E D0 01 00 4F C2 00 B0 00 00" -i 512 "i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1"
#camcontrol cmd da6 -v -c "A1 08 0E D0 01 00 4F C2 00 B0 00 00" -i 512 "i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1"
#camcontrol cmd da7 -v -c "A1 08 0E D0 01 00 4F C2 00 B0 00 00" -i 512 "i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1"
#camcontrol cmd da8 -v -c "A1 08 0E D0 01 00 4F C2 00 B0 00 00" -i 512 "i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1 i1"

use strict;
use warnings;
use bigint;

### Parameters ###
my $drives = "da1 da2 da3 da4 da5 da6 da7 da8";
my $AttrAirTempID = 190;        # Air temperature inside HDD body, value, or 100 - value
my $AttrHDATempID = 194;        # Temperature as reported by HDD built-in sensor
my $AttrTempID    = 231;        # Yet another unknown temperature

###########################################################################################################################
# Parse S.M.A.R.T. data from HDD/SSD for Temperature attributes
# Input:
#     @smart - 512B of smart log in form of Perl array
# Output:
#     %Temp  - hash table with parsed temperatures
sub Parse_SMART_Temperature
{
        #######################################################################################################################
        # S.M.A.R.T. attributes are fixed size 12 bytes long data structures. To find necessary structure it is necessary to
        # walk through all the records, looking for necessry one. We are interested in records with ID 190, 194 and 231,
        # representing:
        # 190 - air temperature inside HDD body, value, or 100 - value
        # 194 - temperature as reported by HDD built-in sensor
        # 237 - yet another temperature
        my @smart = @_; my $idx;
        # Return hash placeholder, if attribute is not found, -1 will remain as temperature value
        my %Temp = ("$AttrAirTempID" => -1, "$AttrHDATempID" => -1, "$AttrTempID" => -1);
        for ($idx = 2; $idx < 361; $idx = $idx + 12) {
                my $AttrID  = $smart[$idx + 0];
                my $Flags   = $smart[$idx + 1] + ($smart[$idx + 2] << 8);
                my $Current = $smart[$idx + 3];
                my $Worst   = $smart[$idx + 4];
                my $Thresh  = $smart[$idx + 11];
                
                if ($AttrID == $AttrAirTempID) {
                        $Temp{"$AttrAirTempID"} = 100 - $Current;
                } elsif ($AttrID == $AttrHDATempID) {
                        $Temp{"$AttrHDATempID"} = $Current;
                } elsif ($AttrID == $AttrTempID) {
                        # I have no such HDD, so I have no idea how to handle this attribute
                }
        }
        return %Temp;
}

###########################################################################################################################
# Query HDD/SSD S.M.A.R.T. data log
# Input:
#     $DeviceID - device to query suitable for camcontrol cmd: da1 => /dev/da1 will be queried
# Output:
#     @smart - 512B Perl array with S.M.A.R.T. data
sub Query_SMART_Log
{
        my $camc_prefix = "camcontrol cmd ";
        my $camc_suffix = " -v -c \"A1 08 0E D0 01 00 4F C2 00 B0 00 00\" -i 512 -";
        my $camc_device = $_[0];
        my $camc_cmd    = $camc_prefix.$camc_device.$camc_suffix;
        my $result      = `$camc_cmd`;
        return unpack("C*", $result);
}

###########################################################################################################################
# Query device model and device serial number
# Input:
#     $DeviceID - device to query suitable for cmartctl -a /dev/$DeviceID: da1 => /dev/da1 will be queried
# Output:
#     None, models and serials are stored in %Models and %Serials hash tables
my %Models; my %Serials;
sub Query_Models_Serials
{
        my $smartctl_prefix = "smartctl -a /dev/";
        my $smartctl_suffix = " | grep 'Device Model:\\|Serial Number:' | awk '{print \$3}'";
        my $smartctl_device = $_[0];
        my $smartctl_cmd    = $smartctl_prefix.$smartctl_device.$smartctl_suffix;
        my $result          = `$smartctl_cmd`;
        ($Models{$smartctl_device}, $Serials{$smartctl_device}) = split("\n", $result);
}

###########################################################################################################################
# Main and inititialization
my @DriveList = split(" ", $drives);

foreach my $DeviceID (@DriveList) {
        Query_Models_Serials($DeviceID);
}

# Main reporting loop
for ( ; ; ) {
        printf("\033c");
        foreach my $DeviceID (@DriveList) {
                my @smart = Query_SMART_Log($DeviceID);
                my %Temp  = Parse_SMART_Temperature(@smart);
                my $reported = 0;
                printf("/dev/%s:  %s  %s  ", $DeviceID, $Models{$DeviceID}, $Serials{$DeviceID});
                if ($Temp{$AttrAirTempID} != -1) { printf("AIR: %d  ", $Temp{$AttrAirTempID}); $reported = 1; }
                if ($Temp{$AttrHDATempID} != -1) { printf("HDA: %d  ", $Temp{$AttrHDATempID}); $reported = 1; }
                if ($reported == 0) { printf("UNDETECTED"); }
                printf("\n");
                sleep(1);
        }
}
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
My very first attempt was to employ a script on top of mix of smartctl -a, grep and awk. There are tons of such examples in the Internet. Unfortunately, such construction is quite heavy. It takes resources! Even if the boot pool is on SSD, it takes good deal of resources to run smartctl -a for 8 HDDs I have in the system. As a next step I transitioned to smartctl -A - slightly more efficient version, which supposed to query only S.M.A.R.T. log and nothing more. Unfortunately, it is still too heavy.
What do you mean with "heavy"?

And what HDD temperatures are we talking about?
 

puzavi

Cadet
Joined
Jul 29, 2021
Messages
2
My goal is to perform software-based automatic regulation of fan speed. If HDDs are not transferring tons of data, they do not produce that much of heat. This way it is possible to reduce fan speed and make system even quiter. Typical script to query HDD temperature with smartctl is something of this type:

Code:
da1_temp=`smartctl -A /dev/da1 | grep "Temperature" | awk '{print $10}'`
da2_temp=`smartctl -A /dev/da2 | grep "Temperature" | awk '{print $10}'`
da3_temp=`smartctl -A /dev/da3 | grep "Temperature" | awk '{print $10}'`
da4_temp=`smartctl -A /dev/da4 | grep "Temperature" | awk '{print $10}'`
da5_temp=`smartctl -A /dev/da5 | grep "Temperature" | awk '{print $10}'`
da6_temp=`smartctl -A /dev/da6 | grep "Temperature" | awk '{print $10}'`
da7_temp=`smartctl -A /dev/da7 | grep "Temperature" | awk '{print $10}'`
da8_temp=`smartctl -A /dev/da8 | grep "Temperature" | awk '{print $10}'`


Running such script for 8 Seagate IronWolf 4TB HDDs in my case takes something from 0.24s all the way till 2.3s per run. According to htop, few CPU load bars jump to 10-15% of load while smartctl -A is executed and its output filtered. Of course, I am not running high-end CPU. In my case it is old but top for its era 6th generation Core-i7, accompanied with 48GB of RAM. This way running such script constantly for closed-loop auto-regulation purposes is not desirable - it will be consuming noticeable amount of CPU ticks.

CPU load influenced by running proposed script even constantly is hardly noticeable. It is easy to see - I am using camcontrol only - single quite small external binary, while smartctl -A solution requires three. System admins tend to forget, that piping output for filtering and processing is not free. With modern CPUs that might be not that noticeable, but why to use more resources, if there is more effcient solution? Of course, even more efficient solution is to create a lib and make a call to it from Python/Perl environment, but that involves binaries and programming and I wanted to awoid it this time.

My IronWolfs report actual temperature in two S.M.A.R.T. attributes: 190 and 194. 190 is the air temperature inside sealed HDD cage, 194 is the HDD electronics temperature, at least, I believe so. While it is not that big deal if 194 attribute reads out high values, but 190 attribute big values directly influence HDD life expectancy. According to Google investigation and few other articles on Internet, high air temperature inside sealed HDD cage, attribute 190, significantly reduces mechanic parts life expectancy. Highest allowed temperature read out for my IronWolfs is 60oC. HDD warranty voids if 60oC border is crossed.

This might not be the case for big Enterprise. They used to trash harddrives and other very expensive stuff. However, I am home user and I would like my hardware to be alive as long as possible.

--
Regards,
Puzavi
 
Joined
Jan 27, 2020
Messages
577
Top