Long Term Monitoring of TrueNAS Drives in particular

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
OK - well after realising that names etc were all case sensitive, when I look in the data explorer there is data in influxdb.
I am now running the script every 5 minutes to populate some data
 
Joined
Jan 27, 2020
Messages
577
OK, so I managed to get my test system up after a few hours of bashing my head on an old problem that turned out to be caused by an ad-blocker... anyway, it's running.

I have a couple of issues though:

1. It's a VM, so I don't get CPU temps / no sensors. So I have tested it only with the options set for "$cpu_temp_control = 1" and "$cpu_fans_cool_hd = 1". Obviously that's not what others would want to do so I set those ones to 1 instead of 0. Remembering that the original script doesn't actually recommend to control the CPU FAN instead of the CPU_FAN header on most MoBos, just to control the outlet fans behind the CPU, so that's something that may not even be necessary if you have a good CPU FAN and can leave your rear case fans at an optimal speed.

2. I'm using an OpenCorsairLink device (commander pro) passed into the VM as a USB device... so can only test that option, not the other 2 options I have in the script (ASRock or Supermicro)

So if somebody wants to have a look at it and let me know how it goes, Great!

I have it set for everything to go in /root/ for now, but you could put it somewhere in a pool if you prefer and update the locations in the MASTER file.

It's currently set for linux OS (SCALE) and for supermicro fan control.

Edit the MASTER file to fill in your needed variables... I would recommend looking at these 3 and see if you want to mess with influx or not... make sure you create the DB on the influx server with a matching name.

$use_influx = 1
$influxdb_db="freenas"
$influxdb_host="192.168.1.1"

it's important that you also set your fan max speeds to numbers that your fan can actually hit:

$cpu_max_fan_speed = 2200
$hd_max_fan_speed = 1500


You should set the MASTER file to executable first, then run it.
Been running it all day and seems to work well so far. Though I didn't test the influxdb part. Thank you very much for fixing it!
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
1677089217812.png


A partial dashboard within influxdb.
Thank you - now I get to connect Grafana to influxdb - tick

But the dashboard (grafana) is proving headscratching
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
I am / was a network engineer. I designed and built networks, mostly up to layer 4. Whats above that sort of eludes me / never bothered me. I used to work at a well known telecoms company working on a world wide private internet type setup for financial companies.

I have never been involved in any form of programming per se

So - what do I have.
I have a grafana instance which, after rather longer than I care to admit, is connected to an influxdb instance (which has data in it). When I click Save & test on the data source I get
1677170126040.png

Which is a good sign I guess. Its also set as the default

If I go to influxdb and "Data Explorer" I can see 3 buckets: truenas, _monitoring & _tasks. Given that truenas is the one I set up and has data from @sretalla 's script in it - that would seem to be the relevent bucket and the info looks correct - although some is missing. I can see:
1677170649889.png

but sdc, sdd, sdg, sdh, sdi, sdj, sdr, sds, sdt, sdu, sdw & sdy are missing - all of which are SATA or SAS SSD's, connected via the backplane, and actually don't really concern me that much as SSD's will thermally throttle and basically look after themselves - mostly

sda & e are very old SSDs in the boot pool - which do appear. These are connected by motherboard SATA ports and have poor airflow given where they are - (velcro'd into the case in a non airflow friendly location)
b,e,f,k,l,m,n,o,p,q,v,x are all HDD's connected to the SAS Backplane and are the ones that concern me. So I have data for 2 relatively unimportant SSD's and all the HDD's I care about plus the two NVMe drives.

The two nvme drives are optanes in the same location as the sda & sde - so actually of interest. Whether by accident or design @sretalla 's script actually already filters out the less interesting data!!!!!! [How did you know???]

Back to grafana and I select new dashboard and I now appear to have to write a query. Now I now intellectually what a query is - but practically - not a clue.

In influxdb there is a script editor which produces:
Code:
from(bucket: "truenas")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "DiskTemp")
  |> filter(fn: (r) => r["_field"] == "value")
  |> filter(fn: (r) => r["component"] == "NewNASnvme0" or r["component"] == "NewNASnvme1" or r["component"] == "NewNASsda" or r["component"] == "NewNASsdb" or r["component"] == "NewNASsde" or r["component"] == "NewNASsdf" or r["component"] == "NewNASsdl" or r["component"] == "NewNASsdk" or r["component"] == "NewNASsdm" or r["component"] == "NewNASsdn" or r["component"] == "NewNASsdo" or r["component"] == "NewNASsdp" or r["component"] == "NewNASsdq" or r["component"] == "NewNASsdv" or r["component"] == "NewNASsdx")
  |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)
  |> yield(name: "mean")


This looks very like the example query that Grafana gave as an example - so why not...

If I copy and paste that into grafana - I get data - but not as a graph. But then I can turn it into a time series graph (just at the press of a button) - which works, somewhat by mistake. [shows the value of flailing widely]

And it seems to work.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
@sretalla
I am curious as to how the script filters out some of the SSD's. It appears to take data from sfdisk -l
But I don't see (/understand) any logic that filters that result for SSD's / HDD's. I have uploaded a text file just in case you are interested enough to have a look / explain it to me!!

But thank you
 

Attachments

  • disklist.zip
    2.4 KB · Views: 67

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I am curious as to how the script filters out some of the SSD's. It appears to take data from sfdisk -l
it's this line (108):
next if (/Verbatim|Kingston|Elements|Enclosure|Virtual|KINGSTON|mapper/);

If any of the items (separated by the pipes | ) are matched in the header of that disk, it is skipped and won't be in the disk list to be checked...

What you won't see in the code, but is in the next filter effectively on line 161:
if (/Temperature_Celsius|Airflow_Temperature_Cel/) { $temp = (split)[9]; }

If that line produces no value, influx will reject the attempt to post data with no value, so no disk appears in influx.

Can be that either that disk reports no temperature or that the regex doesn't hit the name of the line that holds the temperature on those disks (we can fix that if you show the output of smartctl -A for that disk).

There's also the same corresponding lines for the NVMEs, but I didn't list those here as that seems to work fine.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
HDD = 194 Temperature_Celcius
SSD = 194 Temperature_Internal & 190 Temperature_Case

So would the line be:
Code:
if (/Temperature_Celsius|Airflow_Temperature_Cel|Temperature_Internal/) { $temp = (split)[9]; }

as the earlier line (the next if) I don't see being triggered

You don't seem to like Kingston?

That line edit seems to work - as the extra data is now in the influxdb database
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
HDD = 194 Temperature_Celcius
I don't know if this is something you copied directly, but I'm very interested to see since I made what I thought was a typo and couldn't see it for ages as the cause for nothing working...

Note: Celsius vs Celcius
So would the line be:
Code:
if (/Temperature_Celsius|Airflow_Temperature_Cel|Temperature_Internal/) { $temp = (split)[9]; }
Something like that, but maybe also add Temperature_Celcius if some or all of your drives report it like that.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Naa - I can't type - sorry
 

Seani

Dabbler
Joined
Oct 18, 2016
Messages
41
OK, so I managed to get my test system up after a few hours of bashing my head on an old problem that turned out to be caused by an ad-blocker... anyway, it's running.

I have a couple of issues though:

1. It's a VM, so I don't get CPU temps / no sensors. So I have tested it only with the options set for "$cpu_temp_control = 1" and "$cpu_fans_cool_hd = 1". Obviously that's not what others would want to do so I set those ones to 1 instead of 0. Remembering that the original script doesn't actually recommend to control the CPU FAN instead of the CPU_FAN header on most MoBos, just to control the outlet fans behind the CPU, so that's something that may not even be necessary if you have a good CPU FAN and can leave your rear case fans at an optimal speed.

2. I'm using an OpenCorsairLink device (commander pro) passed into the VM as a USB device... so can only test that option, not the other 2 options I have in the script (ASRock or Supermicro)

So if somebody wants to have a look at it and let me know how it goes, Great!

I have it set for everything to go in /root/ for now, but you could put it somewhere in a pool if you prefer and update the locations in the MASTER file.

It's currently set for linux OS (SCALE) and for supermicro fan control.

Edit the MASTER file to fill in your needed variables... I would recommend looking at these 3 and see if you want to mess with influx or not... make sure you create the DB on the influx server with a matching name.

$use_influx = 1
$influxdb_db="freenas"
$influxdb_host="192.168.1.1"

it's important that you also set your fan max speeds to numbers that your fan can actually hit:

$cpu_max_fan_speed = 2200
$hd_max_fan_speed = 1500


You should set the MASTER file to executable first, then run it.
This looks like a promising replacement for this script that I have used for years.
The sheer amount of config options in your script eludes me a bit. My current working one is more primitive. Could I just take the values from my old script (fan zones and rpms) and slap those values in the corresponding places in your script and ignore the other options? My main objective is to have something in place that will stop my 10 3000rpm noctua fans from going completely nuts when I switch from CORE to SCALE.
 

Attachments

  • spinpid2.zip
    7.2 KB · Views: 81

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Could I just take the values from my old script (fan zones and rpms) and slap those values in the corresponding places in your script and ignore the other options?
You could try, but I would advise having a careful look at the top section of the script in full and filling things like whether the HD fans cool the CPU and vice versa.

You may also be able to get it going on CORE first and just change the paths and OS when you switch to SCALE.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
@Seani did you get it working on freebsd.
I tried running it for giggles - and it generates an awful lot of errors. I need to check through the paths as that is I suspect the first cause of issues.

Works perfectly on Scale

Not so much on Core:
  1. Changed $operating_system to "freebsd"
  2. #'d out "my @output = run_command(@influxcommand); " just in case to stop it logging rubbish (if it gets that far)
It would appear that the script isn't picking up the device ID's.
"Use of uninitialized value $disk_dev in concatenation (.) or string at ./influxtemps_SCALE.pl line 155."

I don't think the script likes the output of camcontrol devlist as opposed to sfdisk -l

My disks are ada0-7, da0, da1 if that's of any use
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
#'d out "my @output = run_command(@influxcommand); " just in case to stop it logging rubbish (if it gets that far)
Just set $use_influx = 0 at the top in the variables and you don't need to mess with the rest of the influx stuff.

I don't think the script likes the output of camcontrol devlist as opposed to sfdisk -l
That would be really odd, as the if statement for freebsd just pushes the same code as was in the original script for CORE (so should be working).
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I need to check through the paths as that is I suspect the first cause of issues.
For sure some of those need changing for the thing to work.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
OK - I put stuff back and I now get (the same):
Code:
root@QNAS[/mnt/Tank/SMB/QNAS-Scripts/sretella]# ./influxtemps_SCALE.pl
<HGST HUS726T4TALA6L4 VLGNW40H>    at scbus0 target 0 lun 0 (ada0,pass0)
<HGST HUS726T4TALA6L4 VLGNW40H>    at scbus1 target 0 lun 0 (ada1,pass1)
<ST4000NM0033-9ZM170 SN04>         at scbus2 target 0 lun 0 (ada2,pass2)
<WDC WD4000FYYZ-01UL1B2 01.01K03>  at scbus3 target 0 lun 0 (ada3,pass3)
<ST4000NM0033-9ZM170 SN04>         at scbus6 target 0 lun 0 (ada4,pass4)
<ST4000NM0033-9ZM170 SN04>         at scbus7 target 0 lun 0 (ada5,pass5)
<WDC WD4000FYYZ-01UL1B2 01.01K03>  at scbus10 target 0 lun 0 (ada6,pass6)
<HGST HDN724040ALE640 MJAOA5E0>    at scbus14 target 0 lun 0 (ada7,pass7)
<AMicro AM8180 NVME 1.00>          at scbus18 target 0 lun 0 (da0,pass8)
<JMicron Tech 0209>                at scbus19 target 0 lun 0 (da1,pass9)
Use of uninitialized value $disk_dev in concatenation (.) or string at ./influxtemps_SCALE.pl line 155.
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

/dev/: Unable to detect device type
Please specify device type with the -d option.

Use smartctl -h to get a usage summary

Use of uninitialized value $name_nospaces in substitution (s///) at ./influxtemps_SCALE.pl line 80.
Use of uninitialized value $name_nospaces in concatenation (.) or string at ./influxtemps_SCALE.pl line 81.
Use of uninitialized value $value in concatenation (.) or string at ./influxtemps_SCALE.pl line 81.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   130  100   100  100    30  44503  13351 --:--:-- --:--:-- --:--:-- 65000
HTTP/1.1 400 Bad Request
Content-Type: application/json; charset=utf-8
X-Influxdb-Build: OSS
X-Influxdb-Version: v2.6.1
X-Platform-Error-Code: invalid
Date: Mon, 06 Mar 2023 16:40:20 GMT
Content-Length: 100



and a lot more - but it seems to just cycle around and around. However the first issue seems to occur in the get_hd_list which isn't returning any device names and objects when I unhash the print $1 and the dprint at lines 97 and 101 as unitialised values

Note I know nothing about perl and the very cryptic use of / /\ and \ (and other things) in if statements
 
Last edited:

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
OK, so now I get it... I was thinking we were talking about the fan script... Although the code is just grabbed from there mostly.

Let me look at it tomorrow and see if I can get an updated version which fully supports both.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Thank you
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
OK, new version, just made a small change to the way disks are collected for CORE.
 

Attachments

  • influxtemps_SCALE.pl.zip
    2.1 KB · Views: 71
Top