Long Term Monitoring of TrueNAS Drives in particular

NugentS · Jan 30, 2023

I was wondering if anyone here had any advice for me - I am branching out into unfamiliar territory (or rather trying to) and getting confused by the options.

TrueNAS produces a whole bunch of stats in the reporting pages - which are great - just not very useable in that if I want to look at drive temp for each drive, over time (which is my goal) I have to click lots of times to get any info at all. It also seems to go back only 6 months - and I am looking for a full year, preferably on one page in a view that I can just look at from time to time.

What I would like:
Primarily: to be able to see the drive temps over time, for the last year (and I have 24 or so drives) with a decent granularity so I can look at a chart by the second (maybe a minute minimum would be better), minute, 10 mins, hours, days, weeks, months etc and hopefully see all the drives with the ability to then drill down and look at "problem" drives if I have any.
Secondary: to see how SMART stats on the drives change over time. So chart chosen stats somehow so I can see thing like load/unload cycles increasing (or not) over time.
[As you can see its mostly about drives - I am not so concerned about the rest of the NAS (CPU/Memory/Network) although the ability to add stuff later on is good]

Would anyone care to point me in the right direction as I am struggling to see where to start.

Graphite / Graphana - which is likley to do what I am looking for, simply?

Thanks

sretalla · Jan 31, 2023

NugentS said:
Would anyone care to point me in the right direction as I am struggling to see where to start.

Graphite / Graphana - which is likley to do what I am looking for, simply?

I have incorporated/supplemented my fan script with a disk temperature logging script that pushes data to influx for grafana...

nas_fan_control/influxhddtemps.pl at master · sretalla/nas_fan_control

collection of scripts to control fan speed on NAS boxes - sretalla/nas_fan_control

github.com

For my SSDs (since I don't have them monitored by the fan control script), I run this via a cron task every 5 minutes.

That script is for CORE, but I have updated most (if not all) of the modules for SCALE as well in the fan script:

nas_fan_control/MASTER_PID_fan_control_SCALE.pl at master · sretalla/nas_fan_control

collection of scripts to control fan speed on NAS boxes - sretalla/nas_fan_control

github.com

If you're suitably motivated, you could probably use that code to update the temps script.

If that's something you're not able to do, but really want it, I might find the time to help out.

Patrick M. Hausen · Jan 31, 2023

The collectd in TrueNAS CORE got some feature improvements by iXsystems, so it reports CPU temperatures, HDD/SSD temperatures and a bunch of other stuff. Pretty easy to connect to influxdb and Grafana running in a jail.

NugentS · Jan 31, 2023

@Patrick M. Hausen - except I am on Scale - and want to use Containers, either directly on TrueNAS or preferably on a seperate docker host as I try to avoid Truecharts for a number of reasons. Grafana itself is easy enough to install on TN K3S or a seperate docker host. But there are 100's of different options

As I understand things I need to configure 3 things as:
TN -> collector/logging app -> Grafana

Grafana is "simply" a display program - it takes data from the collector. The collector is a completely seperate app that TN logs to.

Under reports config TN says "Remote Graphite Server Hostname". Implying I should be using graphite which (at least in the container I have found seems to include a collector - graphiteapp/graphite-statsd). Although I think that TN uses graphite as that doesn't work for me in what I want to see

Googling seems to imply (as @Patrick M. Hausen also implied) that a popular route is influxdb as the "collector" and grafana as the graphing app.

I am suffering from information overload / decision paralysis

So suggestion:
1. Install grafana. (grafana/grafana-oss:main)
2. Install influxdb container (influxdb:latest)
3. Point TN at the influxdb
4. Point grafana at influxdb. Configure display
5. See what happens, (presumably nothing as I will be missing something obvious)

Does that sound at least moderatly sensible

@sretalla - thank you - I may take you up on that later - but I think I am missing some obvious bits first

Patrick M. Hausen · Jan 31, 2023

Collector: telegraf or collectd - collects and sends data to a
Time series database: graphite or influxdb (has a graphite compatible ingest connector) - which is then used as a data source by an
Analytics and display program: grafana

NugentS · Jan 31, 2023

collectd is provided by TN is it not?
Is that not the bit configured by:

Patrick M. Hausen · Jan 31, 2023

Precisely. I was only referring to the fact that it is not the only source of metrics existing. And for some reason iX have decided to use the graphite plain text format exclusively. So you need to define a graphite plain text connector in influx. Because influx is much more convenient and manageable than graphite. Unfortunately there is no port of the 2.x branch to FreeBSD, yet. But that is not your concern if you are using SCALE.

"Graphite separate instances" is useful, just trust me and activate it - you will have a way easier life when building your graphs in grafana.

Just like Elastic, Logstash and Kibana make the "ELK" stack, Telegraf, Influx and Grafana are considered one stack of related tools. But using the builtin collectd works great, too.

And last: the graphite plain text format is dead easy to generate. So if you want a widget of your current zpool usage, you can use a script like this:
https://www.truenas.com/community/t...ation-in-grafana-influxdb.102139/#post-705943
https://www.truenas.com/community/t...in-grafana-influxdb.102139/page-2#post-706459

Or shove any metric you are interested in into influx that way.

NugentS · Feb 1, 2023

OK - I figure I am being dumb here
Step 1 - Create influxdb container - tick
Step 2 - login to influxdb - using http://192.168.38.189:8086 - this brings up a nice colourful window
Step 3 - Point TN Graphite Server to http://192.168.38.189:8086 - this might need to be just the IP address or without the http:// - assume a tick for the moment - might need to change

Now @Patrick M. Hausen implies I need to define a "graphite plain text connector in influx". I assume so that influx knows that I am trying to send it something and the format it will be in. The logical place seems to be "source" - but nothing springs out at me. I can upload a file (clearly "no"), or I can use a client library or a telegraf plugin (which would also seem like "no")

Something seems wrong / not obvious - there seem to be a lot of changes in influxdb recently - such that old docs seem non-relavent now

Docker-compose - just for reference

Code:

  influxdb:
    image: influxdb:2.6.1
    container_name: influxdb
    restart: always
    environment:
      - INFLUXDB_DB=influx
      - INFLUXDB_ADMIN_USER=xxxxx
      - INFLUXDB_ADMIN_PASSWORD=xxxxx
      - reporting-disabled
    ports:
      - 8086:8086
    volumes:
      - influxdb_data:/var/lib/influxdb2

Patrick M. Hausen · Feb 1, 2023

To configure this graphite plain text compatible connector for influx you need to get something like this into influxd.conf:

Code:

[[graphite]]
  enabled = true
  database = "graphite"
  retention-policy = ""
  bind-address = ":2003"
  protocol = "tcp"
  consistency-level = "one"

  separator = "."

  templates = [
    "servers.* .hostname.resource.instance.measurement*",
  ]

"database" is just a name - this is the DB you reference in grafana afterwards. "templates" tells influx what to name the individual fields sent by TrueNAS.

For example with the template I use you will end up with a DB entry like:

Code:

temperature,hostname=freenas2_ettlingen_hausen_com,instance=ada0,resource=disktemp

So you can query in InfluxQL like (in English) "give me the 'temperature' measurement for the host freenas2_..., resource disktemp, instance ada0".
I could have named the parts "foo, bar, baz" instead of "hostname, instance, resource" and you would query for that. It's just names but it is your job to come up with a scheme that makes sense to you.

What TrueNAS sends via collectd looks like in plain text "servers.freenas2_ettlingen_hausen_com.disktemp.ada0.temperature <measurement> <timestamp>".
The template tells influx how to cut that into pieces.

That's how you build your widgets in grafana.

How you get this config snippet into the influx configuration in the context of a Docker/Helm deployment is way beyond my knowledge and one of the very reasons why I prefer jails over everything else. Config file in documented location, edit, test, done. Deploy with Ansible. Double done.

HTH,
Patrick

NugentS · Feb 1, 2023

And influxd.conf is not where other searches say it is. More googling I think

NugentS · Feb 5, 2023

Currently I am bouncing off this. However according to the roadmap the TrueNAS reporting ecosystem will change soon - so I am going to put a stop to this particular project for the moment and wait for the changes and then try again. Meanwhile research continues - slowly

mistermanko · Feb 19, 2023

sretalla said:
I have incorporated/supplemented my fan script with a disk temperature logging script that pushes data to influx for grafana...

nas_fan_control/influxhddtemps.pl at master · sretalla/nas_fan_control

collection of scripts to control fan speed on NAS boxes - sretalla/nas_fan_control

github.com

For my SSDs (since I don't have them monitored by the fan control script), I run this via a cron task every 5 minutes.

That script is for CORE, but I have updated most (if not all) of the modules for SCALE as well in the fan script:

nas_fan_control/MASTER_PID_fan_control_SCALE.pl at master · sretalla/nas_fan_control

collection of scripts to control fan speed on NAS boxes - sretalla/nas_fan_control

github.com

If you're suitably motivated, you could probably use that code to update the temps script.

If that's something you're not able to do, but really want it, I might find the time to help out.

Did you actually ran that on SCALE? Because you didn't change the paths for ipmitool to the according default path on SCALE.
And running it throws lots of syntax and math errors.

NugentS · Feb 19, 2023

@sretalla Which version of influx are you talking about - there seems to be major changes in the latest version that means (from a TN PoV) its now TN -> Telegraf -> Influx -> Grafana. I think

[Edit - not to now]

ChrisRJ · Feb 19, 2023

I realize that it doesn't tick all boxes, but wanted to mention that something similar can be achieved with @joeschmuck 's wonderful script.

multi_report.sh version for Core and Scale

TooMuchData submitted a new resource: multi_report.sh versions for Core and Scale - ZPool & SMART status report ZPool & SMART status report. Original script by joeschmuck, modified by Bidelu0hm, then by melp. Minor corrections by TooMuchData. Read more about this resource...

www.truenas.com

NugentS · Feb 19, 2023

I use that script (even have a credit) but its not something I can look at whenever I feel like it to see the temps.
My NAS is in my garage, and it can get a bit warm in there

sretalla · Feb 19, 2023

NugentS said:
Which version of influx are you talking about - there seems to be major changes in the latest version that means (from a TN PoV) its not TN -> Telegraf -> Influx -> Grafana. I think

I am just using the FreeBSD port(pkg) in a jail, so whatever that is. (I only use influx there, then run the latest Grafana in a docker container and point it at the influxdb).

sretalla · Feb 19, 2023

mistermanko said:
Did you actually ran that on SCALE? Because you didn't change the paths for ipmitool to the according default path on SCALE.

A long time ago, I did... but indeed the variables section (including the part where ipmitool path is specified was expected to be changed).

joeschmuck · Feb 19, 2023

NugentS said:
I use that script (even have a credit) but its not something I can look at whenever I feel like it to see the temps.
My NAS is in my garage, and it can get a bit warm in there

The title of this thread is "Long Term Monitoring of TrueNAS Drives in Particular".

NugentS said:
TrueNAS produces a whole bunch of stats in the reporting pages - which are great - just not very useable in that if I want to look at drive temp for each drive, over time (which is my goal) I have to click lots of times to get any info at all. It also seems to go back only 6 months - and I am looking for a full year, preferably on one page in a view that I can just look at from time to time.

You do know that Multi-Report does create and maintain a CSV file (setup for use in a spreadsheet) that records all the data. If you setup an HOURLY (what I use) or maybe every 2 hours or 15 minutes CRON Job for the script and use the -s switch, only an entry into the CSV file will occur, no email.

So I setup the CSV file so I could perform trend analysis. My drives are well past their prime so I want to monitor this stuff. The CSV file also has a Purge feature that will purge anything over 720 days to keep the file from getting too large. Of course that is users configurable as are most of the functions/operations.

If you just needed say a PDF file of a chart, I'm sure that could be figured out. Not by me, yet, but I'm sure some smart whipper snapper can convert a CSV file into a chart on a PDF file. I'm still working on version 2.1 and making a big change, mainly for me so I can assist others as problems arise in the simulation capabilities. I want to simulate a persons entire drive reports and it's working so far, but it's early. I still have a long way to go. Maybe creating a PDF isn't a bad idea, but I'm not sure what TrueNAS has installed that I could use.

Let me know your thoughts on why it doesn't meet your needs, maybe I can make it work as needed and it would benefit others as well.

Thanks @ChrisRJ for the shoutout. As @NugentS said, or didn't say, he's helped quite a bit in debugging the script and obtaining drive data. Same with @sretalla, @mistermanko, and yourself (I only mentioned the folks posting so far in this thread). I genuinely appreciate all the help you folks and others have given me to create what I hope is a useful script. I know it keeps me engaged.

Cheers

NugentS · Feb 19, 2023

OK - what I am looking to do specifically.
In real time, or close to real time. Be able to graph disk temp (mostly HDD's) against time, in a single pane of glass into which I can then, if I see something alarming drill down and see which disk is melting.

I keep my NAS in a rack in a garage. During the summer it can get very warm in there and the disks in particular can get a bit (lot) toasty.
I do have temp controlled extractor fans - but they can only do so much.
I also have a portable aircon unit in a covered frame (new) to blow cold / cooler air directly at the NAS - this is also temperature controlled, only coming on when the garage reaches a certain temperature
Ideally I would have grafana (or similar) open during the day - so I could watch how my janky cooling solution works in real time. Last year during the worst of the temps I had to turn the NAS off - the temps were making me twitchy.

TN already collects the data - just not in a form that I can just look at without interaction (GUI times out. too many disks to fit on a page etc)

joeschmuck · Feb 19, 2023

Realtime does make it difficult. So while Multi-Report isn't realtime, it can simulate realtime. This "might" be a work around but what if you setup CRON to run Multi-Report and maybe send you an email if a threshold is exceeded? Okay, I don't have that feature but I could make that happen easily enough if that would work for you, but only if it would meet your needs. I actually like the idea myself. If I run multi-report every 10 minutes and then only if a threshold is reached, and email is generated. That seems like an easy (I shouldn't jinks myself) change. I took this for action in version 2.1 in my list of things to do. I actually like the idea of just tossing an email when something bad happens. It's not using TrueNAS to monitor unfortunately but it's a work around that will run on both Core and Scale.

This is not "realtime" but it's an option and you can establish how often the data is polled.

Important Announcement for the TrueNAS Community.

Long Term Monitoring of TrueNAS Drives in particular

MVP

Powered by Neutrality

Hall of Famer

MVP

Hall of Famer

MVP

Hall of Famer

MVP

Hall of Famer

MVP

MVP

Guru

MVP

Wizard

MVP

Powered by Neutrality

Powered by Neutrality

Old Man

MVP

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Long Term Monitoring of TrueNAS Drives in particular"

Similar threads