Scripts to report SMART, ZPool and UPS status, HDD/CPU T°, HDD identification and backup the config

asciutto · Jan 14, 2022

asciutto said:
I got the script to run but it would not email me. Is there something within TrueNAS I'm missing? I have an email setup for the root user as well. I'm very new to all this..

Figured it out! I guess I did not have email configured correctly under the System > Email section. Looks like I can't edit posts with this early of an account.

The guide was excellent, thank you for all the leg work!

LIGISTX · Jan 30, 2022

Anyone know if this will run or plan to update for Scale? I have been running a slightly modified version of https://github.com/edgarsuit/FreeNAS-Report for quite a while, but I am planning to migrate to Scale here in the next few weeks (assuming it goes "release" as it should). I knew enough to slightly edit edgarsuit's to implement encryption of the backup, but that was mostly just picking apart what was already there - I don't really have a clue how to know if it would work under debian, or how one would go about making that work.

Mike05 · Feb 5, 2022

Sjöhaga said:
POH is keeping up in attributes data, but not in the smart log, so the test age division gets funky .

View attachment 49874

But I think this is one of those bugs we don't really need to fix .
POH keep increasing in the attributes data so it's not completely broken.

And now I know I can't replace those drives. I want to know how long they will last :D. 2014 wan't that long ago, was it...

This happens with my setup. I have a couple of WD RED drives that are approaching 7 years and 8 months old. My lastTestHours gives 1466 and POH = 67009. The POH matches the 7 years, 7 months age. So, it seems the "last test hours" is the culprit, wrapping after 65535 as Spearfoot mentioned. In this case, it will reset the lastTestHours to 0 when the POH is just under 7.5 years.

I added a line to the script (commented below) to check if the onHours minus lastTestHours was >= 65536, and if so, add 65536 to the last test hours:

Code:

...
END {
    if (onHours - lastTestHours >= 65536) lastTestHours += 65536; # This line added
    testAge=int((onHours - lastTestHours) / 24);
    yrs=int(onHours / 8760);
    ...

Not sure if this is the best way to handle it, but it seems to work and I no longer get the red warning background in the report. However, 7.5 years is getting high, so maybe I should be looking at replacing those drives anyway

.

DrZombi · Feb 6, 2022

Mike05 said:
Birkir Brimdal: I have had the same error, and it seems to be related to the different format of the zpool status output with TrueNAS 12 vs FreeNAS 11.3.

I made the following changes to the script I have in order to get it working on my TrueNAS 12.0-RELEASE server (search for the first word to find the correct location in the script). Note that I have v1.3 of the script that was modified by melp, mentioned here, which reports in html format. It may be slightly different to Bidule0hm's script. It's the parameters given to awk that are important.

EDIT: I believe Spearfoot has managed this in his latest zpool_report.sh script by checking the BSD release and changing the awk parameters accordingly. They match my updated awk parameters.

1. scrubErrors="$(echo "$statusOutput" | grep "scan" | awk '{print $8}')"
It was ... awk '{print $10}' in the previous output format

2. scrubDate="$(echo "$statusOutput" | grep "scan" | awk '{print $15"-"$12"-"$13"_"$14}')"
It was ... awk '{print $17"-"$14"-"$15"_"$16}' in the previous output format

3. scrubTime="$(echo "$statusOutput" | grep "scan" | awk '{print $6}')"
It was ... awk '{print $8}' in the previous output format

4. if [ "$scrubRepBytes" != "N/A" ] && [ "$scrubRepBytes" != "0B" ]; then scrubRepBytesColor="$warnColor"; else scrubRepBytesColor="$bgColor"; fi
The "0B" was "0" in the previous output format.

These changes should fix the incorrect entries in the "Zpool Status Report Summary" table and prevent the error relating to the illegal operation with the date command.

Hope this helps,
Mike.

That totally saved my day ! I was struggling on debugging this script since too much time ! Thanks a lot man !

Amsoil_Jim · Apr 7, 2022

I recently started getting this error using the script

Code:

Failed conversion of ``28-on-Mon_Mar'' using format ``%Y-%b-%e_%H:%M:%S''
date: illegal time format
usage: date [-jnRu] [-d dst] [-r seconds|file] [-t west] [-v[+|-]val[ymwdHMS]]
[-I[date | hours | minutes | seconds]]
[-f fmt date | [[[[[cc]yy]mm]dd]HH]MM[.ss]] [+format]

Any suggestion on the cause or a fix
Thanks

Magius · Apr 29, 2022

Hello everyone. I've been using these scripts for 10ish(?) years making occasional updates when new FreeNAS versions would break them. A couple weeks ago I was trying to do exactly that when I stumbled across a fork that emails reports in HTML format instead of plain text (https://github.com/edgarsuit/FreeNAS-Report). I liked the look of that output, but unfortunately the script hadn't been touched in a while, was only compatible w/ SATA devices like these scripts used to be, etc.

I decided to update the HTML script, and once finished I contacted the owner to see about updating it on his repo. He told me he hasn't touched the script in ~5 years and wasn't really interested in messing with it anymore, but gave me permission to share it in the forums or fork it, etc.

I'm not sure if this is the best place to put it (I don't have my own Github), but I'll attach the source below along with two example reports from my TrueNAS servers. The gist of what I added to the script, from the changelog:
# v1.4:
# - Rewrote device classification logic
# - Added health and status reporting for SAS and NVME devices
# - Improved handling of SSDs
# - Confirmed correct reporting for Conveyance self test type

If anyone wants to use any of the code to backport into the original text-based versions of the scripts you're more than welcome to. And if you have a server with an NVME device you could test this on I'd appreciate it. I've written several SMART tools before (shameless plug: https://github.com/truenas/py-SMART), so I *think* this code will work for NVME. I just don't have any NVME devices in a TrueNAS system to test with.

Hoping someone else finds this useful. Thanks!

dak180 · Apr 29, 2022

Magius said:
I liked the look of that output, but unfortunately the script hadn't been touched in a while, was only compatible w/ SATA devices like these scripts used to be, etc.

Did you checkout some of the more up to date forks (like mine)?

Magius · Apr 29, 2022

dak180 said:
Did you checkout some of the more up to date forks (like mine)?

I did not, but it seems like I should have, whoops! Clearly this whole html version of the script escaped my notice for several years, so when I found it I figured I'd spend an afternoon fixing it up for my use.

Is there a consensus on the best/newest fork out there? Is yours it? I'll test yours tomorrow and if I can think of any improvements I'll reach out to you. Thanks!

Magius · Apr 29, 2022

I had a free minute to look at the newer scripts. I downloaded the 'bibz' fork of the 'dak180' fork, which looked like the most recently updated, version "1.9". The biggest issue I saw right off the bat is that it still has no support at all for SAS drives. On both of my servers it only detected the SATA drives. It's cool that it also supports NVME, but I suspect far more people are using SAS HDDs in their TrueNAS servers than NVME SSDs..? Supporting SAS seems like a bare minimum capability IMHO, but maybe I can help you guys add that if you're interested? The parsing is trivial since SAS drives don't have SMART attributes, but for that same reason it really requires a new summary table format vs. the SATA HDDs.

Running the "1.9" script on my primary server I got no errors, and the summary table looked about how I'd expect. Nice!
Running the script on my backup server I got a handful of errors but I'm not sure they affected anything? The summary table looked about right but it did not detect my SSDs' reallocated sectors / bad blocks (24 and 35, respectively, it said 0 and 0). I'd definitely want to fix that! Technically it also didn't output my total bytes written for either SSD (it looks like it got 'null' there), but I can't blame it for that. It's a huge pain to figure out how most SSDs report bytes written, and I hadn't even attempted to program that detection into my own script either. Frankly it's a huge pain to reliably parse *any* non-NVME SSD attributes, so if we can just add SAS parsing to the script I'd 100% forgive any weirdness with the SSD tables

If you're interested, here are the errors it threw on my backup server. I'm not sure what they mean without any line numbers:

root@backup[~]# ./new_script.sh -c config.txt
jq: error (at <stdin>:755): Cannot iterate over null (null)
bc: stdin:1: syntax error: * unexpected
bc: stdin:1: syntax error: > unexpected
bc: stdin:1: syntax error: > unexpected
bc: stdin:1: syntax error: * unexpected
jq: error (at <stdin>:755): Cannot iterate over null (null)
bc: stdin:1: syntax error: * unexpected
bc: stdin:1: syntax error: > unexpected
bc: stdin:1: syntax error: > unexpected
bc: stdin:1: syntax error: * unexpected

Do you (@dak180) have any plans to merge in any of the changes from the bibz fork (v1.7 -> v1.9)? If I'm going to contribute to someone's version I'd like to know which baseline I should start from.

Thanks!

Ericloewe · Apr 29, 2022

Given the increasing relevance of NVMe from boot devices to bulk storage and accelerators, and the long-held general aversion to SAS disks (why pay more for the same disk?), I’d hesitate to say that there are more SAS users than NVMe users out there.

Magius · Apr 29, 2022

Ericloewe said:
Given the increasing relevance of NVMe from boot devices to bulk storage and accelerators, and the long-held general aversion to SAS disks (why pay more for the same disk?), I’d hesitate to say that there are more SAS users than NVMe users out there.

That's an interesting take. Not to be argumentative but I've always found used SAS drives to be cheaper than equivalent capacity used SATA, and with the (admittedly arguable) better reliability (5yr warranty vs 1-3yr, better URE spec, etc), SAS seems like a no brainer for your bulk data pools.
I agree NVME are becoming more popular as well, I have them in all my non-TrueNAS machines, but with a lot of people using previous gen server boards to save cost, eg supermicro, a lot of builds don't support NVME.
I guess my main point was that SAS and NVME are both very easy to support in this kind of script, as their health attributes are more standardized, while SATA (especially for SSDs) is very difficult to parse correctly without tons of manufacturer-specific tweaks. It seems odd to me to not build in SAS support, regardless of whether we think it's more or less commonly used than SATA or NVME?

dak180 · Apr 30, 2022

Magius said:
Is there a consensus on the best/newest fork out there? Is yours it?

The network view is the easiest way to check on such things; the version numbers have no real consensus or commonality.

Magius said:
The biggest issue I saw right off the bat is that it still has no support at all for SAS drives.

See this issue for an overview of the difficulties in supporting SAS drives (lack of standards).

Magius · Apr 30, 2022

dak180 said:
See this issue for an overview of the difficulties in supporting SAS drives (lack of standards).

Actually SAS and NVME drives are both far more standardized than SATA. SATA, SSDs particularly, is the one that's difficult to correctly support. Most tools just assume that whatever smartctl says is true, when in many cases smartctl has no idea what the SMART attributes mean so it just applies a label based on what other SATA drives tend to use that attribute number for. This causes all kinds of bad assumptions in tools, and is almost impossible to get right with any consistency across manufacturers. That's why I said I can't blame the current script for reporting 0 reallocated sectors when my drives actually have 20-30, getting that right is a nightmare on SATA.

SAS drives on the other hand don't even have SMART attributes, so there's nothing for smartctl to misinterpret. That's a whole ton of work you just don't have to worry about. Instead you have to probe SCSI logs for the info you want, and there's not much of it per standard. The only things that really matter on SAS drives are the grown defect list and the uncorrectable (read/write/verify) errors in the error log (smartctl -l error). Power on hours is found in the background scan results log (smartctl -l background), and unlike SATA it's always reported in the same format no weird millisecond resolution or whatever a manufacturer wants to report. Those 5 things cover all you really need for SAS, and they're standard on every SAS drive ever made. The only other things worth grabbing IMHO are the number of bytes read/written(and verified, less so) and those three are standardized and found in the same error log as above.

Let me know if you'd like any assistance working this SCSI stuff into the script. Or just steal the parsing code from the one I uploaded above.

dak180 · Apr 30, 2022

Magius said:
Let me know if you'd like any assistance working this SCSI stuff into the script.

I would love it if you could contribute the parsing you have done (see also other outputs in the ticket); I have not been doing much with it because I do not have any SAS drives to test with and the history @jgreco has noted before makes finding documentation difficult at best.

jgreco · Apr 30, 2022

dak180 said:
I would love it if you could contribute the parsing you have done (see also other outputs in the ticket); I have not been doing much with it because I do not have any SAS drives to test with and the history @jgreco has noted before makes finding documentation difficult at best.

Wow, someone reads stuff I wrote. Yay!

dak180 · Apr 30, 2022

jgreco said:
Wow, someone reads stuff I wrote. Yay!

You are currently in the lead as an author of posts that I have bookmarked (definitely a fan of good documentation).

jgreco · Apr 30, 2022

Yeah, but, still, that was somewhat obscure. It was just me blabbing on and on in the middle of a thread somewhere.

Magius · Apr 30, 2022

dak180 said:
I would love it if you could contribute the parsing you have done (see also other outputs in the ticket); I have not been doing much with it because I do not have any SAS drives to test with and the history @jgreco has noted before makes finding documentation difficult at best.

Ha, yeah jgreco really nailed the history there pretty well. There are major pros and cons in comparing the SCSI and ATA standards, but fortunately SCSI is super easy to parse for this kind of tool. To do it the way your (@dak180) script wants it, you want the command 'smartctl -AHijl error /dev/XXX'. That will give you something like the below (done as quote vs. code on purpose so I can mark it up inline).

root@freenas:~ # smartctl -AHijl error /dev/da12
{
"json_format_version": [
1,
0
],
"smartctl": {
"version": [
7,
2
],
"svn_revision": "5155",
"platform_info": "FreeBSD 12.2-RELEASE-p14 amd64",
"build_info": "(local build)",
"argv": [
"smartctl",
"-AHijl",
"error",
"/dev/da12"
],
"exit_status": 0
},
"device": {
"name": "/dev/da12",
"info_name": "/dev/da12",
"type": "scsi",
"protocol": "SCSI" #Note that this can be used to identify 'SCSI', 'ATA' and 'NVMe' devices to determine what kind of parsing to do next
},
"vendor": "HITACHI",
"product": "HUS72303CLAR3000",
"model_name": "HITACHI HUS72303CLAR3000", #This is essentially the "Model Family" that many scripts parse.
"revision": "C442",
"scsi_version": "SPC-4",
"user_capacity": {
"blocks": 5860533168,
"bytes": 3000592982016
},
"logical_block_size": 512,
"rotation_rate": 7200, #Just like SATA, this will tell you if it's an SSD (0) or HDD (!0)
"form_factor": {
"scsi_value": 2,
"name": "3.5 inches"
},
"serial_number": "YXGKV08K", # Self Explanatory
"device_type": {
"scsi_value": 0,
"name": "disk"
},
"local_time": {
"time_t": 1651375013,
"asctime": "Sat Apr 30 23:16:53 2022 EDT"
},
"smart_status": {
"passed": true # Self Explanatory
},
"temperature": {
"current": 37, # Self Explanatory
"drive_trip": 85
},
"power_on_time": {
"hours": 6929, # Self Explanatory
"minutes": 34
},
"scsi_grown_defect_list": 0, # Essentially the same as SATA's "Reallocated Sectors", this is the most important value to parse.
"scsi_error_counter_log": {
"read": {
"errors_corrected_by_eccfast": 0, #These corrected errors don't really matter, I see drives with thousands of them routinely
"errors_corrected_by_eccdelayed": 0,
"errors_corrected_by_rereads_rewrites": 0,
"total_errors_corrected": 0,
"correction_algorithm_invocations": 38430822,
"gigabytes_processed": "169980.507", # Total GB read, if you want it. See below for writes and verifies
"total_uncorrected_errors": 0 # These are the other important error counters (uncorrected read, write & verify) to parse
}, # While they're not quite the same, consider these like the CRC and Offline Unc errors most
"write": { # scripts report for SATA drives.
"errors_corrected_by_eccfast": 0,
"errors_corrected_by_eccdelayed": 0,
"errors_corrected_by_rereads_rewrites": 0,
"total_errors_corrected": 0,
"correction_algorithm_invocations": 735266,
"gigabytes_processed": "110098.841", # See note above
"total_uncorrected_errors": 0 # See note above
},
"verify": {
"errors_corrected_by_eccfast": 0,
"errors_corrected_by_eccdelayed": 0,
"errors_corrected_by_rereads_rewrites": 0,
"total_errors_corrected": 0,
"correction_algorithm_invocations": 3672791,
"gigabytes_processed": "27237.958", # See note above
"total_uncorrected_errors": 0 # See note above
}
}
}

That information essentially gives you parity with what most scripts report for SATA drives, with zero chance of misinterpretation due to wonky SMART attributes. The one big thing missing is the self test log data. Unfortunately, there seems to be a bug in smartctl where it can't return the selftest log in json format? None of my previous tools used json format so I'd never run into this issue, but the data is very easy to parse from the standard output. See below, first for the bug:

root@freenas:~ # smartctl -jl selftest /dev/da12
{
"json_format_version": [
1,
0
],
"smartctl": {
"version": [
7,
2
],
"svn_revision": "5155",
"platform_info": "FreeBSD 12.2-RELEASE-p14 amd64",
"build_info": "(local build)",
"argv": [
"smartctl",
"-jl",
"selftest",
"/dev/da12"
],
"exit_status": 0
},
"device": {
"name": "/dev/da12",
"info_name": "/dev/da12",
"type": "scsi",
"protocol": "SCSI"
}
} #Looks like it just dies after the info section, no selftest log output at all..? I tested and this command works as expected for my SATA drives. I wonder if the smartmontools folks are aware of this?

And here's the standard non-JSON version:

root@freenas:~ # smartctl -l selftest /dev/da12
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p14 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed - 6811 - [- - -]
# 2 Background short Completed - 6644 - [- - -]
# 3 Background short Completed - 6476 - [- - -]
# 4 Background short Completed - 6308 - [- - -]
# 5 Background short Completed - 6068 - [- - -]
# 6 Background short Completed - 5900 - [- - -]
# 7 Background short Completed - 5733 - [- - -]
# 8 Background short Completed - 5565 - [- - -]
# 9 Background short Completed - 5397 - [- - -]
#10 Background short Completed - 5229 - [- - -]
#11 Background short Completed - 5061 - [- - -]
#12 Background short Completed - 4893 - [- - -]
#13 Background short Completed - 4669 - [- - -]
#14 Background short Completed - 4501 - [- - -]
#15 Background short Completed - 4333 - [- - -]
#16 Background short Completed - 4165 - [- - -]
#17 Background short Completed - 3925 - [- - -]
#18 Background short Completed - 3757 - [- - -]
#19 Background short Completed - 3589 - [- - -]
#20 Background short Completed - 3421 - [- - -]

Long (extended) Self-test duration: 6 seconds [0.1 minutes]

For the normal summary report that just has the type of the last test and how long ago it completed, everything you need is in that table, so you're done!

If you wanted to get fancier with parsing failed tests, you'll need a decoder ring for what are called Key Code Qualifiers (KCQs), which would show up on the far right where the three dashes are. The labels on top stand for Sense Key (SK), Additional Sense Code (ASC) and ASC Qualifier (ASCQ). There are hundreds of KCQs defined in the SCSI standard, but only a subset can be reported by disk drives (others are reserved for tapes, optical, etc.), and an even smaller subset would ever likely show up in a self test result. For example a write failure would be likely, a "device not ready" would not be. A good reference for the full list is here

Regardless, rather than try to decode the KCQ's in the reporting script, since the text wouldn't fit in a summary table anyway, I'd recommend that if a selftest fails, you just parse the 3 digits between the [ ]'s and print them in a summary table box (labeled 'Selftest KCQ' maybe?) in Warn or Crit color. The user can google or come ask in the forums what the codes mean :) And frankly, there's no reason to even go this far, since the script doesn't do anything similar for failed SATA selftests.

As I mentioned earlier I haven't used Github, so I'm not really sure how to check out your repo and contribute code to do this for you. I'm sure you'll agree the implementation is quite trivial. I could download, edit and send the script back to you(?) Or if you have other contributors any of them should be able to follow along with the above notes and do the updates.

Hope this is helpful. Thanks!

Magius · Apr 30, 2022

Just to follow up so nobody else does it, I searched the open tickets against smartmontools and didn't see anything that looked like this problem, so I created a new ticket with the bug report. https://www.smartmontools.org/ticket/1606

dak180 · May 1, 2022

Magius said:
As I mentioned earlier I haven't used Github, so I'm not really sure how to check out your repo and contribute code to do this for you. I'm sure you'll agree the implementation is quite trivial. I could download, edit and send the script back to you(?) Or if you have other contributors any of them should be able to follow along with the above notes and do the updates.

Could you try out my first pass and comment in the github issue tracking this (I have no sas drives and therefor no way to test myself)?

Important Announcement for the TrueNAS Community.

Scripts to report SMART, ZPool and UPS status, HDD/CPU T°, HDD identification and backup the config

Cadet

Guru

Cadet

Dabbler

Contributor

Explorer

Attachments

Patron

Explorer

Explorer

Server Wrangler

Explorer

Patron

Explorer

Patron

Resident Grinch

Patron

Resident Grinch

Explorer

Explorer

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Scripts to report SMART, ZPool and UPS status, HDD/CPU T°, HDD identification and backup the config"

Similar threads