Resource icon

multi_report.sh version for Core and Scale 3.0

TooMuchData

Contributor
Joined
Jan 4, 2015
Messages
188

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Check the last line of your core file...

Otherwise, great work.
 
Joined
Jan 4, 2014
Messages
1,644
@TooMuchData This is still one of my favourite scripts, though it hasn't been maintained for quite some time and cracks have started to appear. A list of known issues can be found here. Are you planning on upgrading and maintaining this script?
 
Last edited:

TooMuchData

Contributor
Joined
Jan 4, 2015
Messages
188
I modified the script because no one else seemed inclined. I just wanted it to run in Scale. Then I thought I should share it. I have no plans to maintain the script going forward, but would probably fix errors that showed up on any of my four TrueNAS servers. I will look more closely at the list of outstanding issues and get back to you.
 

dak180

Patron
Joined
Nov 22, 2017
Messages
310
@TooMuchData This is still one of my favourite scripts, though it hasn't been maintained for quite some time and cracks have started to appear. A list of known issues can be found here. Are you planning on upgrading and maintaining this script?
Have you checked out my update for it?
 

TooMuchData

Contributor
Joined
Jan 4, 2015
Messages
188
Thanks dak180. I wish I had seen your version previously. I searched, but yours is not called "mult_report", the only title I've known.
I ran yours on Core and got error "Please specify a config file location." I gather more is needed than just changing email. I'll leave it to you to resolve.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
I gather more is needed than just changing email.
I deleted my last post, because it wasn't based on the current version of @dak180's script. If you use the script in the topic/refactor branch, you need to specify a config file, like this: ./report.sh -c report.cfg. Since that file doesn't yet exist, the script will create it. You'll then need to edit report.cfg, set defaultConfig to 0 (to tell the script this isn't just a default file), enter your email address, and make any other desired changes. Then you can run the script with ./report.sh -c report.cfg, and you'll get the report.

@dak180, it looks like everything I'd said was missing was already there, and I just wasn't looking in the right place--my bad. I assume you'll be pushing the refactor branch to master soon enough. But first, it would be good to edit the README to note the syntax.
 

dak180

Patron
Joined
Nov 22, 2017
Messages
310
But first, it would be good to edit the README to note the syntax.
That is on my list of things to do when I get a few free moments.
 

dak180

Patron
Joined
Nov 22, 2017
Messages
310
I figured it was a WIP since you hadn't merged the work into master yet.
One major reason not to do that yet is that I have not really gotten any feedback or testing from anyone besides myself; so any thoughts or issues you or anyone else has would be very useful.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Here is an update that I created yesterday and tested on both TrueNAS CORE & SCALE. I'd like any feedback, preferably constructive. And I'm sure I could streamline some of the code better but since I'm trying to make it work on both platforms, baby steps. Of course rename the file from .txt to .sh

So these are things that I like to see, being able to minimize altering a default configuration, quickly identify if I need to read the email or not (hence the Good/Critical/Warning in the email Subject line), showing all the drives, even those not supporting SMART if fruitful. By default I have all the options enabled except the backup up of the FreeNAS config file since the path to save it needs to be established first. Another change I made was the scrubAgeWarn to add 7 days to the limit becasue typically the scrub is set to happen on a specific day of the week, so if you exceed the limit of 31 but it's not yet the day of the week, well you get an alert and I didn't like that, so I changed the default to 37 days. The temperature and sectors values are also personal and should be adjusted to your taste. I also run a SMART short test daily on all my drives at 9PM except Tuesday, that test lasts no longer than 2 minutes for a good drive, longer for a failing drive. I run a SMART long test on all my drives at 9PM on Tuesday. This means that I have a SMART Test running daily so my testAgewarn=2, only because the long test could take over 1 day on my pool (it doesn't since I cleaned it up but it was a few months ago). Not sure I care the the "Seek Error Health" column, not sure how helpful that is. And this is not all my work, I may have started a version of this script back in the day but it's a team effort to create a well thought out script that many people can easily use.

# v1.4:
# - Run on CRON JOB using /path/multi_report_v1.4.sh
# - Fixed for automatic running between FreeBSD and Linux Debian (aka SCALE) as of this date.
# - All SMART Devices will report.
# - Added conditional Subject Line (Good/Critical/Warning).
# - Added Automatic SSD Support.
# --- Some updates may need to be made to fit some of SSD's. Code in the area of about line 530 will
# --- need to be adjusted to add new attributes for the desired SSD's fields.
# - UDMA_CRC_ERROR Override because once a drive encounters this type of error, it cannot be cleared
# --- so you can offset it now vice having an alarm condition for old UDMA_CRC_Errors.
# - Added listing NON-SMART Supported Drives. Use only if useful to you, some drives will
# --- still output some relevant data, many will not.


TEST SCALE OUTPUT (Note: drive sdb has two UDMA_CRC_Errors but is not in alarm condition becasue I offset it in the user defined parameters, it is yellow background in the email and the SMART Status is Green background, boot-pool scrub is in a light blue background)
---------------------------------
Code:
Multi-Report v1.4a

ZPool Status Report Summary​
Pool Name​
Status​
Read Errors​
Write Errors​
Cksum Errors​
Used %​
Scrub Repaired Bytes​
Scrub Errors​
Last Scrub Age​
Last Scrub Duration​
boot-pool​
ONLINE​
0​
0​
0​
24%​
N/A​
N/A​
In Progress​
Est Comp: 00:00:28​
farm​
ONLINE​
0​
0​
0​
4%​
N/A​
N/A​
Canceled​
N/A​
farmssd​
ONLINE​
0​
0​
0​
0%​
N/A​
N/A​
Never Scrubbed​
N/A​
Hard Drive - SMART Status Report Summary​
Device​
Serial Number​
SMART Status​
Temp​
Power-On Time​
Start/Stop Count​
Spin Retry Count​
Realloc'd Sectors​
Realloc Events​
Current Pending Sectors​
Offline Uncorrectable Sectors​
UltraDMA CRC Errors​
Seek Error Health​
Last Test Age (days)​
Last Test Type​
/dev/sdb​
WD-WXQ1E36C19XC​
PASSED​
27*C​
0y 1m 20d 23h​
2070​
0​
0​
0​
0​
0​
0​
200%​
0​
Short​
/dev/sda​
S2X1J90CA48799​
PASSED​
26*C​
0y 10m 4d 20h​
121​
0​
0​
0​
0​
0​
0​
252%​
0​
Short​
SSD Auto Detection Enabled
SSD - SMART Status Report Summary​
Device​
Serial Number​
SMART Status​
Temp​
Power-On Time​
Wear Level​
Realloc'd Sectors​
Realloc Events​
Offline Uncorrectable Sectors​
UltraDMA CRC Errors​
Read Error Rate​
Last Test Age (days)​
Last Test Type​
/dev/sdc​
P02618119268​
PASSED​
0*C​
0y 2m 1d 19h​
100​
0​
0​
0​
0​
0​
0​
Short​
########## ZPool status report for boot-pool ##########style='font-size:10.5pt'> pool: boot-pool state: ONLINE status: Some supported and requested features are not enabled on the pool. The pool can still be used, but some features are unavailable. action: Enable all features using 'zpool upgrade'. Once this is done, the pool may no longer be accessible by software that does not support the features. See zpool-features(7) for details. scan: scrub in progress since Sun Dec 26 19:30:10 2021 3.49G scanned at 28.8M/s, 2.88G issued at 23.8M/s, 3.49G total 0B repaired, 82.54% done, 00:00:26 to go config: NAME STATE READ WRITE CKSUM boot-pool ONLINE 0 0 0 sdd ONLINE 0 0 0 errors: No known data errors ########## ZPool status report for farm ##########style='font-size:10.5pt'> pool: farm state: ONLINE status: Some supported and requested features are not enabled on the pool. The pool can still be used, but some features are unavailable. action: Enable all features using 'zpool upgrade'. Once this is done, the pool may no longer be accessible by software that does not support the features. See zpool-features(7) for details. scan: scrub canceled on Sun Dec 26 14:25:45 2021 config: NAME STATE READ WRITE CKSUM farm ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ac0a43ea-6119-11ec-be7c-ac1f6b6ad038 ONLINE 0 0 0 ad39332f-6119-11ec-be7c-ac1f6b6ad038 ONLINE 0 0 0 errors: No known data errors ########## ZPool status report for farmssd ########## pool: farmssd state: ONLINE status: Some supported and requested features are not enabled on the pool. The pool can still be used, but some features are unavailable. action: Enable all features using 'zpool upgrade'. Once this is done, the pool may no longer be accessible by software that does not support the features. See zpool-features(7) for details. config: NAME STATE READ WRITE CKSUM farmssd ONLINE 0 0 0 d7a22396-6119-11ec-be7c-ac1f6b6ad038 ONLINE 0 0 0 errors: No known data errors ########## SMART status report for sdb drive (Western Digital Blue: WD-WXQ1E36C19XC) ########## SMART overall-health self-assessment test result: PASSED ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 150 142 021 Pre-fail Always - 1500 4 Start_Stop_Count 0x0032 098 098 000 Old_age Always - 2070 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1233 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 58 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 30 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2142 194 Temperature_Celsius 0x0022 116 099 000 Old_age Always - 27 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 No Errors Logged Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Extended offline Completed without error 00% 1211 - Short offline Completed without error 00% 1222 - ########## SMART status report for sda drive (Seagate Samsung SpinPoint: S2X1J90CA48799) ########## SMART overall-health self-assessment test result: PASSED ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 0 2 Throughput_Performance 0x0026 055 054 000 Old_age Always - 6178 3 Spin_Up_Time 0x0023 091 091 025 Pre-fail Always - 2858 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 121 5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0 8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 7416 SNIP -> SNIP No Errors Logged Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Extended offline Completed without error 00% 7394 - Short offline Completed without error 00% 7405 - ########## SMART status report for sdc drive (Plextor M3/M5/M6/M7 Series: P02618119268) ########## SMART overall-health self-assessment test result: PASSED ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0 SNIP -> SNIP 177 Wear_Leveling_Count 0x0003 100 100 000 Pre-fail Always - 4125 233 Media_Wearout_Indicator 0x0003 100 100 000 Pre-fail Always - 773 No Errors Logged Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Extended offline Completed without error 00% 1480 - Short offline Completed without error 00% 1493 - ########## NON-SMART status report for sdd drive (USB Flash Drive: ) ########## SMARTCTL DATA /dev/sdd: Unknown USB bridge [0x125f:0xdb8a (0x1100)] Please specify device type with the -d option. FDISK DATA Disk /dev/sdd: 14.45 GiB, 15518924800 bytes, 30310400 sectors Disk model: USB Flash Drive Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: 05D20A33-6115-11EC-AF54-AC1F6B6AD038 Device Start End Sectors Size Type /dev/sdd1 40 532519 532480 260M EFI System /dev/sdd2 532520 30285863 29753344 14.2G FreeBSD ZFS


 

Attachments

  • multi_report_v1.4a.txt
    47.8 KB · Views: 394
Last edited:

Dan Tudora

Patron
Joined
Jul 6, 2017
Messages
276
Hello @joeschmuck
I try script on TrueNAS CORE 12.0-U6.1 and have some problem

Code:
nas0sstp# ./multi_report_v1.4a.sh
Failed conversion of ``28-on-Tue_Dec'' using format ``%Y-%b-%e_%H:%M:%S''
date: illegal time format
usage: date [-jnRu] [-d dst] [-r seconds|file] [-t west] [-v[+|-]val[ymwdHMS]]
            [-I[date | hours | minutes | seconds]]
            [-f fmt date | [[[[[cc]yy]mm]dd]HH]MM[.ss]] [+format]
nas0sstp# date
Tue Dec 28 21:09:23 EET 2021
nas0sstp#


Change root shell from zsh to sh and nothing change
Change root shell from sh to bash and nothing change
An now I am lost :D
cheers
 

Dan Tudora

Patron
Joined
Jul 6, 2017
Messages
276
Hello again
and now on TrueNAS-SCALE-22.02-RC.2

Code:
root@nas2sstp[~]# ./multi_report_v1.4a.sh
sysctl: cannot stat /proc/sys/kern/disks: No such file or directory
date: invalid option -- 'j'
Try 'date --help' for more information.
date: invalid option -- 'j'
Try 'date --help' for more information.
sed: can't read : No such file or directory
sed: can't read : No such file or directory
sed: can't read : No such file or directory
sed: can't read : No such file or directory
sed: can't read : No such file or directory
sed: can't read : No such file or directory
root@nas2sstp[~]#
root@nas2sstp[~]# date
Tue Dec 28 21:32:01 EET 2021
root@nas2sstp[~]#


but it's a team effort to create a well thought out script that many people can easily use.
Yes indeed.
cheers
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Hello @joeschmuck
I try script on TrueNAS CORE 12.0-U6.1 and have some problem

Code:
nas0sstp# ./multi_report_v1.4a.sh
Failed conversion of ``28-on-Tue_Dec'' using format ``%Y-%b-%e_%H:%M:%S''
date: illegal time format
usage: date [-jnRu] [-d dst] [-r seconds|file] [-t west] [-v[+|-]val[ymwdHMS]]
            [-I[date | hours | minutes | seconds]]
            [-f fmt date | [[[[[cc]yy]mm]dd]HH]MM[.ss]] [+format]
nas0sstp# date
Tue Dec 28 21:09:23 EET 2021
nas0sstp#


Change root shell from zsh to sh and nothing change
Change root shell from sh to bash and nothing change
An now I am lost :D
cheers
Dan,
I will send you a private message so we can troubleshoot this. While I'm not certain, I was thinking it could be your configuration for your timezone and maybe language? I can change my test system to anything needed so I'd like to configure it similar to yours. I'll send that private message soon.
 

TooMuchData

Contributor
Joined
Jan 4, 2015
Messages
188
@joeschmuck has produced version 1.4b that runs nicely for me on both Core and Scale. It includes warnings if SMART tests have not been run recently. I'll post it as the resource update. See history.
 
Last edited:

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Just testing this on a bunch of scratch disks - some of which are SMART fails.
1. Doesn't spot SAS disks - I have one and it does not appear (sdf, which is also a smart fail)
2. All the rest of the disks are marked as Passed. But sdf, sdg, sdj are all failing extended tests (they have no data on them so its not an issue). This is purely a test system.
3. The report title does say *CRITICAL ERROR* but the summary is wrong.
4. In the individual SMART Status Report it says test result Passed - which would appear to be incorrect (for selected drives)

Results are attached in the pdf

Smartctl --scan shows all the disks - ** = my notes
root@scalenas[/mnt/ScratchSSD/SMB/Scale-Scripts]# smartctl --scan /dev/sda -d scsi # /dev/sda, SCSI device /dev/sdb -d scsi # /dev/sdb, SCSI device /dev/sdc -d scsi # /dev/sdc, SCSI device /dev/sdd -d scsi # /dev/sdd, SCSI device /dev/sde -d scsi # /dev/sde, SCSI device /dev/sdf -d scsi # /dev/sdf, SCSI device **SAS /dev/sdg -d scsi # /dev/sdg, SCSI device **SMART Fail /dev/sdh -d scsi # /dev/sdh, SCSI device /dev/sdi -d scsi # /dev/sdi, SCSI device /dev/sdj -d scsi # /dev/sdj, SCSI device **SMART Fail /dev/sdk -d scsi # /dev/sdk, SCSI device /dev/sdl -d scsi # /dev/sdl, SCSI device /dev/sdm -d scsi # /dev/sdm, SCSI device /dev/sdn -d scsi # /dev/sdn, SCSI device /dev/sdo -d scsi # /dev/sdo, SCSI device /dev/sdp -d scsi # /dev/sdp, SCSI device /dev/sdq -d scsi # /dev/sdq, SCSI device
 

Attachments

  • multi-report.pdf
    667.8 KB · Views: 399

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
4. In the individual SMART Status Report it says test result Passed - which would appear to be incorrect (for selected drives)
That is the built in SMART to the drive. I can't fix that, it's not the script, or I could restructure the script with a new column that is titled "Overall Results" for each drive. While I like to have all the data available at a glance, too much data is busy. I could change PASSED to red background as well. That is why I like the email Subject line with tells you all is good or something is wrong and you need to look at the data closer.

3. The report title does say *CRITICAL ERROR* but the summary is wrong.
Same as the previous answer, this value comes from the smartctl command results, not a judgement on my part, except for the Critical Error message is defined by the other items (tempCrit, sectorsCrit, and crcErrors). What caused this was the UDMA CRC Errors for any one of the three drives that have these errors. You could zero these out in the script.
2. All the rest of the disks are marked as Passed. But sdf, sdg, sdj are all failing extended tests (they have no data on them so its not an issue). This is purely a test system.
You are correct, I can see improvement by scanning the short/extended results and if they contain the word "failure" then I could mark the Last Test Type as RED so a user could go locate the actual error message. It's funny because you have MultiZone errors for these two drives which are likely a result of the read failures. I could include these values as well but unfortunately not all MultiZone failures cause this problem.
1. Doesn't spot SAS disks - I have one and it does not appear (sdf, which is also a smart fail)
I'd love to fix this but unfortunately I do not have any SAS disks so I cannot recreate this situation. Maybe you could fin out the proper command to list this SAS disk and include it into the inventory? Looking at the script, about line 148 you will see this text:
Code:
# Get Hard Drive listing - MUST support SMART
# variable smartdrives
if [ $softver != "Linux" ]; then
 smartdrives=$(for drive in $(sysctl -n kern.disks); do
        if [ "$(smartctl -i /dev/"${drive}" | grep "SMART support is: Enabled")" ] && ! [ "$(smartctl -i /dev/"${drive}" | grep "Solid State Device")" ]; then
            printf "%s " "${drive}"
        fi
    done | awk '{for (i=NF; i!=0 ; i--) print $i }')
   else
 smartdrives=$(for drive in $(fdisk -l | grep "Disk /dev/sd" | cut -c 11-13 | tr '\n' ' '); do
         if [ "$(smartctl -i /dev/"${drive}" | grep "SMART support is: Enabled")" ] && ! [ "$(smartctl -i /dev/"${drive}" | grep "Solid State Device")" ]; then
             printf "%s " "${drive}"
         fi
     done | awk '{for (i=NF; i!=0 ; i--) print $i }')
fi


The specific command I'm looking at is: fdisk -l | grep "Disk /dev/sd"
This should list all the drives and apparently it may not me listing the SAS drive. If it does show up then maybe the "cut" command is not working properly. This is Linux and things work differently so tweaking may be required. Please provide a listing of the drives if it does show up, I should be able to make an adjustment. I don't think I can simulate a SAS drive through ESXi, I'll have to look into it.

Back to the "PASSED" in the Summary Report... I agree it's confusing but I'm struggling to change it to anything else other than what smartctl reports it as being. This is why I think changing the background to red/yellow for alarm conditions for the specific value helps the user scroll through the raw data to locate the problem area.

Thanks for the feedback and comments are welcome.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
I'll run some tests this weekend.
Changing the Passed to Red would work

Its useful having a bunch of ratty old disks

What would also be nice is when you list the drive results, to list them in order a-z rather than what appears to be a random order (I am sure isn't)

Warning, my scripting skills are at the "Hello World" level - but I will give things a try. I did hack Spearfoot's FreeNAS-Scripts to spot NVME drives (and it was a truely horrid hack)
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Its useful having a bunch of ratty old disks

What would also be nice is when you list the drive results, to list them in order a-z rather than what appears to be a random order (I am sure isn't)
I'll bet it is nice having a bunch or ratty old disks. Makes testing more inclusive.

Yes, it's the way the system lists the drives when the commands are executed, but I do understand your comment. I'll have to look into it but first, fixing the operational issues. Sorting "should" be easy but I've never done that in a BASH script before, time to learn.
Warning, my scripting skills are at the "Hello World" level - but I will give things a try. I did hack Spearfoot's FreeNAS-Scripts to spot NVME drives (and it was a truly horrid hack)
That is okay, it's a test system. All I really need is for you to get to the shell and issue the command fdisk -l | grep "Disk /dev/sd" and if this does not show the SAS drive, then just issue the command fdisk -l and see if that shows it. I would imagine the first command would be fine since the drive was identified as "sdf" when you ran the smartctl --scan command."

Also, while you are at it, could you test this SAS drive issue on TrueNAS Core as well? I'm curious if there is an issue there and I'd like to make sure we test everything and make one good fix for it all.

Cheers,
-Mark
 
Top