multi_report.sh version for Core and Scale 3.0

joeschmuck · Oct 1, 2023

@GrimmReaperNL
I can send you an edited copy of the script which does not check for multiple running scripts, but I'd rather figure out what is actually happening on your system, it's the only one I've heard of having this problem. I will send you an email so we can take this offline, hopefully resolve it, and then post the issue and fix.

ohlin5 · Oct 12, 2023

Just came here to say thank you so much for this script - it already surfaced some reallocated events on one of my drives that I had not seen before. Dead simple and just works. Love it!

ohlin5 · Oct 23, 2023

Deeda said:
Thanks for the update!

I just updated one of my TrueNAS servers that is having the issue with reporting incorrectly on the "Last Test Age". Unfortunately that error still seems to persist. Eg, as you can see from the attachments, it tells me that drive da0 has a last test age of 610, but using the -dump command shows that da0 has completed short and long tests.

Is there any fix for this? I've gots my smart test tasks set to test all disks, and I can from smartctl that ada0 and ada1 (my boot pool) both show short and long tests, but they both show very high 'last test age' column numbers (highlighted in orange). All my other disks report ok...for some reason it's just my boot pool.

joeschmuck · Oct 23, 2023

ohlin5 said:
Is there any fix for this? I've gots my smart test tasks set to test all disks, and I can from smartctl that ada0 and ada1 (my boot pool) both show short and long tests, but they both show very high 'last test age' column numbers (highlighted in orange). All my other disks report ok...for some reason it's just my boot pool.

That specific problem was solved early in the year, you have a different version of this problem.

I will need some data in order to provide a proper answer/fix. Every case is always different unfortunately. I suspect the SMART data is being presented in a different format, which does happen since there is no standard for displaying SMART data.

There are two ways to do this, you could run the script from the command line and add the switch -dump email which should send me all the data I need to figure it out, or if you would prefer, you could run the script using just -dump and then open a message in Conversations and attach all the files from the dump. The -dump email is much easier on you. As others know here, I do not share email addresses with anyone and no personal data is transmitted either. And this goes to a dedicated email address only for troubleshooting this script. It's not used for anything else.

If you reach out within the next 3 hours, I should be able to massage the script for your situation and then everyone can benefit from it when I publish version 2.5.

ohlin5 · Oct 23, 2023

joeschmuck said:
That specific problem was solved early in the year, you have a different version of this problem.

I will need some data in order to provide a proper answer/fix. Every case is always different unfortunately. I suspect the SMART data is being presented in a different format, which does happen since there is no standard for displaying SMART data.

There are two ways to do this, you could run the script from the command line and add the switch -dump email which should send me all the data I need to figure it out, or if you would prefer, you could run the script using just -dump and then open a message in Conversations and attach all the files from the dump. The -dump email is much easier on you. As others know here, I do not share email addresses with anyone and no personal data is transmitted either. And this goes to a dedicated email address only for troubleshooting this script. It's not used for anything else.

If you reach out within the next 3 hours, I should be able to massage the script for your situation and then everyone can benefit from it when I publish version 2.5.

I'm sorry I just saw this - I'm going to run it now; you should have an email shortly! Just promise me you won't laugh at some of my old ass equipment LOL ;)

joeschmuck · Oct 24, 2023

Got your email. I will take a look at it today and I should have an answer later on. But off to work I go right now.

EDIT: I can only see what drives you have for hardware, not any old ass equipment unless the drives are old

joeschmuck · Oct 24, 2023

@ohlin5 I examined the data and sent you a very detailed email. For the folks here, the problem was the SMART TEST LOG Lifetime Hours rolled over at #FFFF (65535 hours). The only way to fix this in the script is to go into Custom Drive Settings and Ignore Last Test Age. Some of you may recall that I would customize the script for some people if they asked to handle specific drive issues such as this type, it was a lot of work and each one was handled by serial number. The Custom Drive Settings replaced what I was doing manually myself.

Hopefully this case is closed.

dak180 · Oct 24, 2023

@joeschmuck are you using --log="xselftest,selftest" or --log="selftest" if the disk supports xselftest, among other improvements (support for larger disks), I do not believe it has the same issue with timestamps wrapping.

joeschmuck · Oct 24, 2023

dak180 said:
@joeschmuck are you using --log="xselftest,selftest" or --log="selftest" if the disk supports xselftest, among other improvements (support for larger disks), I do not believe it has the same issue with timestamps wrapping.

Thanks for that information. I was just using -x --json, that is strange that it would be reported differently but we will have to see.

limehawk · Oct 24, 2023

How long should the script take to run? We have 33 disks and 4 storage pools. Script seems to hang without any status about what is going on.

Code:

Multi-Report v2.4.4 dtd:2023-08-19 (TrueNAS Core 12.0-U8.1)
Checking for Updates
Current Version 2.4.4 -- GitHub Version 2.4.4
No Update Required


^C

joeschmuck · Oct 24, 2023

First, this script should run fine on TrueNAS 12.0, I did test that out several months ago.

The script is not super fast as it must poll every drive. The longest time I've every seen is 16 minutes and 28 second, that was for 201 drives (25 are SSD, rest HDD).

The script first looks for any updates, which it looks like it did that and told you no updated exist, and I would think maybe 5 minutes or less but I'd let it run for 10 minutes, that for certain should be long enough. If any of the drives are not spinning then you need to consider additional time to spin up the drives, that actually adds a lot when you have many drives. Most people have under 2 minutes to run the script, and it will note that in the output email.

The script should tell you when it's done as well when run from the CLI.

If it has been a long time, I'd recommend you open up another CLI and run 'top' to see if it is still running. Of course the CTRL-C should have kicked it out and left behind some temp files under /tmp/ directory since it was unable to clean them up. When you reboot those temporary file will vanish.

If you cannot get the script to complete, please let me know how you have the script installed and how you are running it. You can PM me since this will likely get lengthy.

Also, if we find out the script is hanging for some reason we can't fix, @dak180 has a different version of this script that he's been developing for a while. While I'd hate to lose a customer, it's more important that the end user is satisfied.

joeschmuck · Oct 25, 2023

dak180 said:
@joeschmuck are you using --log="xselftest,selftest" or --log="selftest" if the disk supports xselftest, among other improvements (support for larger disks), I do not believe it has the same issue with timestamps wrapping.

Didn't work but at least we tried and I appreciate the suggestion. I don't know everything and I'm always willing to take advice. @ohlin5 is up and running with those two alarmed drives for Test Age ignored.

crembz · Nov 7, 2023

I've tries running this up on Truenas Scale 23 but am hitting the following error:

./multi_report.sh: line 3282: ( / 8760): syntax error: operand expected (error token is "/ 8760)")

Has anyone else hit this?

GrimmReaperNL · Nov 7, 2023

crembz said:
I've tries running this up on Truenas Scale 23 but am hitting the following error:

./multi_report.sh: line 3282: ( / 8760): syntax error: operand expected (error token is "/ 8760)")

Has anyone else hit this?

ran without issue for me on version .2 I have a special version though from joe

joeschmuck · Nov 7, 2023

crembz said:
I've tries running this up on Truenas Scale 23 but am hitting the following error:

./multi_report.sh: line 3282: ( / 8760): syntax error: operand expected (error token is "/ 8760)")

Has anyone else hit this?

To answer your question, yes, a lot of people have received a form of that error. I have been able to modify the script to account for yet another version of the SMART report.

Question: What version of Multi-Report are you running? It should be v2.4.4 dated 2023_08_19. Also, exactly which version of SCALE (full version number please). I'm fairly certain I have tested the few different versions but I can double check, but typically this type of error is due to a value being read that was expected to be present but was not.

If you are running this version, odds are you have "yet another" drive that reports slightly different. If you could run the script from the command line and add the switch -dump email then it will send me a copy of the data I need to duplicate the problem (in most cases) and resolve it. If you would prefer to not do it this way a I would know your email address, then you could run the script using just -dump and then PM me with all the attachments you received from the script. I can do it either way, the '-dump email' option is just easier on you. Of course I'm assuming the script does not fail and stop at the error, in the past it has continued to run. But if it stops, we can work through it.

crembz · Nov 7, 2023

joeschmuck said:
To answer your question, yes, a lot of people have received a form of that error. I have been able to modify the script to account for yet another version of the SMART report.

Question: What version of Multi-Report are you running? It should be v2.4.4 dated 2023_08_19. Also, exactly which version of SCALE (full version number please). I'm fairly certain I have tested the few different versions but I can double check, but typically this type of error is due to a value being read that was expected to be present but was not.

If you are running this version, odds are you have "yet another" drive that reports slightly different. If you could run the script from the command line and add the switch -dump email then it will send me a copy of the data I need to duplicate the problem (in most cases) and resolve it. If you would prefer to not do it this way a I would know your email address, then you could run the script using just -dump and then PM me with all the attachments you received from the script. I can do it either way, the '-dump email' option is just easier on you. Of course I'm assuming the script does not fail and stop at the error, in the past it has continued to run. But if it stops, we can work through it.

Thanks Joe,

I actually added two drives which I recently added to a pool into the ignore_drives section and it ran without error so I believe that is the problem.

Let me remove them and rerun the dump script. How will you know if it is from me?

joeschmuck · Nov 8, 2023

crembz said:
How will you know if it is from me?

You will be the only person sending me an email. This email account is used only for multi-report as of now. It has virtually no traffic, which is a good thing. Also if I had several emails, the problem you experience would tell me. Or i'd just guess. Regardless of who it is, if they send me a dump and there is a problem, I will do what I can to fix it.

It looks to fail on your LITEON SSD, but that was a quick glance. When I return from work in about 10 hours, I will actually examine it and identify the exact cause. Hopefully I will have an updated version today. I will email you the updated version and ask you to run it to verify it is fixed. The new version will be v2.5 Beta because that is the version I'm working on. I have one additional feature I'm adding but I ran into a road block making it work the way I want it to work. It does work now but I want it to do the task a little bit differently. Little things matter. Since you do not have any NVMe drives you would not even see this new feature. I will explain more when I send you an email.

And thanks for reporting the issue.

crembz · Nov 8, 2023

joeschmuck said:
You will be the only person sending me an email. This email account is used only for multi-report as of now. It has virtually no traffic, which is a good thing. Also if I had several emails, the problem you experience would tell me. Or i'd just guess. Regardless of who it is, if they send me a dump and there is a problem, I will do what I can to fix it.

It looks to fail on your LITEON SSD, but that was a quick glance. When I return from work in about 10 hours, I will actually examine it and identify the exact cause. Hopefully I will have an updated version today. I will email you the updated version and ask you to run it to verify it is fixed. The new version will be v2.5 Beta because that is the version I'm working on. I have one additional feature I'm adding but I ran into a road block making it work the way I want it to work. It does work now but I want it to do the task a little bit differently. Little things matter. Since you do not have any NVMe drives you would not even see this new feature. I will explain more when I send you an email.

And thanks for reporting the issue.

Thanks for that, yes I can confirm that adding the feel branded liteon ssds allows the script to complete without error. I look forward to the update and thanks heaps! Awesome work

joeschmuck · Nov 8, 2023

You have email. The main issue was one LITEON drive did not have Power On Hours listed in the .json data. That is a new one for me. The other LITEON drive did have power on hours but it said the drive had been running for over 135 years. Crazy numbers. I added a hot patch to fix the problem for now and will clean it up this weekend and send you the nicer version. I doubt this will take me long, it's just that I don't have the time this moment to make it pretty. Functional is what you get for now. Your only other option would be to add them to the Ignore List, but that should be a last resort. And thanks for reporting the issue. I am sincere that I want to fix any issues if at all possible.

Also, those two LITEON drives are now part of my test database to test out future versions of the script. Thanks for the contribution.

lozyyzens · Nov 18, 2023

OK, so it's my fault... I had been running the previous version with "/bin/sh /mnt/path/script.sh" under cron due to something historic that had caused problems with zsh default shell.

Important Announcement for the TrueNAS Community.

multi_report.sh version for Core and Scale 3.0

Old Man

Cadet

Cadet

Old Man

Cadet

Old Man

Old Man

Patron

Old Man

Cadet

Old Man

Old Man

Cadet

Explorer

Old Man

Cadet

Old Man

Cadet

Old Man

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "multi_report.sh version for Core and Scale"

Similar threads