Resource icon

multi_report.sh version for Core and Scale 3.0

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
fdisk -l result is fdisk0l.txt attached
fdisk -l | grep "Disk /dev/sd"
Disk /dev/sdn: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors Disk /dev/sdq: 14.75 GiB, 15837691904 bytes, 30932992 sectors Disk /dev/sdp: 14.75 GiB, 15837691904 bytes, 30932992 sectors Disk /dev/sdi: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors Disk /dev/sdg: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors Disk /dev/sdj: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors Disk /dev/sdb: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors Disk /dev/sda: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors Disk /dev/sde: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors Disk /dev/sdd: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors Disk /dev/sdm: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors Disk /dev/sdf: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors Disk /dev/sdk: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors Disk /dev/sdo: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors Disk /dev/sdl: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors Disk /dev/sdc: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors Disk /dev/sdh: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors

smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device /dev/sdb -d scsi # /dev/sdb, SCSI device /dev/sdc -d scsi # /dev/sdc, SCSI device /dev/sdd -d scsi # /dev/sdd, SCSI device /dev/sde -d scsi # /dev/sde, SCSI device /dev/sdf -d scsi # /dev/sdf, SCSI device /dev/sdg -d scsi # /dev/sdg, SCSI device /dev/sdh -d scsi # /dev/sdh, SCSI device /dev/sdi -d scsi # /dev/sdi, SCSI device /dev/sdj -d scsi # /dev/sdj, SCSI device /dev/sdk -d scsi # /dev/sdk, SCSI device /dev/sdl -d scsi # /dev/sdl, SCSI device /dev/sdm -d scsi # /dev/sdm, SCSI device /dev/sdn -d scsi # /dev/sdn, SCSI device /dev/sdo -d scsi # /dev/sdo, SCSI device /dev/sdp -d scsi # /dev/sdp, SCSI device /dev/sdq -d scsi # /dev/sdq, SCSI device

smartctl -a /dev/sde
root@scalenas[/mnt/ScratchSSD/SMB/Scale-Scripts]# smartctl -a /dev/sde smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.81+truenas] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: SEAGATE Product: ST3000NM0023 Revision: 0006 Compliance: SPC-4 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Logical block size: 512 bytes LU is fully provisioned Rotation Rate: 7200 rpm Form Factor: 3.5 inches Logical Unit id: 0x5000c500625880ab Serial number: Z1Z5JDWE0000C507EFJZ Device type: disk Transport protocol: SAS (SPL-3) Local Time is: Fri Feb 4 16:51:24 2022 GMT SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Enabled === START OF READ SMART DATA SECTION === SMART Health Status: OK Current Drive Temperature: 25 C Drive Trip Temperature: 60 C Accumulated power on time, hours:minutes 52096:24 Manufactured in week 36 of year 2014 Specified cycle count over device lifetime: 10000 Accumulated start-stop cycles: 350 Specified load-unload count over device lifetime: 300000 Accumulated load-unload cycles: 2467 Elements in grown defect list: 0 Vendor (Seagate Cache) information Blocks sent to initiator = 251129865 Blocks received from initiator = 893628169 Blocks read from cache and sent to initiator = 1215639293 Number of read and write commands whose size <= segment size = 221084023 Number of read and write commands whose size > segment size = 0 Vendor (Seagate/Hitachi) factory information number of hours powered up = 52096.40 number of minutes until next internal SMART test = 36 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 254785013 2 0 254785015 92 35312.951 90 write: 0 0 0 0 0 24655.735 0 verify: 4981 0 0 4981 0 0.000 0 Non-medium error count: 6 [GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on'] SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background long Failed in segment --> - 52080 531 [0x3 0x11 0x0] # 2 Background short Failed in segment --> - 52079 1550955 [0x3 0x11 0x0] # 3 Background long Failed in segment --> - 52078 531 [0x3 0x11 0x0] # 4 Background long Completed - 52067 - [- - -] # 5 Background short Completed - 52062 - [- - -] # 6 Background short Completed - 52035 - [- - -] # 7 Background short Completed - 52010 - [- - -] # 8 Background short Completed - 51986 - [- - -] # 9 Background short Completed - 51962 - [- - -] #10 Background short Completed - 51938 - [- - -] #11 Background short Completed - 51914 - [- - -] #12 Background short Completed - 51890 - [- - -] #13 Background long Completed - 51870 - [- - -] #14 Background short Completed - 51842 - [- - -] #15 Background short Completed - 51818 - [- - -] #16 Background short Completed - 51794 - [- - -] #17 Background short Completed - 51770 - [- - -] #18 Background short Completed - 51746 - [- - -] #19 Background short Completed - 51722 - [- - -] #20 Background short Completed - 51698 - [- - -] Long (extended) Self-test duration: 26000 seconds [433.3 minutes]

I'll check and see if I have a SAS drive in my core box. I think I will need to install one
 

Attachments

  • fdisk0l.txt
    9.6 KB · Views: 280

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
No SAS in Core I am afraid. I'll plug one in when I get home on Sunday - I have a couple of spares hanging around
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
No SAS in Core I am afraid. I'll plug one in when I get home on Sunday - I have a couple of spares hanging around
Don't worry about it right now, lets fix up Scale first then we could test on Core.

Here is a sorting modification you can easily make on your own:

There are six lines in the section starting around line 159 or so. Look for the following done | awk '{for (i=NF; i!=0 ; i--) print $i }' )

Change it to done | awk '{for (i=NF; i!=0 ; i--) print $i }' | tr ' ' '\n' | sort | tr '\n' ' ')

Now look for five more places to change it. Note that you are removing the right parentheses and adding code and then adding the right parentheses again. It would be best to copy and paste to ensure spacing and punctuation is preserved.

This works in FreeBSD, I have not tested in Linux yet but the command looked the same so maybe it will work. This will sort the drive order alphabetically.

I will look at the data you sent me later tonight or tomorrow. I have to go out of town on a business trip Sunday for the week so working this issue won't happen until I get back home.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Also, while you are at it, could you test this SAS drive issue on TrueNAS Core as well?
I have a few SAS disks on CORE, but I'm away from home right now. If you can ping me in a couple of days I can check.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I checked the code changes for the sorting on both Core and Scale, seems to work. Tomorrow I will work the read failure message, but since I don't presently have a failed drive, I will have to do some crude fake testing and if it passes, I'll post an update with the two fixes (alphabetical sorting & read failure) and call it 1.4c-beta until I can do the third fix as well and then ensure it all passes the SAS drive listing. Still on the fence about changing the overall report summary. Thinking about making it a more customizable setup for a users to select what they desire. It could be a large rewrite.

I have a few SAS disks on CORE, but I'm away from home right now. If you can ping me in a couple of days I can check.
That would be helpful. I will be away from home until next Saturday:cool: at Cape Canaveral Florida. It's a work trip, unfortunately.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
So here is a v1.4c_beta1 version which sorts the drive letters now and it will report most (not all) SMART Extended/Short test failures by looking for the result "Completed:" because the next thing is generally "read failure". This will need to be modified to address other error results of course. This version does not address the SAS drive issue experienced as I'm still unsure as to why it didn't show up. The Device column will now turn red for any failure for a select drive to complement the Subject Title. Also there is some Linux variant coding which makes a zero drive temp problematic, I have not tested this completely, honestly my eyes are bugging out on me right now. Time to pack for a business trip. And the script is getting a bit complex and thus complicated. Adding too many features is a sure sign for trouble. I may need to redesign it to make it easier to modify. Let's see what happens when Scale is released later this month, we may just come up with two separate versions vice a combined version. That would make it more manageable.
 

Attachments

  • 14c_beta1.txt
    55.3 KB · Views: 226

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
So here is a v1.4c_beta1 version which sorts the drive letters now and it will report most (not all) SMART Extended/Short test failures by looking for the result "Completed:" because the next thing is generally "read failure". This will need to be modified to address other error results of course. This version does not address the SAS drive issue experienced as I'm still unsure as to why it didn't show up. The Device column will now turn red for any failure for a select drive to complement the Subject Title. Also there is some Linux variant coding which makes a zero drive temp problematic, I have not tested this completely, honestly my eyes are bugging out on me right now. Time to pack for a business trip. And the script is getting a bit complex and thus complicated. Adding too many features is a sure sign for trouble. I may need to redesign it to make it easier to modify. Let's see what happens when Scale is released later this month, we may just come up with two separate versions vice a combined version. That would make it more manageable.
For my scratchnas test bench
Two HDD's are now flagged in Red as SMART fails
One HDD is flagged as a Red Fail possibly cos its showing a single UltraDMA CRC error (and has for a long time)
The SSD's are all flagged as red - but I don't know why. I don't think there is anything wrong with these drives.

And lastly - the SAS drive has appeared at the end of the list. It doesn't appear in the summary but /dev/sde does appear at the end.

The whole thing is below
 

Attachments

  • 1.4c-beta.pdf
    268.1 KB · Views: 227

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
So to answer your results:
Drive WD-WMC4N2578099 is likely due to the CRC errors, you can null those out in the script, follow the instructions built in.
The SSD's are likely showing red due to the missing data. The good thing is I should be able to fix most if not all of these with the data you provided, however I really do not feel like maintaining the script to address every SSD quirk and would rather make it fairly simple for the end user to make minor changes to it and some of this is already built into the script. When I return from my trip I will toss out something else to test but also feel free to muck with it yourself. The worst you could do is make a mess and have to replace the script and start over again. If you do make a change that works, please post it so others may use it.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
I have nulled out the UltraDMA Errors, this works for the HDD's but not for the SSD's which display as the screenshot below
1644163036878.png


SSD's are definately an issue in that they report different stats for different drive type / manufacturer
sdp & sdq are both SMC SATADOM and that Last Test Age is rubbish (but accurate for the other 4)
And wear level is reported by the SATADOMs but not (as such) by the others.

Others may agree / disagree - but I think you are trying to include too much info. [Mind you I would like to see wear level on SSD's - which clearly isn't easy]. What do others think?

My scripting isn't good enough to work out some of the stuff you are doing I am afraid.
 

Attachments

  • 1644162907654.png
    1644162907654.png
    46.3 KB · Views: 208

dak180

Patron
Joined
Nov 22, 2017
Messages
310
Thanks dak180. I wish I had seen your version previously. I searched, but yours is not called "mult_report", the only title I've known.
I ran yours on Core and got error "Please specify a config file location." I gather more is needed than just changing email. I'll leave it to you to resolve.
I updated the Getting Started section a while back; did you ever get back to trying it again?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
@dak180
On Scale (if its meant to be running on Scale)
glabel is missing, please install
 

dak180

Patron
Joined
Nov 22, 2017
Messages
310
On Scale (if its meant to be running on Scale)
glabel is missing, please install
I would not mind getting it working on scale but I do not have a way to test currently. So if anyone is interested in helping with that please open an issue or pull request on my github repo.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
@dak180
Issue raised - I have a sacrificial test system. Its running badblocks at the moment so can't reboot, but can do just about anything else
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Others may agree / disagree - but I think you are trying to include too much info. [Mind you I would like to see wear level on SSD's - which clearly isn't easy]. What do others think?
I think it depends on what people desire and to be honest, when it comes to SSD's, people who really want this stuff are going to need to learn a little bit about editing the script. Wear Level is a challenge too as how to normalize the various manufacturer values. I'd like to see it run from 100 and decrease to 0. My Samsung SSD's do this and I like it, but I have other SSD's that do not conform.

If there is a consensus of the minimum report required, I could work that. The tricky part for people editing a script is changing the number of columns used, while it is easy (for me) it will be difficult for people unfamiliar with it and unwilling to experiment.

My opinion is I like the columns for the Zpool Summary and the Hard Drive Summary, but when it comes to the SSD's, I need some ideas. Right now you can see what works, what doesn't work, but what is important to you (I mean anyone reading this)?

I have been thinking that maybe (just maybe) I could write a program (A.K.A. application) that runs on Linux (TrueNAS Scale) and make it user configurable. I have no idea what programming language to use for this but I'd like to link into the TrueNAS API or maybe the program could create a custom executable, heck I don't know, but I know working with a very limited scripting language has serious drawbacks. I'm willing to make a better gadget, just need an idea how to. And I can program, but I know I'm not great at it.

Well I made it to Cape Canaveral Florida, it's not hot and sunny :frown:
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
@NugentS
Why are your two SSD's with -18 and -19 crcErrors listed in the negative? Did you apply the wrong offset? That doesn't make sense to me, they should be a zero value and yellow.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
I agree. I adjusted the script witrh the correct figure
UCRC1SN="WD-WMC4N2578099" UCRC1CNT=1 UCRC2SN="2011E2940536" UCRC2CNT=19 UCRC3SN="2011E2940534" UCRC3CNT=18
But I think the adjustment is applied twice. I suppose I could try halving the values to 9.5 and 9. The top one works but I think the logic is applied twice
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I agree. I adjusted the script witrh the correct figure
UCRC1SN="WD-WMC4N2578099" UCRC1CNT=1 UCRC2SN="2011E2940536" UCRC2CNT=19 UCRC3SN="2011E2940534" UCRC3CNT=18
But I think the adjustment is applied twice. I suppose I could try halving the values to 9.5 and 9. The top one works but I think the logic is applied twice
Weird, maybe I overlooked something, needless to say it should not adjust the value twice. Probably some looping issue.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
@NugentS
To follow up on the issues you notified me of.

1) The SAS drive not being reported in the SMART section: Turns out the drive is SMART enabled however the first word it his signifying this is "SMART support is: Available - device has SMART capability." vice what I'm use to seeing which is in the second line: "SMART support is: Enabled". Unfortunately based on the data dump the are no standard format SMART data ID's to utilize. So while the drive does report via SMART, the format is significantly different than most drives we see here. I'm sure we could make a routine to handle this but I'm not sure it's worth the effort on my part.

2) UDMA duplicate errors for the SSD section: I looked into this and I did not find any obvious error. What is weird is none of your drives with UDMA CRC errors were zeroed out. In the UDMA_CRC_Error_Count Flags section you should have entered the following data:
UCRC1SN="WD-WMC4N2578099"
UCRC1CNT=1
UCRC2SN="2011E2940536"
UCRC2CNT=19
UCRC3SN="2011E2940534"
UCRC3CNT=18


3) The changing of the Device column to RED bothers me so I added a new parameter called "deviceRedFlag" when set to "true" it will display the red indications, when set to false (or anything not true) it will not highlight that device column. It just makes the screen too red for me.

4) The read failures now look to be working for a standard formatted SMART message. I noticed the SAS drive also had read failures.

5) The two SSD's (sdp / sdq) that had about 500 days since last testing, that is true. Check the power on hours and then the last time a Short/Extended test was accomplished. The math is true.

6) I added what I could to populate the blank spaces for the SSD's (Wear Level/Reallocated Sectors/Reallocated Sector Events). If you find one that does not work yet the data in available via smartctl -a then toss me a message and the dump of that drive so I can adjust for it.

And I really appreciate the data you have been providing, it really helps.

Attached is the latest version for testing (1.4c-beta2).

I have been thinking about creating a script which reports the device name, power on hours, and a GOOD status if there are no issues with a specific drive, and if a drive has any issues then a full report would be generated for the suspect drive. This would reduce the crap I have to see. Needless to say this would be a switchable option in the script. But first I want to see if I can create simulated SSD's and SAS drives in a VM so I can test a bit better.
 

Attachments

  • 14c_beta2.txt
    56 KB · Views: 279

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
1. Lack of standards makes standard reporting an issue. I understand. I have a total of 3 SAS drives (1 in this server, 2 12TB spares as just in case) so I suspect its an edge case. I will say that the existance of this scrip and the other smartreport do indicate that IX are missing something, even for their corporate customers

2. This is my file
UCRC1SN="WD-WMC4N2578099" UCRC1CNT=1 UCRC2SN="2011E2940536" UCRC2CNT=19 UCRC3SN="2011E2940534" UCRC3CNT=18 UCRC4SN="YOUR_SERIAL_NUMBER_HERE" UCRC4CNT=1 UCRC5SN="YOUR_SERIAL_NUMBER_HERE" UCRC5CNT=1 UCRC6SN="YOUR_SERIAL_NUMBER_HERE" UCRC6CNT=1 UCRC7SN="YOUR_SERIAL_NUMBER_HERE" UCRC7CNT=1 UCRC8SN="YOUR_SERIAL_NUMBER_HERE" UCRC8CNT=1
So I have done it correctly

5. The two sdp & sdq and SMC SATADOMs. Running SMART on those is proving difficult. Or rather getting them to admit that SMART has been run is difficult. I have just run a short test on them, completed successfully but it still reports 500+ days. I have attached a txt file sdp.txt that is a recent output after the smarttest completed successfully. Its annoying. I wonder if the LifeTime (hours) is time since last run (nope as I ran it again withing a few minutes and its says 36 hours. I think its just bugged. I might raise a ticket with SMC for giggles

"I have been thinking about creating a script which reports the device name, power on hours, and a GOOD status if there are no issues with a specific drive, and if a drive has any issues then a full report would be generated for the suspect drive. This would reduce the crap I have to see. Needless to say this would be a switchable option in the script. But first I want to see if I can create simulated SSD's and SAS drives in a VM so I can test a bit better."

Oh God Yes - that exactly what I am looking for. A sea of green (or no colour" with a touch of Red indicating that there is something I need to follow up on. That way you wouldn't have to worry about generating stats from irritating gear. I don't even want the full report just a report like:

Report Header - Status Report for NAS in green or red. Red meaning something to follow up

Zpool Status (exactly as you currently do)
...
...
...
Disk Status - SATA
Dev | Serial Number | Time since last SMART | Status (red or Green)
...
...
...
Disk Status SAS
Dev | Serial Number | Time since last SMART | Status (red or Green)
...
...
...
Disk Status NVME
Dev | Serial Number | Time since last SMART | Status (red or Green)
...
...
...

--------------

Comments on latest Beta. Results attached
Ultra DMA CRC Errors on those two drives are still wrong
No SAS - but we knew that. Can I suggest list a SAS drive section with the comments that individual smart results are at the bottom

Most of the bad blocks have no vanished since I finished badblocking the "bad" drives.
sde is still going (cos it appears to be slow and I stopped it by mistake) - but its the only one left. I am goinf to create a new BadPool and trash it for a while to see what happens.

Looking at smartctl -i /dev/sda (&sde) I see the following:
sde Local Time is: Wed Feb 9 23:43:32 2022 GMT SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Enabled
sda Model Family: Western Digital Green Device Model: WDC WD30EZRX-22D8PB0 Serial Number: WD-WCC4N3XFP1F1 LU WWN Device Id: 5 0014ee 2b62cb898 Firmware Version: 80.00A80 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5400 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Wed Feb 9 23:43:46 2022 GMT SMART support is: Available - device has SMART capability. SMART support is: Enabled
Both reports have SMART support is: Enabled as a second line
It would be nice to mention that the drive exists in the summary, even if no stats were available

Unfortunately there is a GUI bug that means I need to reboot before assigning any drives - and I can't till that one drive completes its badblocks run
 

Attachments

  • sdp.txt
    7.1 KB · Views: 212
  • Results Multireport V1.4c-beta2.pdf
    263.1 KB · Views: 183
Last edited:

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
That way you wouldn't have to worry about generating stats from irritating gear.
LOL, yes I still do, I have to be able to detect something is wrong still. Generating a report is the somewhat easy part, detecting the issues between so many devices when there isn't a solid standard is the hard part.

Sorry that the CRC errors are still there, I cannot reproduce those results on my system and I only have one SSD with a single CRC error, not double digits.

That SATA DOM is odd about recording the completion of a SMART test, unfortunately I can't fix that. Out of curiosity, what is the output of "smartctl -x" for this drive?

Both reports have SMART support is: Enabled as a second line
It would be nice to mention that the drive exists in the summary, even if no stats were available
So the SAS drive thing is a script limitation, maybe I could get around it, not sure. The issue are the fields the text resides. I'm looking for "SMART Support is:" and then the the next field of data hopefully will be "Enabled" but when "Available" shows up it is not a qualifier. I could restructure the script but I'm trying to figure out a good way to do it. I'm not a great programmer, but I can get the job done in most cases.

I will say that the existance of this scrip and the other smartreport do indicate that IX are missing something, even for their corporate customers
I have to agree, I often wonder why a highly customizable report is not part of the basic program. It could be due to so many manufacturers doing their own thing. I wouldn't want to be updating the code to add new configurations all the time either, but making a way for the end user to add new situations would be a good option.

I also thought I had fixed the wearlevel values for the SSD's. I will need to look again at the script, my eyes were getting tired about that time, I probably missed something stupidly simple.
 
Top