multi_report.sh version for Core and Scale 3.0

NugentS · Feb 4, 2022

fdisk -l result is fdisk0l.txt attached
fdisk -l | grep "Disk /dev/sd"


Disk /dev/sdn: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors
Disk /dev/sdq: 14.75 GiB, 15837691904 bytes, 30932992 sectors
Disk /dev/sdp: 14.75 GiB, 15837691904 bytes, 30932992 sectors
Disk /dev/sdi: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors
Disk /dev/sdg: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors
Disk /dev/sdj: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors
Disk /dev/sdb: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors
Disk /dev/sda: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors
Disk /dev/sde: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors
Disk /dev/sdd: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors
Disk /dev/sdm: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors
Disk /dev/sdf: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors
Disk /dev/sdk: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors
Disk /dev/sdo: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors
Disk /dev/sdl: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors
Disk /dev/sdc: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors
Disk /dev/sdh: 2.73 TiB, 3000592982016 bytes, 5860533168 sectors

smartctl --scan


/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
/dev/sdc -d scsi # /dev/sdc, SCSI device
/dev/sdd -d scsi # /dev/sdd, SCSI device
/dev/sde -d scsi # /dev/sde, SCSI device
/dev/sdf -d scsi # /dev/sdf, SCSI device
/dev/sdg -d scsi # /dev/sdg, SCSI device
/dev/sdh -d scsi # /dev/sdh, SCSI device
/dev/sdi -d scsi # /dev/sdi, SCSI device
/dev/sdj -d scsi # /dev/sdj, SCSI device
/dev/sdk -d scsi # /dev/sdk, SCSI device
/dev/sdl -d scsi # /dev/sdl, SCSI device
/dev/sdm -d scsi # /dev/sdm, SCSI device
/dev/sdn -d scsi # /dev/sdn, SCSI device
/dev/sdo -d scsi # /dev/sdo, SCSI device
/dev/sdp -d scsi # /dev/sdp, SCSI device
/dev/sdq -d scsi # /dev/sdq, SCSI device

smartctl -a /dev/sde


root@scalenas[/mnt/ScratchSSD/SMB/Scale-Scripts]# smartctl -a /dev/sde
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.81+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST3000NM0023
Revision:             0006
Compliance:           SPC-4
User Capacity:        3,000,592,982,016 bytes [3.00 TB]
Logical block size:   512 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c500625880ab
Serial number:        Z1Z5JDWE0000C507EFJZ
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Fri Feb  4 16:51:24 2022 GMT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     25 C
Drive Trip Temperature:        60 C

Accumulated power on time, hours:minutes 52096:24
Manufactured in week 36 of year 2014
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  350
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  2467
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 251129865
  Blocks received from initiator = 893628169
  Blocks read from cache and sent to initiator = 1215639293
  Number of read and write commands whose size <= segment size = 221084023
  Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 52096.40
  number of minutes until next internal SMART test = 36

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   254785013        2         0  254785015         92      35312.951          90
write:         0        0         0         0          0      24655.735           0
verify:     4981        0         0      4981          0          0.000           0

Non-medium error count:        6


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Failed in segment -->       -   52080               531 [0x3 0x11 0x0]
# 2  Background short  Failed in segment -->       -   52079           1550955 [0x3 0x11 0x0]
# 3  Background long   Failed in segment -->       -   52078               531 [0x3 0x11 0x0]
# 4  Background long   Completed                   -   52067                 - [-   -    -]
# 5  Background short  Completed                   -   52062                 - [-   -    -]
# 6  Background short  Completed                   -   52035                 - [-   -    -]
# 7  Background short  Completed                   -   52010                 - [-   -    -]
# 8  Background short  Completed                   -   51986                 - [-   -    -]
# 9  Background short  Completed                   -   51962                 - [-   -    -]
#10  Background short  Completed                   -   51938                 - [-   -    -]
#11  Background short  Completed                   -   51914                 - [-   -    -]
#12  Background short  Completed                   -   51890                 - [-   -    -]
#13  Background long   Completed                   -   51870                 - [-   -    -]
#14  Background short  Completed                   -   51842                 - [-   -    -]
#15  Background short  Completed                   -   51818                 - [-   -    -]
#16  Background short  Completed                   -   51794                 - [-   -    -]
#17  Background short  Completed                   -   51770                 - [-   -    -]
#18  Background short  Completed                   -   51746                 - [-   -    -]
#19  Background short  Completed                   -   51722                 - [-   -    -]
#20  Background short  Completed                   -   51698                 - [-   -    -]

Long (extended) Self-test duration: 26000 seconds [433.3 minutes]

I'll check and see if I have a SAS drive in my core box. I think I will need to install one

NugentS · Feb 4, 2022

No SAS in Core I am afraid. I'll plug one in when I get home on Sunday - I have a couple of spares hanging around

joeschmuck · Feb 4, 2022

NugentS said:
No SAS in Core I am afraid. I'll plug one in when I get home on Sunday - I have a couple of spares hanging around

Don't worry about it right now, lets fix up Scale first then we could test on Core.

Here is a sorting modification you can easily make on your own:

There are six lines in the section starting around line 159 or so. Look for the following done | awk '{for (i=NF; i!=0 ; i--) print $i }' )

Change it to done | awk '{for (i=NF; i!=0 ; i--) print $i }' | tr ' ' '\n' | sort | tr '\n' ' ')

Now look for five more places to change it. Note that you are removing the right parentheses and adding code and then adding the right parentheses again. It would be best to copy and paste to ensure spacing and punctuation is preserved.

This works in FreeBSD, I have not tested in Linux yet but the command looked the same so maybe it will work. This will sort the drive order alphabetically.

I will look at the data you sent me later tonight or tomorrow. I have to go out of town on a business trip Sunday for the week so working this issue won't happen until I get back home.

danb35 · Feb 4, 2022

joeschmuck said:
Also, while you are at it, could you test this SAS drive issue on TrueNAS Core as well?

I have a few SAS disks on CORE, but I'm away from home right now. If you can ping me in a couple of days I can check.

joeschmuck · Feb 4, 2022

I checked the code changes for the sorting on both Core and Scale, seems to work. Tomorrow I will work the read failure message, but since I don't presently have a failed drive, I will have to do some crude fake testing and if it passes, I'll post an update with the two fixes (alphabetical sorting & read failure) and call it 1.4c-beta until I can do the third fix as well and then ensure it all passes the SAS drive listing. Still on the fence about changing the overall report summary. Thinking about making it a more customizable setup for a users to select what they desire. It could be a large rewrite.

danb35 said:
I have a few SAS disks on CORE, but I'm away from home right now. If you can ping me in a couple of days I can check.

That would be helpful. I will be away from home until next Saturday

at Cape Canaveral Florida. It's a work trip, unfortunately.

joeschmuck · Feb 5, 2022

So here is a v1.4c_beta1 version which sorts the drive letters now and it will report most (not all) SMART Extended/Short test failures by looking for the result "Completed:" because the next thing is generally "read failure". This will need to be modified to address other error results of course. This version does not address the SAS drive issue experienced as I'm still unsure as to why it didn't show up. The Device column will now turn red for any failure for a select drive to complement the Subject Title. Also there is some Linux variant coding which makes a zero drive temp problematic, I have not tested this completely, honestly my eyes are bugging out on me right now. Time to pack for a business trip. And the script is getting a bit complex and thus complicated. Adding too many features is a sure sign for trouble. I may need to redesign it to make it easier to modify. Let's see what happens when Scale is released later this month, we may just come up with two separate versions vice a combined version. That would make it more manageable.

NugentS · Feb 5, 2022

joeschmuck said:
So here is a v1.4c_beta1 version which sorts the drive letters now and it will report most (not all) SMART Extended/Short test failures by looking for the result "Completed:" because the next thing is generally "read failure". This will need to be modified to address other error results of course. This version does not address the SAS drive issue experienced as I'm still unsure as to why it didn't show up. The Device column will now turn red for any failure for a select drive to complement the Subject Title. Also there is some Linux variant coding which makes a zero drive temp problematic, I have not tested this completely, honestly my eyes are bugging out on me right now. Time to pack for a business trip. And the script is getting a bit complex and thus complicated. Adding too many features is a sure sign for trouble. I may need to redesign it to make it easier to modify. Let's see what happens when Scale is released later this month, we may just come up with two separate versions vice a combined version. That would make it more manageable.

For my scratchnas test bench
Two HDD's are now flagged in Red as SMART fails
One HDD is flagged as a Red Fail possibly cos its showing a single UltraDMA CRC error (and has for a long time)
The SSD's are all flagged as red - but I don't know why. I don't think there is anything wrong with these drives.

And lastly - the SAS drive has appeared at the end of the list. It doesn't appear in the summary but /dev/sde does appear at the end.

The whole thing is below

joeschmuck · Feb 5, 2022

So to answer your results:
Drive WD-WMC4N2578099 is likely due to the CRC errors, you can null those out in the script, follow the instructions built in.
The SSD's are likely showing red due to the missing data. The good thing is I should be able to fix most if not all of these with the data you provided, however I really do not feel like maintaining the script to address every SSD quirk and would rather make it fairly simple for the end user to make minor changes to it and some of this is already built into the script. When I return from my trip I will toss out something else to test but also feel free to muck with it yourself. The worst you could do is make a mess and have to replace the script and start over again. If you do make a change that works, please post it so others may use it.

NugentS · Feb 6, 2022

I have nulled out the UltraDMA Errors, this works for the HDD's but not for the SSD's which display as the screenshot below

SSD's are definately an issue in that they report different stats for different drive type / manufacturer
sdp & sdq are both SMC SATADOM and that Last Test Age is rubbish (but accurate for the other 4)
And wear level is reported by the SATADOMs but not (as such) by the others.

Others may agree / disagree - but I think you are trying to include too much info. [Mind you I would like to see wear level on SSD's - which clearly isn't easy]. What do others think?

My scripting isn't good enough to work out some of the stuff you are doing I am afraid.

dak180 · Feb 6, 2022

TooMuchData said:
Thanks dak180. I wish I had seen your version previously. I searched, but yours is not called "mult_report", the only title I've known.
I ran yours on Core and got error "Please specify a config file location." I gather more is needed than just changing email. I'll leave it to you to resolve.

I updated the Getting Started section a while back; did you ever get back to trying it again?

NugentS · Feb 6, 2022

@dak180
On Scale (if its meant to be running on Scale)
glabel is missing, please install

dak180 · Feb 6, 2022

NugentS said:
On Scale (if its meant to be running on Scale)
glabel is missing, please install

I would not mind getting it working on scale but I do not have a way to test currently. So if anyone is interested in helping with that please open an issue or pull request on my github repo.

NugentS · Feb 6, 2022

@dak180
Issue raised - I have a sacrificial test system. Its running badblocks at the moment so can't reboot, but can do just about anything else

joeschmuck · Feb 6, 2022

NugentS said:
Others may agree / disagree - but I think you are trying to include too much info. [Mind you I would like to see wear level on SSD's - which clearly isn't easy]. What do others think?

I think it depends on what people desire and to be honest, when it comes to SSD's, people who really want this stuff are going to need to learn a little bit about editing the script. Wear Level is a challenge too as how to normalize the various manufacturer values. I'd like to see it run from 100 and decrease to 0. My Samsung SSD's do this and I like it, but I have other SSD's that do not conform.

If there is a consensus of the minimum report required, I could work that. The tricky part for people editing a script is changing the number of columns used, while it is easy (for me) it will be difficult for people unfamiliar with it and unwilling to experiment.

My opinion is I like the columns for the Zpool Summary and the Hard Drive Summary, but when it comes to the SSD's, I need some ideas. Right now you can see what works, what doesn't work, but what is important to you (I mean anyone reading this)?

I have been thinking that maybe (just maybe) I could write a program (A.K.A. application) that runs on Linux (TrueNAS Scale) and make it user configurable. I have no idea what programming language to use for this but I'd like to link into the TrueNAS API or maybe the program could create a custom executable, heck I don't know, but I know working with a very limited scripting language has serious drawbacks. I'm willing to make a better gadget, just need an idea how to. And I can program, but I know I'm not great at it.

Well I made it to Cape Canaveral Florida, it's not hot and sunny

joeschmuck · Feb 6, 2022

@NugentS
Why are your two SSD's with -18 and -19 crcErrors listed in the negative? Did you apply the wrong offset? That doesn't make sense to me, they should be a zero value and yellow.

NugentS · Feb 7, 2022

I agree. I adjusted the script witrh the correct figure


UCRC1SN="WD-WMC4N2578099"
UCRC1CNT=1
UCRC2SN="2011E2940536"
UCRC2CNT=19
UCRC3SN="2011E2940534"
UCRC3CNT=18

But I think the adjustment is applied twice. I suppose I could try halving the values to 9.5 and 9. The top one works but I think the logic is applied twice

joeschmuck · Feb 7, 2022

NugentS said:
I agree. I adjusted the script witrh the correct figure
UCRC1SN="WD-WMC4N2578099" UCRC1CNT=1 UCRC2SN="2011E2940536" UCRC2CNT=19 UCRC3SN="2011E2940534" UCRC3CNT=18
But I think the adjustment is applied twice. I suppose I could try halving the values to 9.5 and 9. The top one works but I think the logic is applied twice

Weird, maybe I overlooked something, needless to say it should not adjust the value twice. Probably some looping issue.

joeschmuck · Feb 9, 2022

@NugentS
To follow up on the issues you notified me of.

1) The SAS drive not being reported in the SMART section: Turns out the drive is SMART enabled however the first word it his signifying this is "SMART support is: Available - device has SMART capability." vice what I'm use to seeing which is in the second line: "SMART support is: Enabled". Unfortunately based on the data dump the are no standard format SMART data ID's to utilize. So while the drive does report via SMART, the format is significantly different than most drives we see here. I'm sure we could make a routine to handle this but I'm not sure it's worth the effort on my part.

2) UDMA duplicate errors for the SSD section: I looked into this and I did not find any obvious error. What is weird is none of your drives with UDMA CRC errors were zeroed out. In the UDMA_CRC_Error_Count Flags section you should have entered the following data:

 UCRC1SN="WD-WMC4N2578099"

UCRC1CNT=1

UCRC2SN="2011E2940536"

UCRC2CNT=19

UCRC3SN="2011E2940534"

UCRC3CNT=18

3) The changing of the Device column to RED bothers me so I added a new parameter called "deviceRedFlag" when set to "true" it will display the red indications, when set to false (or anything not true) it will not highlight that device column. It just makes the screen too red for me.

4) The read failures now look to be working for a standard formatted SMART message. I noticed the SAS drive also had read failures.

5) The two SSD's (sdp / sdq) that had about 500 days since last testing, that is true. Check the power on hours and then the last time a Short/Extended test was accomplished. The math is true.

6) I added what I could to populate the blank spaces for the SSD's (Wear Level/Reallocated Sectors/Reallocated Sector Events). If you find one that does not work yet the data in available via smartctl -a then toss me a message and the dump of that drive so I can adjust for it.

And I really appreciate the data you have been providing, it really helps.

Attached is the latest version for testing (1.4c-beta2).

I have been thinking about creating a script which reports the device name, power on hours, and a GOOD status if there are no issues with a specific drive, and if a drive has any issues then a full report would be generated for the suspect drive. This would reduce the crap I have to see. Needless to say this would be a switchable option in the script. But first I want to see if I can create simulated SSD's and SAS drives in a VM so I can test a bit better.

NugentS · Feb 9, 2022

1. Lack of standards makes standard reporting an issue. I understand. I have a total of 3 SAS drives (1 in this server, 2 12TB spares as just in case) so I suspect its an edge case. I will say that the existance of this scrip and the other smartreport do indicate that IX are missing something, even for their corporate customers

2. This is my file


UCRC1SN="WD-WMC4N2578099"
UCRC1CNT=1
UCRC2SN="2011E2940536"
UCRC2CNT=19
UCRC3SN="2011E2940534"
UCRC3CNT=18
UCRC4SN="YOUR_SERIAL_NUMBER_HERE"
UCRC4CNT=1
UCRC5SN="YOUR_SERIAL_NUMBER_HERE"
UCRC5CNT=1
UCRC6SN="YOUR_SERIAL_NUMBER_HERE"
UCRC6CNT=1
UCRC7SN="YOUR_SERIAL_NUMBER_HERE"
UCRC7CNT=1
UCRC8SN="YOUR_SERIAL_NUMBER_HERE"
UCRC8CNT=1

So I have done it correctly

5. The two sdp & sdq and SMC SATADOMs. Running SMART on those is proving difficult. Or rather getting them to admit that SMART has been run is difficult. I have just run a short test on them, completed successfully but it still reports 500+ days. I have attached a txt file sdp.txt that is a recent output after the smarttest completed successfully. Its annoying. I wonder if the LifeTime (hours) is time since last run (nope as I ran it again withing a few minutes and its says 36 hours. I think its just bugged. I might raise a ticket with SMC for giggles

"I have been thinking about creating a script which reports the device name, power on hours, and a GOOD status if there are no issues with a specific drive, and if a drive has any issues then a full report would be generated for the suspect drive. This would reduce the crap I have to see. Needless to say this would be a switchable option in the script. But first I want to see if I can create simulated SSD's and SAS drives in a VM so I can test a bit better."

Oh God Yes - that exactly what I am looking for. A sea of green (or no colour" with a touch of Red indicating that there is something I need to follow up on. That way you wouldn't have to worry about generating stats from irritating gear. I don't even want the full report just a report like:

Report Header - Status Report for NAS in green or red. Red meaning something to follow up

Zpool Status (exactly as you currently do)
...
...
...
Disk Status - SATA
Dev | Serial Number | Time since last SMART | Status (red or Green)
...
...
...
Disk Status SAS
Dev | Serial Number | Time since last SMART | Status (red or Green)
...
...
...
Disk Status NVME
Dev | Serial Number | Time since last SMART | Status (red or Green)
...
...
...

--------------

Comments on latest Beta. Results attached
Ultra DMA CRC Errors on those two drives are still wrong
No SAS - but we knew that. Can I suggest list a SAS drive section with the comments that individual smart results are at the bottom

Most of the bad blocks have no vanished since I finished badblocking the "bad" drives.
sde is still going (cos it appears to be slow and I stopped it by mistake) - but its the only one left. I am goinf to create a new BadPool and trash it for a while to see what happens.

Looking at smartctl -i /dev/sda (&sde) I see the following:
sde


Local Time is:        Wed Feb  9 23:43:32 2022 GMT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

sda


Model Family:     Western Digital Green
Device Model:     WDC WD30EZRX-22D8PB0
Serial Number:    WD-WCC4N3XFP1F1
LU WWN Device Id: 5 0014ee 2b62cb898
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Feb  9 23:43:46 2022 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Both reports have SMART support is: Enabled as a second line
It would be nice to mention that the drive exists in the summary, even if no stats were available

Unfortunately there is a GUI bug that means I need to reboot before assigning any drives - and I can't till that one drive completes its badblocks run

joeschmuck · Feb 9, 2022

NugentS said:
That way you wouldn't have to worry about generating stats from irritating gear.

LOL, yes I still do, I have to be able to detect something is wrong still. Generating a report is the somewhat easy part, detecting the issues between so many devices when there isn't a solid standard is the hard part.

Sorry that the CRC errors are still there, I cannot reproduce those results on my system and I only have one SSD with a single CRC error, not double digits.

That SATA DOM is odd about recording the completion of a SMART test, unfortunately I can't fix that. Out of curiosity, what is the output of "smartctl -x" for this drive?

NugentS said:
Both reports have SMART support is: Enabled as a second line
It would be nice to mention that the drive exists in the summary, even if no stats were available

So the SAS drive thing is a script limitation, maybe I could get around it, not sure. The issue are the fields the text resides. I'm looking for "SMART Support is:" and then the the next field of data hopefully will be "Enabled" but when "Available" shows up it is not a qualifier. I could restructure the script but I'm trying to figure out a good way to do it. I'm not a great programmer, but I can get the job done in most cases.

NugentS said:
I will say that the existance of this scrip and the other smartreport do indicate that IX are missing something, even for their corporate customers

I have to agree, I often wonder why a highly customizable report is not part of the basic program. It could be due to so many manufacturers doing their own thing. I wouldn't want to be updating the code to add new configurations all the time either, but making a way for the end user to add new situations would be a good option.

I also thought I had fixed the wearlevel values for the SSD's. I will need to look again at the script, my eyes were getting tired about that time, I probably missed something stupidly simple.

Important Announcement for the TrueNAS Community.

multi_report.sh version for Core and Scale 3.0

MVP

Attachments

MVP

Old Man

Hall of Famer

Old Man

Old Man

Attachments

MVP

Attachments

Old Man

MVP

Attachments

Patron

MVP

Patron

MVP

Old Man

Old Man

MVP

Old Man

Old Man

Attachments

MVP

Attachments

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "multi_report.sh version for Core and Scale"

Similar threads