Should I trust smartctl or Scrutiny when it comes to S.M.A.R.T monitoring?

scotrod

Dabbler
Joined
Apr 30, 2021
Messages
42
Hi. I am trying to understand whether I should trust TrueNAS Scale smartctl tool or Scrutiny (https://github.com/AnalogJ/scrutiny) for S.M.A.R.T results.

I recently replaced a failing boot disk (SSD) with another - one of my old ones. I performed both SHORT and LONG S.M.A.R.T tests for my new boot disk using smartctl and TrueNAS Scale GUI said that everything is fine. Upon running `smartctl -a /dev/sde` I am getting the following output:


root@truenas[~]# smartctl -a /dev/sdf
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.79+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Intel 540 Series SSDs
Device Model: INTEL SSDSC2KW120H6
Serial Number: CVLT62000C7C120GGN
LU WWN Device Id: 5 5cd2e4 14cbfa88c
Firmware Version: LSBG200
User Capacity: 120,034,123,776 bytes [120 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 (minor revision not indicated)
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Thu Apr 27 19:24:08 2023 EEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x53) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 15) minutes.
SCT capabilities: (0x0039) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 373
9 Power_On_Hours_and_Msec 0x0032 100 100 000 Old_age Always - 870h+00m+00.000s
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 1312
170 Available_Reservd_Space 0x0033 098 098 010 Pre-fail Always - 0
171 Program_Fail_Count 0x0032 100 100 010 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 010 Old_age Always - 0
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 34
183 SATA_Downshift_Count 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0033 100 100 090 Pre-fail Always - 0
187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 51415
190 Airflow_Temperature_Cel 0x0032 038 051 000 Old_age Always - 38 (Min/Max 26/51)
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 34
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 383579
226 Workld_Media_Wear_Indic 0x0032 100 100 000 Old_age Always - 0
227 Workld_Host_Reads_Perc 0x0032 100 100 000 Old_age Always - 0
228 Workload_Minutes 0x0032 100 100 000 Old_age Always - 0
232 Available_Reservd_Space 0x0033 059 059 010 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0032 081 081 000 Old_age Always - 0
241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 383579
242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 347127
249 NAND_Writes_1GiB 0x0032 100 100 000 Old_age Always - 7373
252 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 80

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 0
Warning: ATA Specification requires self-test log structure revision number = 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 70403103932424 70403103932424 Not_testing
2 70403103932424 70403103932424 Not_testing
3 70403103932424 70403103932424 Not_testing
4 70403103932424 70403103932424 Not_testing
5 70403103932424 70403103932424 Not_testing
Selective self-test flags (0x4008):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@truenas[~]#

For the same disk in Scrutiny however, I get:
1682612334543.png


First of all, this disk should have waaay more than 36 days of uptime - this was my boot drive on my old computer for years. No way that data is true. So, either the disk is going bad and the S.M.A.R.T data is not really correct, or one of these tools is not reading it correcly.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
one of these tools is not reading it correcly.
Why do you say "one of these tools"? They're both reporting the same thing--Scrutiny is saying 36 days of power-on time; smartctl is saying 870 hours--which is 36 days. The output of those two tools is identical, and both show a disk that's not only "going" bad, but has long since gone bad.
 

scotrod

Dabbler
Joined
Apr 30, 2021
Messages
42
Why do you say "one of these tools"? They're both reporting the same thing--Scrutiny is saying 36 days of power-on time; smartctl is saying 870 hours--which is 36 days. The output of those two tools is identical, and both show a disk that's not only "going" bad, but has long since gone bad.
Pardon me, that's true, yes. The thing is... if that drive is going bad... why is TrueNAS saying SUCCESS after every single S.M.A.R.T test? There are absolutely no indications in the GUI that there is something wrong with the disk.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
why is TrueNAS saying SUCCESS after every single S.M.A.R.T test?
What SMART test? None have been performed on that device during its entire lifetime.
 

scotrod

Dabbler
Joined
Apr 30, 2021
Messages
42
What SMART test? None have been performed on that device during its entire lifetime.
1682622571683.png
I managed to capture this before rebooting the server. After rebooting, I have **nothing** in the SMART test results menu. I just made another SHORT test, and its a successful one. Also... pardon me if this is a dumb question, but isn't the smartctl -a command pulling the output from the latest SMART test on that drive?
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
I managed to capture this before rebooting the server
That was for /dev/sde; the SMART data you posted above is for /dev/sdf. That doesn't mean it's necessarily a different device (nothing on that screen identifies the device by serial number, for example), but it's suspicious.
isn't the smartctl -a command pulling the output from the latest SMART test on that drive?
smartctl -a pulls the current SMART data. The self-test results are included in there (which is how I concluded that none have been run on that device), but it doesn't initiate a SMART self-test.
 

scotrod

Dabbler
Joined
Apr 30, 2021
Messages
42
That was for /dev/sde; the SMART data you posted above is for /dev/sdf. That doesn't mean it's necessarily a different device (nothing on that screen identifies the device by serial number, for example), but it's suspicious.

smartctl -a pulls the current SMART data. The self-test results are included in there (which is how I concluded that none have been run on that device), but it doesn't initiate a SMART self-test.
That's correct - TrueNAS managed to change the drive letter of my boot drive from sde to sdf after reboot... reason is unknown to me. What can I do to provide actual SMART tests? Do a test from the TrueNAS GUI and then what? There are no logs in the GUI which I can pull to provide here.

I've just found minutes ago that after occupying the M2 slot with SATA (not NVME) SSD, I suddendly lose another drive. After removing this particular SSD, I see the drive again

Critical
Pool Vault state is DEGRADED: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
Disk ST3000DM008-2DM166 Z504YCW7 is UNAVAIL

Since my motherboard has another M2 slot, I've tried it as well. Moving the drive there suddenly made TWO of my other HDDs to go offline for some reason... I'll make another topic about this, just wanted to mention this here.
 
Top