Puzzled over smart result

jyavenard

Patron
Joined
Oct 16, 2013
Messages
361
So I have this quite old drive in my array, almost 10 years old ! (you got to be in awe with the reliability of WD Red really)

supernas% sudo smartctl -a /dev/da0
Password:
Sorry, try again.
Password:
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD40EFRX-68WT0N0
Serial Number: WD-WCC4E0281971
LU WWN Device Id: 5 0014ee 2091d4c34
Firmware Version: 80.00A80
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Sep 1 10:03:43 2023 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 117) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: (53280) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 532) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x703d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 180 174 021 Pre-fail Always - 7983
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 190
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 001 001 000 Old_age Always - 84119
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 190
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 179
193 Load_Cycle_Count 0x0032 179 179 000 Old_age Always - 63114
194 Temperature_Celsius 0x0022 119 098 000 Old_age Always - 33
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 50% 18579 4181840
# 2 Short offline Completed: read failure 40% 18573 4181840
# 3 Short offline Completed: read failure 40% 18567 4181840
# 4 Short offline Completed: read failure 50% 18561 4181840
# 5 Short offline Completed: read failure 40% 18552 4181840
# 6 Short offline Completed: read failure 50% 18546 4181840
# 7 Short offline Completed: read failure 50% 18540 4181840
# 8 Short offline Completed: read failure 50% 18534 4181840
# 9 Short offline Completed: read failure 50% 18528 4181840
#10 Short offline Completed: read failure 50% 18522 4181840
#11 Extended offline Completed without error 00% 18519 -
#12 Short offline Completed without error 00% 18498 -
#13 Short offline Completed without error 00% 18492 -
#14 Short offline Completed without error 00% 18486 -
#15 Short offline Completed without error 00% 18480 -
#16 Short offline Completed without error 00% 18474 -
#17 Short offline Completed without error 00% 18468 -
#18 Short offline Completed without error 00% 18462 -
#19 Short offline Completed without error 00% 18456 -
#20 Short offline Completed without error 00% 18450 -
#21 Short offline Completed without error 00% 18444 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

From the various numbers, it's all good, no read, seek or re-allocated sectors. It does however have various self-test error not being able to complete. I'm puzzled though that it shows that those occurs a while ago (around 18500 hours), but I don't trust those hours value.

But I got an email
`* Device: /dev/da0 [SAT], Self-Test Log error count increased from 0 to 1.`
 

jyavenard

Patron
Joined
Oct 16, 2013
Messages
361
Actually, it may just be a buggy smartctl implementation.

I replaced once such drive last week after it failed (real failure this time), drive was under warranty and replaced by WD.

And I get the same:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 205 205 021 Pre-fail Always - 2725
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 11
5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 65
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 497
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 10
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 4
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 7
194 Temperature_Celsius 0x0022 112 100 000 Old_age Always - 35
196 Reallocated_Event_Count 0x0032 135 135 000 Old_age Always - 65
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 196 196 000 Old_age Offline - 3
SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 40% 491 4232304
# 2 Short offline Completed: read failure 40% 485 4232304
# 3 Short offline Completed: read failure 20% 479 4232304
# 4 Short offline Completed: read failure 30% 473 4232304
# 5 Short offline Completed: read failure 30% 464 4232304
# 6 Short offline Completed: read failure 40% 458 4214144
# 7 Short offline Completed: read failure 50% 452 4191312
# 8 Short offline Completed without error 00% 446 -
 
Joined
Oct 22, 2019
Messages
3,641
If a short selftest fails with a read error, that's a sign the drive needs to be replaced. (Especially if you see consistent failures.) Why do you believe it's a buggy smartctl implementation?

As for the "replacement drive" also erroring with short selftests, that's within probability. It could be a refurbished drive they sent you.
 

jyavenard

Patron
Joined
Oct 16, 2013
Messages
361
Why do you believe it's a buggy smartctl implementation?
Because now I have 3 drives doing it.
This old one, a brand new WD40EFPX (Red 4TB with 256MB cache) I got last week, and the one WD sent me last week WD40EFZX (Red 4TB with 128MB cache).

The probability that two new drives (out of 2) gives smart errors, yet zero errors on usage ever is pretty low don't you think?

Mind you, I only realised last week that the daily cron for short test was disabled and turn it on. So I had never paid attention to that before.
 
Joined
Oct 22, 2019
Messages
3,641
One drive, which you said suffered a "real failure this time" was replaced by a possibly refurbished drive from Western Digital. This replacement (refurbished) drive begins to show errors early on. (Maybe because it's refurbished and/or is a victim of "infant mortality"?)

That's 2 drives with errors.

The other drive is ten years old. It begins to fail selftests.

That's 3 total drives with errors.

It's within the realm of possibility, and not so far-fetched.

If you really want to rule out a "buggy smartctl", you can test them on a different computer or with a different OS, such as Ubuntu Linux or Windows.


The probability that two new drives (out of 2) gives smart errors, yet zero errors on usage ever is pretty low don't you think?
With ZFS and HDD selftests, you can have errors on one without the other.

A drive can fail its own internal sefltests, yet ZFS does not report any issues, because the data is not stored on those sectors.

ZFS can report errors if the checksums don't match, meanwhile the drive's selftest reports there are no errors.
 
Last edited:

jyavenard

Patron
Joined
Oct 16, 2013
Messages
361
I have a spare, brand new. I'll test it on too. Will see.
 
Top