S.M.A.R.T Short test failed

T.J.Hammer

Dabbler
Joined
Sep 22, 2022
Messages
24
Hi,

I received an update that a scheduled short test failed during the night.
I ran another short test (manually) and it also failed.
I have no Idea why but also my pool status say unhealthy but I did not received any email or anything, it was just a red ping on the right top corner.
I checked the shell and this is what I get:
Warning: the supported mechanisms for making configuration changes
are the TrueNAS WebUI and API exclusively. ALL OTHERS ARE
NOT SUPPORTED AND WILL RESULT IN UNDEFINED BEHAVIOR AND MAY
RESULT IN SYSTEM FAILURE.

root@truenas[~]# smartctl -a /dev/ada1
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p1 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Black
Device Model: WDC WD4003FZEX-00Z4SA0
Serial Number: WD-WMC5D0D3CPHH
LU WWN Device Id: 5 0014ee 0596903e2
Firmware Version: 01.01A01
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Thu Sep 29 17:54:54 2022 IDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 113) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: (45000) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 487) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x7035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 5
3 Spin_Up_Time 0x0027 194 141 021 Pre-fail Always - 9258
4 Start_Stop_Count 0x0032 075 075 000 Old_age Always - 25325
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 044 044 000 Old_age Always - 41594
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1176
16 Unknown_Attribute 0x0022 018 182 000 Old_age Always - 332426432413
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 394
193 Load_Cycle_Count 0x0032 192 192 000 Old_age Always - 24930
194 Temperature_Celsius 0x0022 113 097 000 Old_age Always - 39
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 10% 41594 21273032
# 2 Short offline Completed: read failure 10% 41577 21273032
# 3 Short offline Completed without error 00% 41554 -
# 4 Extended offline Completed without error 00% 41531 -
# 5 Short offline Completed without error 00% 41522 -
# 6 Short offline Completed without error 00% 41521 -
# 7 Short offline Completed without error 00% 41505 -
# 8 Short offline Completed without error 00% 41481 -
# 9 Short offline Completed without error 00% 41457 -
#10 Short offline Completed: read failure 10% 41433 94378000
#11 Short offline Completed without error 00% 6761 -
#12 Short offline Completed without error 00% 5485 -
1 of 3 failed self-tests are outdated by newer successful extended offline self-test # 4

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
this is the unhealthy message:
1664463848133.png


I have I7-9700
Z390 aorus extreme
32GB DDR4 2666
750W gold Antec HCG power supply
6X 4TB WD HDD's (4 red, 2 black)
and the version of truenas is 13.0-u2

I have no idea what is wring or how to fix it, can anyone help? I'm very new to this and just wanted to create a storage for my plex and all the family photos.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
What is of relevance is the Pool>Status screen or output of zpool status.
But in any case, a drive which fails SMART tests is to be replaced (RMA if applicable).

WD Black are not optimal for NAS duty, and let's hope these WD Red are not SMR.
 

T.J.Hammer

Dabbler
Joined
Sep 22, 2022
Messages
24
thank you for your answer.
I did not buy any of the drives, or the other parts.
this is some leftovers from old machines that are no longer in use so I don't know if the red ones are SMR or not.
RMA is not an option, I think this drives are few years old now.

here is the Zpool status (sorry I'm new don't really know what to post to get help)
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@truenas[~]# zpool status
pool: Alkobi
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub repaired 0B in 00:00:03 with 0 errors on Thu Sep 22 11:16:08 2022
config:

NAME STATE READ WRITE CKSUM
Alkobi ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/448c27c5-3a43-11ed-ab80-18c04d6d9038 ONLINE 0 0 0
gptid/44b90497-3a43-11ed-ab80-18c04d6d9038 ONLINE 33 0 0
gptid/44ee3275-3a43-11ed-ab80-18c04d6d9038 ONLINE 0 0 0
gptid/44e5cbb3-3a43-11ed-ab80-18c04d6d9038 ONLINE 0 0 0
gptid/44c965d6-3a43-11ed-ab80-18c04d6d9038 ONLINE 0 0 0
gptid/44d2f088-3a43-11ed-ab80-18c04d6d9038 ONLINE 0 0 0

errors: No known data errors

pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:01 with 0 errors on Wed Sep 28 03:45:01 2022
config:

NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
nvd0p2 ONLINE 0 0 0

errors: No known data errors
root@truenas[~]#
 

T.J.Hammer

Dabbler
Joined
Sep 22, 2022
Messages
24
Just had a power outage and the system reboot, now its says that the pool is healthy but I ran fast test on all disks to be sure.
the disk in question is still failing the test and I don't know why.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
Thanks. Good news is that you have a decent raidz2 layout and ZFS has repaired errors—for now.
I would run long SMART tests on the suspect drive, and on all others for good measure.

this is some leftovers from old machines that are no longer in use so I don't know if the red ones are SMR or not.
RMA is not an option, I think this drives are few years old now.
You can still write down the serial numbers (from Pool>Disks or SMART reports) and check coverage on WD website. The only risk is finding that some of these disks are still under warranty after all…
Take this opportunity to see if the Red are EFAX (SMR, bad) or EFRX (CMR, good).
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
the disk in question is still failing the test and I don't know why.
Whatever the cause, if a drive is failing SMART tests it should be replaced and properly disposed off.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
the disk in question is still failing the test and I don't know why.
Probably because the disk is dying--what else do you think those tests are supposed to tell you? Try running a long test and see if it comes up with a different result. If not, it's time to replace the disk.
 

T.J.Hammer

Dabbler
Joined
Sep 22, 2022
Messages
24
Thanks. Good news is that you have a decent raidz2 layout and ZFS has repaired errors—for now.
I would run long SMART tests on the suspect drive, and on all others for good measure.


You can still write down the serial numbers (from Pool>Disks or SMART reports) and check coverage on WD website. The only risk is finding that some of these disks are still under warranty after all…
Take this opportunity to see if the Red are EFAX (SMR, bad) or EFRX (CMR, good).
thank you so much for your help!
I tried to check the warranty but only the 2 black ones are showing and they are out of warranty for almost 2 years now.

I can replace all the disks if needed because this are very old disks (at least 4 years of service).
What disk is the best one for storing media?
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
What disk is the best one for storing media?
Any CMR drive, most preferably from a NAS or Enterprise line: WD Red Plus/Pro or Gold, Seagate Ironwolf (Pro) or Exos, Toshiba N300 or MG. Pick the best price per TB.

You may use drives larger than 4 TB. Upon replacing the last drive, the pool will grow to use the added capacity.
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
What disk is the best one for storing media?
Any NAS or enterprise storage disk should be ok. If you want to power down disks, you should check for that aspect extra, because enterprise/data center disks usually don't like this (it is simply not part of their typical "life").

Whenever I looked in my home market (Germany), to my surprise the data center-class disks were cheaper (sometimes 30-40%) than NAS disks. And I am not talking OEM/bulk-ware but the full 5 years of warranty.

If you want to stick with 4 TB drives, you need to check for SMR! On the other hand, you now have about 16 TB of net capacity. Are you happy with that capacity? And how much is electricity cost a factor in your country? Do you expect to need more capacity in the foreseeable future?

In general, I would also recommend to check the "Recommended readings" in my signature. Some things may not be relevant right now, but the more you absorb, the better your overall understanding will become.

Good luck!
 

T.J.Hammer

Dabbler
Joined
Sep 22, 2022
Messages
24
Any NAS or enterprise storage disk should be ok. If you want to power down disks, you should check for that aspect extra, because enterprise/data center disks usually don't like this (it is simply not part of their typical "life").

Whenever I looked in my home market (Germany), to my surprise the data center-class disks were cheaper (sometimes 30-40%) than NAS disks. And I am not talking OEM/bulk-ware but the full 5 years of warranty.

If you want to stick with 4 TB drives, you need to check for SMR! On the other hand, you now have about 16 TB of net capacity. Are you happy with that capacity? And how much is electricity cost a factor in your country? Do you expect to need more capacity in the foreseeable future?

In general, I would also recommend to check the "Recommended readings" in my signature. Some things may not be relevant right now, but the more you absorb, the better your overall understanding will become.

Good luck!
Hi, thanks for all the information :)

I think I wont need more storage then 16TB but I cant try and find a good deal thru my work supplier for 8TB disks.
as for the electricity, its very expensive but I cant say that's an issue.

If I say that there is an unlimited budget for the disks, what disks I should get?
I have no more space in my case it can only hold 6 disks and I try to keep the case as small as possible.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
In .us at least, the best $/TB seems to be with the external WD drives. Remove the drives from the enclosures and you're good to go. No warranty that way, but they're cheap enough to buy a spare or two. At 8 TB and under there's a risk of SMR drives, but I don't think SMR is used at higher capacities (it isn't for Seagate, at least).
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
Seagate Archive are SMR. Whatever the capacity, check the drives aren't SMR.
 

T.J.Hammer

Dabbler
Joined
Sep 22, 2022
Messages
24
So I did a long test for the night and now I see it failed (only on ada1 and again no email no nothing).
after the long test (manual) there was a short scheduled test for all the drive all all passed, ada1 also passed after he failed 4 shorts yesterday and 1 long over the night.

is it common to have no idea what's going on? how can I know for sure that I need to replace the disk?
I already understand that I need to replace all my disks to CMR ones and I will order 6 new 8TB WD red pro but I'm asking for the future, how can I know if I need to replace the disks when its failing and passing at the same time?
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
You need to look at the output of smartctl -a /dev/ada1 and look at the values. Post the output in [CODE][/CODE].
Generally, if it fails a long test you have to change it, but you also need to monitor the values.

If you look into my signature, there is a link to a script (multireport.sh) that helps monitoring HDDs.
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
So I did a long test for the night and now I see it failed
is it common to have no idea what's going on? how can I know for sure that I need to replace the disk?

So you have a drive that's failed short tests and long tests. You've had posters here in this thread tell you that the drive is failing. What more do you need for you to be convinced that this drive needs to be replaced?
 

T.J.Hammer

Dabbler
Joined
Sep 22, 2022
Messages
24
So you have a drive that's failed short tests and long tests. You've had posters here in this thread tell you that the drive is failing. What more do you need for you to be convinced that this drive needs to be replaced?
I don't need convincing I understand that its failing and will replace it ASAP I just wanted to know if it will happened in the future what to do?
This drive now passing all tests so if a new drive will fail some test it means I need to instantly replace it?
I just don't know what is the indicator in the test itself that says replace it now.
One test is enough?
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
As I said, you need to look at the datas of you want a sort of definitive answer.
If a drive consistently fails long tests, you are certain it won't last.
If a drive failed a few short tests and now is passing both long and short, keep it under observation.

 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I don't need convincing I understand that its failing and will replace it ASAP I just wanted to know if it will happened in the future what to do?
I think you have been given a lot of great advice and you probably know this. It looks to me like this drive has served you well with over 41,000 hours of runtime. The only obvious errors I see is 5 Raw Read Errors and the big one is it will not pass a Short SMART test. I suspect the head positioning is out of whack or maybe the motor speed is not stable, and that is not factual data I could prove, just what I think I understand about hard drives and the data being reported. All we know for certain is the drive is failing and you know that already.

This drive now passing all tests so if a new drive will fail some test it means I need to instantly replace it?
If you keep the drive (I highly recommend you replace the drive as soon as you can order a replacement), I recommend a daily SMART Short test and a weekly SMART Long/Extended test.
If a drive fails something, you must assess what the failure is and determine if the drive needs to be replaced. For example an UDMA_CRC_ERROR is often a loose of bad SATA cable, not the hard drive, but the drive reports the issue.

Will this happen again? You bet it will. The hard drives in a system should be considered 'Consumable' items. They have a limited life expectancy and the manufacturers tell you with the warranty they provide. Most 3 year warranty drives will last about 5 years with proper care, some longer, some much shorter. A hard drive are something you can bet you will need to replace eventually.

As for if this drive is SMR... If this drive has been part of your ZFS pool for quite a while without issue, then it is likely not an SMR drive. That isn't a very scientific approach but it's all I got.
 
Top