FreeNAS Discord

Snake3y3s

Explorer
Joined
Oct 3, 2017
Messages
96
this is the results i get from running smartcl -a /dev/ada3...

Code:
SMART Attributes Data Structure revision number: 16                                                                                 
Vendor Specific SMART Attributes with Thresholds:                                                                                   
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                   
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       2                                           
  3 Spin_Up_Time            0x0027   181   178   021    Pre-fail  Always       -       5908                                         
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       304                                         
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                           
  7 Seek_Error_Rate         0x002e   001   001   000    Old_age   Always       -       262140                                       
  9 Power_On_Hours          0x0032   091   091   000    Old_age   Always       -       6949                                         
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0                                           
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0                                           
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       304                                         
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       38                                           
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       372                                         
194 Temperature_Celsius     0x0022   117   112   000    Old_age   Always       -       33                                           
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                           
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0                                           
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                           
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                           
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0                                           
                                                                                                                                    
SMART Error Log Version: 1                                                                                                         
No Errors Logged                                                                                                                   
                                                                                                                                    
SMART Self-test log structure revision number 1                                                                                     
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                     
# 1  Extended offline    Interrupted (host reset)      90%      6907         -                                                     
                                                                                                                                    
SMART Selective self-test log data structure revision number 1                                                                     
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                                                                       
    1        0        0  Not_testing                                                                                               
    2        0        0  Not_testing                                                                                               
    3        0        0  Not_testing                                                                                               
    4        0        0  Not_testing                                                                                               
    5        0        0  Not_testing                                                                                               
Selective self-test flags (0x0):                                                                                                   
  After scanning selected spans, do NOT read-scan remainder of disk.                                                               
If Selective self-test is pending on power-up, resume after 0 minute delay.                                                         
                                                                            
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
I think I did try that command and didn't see anything out of the ordinary...
Anyway it would seem that the activity has now dropped from a constant 100% to now spiking between 60% and 80% on ada3
the rest of the drive are still going between 2% and 5% activity
Provide the output of that smart cmd please. We need to see when the last time you ran a long test is and what the values are. Seems like your disk is dead and you haven't been monitoring it.
 

Snake3y3s

Explorer
Joined
Oct 3, 2017
Messages
96
results posted
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
this is the results i get from running smartcl -a /dev/ada3...

Code:
SMART Attributes Data Structure revision number: 16                                                                                 
Vendor Specific SMART Attributes with Thresholds:                                                                                   
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                   
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       2                                           
  3 Spin_Up_Time            0x0027   181   178   021    Pre-fail  Always       -       5908                                         
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       304                                         
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                           
  7 Seek_Error_Rate         0x002e   001   001   000    Old_age   Always       -       262140                                       
  9 Power_On_Hours          0x0032   091   091   000    Old_age   Always       -       6949                                         
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0                                           
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0                                           
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       304                                         
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       38                                           
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       372                                         
194 Temperature_Celsius     0x0022   117   112   000    Old_age   Always       -       33                                           
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                           
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0                                           
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                           
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                           
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0                                           
                                                                                                                                    
SMART Error Log Version: 1                                                                                                         
No Errors Logged                                                                                                                   
                                                                                                                                    
SMART Self-test log structure revision number 1                                                                                     
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                     
# 1  Extended offline    Interrupted (host reset)      90%      6907         -                                                     
                                                                                                                                    
SMART Selective self-test log data structure revision number 1                                                                     
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                                                                       
    1        0        0  Not_testing                                                                                               
    2        0        0  Not_testing                                                                                               
    3        0        0  Not_testing                                                                                               
    4        0        0  Not_testing                                                                                               
    5        0        0  Not_testing                                                                                               
Selective self-test flags (0x0):                                                                                                   
  After scanning selected spans, do NOT read-scan remainder of disk.                                                               
If Selective self-test is pending on power-up, resume after 0 minute delay.                                                         
                                                                            
This drive is dead. You also need to setup a task to run smart long tests and short tests on all your drives. This combined with email alerts makes it so you don't have this problem.
 

Snake3y3s

Explorer
Joined
Oct 3, 2017
Messages
96
errr... okay, where does it say that it is dead? and if so, why is the FreeNAS gui reporting as all is okay? (or does this status only change if you schedule SMART tests?)
I did send the command to do a long test (the drive stated its going to take 16hours for the test to complete)

is there a guide to setting up the smart tests? and what is the recommended setup? (short one everyday and a long one on a sunday or something?)

Also, after the smart tests are run on the drives, does FreeNAS output the results to a log or txt file? would i need to go in and check each drives logs via the shell? or do i need to get the report emailed to me?
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
okay, where does it say that it is dead?

Test ID No 7.

For Seagate product, that value is calculated in a different way, so I can not compare it to yours. But because you have WD, these with such a high seek error counts mean a dead drive.

Here are some (old) info about interpreting that value and the difference between Seagate and WD about how they use that value...

Interpreting Smart results
 

Snake3y3s

Explorer
Joined
Oct 3, 2017
Messages
96
so again i would like to find out, if i started a scrub or a Long Smart test, does rebooting or shutting down the server completely stop this or will it resume or even restart once the server has been restarted??

also, those smart test results are showing "No Errors Logged"??
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
so again i would like to find out, if i started a scrub or a Long Smart test, does rebooting or shutting down the server completely stop this or will it resume or even restart once the server has been restarted??

also, those smart test results are showing "No Errors Logged"??
Your read error values should be 0 and your seek error should be 0 also. That's why is failing.

Your scrub will continue after a reboot but a smart test will not. Freenas doesn't log the test results anyplace but it will send you an email notification if there is an increase in count or a test fails. The freenas gui says it's still ok because it's still returning good data just taking way too long but not long enough for zfs to drop it from the pool. My guess is soon or if you tried to do a scrub the pool will go degraded and this disk will get dropped.
 

Snake3y3s

Explorer
Joined
Oct 3, 2017
Messages
96
so replace drive ASAP.
And, why did the performance improve slightly? its actually usable at the moment again
 

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
I think I did try that command and didn't see anything out of the ordinary...
Anyway it would seem that the activity has now dropped from a constant 100% to now spiking between 60% and 80% on ada3
the rest of the drive are still going between 2% and 5% activity

The latency on ada3 was very high (100x higher than normal)..... normally a sign of a fault within the drive. If the latency does not go away, then you should treat it as a soft fault and replace the drive.
 

Snake3y3s

Explorer
Joined
Oct 3, 2017
Messages
96
Should I risk attempting the "remove disk, thereby degrading the pool, then re-insert and let it re-silver"
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
"remove disk, thereby degrading the pool, then re-insert and let it re-silver"
You can remove the disk (offline it first or instead), but don't re-insert it... there will be no miraculous recovery for a disk that has errors like that.
 

Snake3y3s

Explorer
Joined
Oct 3, 2017
Messages
96
As a replacement drive, am I to be looking for a Western Digital RED PRO drive (CMR)?
 

Snake3y3s

Explorer
Joined
Oct 3, 2017
Messages
96
okay, so I have run a smartctl Long test and it came back with one error apparently (FreeNAS is now showing a red status with this warning: "CRITICAL: Jan. 9, 2021, 12:03 p.m. - Device: /dev/ada3, Self-Test Log error count increased from 0 to 1")

here is the smart test results after the long test:
Code:
/$ smartctl -a /dev/ada3
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WCC4N3EYE1Y3
LU WWN Device Id: 5 0014ee 2b9970486
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Jan  9 13:12:38 2021 CAT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 113)    The previous self-test completed having
                    the read element of the test failed.
Total time to complete Offline
data collection:         (40980) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 411) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x703d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       4
  3 Spin_Up_Time            0x0027   181   178   021    Pre-fail  Always       -       5908
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       304
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   001   001   000    Old_age   Always       -       167103
  9 Power_On_Hours          0x0032   091   091   000    Old_age   Always       -       7132
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       304
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       38
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       372
194 Temperature_Celsius     0x0022   114   110   000    Old_age   Always       -       36
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       5

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       10%      7130         3646336664
# 2  Short offline       Completed without error       00%      7111         -
# 3  Extended offline    Completed without error       00%      7105         -
# 4  Extended offline    Interrupted (host reset)      90%      6907         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



So... i guess im turning off the nas until i can find a replacement drive
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
So... i guess im turning off the nas until i can find a replacement drive

Don't know if I would do that either... The thermal cycle might cause another drive to fail. Raidz2 can suffer two drives failing, so for now your data is safe, and taking that drive offline should improve performance. I don't think there's any 3Tb CMR drives still in production at this point. You can replace it with a larger CMR drive, and the extra space will simply be wasted.
 

Snake3y3s

Explorer
Joined
Oct 3, 2017
Messages
96
I live in South Africa. we still get the EFRX drives here, so I will be able to get one i think. would love to get a 6TB (minor upgrade for when i replace the rest) or a pro series.... but cash :(
the reason i would switch off the NAS for now is because i honestly dont know when i will be able to afford another drive.
 
Top