Confused - Received E-Mail for Disk Offline, but GUI Shows Online

Status
Not open for further replies.

khelm

Dabbler
Joined
Feb 10, 2012
Messages
10
First I should say that I am fairly new to FreeNas and I'm still in the learning phase. I've done some searching, but wasn't able to find anything on the subject. I am running :
Build FreeNAS-9.3-STABLE-201506292130

I received an email with A critical Alert saying a disk was offline. I look at the GUI and it shown online. Here is the timeline of events:

Sun, Aug 2, 2015 at 3:24 AM
Email received.
Subjet: Critical Alerts
Message:
Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors

Sun, Aug 2, 2015 at 6:55 AM
Email received.
Subject: Critical Alerts
Message:
Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Device: /dev/da0 [SAT], 2 Offline uncorrectable sectors

Mon, Aug 3, 3015 8:30 PM
I look at the GUI under Storage>Volumes and it shows the volume as "HEALTHY"
I then click on the volume and then click the volume status and it shows "ONLINE" for all disk including the suspect da0.

Here are some of the recent console messages copied from the GUI:
Code:
Aug  3 19:21:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  3 19:21:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  3 19:27:54 MainFileServer nmbd[7286]: [2015/08/03 19:27:54.683783,  0] ../source3/nmbd/nmbd_packets.c:1289(process_dgram)
Aug  3 19:27:54 MainFileServer nmbd[7286]:   process_dgram: ignoring malformed3 (datasize = 494, len=396, off=104) datagram packet sent to name MYHOME<00> from IP 69.11.25.16
Aug  3 19:51:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  3 19:51:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  3 20:20:45 MainFileServer smbd[18491]:   STATUS=daemon 'smbd' finished starting up and ready to serve connectionsFailed to fetch record!
Aug  3 20:21:54 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  3 20:21:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  3 20:51:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  3 20:51:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  3 21:21:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  3 21:21:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors



smartctl report:
smartctl -a /dev/da0

Code:
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p16 amd64] (local build)                                                        
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org                                                        
                                                                                                                                   
=== START OF INFORMATION SECTION ===                                                                                               
Model Family:     Western Digital RE4-GP                                                                                           
Device Model:     WDC WD2002FYPS-01U1B0                                                                                            
Serial Number:    WD-WCAVY0557782                                                                                                  
LU WWN Device Id: 5 0014ee 25870240e                                                                                               
Firmware Version: 04.05G05                                                                                                         
User Capacity:    2,000,398,934,016 bytes [2.00 TB]                                                                                
Sector Size:      512 bytes logical/physical                                                                                       
Rotation Rate:    5400 rpm                                                                                                         
Device is:        In smartctl database [for details use: -P show]                                                                  
ATA Version is:   ATA8-ACS (minor revision not indicated)                                                                          
SATA Version is:  SATA 2.6, 3.0 Gb/s                                                                                               
Local Time is:    Mon Aug  3 21:25:59 2015 EDT                                                                                     
SMART support is: Available - device has SMART capability.                                                                         
SMART support is: Enabled                                                                                                          
                                                                                                                                   
=== START OF READ SMART DATA SECTION ===                                                                                           
SMART overall-health self-assessment test result: PASSED                                                                           
                                                                                                                                   
General SMART Values:                                                                                                              
Offline data collection status:  (0x84) Offline data collection activity                                                           
                                        was suspended by an interrupting command from host.                                        
                                        Auto Offline Data Collection: Enabled.                                                     
Self-test execution status:      (   0) The previous self-test routine completed                                                   
                                        without error or no self-test has ever                                                     
                                        been run.                                                                                  
Total time to complete Offline                                                                                                     
data collection:                (42360) seconds.                                                                                   
Offline data collection                                                                                                            
capabilities:                    (0x7b) SMART execute Offline immediate.                                                           
                                        Auto Offline data collection on/off support.                                               
                                        Suspend Offline collection upon new                                                        
                                        command.                                                                                   
                                        Offline surface scan supported.                                                            
                                        Self-test supported.                                                                       
                                        Conveyance Self-test supported.                                                            
                                        Selective Self-test supported.                                                             
SMART capabilities:            (0x0003) Saves SMART data before entering                                                           
                                        power-saving mode.                                                                         
                                        Supports SMART auto save timer.                                                            
Error logging capability:        (0x01) Error logging supported.                                                                   
                                        General Purpose Logging supported.                                                         
Short self-test routine                                                                                                            
recommended polling time:        (   2) minutes.                                                                                   
Extended self-test routine                                                                                                         
recommended polling time:        ( 482) minutes.
Conveyance self-test routine                                                                                                       
recommended polling time:        (   5) minutes.                                                                                   
SCT capabilities:              (0x303f) SCT Status supported.                                                                      
                                        SCT Error Recovery Control supported.                                                      
                                        SCT Feature Control supported.                                                             
                                        SCT Data Table supported.                                                                  
                                                                                                                                   
SMART Attributes Data Structure revision number: 16                                                                                
Vendor Specific SMART Attributes with Thresholds:                                                                                  
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                   
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0                                           
  3 Spin_Up_Time            0x0027   148   147   021    Pre-fail  Always       -       9583                                        
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       779                                         
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                           
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0                                           
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       3674                                        
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0                                           
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0                                           
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       69                                          
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       55                                          
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       4656                                        
194 Temperature_Celsius     0x0022   116   108   000    Old_age   Always       -       36                                          
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                           
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       2                                           
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0                                           
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                           
200 Multi_Zone_Error_Rate   0x0008   200   153   000    Old_age   Offline      -       0                                           
                                                                                                                                   
SMART Error Log Version: 1                                                                                                         
No Errors Logged                                                                                                                   
                                                                                                                                   
SMART Self-test log structure revision number 1                                                                                    
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                    
# 1  Short offline       Completed without error       00%      3621         -                                                     
# 2  Extended offline    Completed without error       00%      3533         -                                                     
                                                                                                                                   
SMART Selective self-test log data structure revision number 1                                                                     
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                                                                       
    1        0        0  Not_testing                                                                                               
    2        0        0  Not_testing                                                                                               
    3        0        0  Not_testing                                                                                               
    4        0        0  Not_testing                                                                                               
    5        0        0  Not_testing                                                                                               
Selective self-test flags (0x0):                                                                                                   
  After scanning selected spans, do NOT read-scan remainder of disk.                                                               
If Selective self-test is pending on power-up, resume after 0 minute delay.      
 


The upper right hand corner of the GUI has a red flashing button with alert under it. When I click on it it brings up a dialog box as show in attached "System Alert.jpg". Since there is no time stamp with these, I'm not sure when the are from.

I guess I'm a bit confused in the discrepancies between the emails and what the GUI show. So should this drive be replaced?

Thank You,
Kerry
 

Attachments

  • System Alert.jpg
    System Alert.jpg
    18.6 KB · Views: 193

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
That email doesn't say your disk is offline; it says your disk has two bad sectors. It's the same as SMART attribute 198 in the output you posted. I'd recommend running a long SMART test and keeping a close eye on the disk--what's showing right now isn't good, but I wouldn't feel that it's critical to replace the disk immediately based on this information.

Looks like you've been running the drive a little less than 2 months. The load cycle count looks fairly high for that. You might want to to a search for WDIDLE3.EXE and make sure your drive is set appropriately (300 seconds or off are the recommended settings).
 

khelm

Dabbler
Joined
Feb 10, 2012
Messages
10
I guess seeing the word "Offline" made me panic a bit, before figuring out what it was really saying! So SMART attribute 197 would be sectors marked as bad and unusable?

I'm still a bit confused though because the SMART attribute 198 Offline_Uncorrectable is 0, but 197 Current_Pending_Sector is 2. So does attribute 197 mean that it's unable to move the data from the sector it's having trouble with? This drive has TLER enabled with a default time of 7 seconds. Would TLER be keeping the drive from re-allocating the bad sector because it doesn't have enough time?

Thanks for mentioning WDIDLE3.EXE. While the timer has already been disable on this drive, it was used for a little while while back in 2009 before I learned about WDIDLE3 and disabled it.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
Here's how I understand the relevant SMART attributes. You can read more and perhaps confirm or deny my explanation by looking up SMART on wikipedia.
#197 says the drive has detected errors at some point with 2 sectors, and will reallocate those sectors when they are next written to.
If the reallocation succeeds, #5 will be adjusted to show 2 reallocated sectors.
If the reallocation fails, #198 will be adjusted to show 2 failed reallocations.
 

khelm

Dabbler
Joined
Feb 10, 2012
Messages
10
The SMART attribute 197 Current_Pending_Sector is still at 2, but I see console message every 30 minutes for the unreadable sectors as shown below. Does this mean that it keeps trying to re-allocate the sectors every 30 minutes, or is it just reporting the same smart error every 30 minutes?

Console Display:
Code:
Aug  4 10:51:54 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 10:51:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 11:21:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 11:21:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 11:51:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 11:51:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 12:21:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 12:21:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 12:51:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 12:51:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 13:21:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 13:21:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 13:51:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 13:51:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 14:21:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 14:21:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 14:51:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 14:51:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 15:21:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 15:21:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 15:51:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 15:51:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 16:21:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 16:21:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 16:51:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 16:51:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 17:21:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
Aug  4 17:21:55 MainFileServer smartd[3038]: Device: /dev/da0 [SAT], 2 Currently unreadable (pending) sectors
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
Does this mean that it keeps trying to re-allocate the sectors every 30 minutes, or is it just reporting the same smart error every 30 minutes?
It means you have the SMART service set to check the SMART attributes every 30 minutes (the default), and it's finding the same issue each time it checks and reporting it to you.
 

khelm

Dabbler
Joined
Feb 10, 2012
Messages
10
Thanks! I think this is starting to make sense to me now. I guess that I've always assumed that when a drive has a problem with a sector that is was reallocated to another sector and marked as bad right away. I should know better than to assume!

I will try a long test and keep an eye on it.

Thanks To All!
Kerry
 

khelm

Dabbler
Joined
Feb 10, 2012
Messages
10
I don't like the seeing the current pending sectors. Since I have a few spare drives, I'm just going to swap the drive out. Once the drive is swapped out, I'm going to exercise the drive and see if I can get the pending sectors to clear. Should I run the scrub before or after replacing the drive or both? I'm not sure if it makes a difference or not, but I thought I would mention that the problem drive belongs to a pool made up of 5 Vdevs and all Vdevs are pairs of mirrored drives.

Thank You,
Kerry
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
Run the scrub before swapping the drive.

Follow the drive replacement directions carefully. Not sure there's any benefit to scrubbing afterwards.

If you have a *nix box handy, you might be able to clear the pending sectors by running badblocks, since that will write to every sector.
 
Status
Not open for further replies.
Top