Unsure of issue: "One or more devices experienced an unrecoverable error"

Status
Not open for further replies.
Joined
Jun 26, 2012
Messages
260
Is this a problem? Not really sure what happened. My NAS went down (nothing available) so I restarted. The below Alert is what I saw when I logged into the GUI

The volume Data1 (ZFS) status is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'.


Not sure what/how to check.
I ran smartctl -a /dev/ada0 and got:
Code:
Error 2 occurred at disk power-on lifetime: 22338 hours (930 days + 18 hours)                                                      
  When the command that caused the error occurred, the device was active or idle.                                                  
                                                                                                                                   
  After command completion occurred, registers were:                                                                               
  ER ST SC SN CL CH DH                                                                                                             
  -- -- -- -- -- -- --                                                                                                             
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455                                                                 
                                                                                                                                   
  Commands leading to the command that caused the error were:                                                                      
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name                                                                  
  -- -- -- -- -- -- -- --  ----------------  --------------------                                                                  
  60 00 00 ff ff ff 4f 00  41d+06:05:47.039  READ FPDMA QUEUED                                                                     
  60 00 00 ff ff ff 4f 00  41d+06:05:46.973  READ FPDMA QUEUED                                                                     
  60 00 00 ff ff ff 4f 00  41d+06:05:46.972  READ FPDMA QUEUED                                                                     
  60 00 00 ff ff ff 4f 00  41d+06:05:46.972  READ FPDMA QUEUED                                                                     
  60 00 00 ff ff ff 4f 00  41d+06:05:46.972  READ FPDMA QUEUED                                                                     
                                                                                                                                   
Error 1 occurred at disk power-on lifetime: 22334 hours (930 days + 14 hours)                                                      
  When the command that caused the error occurred, the device was active or idle.                                                  
                                                                                                                                   
  After command completion occurred, registers were:                                                                               
  ER ST SC SN CL CH DH                                                                                                             
  -- -- -- -- -- -- --                                                                                                             
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455                                                                 
                                                                                                                                   
  Commands leading to the command that caused the error were:                                                                      
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name                                                                  
  -- -- -- -- -- -- -- --  ----------------  --------------------                                                                  
  60 00 00 ff ff ff 4f 00  41d+02:52:33.738  READ FPDMA QUEUED                                                                     
  60 00 00 ff ff ff 4f 00  41d+02:52:32.264  READ FPDMA QUEUED                                                                     
  60 00 00 ff ff ff 4f 00  41d+02:52:32.244  READ FPDMA QUEUED                                                                     
  60 00 00 ff ff ff 4f 00  41d+02:52:32.243  READ FPDMA QUEUED                                                                     
  60 00 00 ff ff ff 4f 00  41d+02:52:32.242  READ FPDMA QUEUED                                                                     
                                                                                                                                   
SMART Self-test log structure revision number 1                                                                                    
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                    
# 1  Short offline       Completed: read failure       90%     14164         149884                                                
                                                                                                                                   
SMART Selective self-test log data structure revision number 1                                                                     
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                                                                       
    1        0        0  Not_testing                                                                                               
    2        0        0  Not_testing                                                                                               
    3        0        0  Not_testing                                                                                               
    4        0        0  Not_testing                                                                                               
    5        0        0  Not_testing                                                                                               
Selective self-test flags (0x0):                                                                                                   
  After scanning selected spans, do NOT read-scan remainder of disk.                                                               
If Selective self-test is pending on power-up, resume after 0 minute delay.                    



2 3TB drives in RAID1
8 GB RAM, FreeNAS 9.2.1.3 x64
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Woah, several problems here:

First you are NOT running scheduled SMART tests! There's only a single short test on that drive, which is a very bad sign. It's failed, too, so the drive is toast.

You need to brush up on hard drive maintenance, namely SMART tests and scrubs.

For now, give us the output of zpool status (don't forget the code tags) and manually run a SMART long test on all drives with smartctl -t long /dev/adawhatever
 

gpsguy

Active Member
Joined
Jan 22, 2012
Messages
4,472
And, that failed short test was run about a year ago (assuming 24x7 usage).
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
And, that failed short test was run about a year ago (assuming 24x7 usage).

Good catch. I was going to say "this disk has been failing for quite a whole too".
 
Joined
Jun 26, 2012
Messages
260
Woah, several problems here:

First you are NOT running scheduled SMART tests! There's only a single short test on that drive, which is a very bad sign. It's failed, too, so the drive is toast.

You need to brush up on hard drive maintenance, namely SMART tests and scrubs.

For now, give us the output of zpool status (don't forget the code tags) and manually run a SMART long test on all drives with smartctl -t long /dev/adawhatever

Hmm...that does not sound good. I am a true noob here. I am knowledgeable enough to get this thing working,
but not knowledgeable to do proper maintenance obviously.

I have run the smartctl -t noted above. Says it will complete in 5 hours. I am clearly not sure what I am doing here...

Code:
Use smartctl -X to abort test.                                                                                                     
[root@freenas ~]# zpool status                                                                                                     
  pool: CVZData1                                                                                                                   
state: ONLINE                                                                                                                     
status: The pool is formatted using a legacy on-disk format.  The pool can                                                         
        still be used, but some features are unavailable.                                                                          
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the                                                            
        pool will no longer be accessible on software that does not support feature                                                
        flags.                                                                                                                     
  scan: scrub repaired 0 in 5h54m with 0 errors on Sun Jan 18 05:54:51 2015                                                        
config:                                                                                                                            
                                                                                                                                   
        NAME                                            STATE     READ WRITE CKSUM                                                 
             Data1                                        ONLINE       0     0     0                                                 
          mirror-0                                      ONLINE       0     0     0                                                 
            gptid/f565bba0-b456-11e3-89da-94de8001cf84  ONLINE       0     0     0                                                 
            gptid/268bcf06-c195-11e1-b9d6-6c626d8c6b58  ONLINE       0     0     0                                                 
                                                                                                                                   
errors: No known data errors                      


Just typed zpool status on the command line (assuming that is what you wanted in order to get a zpool status...like I said, I am very new to all of this).
I built it, plugged it in and tinkered with settings until it worked. General rules and maintenance is unknown to me...
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
When that's done, post the output of smartctl -a /dev/adawhatever for all drives, as well as zpool status again.
 
Joined
Jun 26, 2012
Messages
260

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Additional note, I changed one of my HDDs about a year ago in March.
https://forums.freenas.org/index.ph...er-i-o-error-pagein-failed.19288/#post-110717

Is it possible I replaced the drive and then did not process something correctly?
I know I am not following general maintenance rules etc...but 2 drive failures in 1 year is highly surprising...isn't it?

There's no smoking gun, so let's assume the replacement was correctly performed.

Two failures in one year isn't too surprising, even if we're talking about a simple mirror.
Stuff tends to fail when operating for long periods, rotating at 5400 RPM and crazy low tolerances.
 
Joined
Jun 26, 2012
Messages
260
What is a standard SMART test schedule? This is just a home media server.
 
Last edited:

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Personally I use this schedule:
Code:
+=================+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+
|task        day->|01   |02   |03   |04   |05   |06   |07   |08   |09   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |20   |21   |22   |23   |24   |25   |26   |27   |28   |29   |30   |31   |
+=================+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+
|pool scrub       |04:00|     |     |     |     |     |     |     |     |     |     |     |     |     |     |04:00|     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
+-----------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|long smart test  |     |03:00|     |     |     |     |     |     |     |     |     |     |     |     |     |     |03:00|     |     |     |     |     |     |     |     |     |     |     |     |     |     |
+-----------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|short smart test |     |     |     |     |     |     |05:00|     |     |     |     |05:00|     |     |     |     |     |     |     |     |     |05:00|     |     |     |     |05:00|     |     |     |     |
+-----------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|send smart report|     |     |06:00|     |     |     |     |06:00|     |     |     |     |06:00|     |     |     |     |06:00|     |     |     |     |06:00|     |     |     |     |06:00|     |     |     |
+-----------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|send zpool report|     |     |06:01|     |     |     |     |     |     |     |     |     |     |     |     |     |     |06:01|     |     |     |     |     |     |     |     |     |     |     |     |     |
+-----------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|send ups report  |     |     |06:02|     |     |     |     |     |     |     |     |     |     |     |     |     |     |06:02|     |     |     |     |     |     |     |     |     |     |     |     |     |
+-----------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+


Edit: why you've deleted your post? my post is useless now... For the record the question was something like: "what is the recommended SMART tests interval?"
 
Last edited:
Joined
Jun 26, 2012
Messages
260
Edit: why you've deleted your post? my post is useless now... For the record the question was something like: "what is the recommended SMART tests interval?"

Back...thought it was a stupid question relatively easily found by searching...but hey...there are no stupid questions...right?
 
Joined
Jun 26, 2012
Messages
260
Personally I use this schedule:
Code:
+=================+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+
|task        day->|01   |02   |03   |03   |04   |05   |06   |08   |09   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |20   |21   |22   |23   |24   |25   |26   |27   |28   |29   |30   |31   |
+=================+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+=====+
|pool scrub       |04:00|     |     |     |     |     |     |     |     |     |     |     |     |     |     |04:00|     |     |     |     |     |     |     |     |     |     |     |     |     |     |     |
+-----------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|long smart test  |     |03:00|     |     |     |     |     |     |     |     |     |     |     |     |     |     |03:00|     |     |     |     |     |     |     |     |     |     |     |     |     |     |
+-----------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|short smart test |     |     |     |     |     |     |05:00|     |     |     |     |05:00|     |     |     |     |     |     |     |     |     |05:00|     |     |     |     |05:00|     |     |     |     |
+-----------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|send smart report|     |     |06:00|     |     |     |     |06:00|     |     |     |     |06:00|     |     |     |     |06:00|     |     |     |     |06:00|     |     |     |     |06:00|     |     |     |
+-----------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|send zpool report|     |     |06:01|     |     |     |     |     |     |     |     |     |     |     |     |     |     |06:01|     |     |     |     |     |     |     |     |     |     |     |     |     |
+-----------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|send ups report  |     |     |06:02|     |     |     |     |     |     |     |     |     |     |     |     |     |     |06:02|     |     |     |     |     |     |     |     |     |     |     |     |     |
+-----------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+


Edit: why you've deleted your post? my post is useless now... For the record the question was something like: "what is the recommended SMART tests interval?"


Thanks...very helpful
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Ok, no big deal ;)

You're welcome.
 
Joined
Jun 26, 2012
Messages
260
When that's done, post the output of smartctl -a /dev/adawhatever for all drives, as well as zpool status again.

Due to several power outages that kept interrupting the smartctl...this is not done yet.
Running again...
 
Joined
Jun 26, 2012
Messages
260
Due to several power outages that kept interrupting the smartctl...this is not done yet.
Running again...
and yes...I am now ordering a UPS that I have been slow rolling on getting. Ugh.
 

gpsguy

Active Member
Joined
Jan 22, 2012
Messages
4,472
Hopefully you have your server on a UPS and you are powering down your server gracefully, when the power fails.
 
Joined
Jun 26, 2012
Messages
260
When that's done, post the output of smartctl -a /dev/adawhatever for all drives, as well as zpool status again.

smartctl:
Code:
After command completion occurred, registers were:                                                                              
  ER ST SC SN CL CH DH                                                                                                            
  -- -- -- -- -- -- --                                                                                                            
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455                                                                
                                                                                                                                  
  Commands leading to the command that caused the error were:                                                                     
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name                                                                 
  -- -- -- -- -- -- -- --  ----------------  --------------------                                                                 
  60 00 08 ff ff ff 4f 00      00:03:41.426  READ FPDMA QUEUED                                                                    
  60 00 08 ff ff ff 4f 00      00:03:41.426  READ FPDMA QUEUED                                                                    
  60 00 08 ff ff ff 4f 00      00:03:41.426  READ FPDMA QUEUED                                                                    
  60 00 08 ff ff ff 4f 00      00:03:41.426  READ FPDMA QUEUED                                                                    
  60 00 08 ff ff ff 4f 00      00:03:41.426  READ FPDMA QUEUED                                                                    
                                                                                                                                  
Error 33 occurred at disk power-on lifetime: 22381 hours (932 days + 13 hours)                                                    
  When the command that caused the error occurred, the device was active or idle.                                                 
                                                                                                                                  
  After command completion occurred, registers were:                                                                              
  ER ST SC SN CL CH DH                                                                                                            
  -- -- -- -- -- -- --                                                                                                            
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455                                                                
                                                                                                                                  
  Commands leading to the command that caused the error were:                                                                     
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name                                                                 
  -- -- -- -- -- -- -- --  ----------------  --------------------                                                                 
  60 00 08 ff ff ff 4f 00      00:03:38.484  READ FPDMA QUEUED                                                                    
  60 00 08 ff ff ff 4f 00      00:03:38.484  READ FPDMA QUEUED                                                                    
  60 00 08 ff ff ff 4f 00      00:03:38.484  READ FPDMA QUEUED                                                                    
  60 00 08 ff ff ff 4f 00      00:03:38.484  READ FPDMA QUEUED                                                                    
  60 00 08 ff ff ff 4f 00      00:03:38.484  READ FPDMA QUEUED                                                                    
                                                                                                                                  
SMART Self-test log structure revision number 1                                                                                   
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                   
# 1  Extended offline    Completed: read failure       90%     22392         149884                                               
# 2  Extended offline    Completed: read failure       90%     22381         149884                                               
# 3  Extended offline    Completed: read failure       90%     22377         149884                                               
# 4  Short offline       Completed: read failure       90%     14164         149884                                               
                                                                                                                                  
SMART Selective self-test log data structure revision number 1                                                                    
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                                                                      
    1        0        0  Not_testing                                                                                              
    2        0        0  Not_testing                                                                                              
    3        0        0  Not_testing                                                                                              
    4        0        0  Not_testing                                                                                              
    5        0        0  Not_testing                                                                                              
Selective self-test flags (0x0):                                                                                                  
  After scanning selected spans, do NOT read-scan remainder of disk.                                                              
If Selective self-test is pending on power-up, resume after 0 minute delay.      


zpool status:
Code:
  pool: Data1                                                                                                                  
state: ONLINE                                                                                                                    
status: The pool is formatted using a legacy on-disk format.  The pool can                                                        
        still be used, but some features are unavailable.                                                                         
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the                                                           
        pool will no longer be accessible on software that does not support feature                                               
        flags.                                                                                                                    
  scan: scrub repaired 0 in 5h54m with 0 errors on Sun Jan 18 05:54:51 2015                                                       
config:                                                                                                                           
                                                                                                                                  
        NAME                                            STATE     READ WRITE CKSUM                                                
             Data1                                        ONLINE       0     0     0                                                
          mirror-0                                      ONLINE       0     0     0                                                
            gptid/f565bba0-b456-11e3-89da-94de8001cf84  ONLINE       0     0     0                                                
            gptid/268bcf06-c195-11e1-b9d6-6c626d8c6b58  ONLINE       0     0     0                                                
                                                                                                                                  
errors: No known data errors 
 
Last edited:
Joined
Jun 26, 2012
Messages
260
also did a smartctl for ada1:

Code:
Extended self-test routine                                                                                                         
recommended polling time:        ( 406) minutes.                                                                                   
Conveyance self-test routine                                                                                                       
recommended polling time:        (   5) minutes.                                                                                   
SCT capabilities:              (0x703d) SCT Status supported.                                                                      
                                        SCT Error Recovery Control supported.                                                      
                                        SCT Feature Control supported.                                                             
                                        SCT Data Table supported.                                                                  
                                                                                                                                   
SMART Attributes Data Structure revision number: 16                                                                                
Vendor Specific SMART Attributes with Thresholds:                                                                                  
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                   
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0                                           
  3 Spin_Up_Time            0x0027   180   180   021    Pre-fail  Always       -       5983                                        
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       10                                          
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                           
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0                                           
  9 Power_On_Hours          0x0032   090   090   000    Old_age   Always       -       7989                                        
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0                                           
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0                                           
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       10                                          
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       9                                           
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       934707                                      
194 Temperature_Celsius     0x0022   116   107   000    Old_age   Always       -       34                                          
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                           
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0                                           
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                           
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                           
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0                                           
                                                                                                                                   
SMART Error Log Version: 1                                                                                                         
No Errors Logged                                                                                                                   
                                                                                                                                   
SMART Self-test log structure revision number 1                                                                                    
No self-tests have been logged.  [To run self-tests, use: smartctl -t]                                                             
                                                                                                                                   
                                                                                                                                   
SMART Selective self-test log data structure revision number 1                                                                     
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                                                                       
    1        0        0  Not_testing                                                                                               
    2        0        0  Not_testing                                                                                               
    3        0        0  Not_testing                                                                                               
    4        0        0  Not_testing                                                                                               
    5        0        0  Not_testing                                                                                               
Selective self-test flags (0x0):                                                                                                   
  After scanning selected spans, do NOT read-scan remainder of disk.                                                               
If Selective self-test is pending on power-up, resume after 0 minute delay. 
 

gpsguy

Active Member
Joined
Jan 22, 2012
Messages
4,472
The failures aren't surprising "Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 934707". I presume this is a WD Green and you didn't run wdidle.exe on it before using it. I'd buy a replacment for this one too.

If you buy another WD Green, you'll want to run wdidle on it/them, so you don't have these huge LCC numbers.
 
Joined
Jun 26, 2012
Messages
260
So many errors and mistakes on this build as I review the various steps.
Sigh.
 
Status
Not open for further replies.
Top