Level of CKSUM Errors that should be a cause for concern?

Status
Not open for further replies.

DaPlumber

Patron
Joined
May 21, 2014
Messages
246
All,

I recently got the following after a scrub:

Code:
  pool: tankname
 state: ONLINE 
status: One or more devices has experienced an unrecoverable error. An 
        attempt was made to correct the error. Applications are unaffected. 
action: Determine if the device needs to be replaced, and clear the errors 
        using 'zpool clear' or replace the device with 'zpool replace'. 
   see: http://illumos.org/msg/ZFS-8000-9P 
  scan: scrub repaired 576K in 17h13m with 0 errors on Mon Nov 3 17:14:04 2014 
config: 

 NAME                                                STATE READ WRITE CKSUM 
 tankname                                            ONLINE   0     0     0 
  raidz2-0                                           ONLINE   0     0     0 
    gptid/926fbb61-e5d6-11e3-985e-d050990a66d5       ONLINE   0     0     0 
    gptid/9350f34f-e5d6-11e3-985e-d050990a66d5       ONLINE   0     0     0 
    gptid/9430b075-e5d6-11e3-985e-d050990a66d5       ONLINE   0     0     3 
    gptid/9514062d-e5d6-11e3-985e-d050990a66d5       ONLINE   0     0     6 

errors: No known data errors 


Because The ZFS error correction is doing it's job and I don't expect 4TB WD RED disks to be perfect forever (or even a few months in this case), I don't think the number of errors is high enough to warrant concern on a home system. I'll obviously keep an eye on the disk stats, here are the smartctl attributes for the two disks:

Code:
SMART Attributes Data Structure revision number: 16                                                                                
Vendor Specific SMART Attributes with Thresholds:                                                                                  
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                    
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0                                            
  3 Spin_Up_Time            0x0027   179   179   021    Pre-fail  Always       -       8033                                        
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       38                                          
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                            
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0                                            
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       4084                                        
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0                                            
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0                                            
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       37                                          
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       23                                          
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       94                                          
194 Temperature_Celsius     0x0022   115   111   000    Old_age   Always       -       37                                          
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                            
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0                                            
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                            
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                            
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0  


and

Code:
SMART Attributes Data Structure revision number: 16                                                                                
Vendor Specific SMART Attributes with Thresholds:                                                                                  
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                    
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1                                            
  3 Spin_Up_Time            0x0027   181   181   021    Pre-fail  Always       -       7933                                        
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       38                                          
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                            
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0                                            
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       4084                                        
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0                                            
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0                                            
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       37                                          
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       23                                          
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       94                                          
194 Temperature_Celsius     0x0022   115   111   000    Old_age   Always       -       37                                          
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                            
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0                                            
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                            
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                            
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0  

So, comments?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Eh, depending on when you did the last scrub you might want to be concerned. Those checksum errors are nothing more than "silent corruption" that ZFS caught. Catching that many that may have developed in 2 weeks is probably not a good sign. But your SMART shows everything is good. Maybe you have dirty power and/or a crappy/failing PSU?
 

DaPlumber

Patron
Joined
May 21, 2014
Messages
246
Eh, lights have flickered a few times this past week, but system is on a UPS. Switched UPS, but still.

Scrub is on a 35 day schedule, but it's relatively light duty home system: backups, iTunes library, BitTorrent (Transmission Jail), Occasional VBox. Shares are AFP only.

I think I'm going to let sleeping datasets lie and "zpool clear". Fixing silent corruption is one of the reasons to require ZFS for me anyway. No this was not the pool I was doing demo zdb commands on... ;)
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,176
9 errors does sound like a lot, considering the SMART data.

When was the last Long Test run?
 

DaPlumber

Patron
Joined
May 21, 2014
Messages
246
9 errors does sound like a lot, considering the SMART data.

When was the last Long Test run?
Quite a while ago, months even. Is there any way to do long tests without effectively off-lining the drive?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,176
Quite a while ago, months even. Is there any way to do long tests without effectively off-lining the drive?

They run fine with the drive online (shared resources so performance is lower, yadda yadda...). Just schedule them from the GUI and/or manually use
Code:
smartctl -t long /dev/adaX

if memory serves me right.

EDIT:

The smartmontools man page confirms the above.
Offline tests have to be explicitly invoked. Online tests are implicit.
 
Last edited:

DaPlumber

Patron
Joined
May 21, 2014
Messages
246
Long test scheduled via GUI for once a month. (See Cyberjock: I will so too willingly use the GUI! :p)

Currently in progress, fired off at midnight and after 10 hours is 90% done. Short tests run daily and all are good so far.

I really ought to get to know S.M.A.R.T. better.

BTW For 9.3 is there better GUI mapping of devices to zpools display and viewing of S.M.A.R.T. results?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,176
Long test scheduled via GUI for once a month. (See Cyberjock: I will so too willingly use the GUI! :p)

Currently in progress, fired off at midnight and after 10 hours is 90% done. Short tests run daily and all are good so far.

I really ought to get to know S.M.A.R.T. better.

BTW For 9.3 is there better GUI mapping of devices to zpools display and viewing of S.M.A.R.T. results?
I think the big GUI changes are scheduled for 10.1.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Long test scheduled via GUI for once a month. (See Cyberjock: I will so too willingly use the GUI! :p)

Currently in progress, fired off at midnight and after 10 hours is 90% done. Short tests run daily and all are good so far.

I really ought to get to know S.M.A.R.T. better.

BTW For 9.3 is there better GUI mapping of devices to zpools display and viewing of S.M.A.R.T. results?

I'm trying to avoid discussions for 9.3 outside the 9.3 section of the forum. Of course, you could give 9.3 a try and see for yourself too. ;)
 

DaPlumber

Patron
Joined
May 21, 2014
Messages
246
OK, so the "long" or "Extended offline" tests all "Completed without error". :confused:
 

DaPlumber

Patron
Joined
May 21, 2014
Messages
246
I'm trying to avoid discussions for 9.3 outside the 9.3 section of the forum. Of course, you could give 9.3 a try and see for yourself too. ;)

OK, bad forum discipline on my part, my bad. Any chance you could take pity and PM me with the answer or a pointer to the appropriate thread in the correct forum? Because:

Sure I'll run 9.3 Beta in my copious spare time and on one of my vast number of lab systems. :rolleyes::p (I don't think there's a BB sarcasm tag? ;))
 
Status
Not open for further replies.
Top