Kevin Horton
Guru
- Joined
- Dec 2, 2015
- Messages
- 730
Overnight, I received an Alert email from one of my FreeNAS servers:
FreeNAS @ big_bertha.local
New alerts:
* Device: /dev/ada1, Self-Test Log error count increased from 0 to 1
=============
I logged on via ssh, and checked the status via SMART, and found that the most recent SMART short test had logged a read failure:
I ran a long SMART test, and it passed:
I'm not seeing anything of concern in the latest SMART output, other than the read error from the failed short test. Am I missing something? Is there anything else I should check?
The pool in question is an 8 disk RAIDZ2 pool that is a backup of my main pool. I have another backup on a second local server, and an offsite rsync backup on a two disk strip. I have one badblock tested spare drive on the shelf. Of course this occurs immediately before I head on the road for seven to ten days. I'm considering adding this drive to the pool as a spare, in case the disk fails while I'm away (I know that I must be careful to not add it to the vdev as a stripe).
If I add the disk as a spare to the backup pool, can I remove it later in case the main pool is the first pool to suffer an actual disk failure?
FreeNAS @ big_bertha.local
New alerts:
* Device: /dev/ada1, Self-Test Log error count increased from 0 to 1
=============
I logged on via ssh, and checked the status via SMART, and found that the most recent SMART short test had logged a read failure:
Code:
smartctl -x /dev/ada1 smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital Red Device Model: WDC WD40EFRX-68WT0N0 Serial Number: WD-WCC4E1HY83CJ LU WWN Device Id: 5 0014ee 20d3e05cc Firmware Version: 82.00A82 User Capacity: 4,000,787,030,016 bytes [4.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5400 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Wed Sep 25 05:36:38 2019 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled AAM feature is: Unavailable APM feature is: Unavailable Rd look-ahead is: Enabled Write cache is: Enabled DSN feature is: Unavailable ATA Security is: Disabled, frozen [SEC2] Wt Cache Reorder: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 117) The previous self-test completed having the read element of the test failed. Total time to complete Offline data collection: (52560) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 526) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x703d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 0 3 Spin_Up_Time POS--K 183 175 021 - 7833 4 Start_Stop_Count -O--CK 099 099 000 - 1200 5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0 7 Seek_Error_Rate -OSR-K 100 253 000 - 0 9 Power_On_Hours -O--CK 068 068 000 - 23413 10 Spin_Retry_Count -O--CK 100 100 000 - 0 11 Calibration_Retry_Count -O--CK 100 253 000 - 0 12 Power_Cycle_Count -O--CK 100 100 000 - 87 192 Power-Off_Retract_Count -O--CK 200 200 000 - 75 193 Load_Cycle_Count -O--CK 200 200 000 - 1273 194 Temperature_Celsius -O---K 115 111 000 - 37 196 Reallocated_Event_Count -O--CK 200 200 000 - 0 197 Current_Pending_Sector -O--CK 200 200 000 - 0 198 Offline_Uncorrectable ----CK 100 253 000 - 0 199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0 200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 0 ||||||_ K auto-keep |||||__ C event count ||||___ R error rate |||____ S speed/performance ||_____ O updated online |______ P prefailure warning General Purpose Log Directory Version 1 SMART Log Directory Version 1 [multi-sector log support] Address Access R/W Size Description 0x00 GPL,SL R/O 1 Log Directory 0x01 SL R/O 1 Summary SMART error log 0x02 SL R/O 5 Comprehensive SMART error log 0x03 GPL R/O 6 Ext. Comprehensive SMART error log 0x06 SL R/O 1 SMART self-test log 0x07 GPL R/O 1 Extended self-test log 0x09 SL R/W 1 Selective self-test log 0x10 GPL R/O 1 NCQ Command Error log 0x11 GPL R/O 1 SATA Phy Event Counters log 0x21 GPL R/O 1 Write stream error log 0x22 GPL R/O 1 Read stream error log 0x80-0x9f GPL,SL R/W 16 Host vendor specific log 0xa0-0xa7 GPL,SL VS 16 Device vendor specific log 0xa8-0xb6 GPL,SL VS 1 Device vendor specific log 0xb7 GPL,SL VS 39 Device vendor specific log 0xbd GPL,SL VS 1 Device vendor specific log 0xc0 GPL,SL VS 1 Device vendor specific log 0xc1 GPL VS 93 Device vendor specific log 0xe0 GPL,SL R/W 1 SCT Command/Status 0xe1 GPL,SL R/W 1 SCT Data Transfer SMART Extended Comprehensive Error Log Version: 1 (6 sectors) No Errors Logged SMART Extended Self-test Log Version: 1 (1 sectors) Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 50% 23404 3178608 # 2 Short offline Completed without error 00% 23380 - # 3 Short offline Completed without error 00% 23356 - # 4 Short offline Completed without error 00% 23332 - # 5 Extended offline Completed without error 00% 23318 - # 6 Short offline Completed without error 00% 23308 - # 7 Short offline Completed without error 00% 23284 - # 8 Short offline Completed without error 00% 23260 - # 9 Short offline Completed without error 00% 23236 - #10 Short offline Completed without error 00% 23212 - #11 Short offline Completed without error 00% 23188 - #12 Short offline Completed without error 00% 23164 - #13 Extended offline Completed without error 00% 23150 - #14 Short offline Completed without error 00% 23140 - #15 Short offline Completed without error 00% 23116 - #16 Short offline Completed without error 00% 23092 - #17 Short offline Completed without error 00% 23068 - #18 Short offline Completed without error 00% 23044 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. SCT Status Version: 3 SCT Version (vendor specific): 258 (0x0102) SCT Support Level: 1 Device State: Active (0) Current Temperature: 37 Celsius Power Cycle Min/Max Temperature: 29/39 Celsius Lifetime Min/Max Temperature: 3/41 Celsius Under/Over Temperature Limit Count: 0/0 Vendor specific: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 SCT Temperature History Version: 2 Temperature Sampling Period: 1 minute Temperature Logging Interval: 1 minute Min/Max recommended Temperature: 0/60 Celsius Min/Max Temperature Limit: -41/85 Celsius Temperature History Size (Index): 478 (217) Index Estimated Time Temperature Celsius 218 2019-09-24 21:39 36 ***************** ... ..( 8 skipped). .. ***************** 227 2019-09-24 21:48 36 ***************** 228 2019-09-24 21:49 37 ****************** ... ..(420 skipped). .. ****************** 171 2019-09-25 04:50 37 ****************** 172 2019-09-25 04:51 36 ***************** ... ..( 20 skipped). .. ***************** 193 2019-09-25 05:12 36 ***************** 194 2019-09-25 05:13 37 ****************** ... ..( 19 skipped). .. ****************** 214 2019-09-25 05:33 37 ****************** 215 2019-09-25 05:34 36 ***************** 216 2019-09-25 05:35 36 ***************** 217 2019-09-25 05:36 36 ***************** SCT Error Recovery Control: Read: 70 (7.0 seconds) Write: 70 (7.0 seconds) Device Statistics (GP/SMART Log 0x04) not supported Pending Defects log (GP Log 0x0c) not supported SATA Phy Event Counters (GP Log 0x11) ID Size Value Description 0x0001 2 0 Command failed due to ICRC error 0x0002 2 0 R_ERR response for data FIS 0x0003 2 0 R_ERR response for device-to-host data FIS 0x0004 2 0 R_ERR response for host-to-device data FIS 0x0005 2 0 R_ERR response for non-data FIS 0x0006 2 0 R_ERR response for device-to-host non-data FIS 0x0007 2 0 R_ERR response for host-to-device non-data FIS 0x0008 2 0 Device-to-host non-data FIS retries 0x0009 2 7 Transition from drive PhyRdy to drive PhyNRdy 0x000a 2 7 Device-to-host register FISes sent due to a COMRESET 0x000b 2 0 CRC errors within host-to-device FIS 0x000f 2 0 R_ERR response for host-to-device data FIS, CRC 0x0012 2 0 R_ERR response for host-to-device non-data FIS, CRC 0x8000 4 4659470 Vendor specific
I ran a long SMART test, and it passed:
Code:
smartctl -x /dev/ada1 smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital Red Device Model: WDC WD40EFRX-68WT0N0 Serial Number: WD-WCC4E1HY83CJ LU WWN Device Id: 5 0014ee 20d3e05cc Firmware Version: 82.00A82 User Capacity: 4,000,787,030,016 bytes [4.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5400 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Wed Sep 25 19:35:14 2019 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled AAM feature is: Unavailable APM feature is: Unavailable Rd look-ahead is: Enabled Write cache is: Enabled DSN feature is: Unavailable ATA Security is: Disabled, frozen [SEC2] Wt Cache Reorder: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (52560) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 526) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x703d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 0 3 Spin_Up_Time POS--K 183 175 021 - 7833 4 Start_Stop_Count -O--CK 099 099 000 - 1200 5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0 7 Seek_Error_Rate -OSR-K 200 200 000 - 0 9 Power_On_Hours -O--CK 068 068 000 - 23427 10 Spin_Retry_Count -O--CK 100 100 000 - 0 11 Calibration_Retry_Count -O--CK 100 253 000 - 0 12 Power_Cycle_Count -O--CK 100 100 000 - 87 192 Power-Off_Retract_Count -O--CK 200 200 000 - 75 193 Load_Cycle_Count -O--CK 200 200 000 - 1273 194 Temperature_Celsius -O---K 115 111 000 - 37 196 Reallocated_Event_Count -O--CK 200 200 000 - 0 197 Current_Pending_Sector -O--CK 200 200 000 - 0 198 Offline_Uncorrectable ----CK 100 253 000 - 0 199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0 200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 9 ||||||_ K auto-keep |||||__ C event count ||||___ R error rate |||____ S speed/performance ||_____ O updated online |______ P prefailure warning General Purpose Log Directory Version 1 SMART Log Directory Version 1 [multi-sector log support] Address Access R/W Size Description 0x00 GPL,SL R/O 1 Log Directory 0x01 SL R/O 1 Summary SMART error log 0x02 SL R/O 5 Comprehensive SMART error log 0x03 GPL R/O 6 Ext. Comprehensive SMART error log 0x06 SL R/O 1 SMART self-test log 0x07 GPL R/O 1 Extended self-test log 0x09 SL R/W 1 Selective self-test log 0x10 GPL R/O 1 NCQ Command Error log 0x11 GPL R/O 1 SATA Phy Event Counters log 0x21 GPL R/O 1 Write stream error log 0x22 GPL R/O 1 Read stream error log 0x80-0x9f GPL,SL R/W 16 Host vendor specific log 0xa0-0xa7 GPL,SL VS 16 Device vendor specific log 0xa8-0xb6 GPL,SL VS 1 Device vendor specific log 0xb7 GPL,SL VS 39 Device vendor specific log 0xbd GPL,SL VS 1 Device vendor specific log 0xc0 GPL,SL VS 1 Device vendor specific log 0xc1 GPL VS 93 Device vendor specific log 0xe0 GPL,SL R/W 1 SCT Command/Status 0xe1 GPL,SL R/W 1 SCT Data Transfer SMART Extended Comprehensive Error Log Version: 1 (6 sectors) No Errors Logged SMART Extended Self-test Log Version: 1 (1 sectors) Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 23423 - # 2 Short offline Completed: read failure 50% 23404 3178608 # 3 Short offline Completed without error 00% 23380 - # 4 Short offline Completed without error 00% 23356 - # 5 Short offline Completed without error 00% 23332 - # 6 Extended offline Completed without error 00% 23318 - # 7 Short offline Completed without error 00% 23308 - # 8 Short offline Completed without error 00% 23284 - # 9 Short offline Completed without error 00% 23260 - #10 Short offline Completed without error 00% 23236 - #11 Short offline Completed without error 00% 23212 - #12 Short offline Completed without error 00% 23188 - #13 Short offline Completed without error 00% 23164 - #14 Extended offline Completed without error 00% 23150 - #15 Short offline Completed without error 00% 23140 - #16 Short offline Completed without error 00% 23116 - #17 Short offline Completed without error 00% 23092 - #18 Short offline Completed without error 00% 23068 - 1 of 1 failed self-tests are outdated by newer successful extended offline self-test # 1 SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. SCT Status Version: 3 SCT Version (vendor specific): 258 (0x0102) SCT Support Level: 1 Device State: Active (0) Current Temperature: 37 Celsius Power Cycle Min/Max Temperature: 29/39 Celsius Lifetime Min/Max Temperature: 3/41 Celsius Under/Over Temperature Limit Count: 0/0 Vendor specific: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 SCT Temperature History Version: 2 Temperature Sampling Period: 1 minute Temperature Logging Interval: 1 minute Min/Max recommended Temperature: 0/60 Celsius Min/Max Temperature Limit: -41/85 Celsius Temperature History Size (Index): 478 (98) Index Estimated Time Temperature Celsius 99 2019-09-25 11:38 37 ****************** ... ..( 92 skipped). .. ****************** 192 2019-09-25 13:11 37 ****************** 193 2019-09-25 13:12 39 ******************** ... ..( 13 skipped). .. ******************** 207 2019-09-25 13:26 39 ******************** 208 2019-09-25 13:27 38 ******************* ... ..( 34 skipped). .. ******************* 243 2019-09-25 14:02 38 ******************* 244 2019-09-25 14:03 39 ******************** ... ..( 12 skipped). .. ******************** 257 2019-09-25 14:16 39 ******************** 258 2019-09-25 14:17 38 ******************* ... ..( 20 skipped). .. ******************* 279 2019-09-25 14:38 38 ******************* 280 2019-09-25 14:39 39 ******************** ... ..( 6 skipped). .. ******************** 287 2019-09-25 14:46 39 ******************** 288 2019-09-25 14:47 38 ******************* ... ..( 20 skipped). .. ******************* 309 2019-09-25 15:08 38 ******************* 310 2019-09-25 15:09 39 ******************** ... ..( 7 skipped). .. ******************** 318 2019-09-25 15:17 39 ******************** 319 2019-09-25 15:18 38 ******************* ... ..( 25 skipped). .. ******************* 345 2019-09-25 15:44 38 ******************* 346 2019-09-25 15:45 39 ******************** ... ..( 5 skipped). .. ******************** 352 2019-09-25 15:51 39 ******************** 353 2019-09-25 15:52 38 ******************* ... ..( 24 skipped). .. ******************* 378 2019-09-25 16:17 38 ******************* 379 2019-09-25 16:18 39 ******************** ... ..( 5 skipped). .. ******************** 385 2019-09-25 16:24 39 ******************** 386 2019-09-25 16:25 38 ******************* ... ..( 24 skipped). .. ******************* 411 2019-09-25 16:50 38 ******************* 412 2019-09-25 16:51 39 ******************** ... ..( 3 skipped). .. ******************** 416 2019-09-25 16:55 39 ******************** 417 2019-09-25 16:56 38 ******************* ... ..( 5 skipped). .. ******************* 423 2019-09-25 17:02 38 ******************* 424 2019-09-25 17:03 39 ******************** 425 2019-09-25 17:04 39 ******************** 426 2019-09-25 17:05 39 ******************** 427 2019-09-25 17:06 38 ******************* ... ..( 27 skipped). .. ******************* 455 2019-09-25 17:34 38 ******************* 456 2019-09-25 17:35 37 ****************** ... ..(119 skipped). .. ****************** 98 2019-09-25 19:35 37 ****************** SCT Error Recovery Control: Read: 70 (7.0 seconds) Write: 70 (7.0 seconds) Device Statistics (GP/SMART Log 0x04) not supported Pending Defects log (GP Log 0x0c) not supported SATA Phy Event Counters (GP Log 0x11) ID Size Value Description 0x0001 2 0 Command failed due to ICRC error 0x0002 2 0 R_ERR response for data FIS 0x0003 2 0 R_ERR response for device-to-host data FIS 0x0004 2 0 R_ERR response for host-to-device data FIS 0x0005 2 0 R_ERR response for non-data FIS 0x0006 2 0 R_ERR response for device-to-host non-data FIS 0x0007 2 0 R_ERR response for host-to-device non-data FIS 0x0008 2 0 Device-to-host non-data FIS retries 0x0009 2 7 Transition from drive PhyRdy to drive PhyNRdy 0x000a 2 7 Device-to-host register FISes sent due to a COMRESET 0x000b 2 0 CRC errors within host-to-device FIS 0x000f 2 0 R_ERR response for host-to-device data FIS, CRC 0x0012 2 0 R_ERR response for host-to-device non-data FIS, CRC 0x8000 4 4709736 Vendor specific
I'm not seeing anything of concern in the latest SMART output, other than the read error from the failed short test. Am I missing something? Is there anything else I should check?
The pool in question is an 8 disk RAIDZ2 pool that is a backup of my main pool. I have another backup on a second local server, and an offsite rsync backup on a two disk strip. I have one badblock tested spare drive on the shelf. Of course this occurs immediately before I head on the road for seven to ten days. I'm considering adding this drive to the pool as a spare, in case the disk fails while I'm away (I know that I must be careful to not add it to the vdev as a stripe).
If I add the disk as a spare to the backup pool, can I remove it later in case the main pool is the first pool to suffer an actual disk failure?