Hello community,
I recently got some problem on FreeNas 11-1 : got smart detection, and pool degraded.
First time, i replaced a drive... 2 days later, it was on another drive...
HDD are 10 months old, in a rack, 17 degre celsuis room, on UPS, and are 8x 4tb TOSHIBA N300 nas drives (model HDWQ140)
After long search on forums and updatelog, i found some tickets about it, and it seems that some smart error detection was solved on newer freenas versions.
Then i upgraded to 11-1U6.... All error was gone (as each time a clear error or reboot system), but 3 days later, the probem was coming back on another HDD.
Here is /log/messages about this drive :
here is volume status :
computer is superMicro, Xeon E5-2609, 32gb ecc ram.
controller is LSI Symbios logic SAS3008,
"Faulty" drive extended smartctl say :
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: TOSHIBA HDWQ140
Serial Number: xxxxxxxxxxxx
LU WWN Device Id: 5 000039 78b802216
Firmware Version: FJ1M
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Oct 16 17:48:51 2018 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Unavailable
APM feature is: Disabled
Rd look-ahead is: Enabled
Write cache is: Enabled
DSN feature is: Unavailable
ATA Security is: Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 120) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off supp
ort.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 451) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate PO-R-- 100 100 050 - 0
2 Throughput_Performance P-S--- 100 100 050 - 0
3 Spin_Up_Time POS--K 100 100 001 - 6844
4 Start_Stop_Count -O--CK 100 100 000 - 10
5 Reallocated_Sector_Ct PO--CK 100 100 050 - 0
7 Seek_Error_Rate PO-R-- 100 100 050 - 0
8 Seek_Time_Performance P-S--- 100 100 050 - 0
9 Power_On_Hours -O--CK 084 084 000 - 6626
10 Spin_Retry_Count PO--CK 100 100 030 - 0
12 Power_Cycle_Count -O--CK 100 100 000 - 10
191 G-Sense_Error_Rate -O--CK 100 100 000 - 2
192 Power-Off_Retract_Count -O--CK 100 100 000 - 8
193 Load_Cycle_Count -O--CK 100 100 000 - 96
194 Temperature_Celsius -O---K 100 100 000 - 27 (Min/Max 16/34)
196 Reallocated_Event_Count -O--CK 100 100 000 - 0
197 Current_Pending_Sector -O--CK 100 100 000 - 0
198 Offline_Uncorrectable ----CK 100 100 000 - 0
199 UDMA_CRC_Error_Count -O--CK 200 253 000 - 0
220 Disk_Shift -O---- 100 100 000 - 0
222 Loaded_Hours -O--CK 084 084 000 - 6605
223 Load_Retry_Count -O--CK 100 100 000 - 0
224 Load_Friction -O---K 100 100 000 - 0
226 Load-in_Time -OS--K 100 100 000 - 645
240 Head_Flying_Hours P----- 100 100 001 - 0
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 GPL,SL R/O 1 Log Directory
0x01 SL R/O 1 Summary SMART error log
0x02 SL R/O 51 Comprehensive SMART error log
0x03 GPL R/O 64 Ext. Comprehensive SMART error log
0x04 GPL,SL R/O 8 Device Statistics log
0x06 SL R/O 1 SMART self-test log
0x07 GPL R/O 1 Extended self-test log
0x08 GPL R/O 2 Power Conditions log
0x09 SL R/W 1 Selective self-test log
0x0c GPL R/O 2048 Pending Defects log
0x10 GPL R/O 1 NCQ Command Error log
0x11 GPL R/O 1 SATA Phy Event Counters log
0x24 GPL R/O 12288 Current Device Internal Status Data log
0x30 GPL,SL R/O 9 IDENTIFY DEVICE data log
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
0xa7 GPL VS 8 Device vendor specific log
0xe0 GPL,SL R/W 1 SCT Command/Status
0xe1 GPL,SL R/W 1 SCT Data Transfer
SMART Extended Comprehensive Error Log Version: 1 (64 sectors)
No Errors Logged
SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
SCT Status Version: 3
SCT Version (vendor specific): 1 (0x0001)
SCT Support Level: 1
Device State: Active (0)
Current Temperature: 27 Celsius
Power Cycle Min/Max Temperature: 25/31 Celsius
Lifetime Min/Max Temperature: 16/34 Celsius
Under/Over Temperature Limit Count: 0/0
SCT Temperature History Version: 2
Temperature Sampling Period: 1 minute
Temperature Logging Interval: 1 minute
Min/Max recommended Temperature: 5/55 Celsius
Min/Max Temperature Limit: 5/55 Celsius
Temperature History Size (Index): 478 (429)
Index Estimated Time Temperature Celsius
430 2018-10-16 09:52 26 *******
---------REMOVED TEMP INFO BY USER-----------------
... ..( 6 skipped). .. ********
429 2018-10-16 17:49 27 ********
SCT Error Recovery Control:
Read: Disabled
Write: Disabled
Device Statistics (GP Log 0x04)
Page Offset Size Value Flags Description
0x01 ===== = = === == General Statistics (rev 2) ==
0x01 0x008 4 10 --- Lifetime Power-On Resets
0x01 0x010 4 6626 --- Power-on Hours
0x01 0x018 6 13742275084 --- Logical Sectors Written
0x01 0x020 6 300534987 --- Number of Write Commands
0x01 0x028 6 6492664357 --- Logical Sectors Read
0x01 0x030 6 256839928 --- Number of Read Commands
0x02 ===== = = === == Free-Fall Statistics (rev 1) ==
0x02 0x010 4 2 --- Overlimit Shock Events
0x03 ===== = = === == Rotating Media Statistics (rev 1) ==
0x03 0x008 4 6626 --- Spindle Motor Power-on Hours
0x03 0x010 4 6605 --- Head Flying Hours
0x03 0x018 4 96 --- Head Load Events
0x03 0x020 4 0 --- Number of Reallocated Logical Sectors
0x03 0x028 4 0 --- Read Recovery Attempts
0x03 0x030 4 0 --- Number of Mechanical Start Failures
0x04 ===== = = === == General Errors Statistics (rev 1) ==
0x04 0x008 4 0 --- Number of Reported Uncorrectable Errors
0x04 0x010 4 0 --- Resets Between Cmd Acceptance and Completion
0x05 ===== = = === == Temperature Statistics (rev 1) ==
0x05 0x008 1 27 --- Current Temperature
0x05 0x010 1 27 N-- Average Short Term Temperature
0x05 0x018 1 26 N-- Average Long Term Temperature
0x05 0x020 1 34 --- Highest Temperature
0x05 0x028 1 16 --- Lowest Temperature
0x05 0x030 1 32 N-- Highest Average Short Term Temperature
0x05 0x038 1 25 N-- Lowest Average Short Term Temperature
0x05 0x040 1 28 N-- Highest Average Long Term Temperature
0x05 0x048 1 26 N-- Lowest Average Long Term Temperature
0x05 0x050 4 0 --- Time in Over-Temperature
0x05 0x058 1 55 --- Specified Maximum Operating Temperature
0x05 0x060 4 0 --- Time in Under-Temperature
0x05 0x068 1 5 --- Specified Minimum Operating Temperature
0x06 ===== = = === == Transport Statistics (rev 1) ==
0x06 0x008 4 30 --- Number of Hardware Resets
0x06 0x018 4 0 --- Number of Interface CRC Errors
0x07 ===== = = === == Solid State Device Statistics (rev 1) ==
|||_ C monitored condition met
||__ D supports DSN
|___ N normalized value
Pending Defects log (GP Log 0x0c) supported [please try: '-l defects']
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 4 0 Command failed due to ICRC error
0x0002 4 0 R_ERR response for data FIS
|___ N normalized value
Pending Defects log (GP Log 0x0c) supported [please try: '-l defects']
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 4 0 Command failed due to ICRC error
0x0002 4 0 R_ERR response for data FIS
0x0003 4 0 R_ERR response for device-to-host data FIS
0x0004 4 0 R_ERR response for host-to-device data FIS
0x0005 4 0 R_ERR response for non-data FIS
0x0006 4 0 R_ERR response for device-to-host non-data FIS
0x0007 4 0 R_ERR response for host-to-device non-data FIS
0x0008 4 0 Device-to-host non-data FIS retries
0x0009 4 6 Transition from drive PhyRdy to drive PhyNRdy
0x000a 4 6 Device-to-host register FISes sent due to a COMRESET
0x000b 4 0 CRC errors within host-to-device FIS
0x000d 4 0 Non-CRC errors within host-to-device FIS
0x000f 4 0 R_ERR response for host-to-device data FIS, CRC
0x0010 4 0 R_ERR response for host-to-device data FIS, non-CRC
0x0012 4 0 R_ERR response for host-to-device non-data FIS, CRC
0x0013 4 0 R_ERR response for host-to-device non-data FIS, non-CRC
I suspect thoses last line to be the error reason :
0x0009 4 6 Transition from drive PhyRdy to drive PhyNRdy
0x000a 4 6 Device-to-host register FISes sent due to a COMRESET
cause the message log say "UNIT ATTENTION : poweron, reset or bus device reset occured) (see captured screen before)
WHAT CAN I DO ?
Hdd proabaly not bad, computer top quality HDD build for NAS, good temp, all brand new material, last freeNas 11-1U6 version...
Help !!!!
Thanks a lot.
I recently got some problem on FreeNas 11-1 : got smart detection, and pool degraded.
First time, i replaced a drive... 2 days later, it was on another drive...
HDD are 10 months old, in a rack, 17 degre celsuis room, on UPS, and are 8x 4tb TOSHIBA N300 nas drives (model HDWQ140)
After long search on forums and updatelog, i found some tickets about it, and it seems that some smart error detection was solved on newer freenas versions.
Then i upgraded to 11-1U6.... All error was gone (as each time a clear error or reboot system), but 3 days later, the probem was coming back on another HDD.
Here is /log/messages about this drive :
here is volume status :
computer is superMicro, Xeon E5-2609, 32gb ecc ram.
controller is LSI Symbios logic SAS3008,
"Faulty" drive extended smartctl say :
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: TOSHIBA HDWQ140
Serial Number: xxxxxxxxxxxx
LU WWN Device Id: 5 000039 78b802216
Firmware Version: FJ1M
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Oct 16 17:48:51 2018 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Unavailable
APM feature is: Disabled
Rd look-ahead is: Enabled
Write cache is: Enabled
DSN feature is: Unavailable
ATA Security is: Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 120) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off supp
ort.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 451) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate PO-R-- 100 100 050 - 0
2 Throughput_Performance P-S--- 100 100 050 - 0
3 Spin_Up_Time POS--K 100 100 001 - 6844
4 Start_Stop_Count -O--CK 100 100 000 - 10
5 Reallocated_Sector_Ct PO--CK 100 100 050 - 0
7 Seek_Error_Rate PO-R-- 100 100 050 - 0
8 Seek_Time_Performance P-S--- 100 100 050 - 0
9 Power_On_Hours -O--CK 084 084 000 - 6626
10 Spin_Retry_Count PO--CK 100 100 030 - 0
12 Power_Cycle_Count -O--CK 100 100 000 - 10
191 G-Sense_Error_Rate -O--CK 100 100 000 - 2
192 Power-Off_Retract_Count -O--CK 100 100 000 - 8
193 Load_Cycle_Count -O--CK 100 100 000 - 96
194 Temperature_Celsius -O---K 100 100 000 - 27 (Min/Max 16/34)
196 Reallocated_Event_Count -O--CK 100 100 000 - 0
197 Current_Pending_Sector -O--CK 100 100 000 - 0
198 Offline_Uncorrectable ----CK 100 100 000 - 0
199 UDMA_CRC_Error_Count -O--CK 200 253 000 - 0
220 Disk_Shift -O---- 100 100 000 - 0
222 Loaded_Hours -O--CK 084 084 000 - 6605
223 Load_Retry_Count -O--CK 100 100 000 - 0
224 Load_Friction -O---K 100 100 000 - 0
226 Load-in_Time -OS--K 100 100 000 - 645
240 Head_Flying_Hours P----- 100 100 001 - 0
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 GPL,SL R/O 1 Log Directory
0x01 SL R/O 1 Summary SMART error log
0x02 SL R/O 51 Comprehensive SMART error log
0x03 GPL R/O 64 Ext. Comprehensive SMART error log
0x04 GPL,SL R/O 8 Device Statistics log
0x06 SL R/O 1 SMART self-test log
0x07 GPL R/O 1 Extended self-test log
0x08 GPL R/O 2 Power Conditions log
0x09 SL R/W 1 Selective self-test log
0x0c GPL R/O 2048 Pending Defects log
0x10 GPL R/O 1 NCQ Command Error log
0x11 GPL R/O 1 SATA Phy Event Counters log
0x24 GPL R/O 12288 Current Device Internal Status Data log
0x30 GPL,SL R/O 9 IDENTIFY DEVICE data log
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
0xa7 GPL VS 8 Device vendor specific log
0xe0 GPL,SL R/W 1 SCT Command/Status
0xe1 GPL,SL R/W 1 SCT Data Transfer
SMART Extended Comprehensive Error Log Version: 1 (64 sectors)
No Errors Logged
SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
SCT Status Version: 3
SCT Version (vendor specific): 1 (0x0001)
SCT Support Level: 1
Device State: Active (0)
Current Temperature: 27 Celsius
Power Cycle Min/Max Temperature: 25/31 Celsius
Lifetime Min/Max Temperature: 16/34 Celsius
Under/Over Temperature Limit Count: 0/0
SCT Temperature History Version: 2
Temperature Sampling Period: 1 minute
Temperature Logging Interval: 1 minute
Min/Max recommended Temperature: 5/55 Celsius
Min/Max Temperature Limit: 5/55 Celsius
Temperature History Size (Index): 478 (429)
Index Estimated Time Temperature Celsius
430 2018-10-16 09:52 26 *******
---------REMOVED TEMP INFO BY USER-----------------
... ..( 6 skipped). .. ********
429 2018-10-16 17:49 27 ********
SCT Error Recovery Control:
Read: Disabled
Write: Disabled
Device Statistics (GP Log 0x04)
Page Offset Size Value Flags Description
0x01 ===== = = === == General Statistics (rev 2) ==
0x01 0x008 4 10 --- Lifetime Power-On Resets
0x01 0x010 4 6626 --- Power-on Hours
0x01 0x018 6 13742275084 --- Logical Sectors Written
0x01 0x020 6 300534987 --- Number of Write Commands
0x01 0x028 6 6492664357 --- Logical Sectors Read
0x01 0x030 6 256839928 --- Number of Read Commands
0x02 ===== = = === == Free-Fall Statistics (rev 1) ==
0x02 0x010 4 2 --- Overlimit Shock Events
0x03 ===== = = === == Rotating Media Statistics (rev 1) ==
0x03 0x008 4 6626 --- Spindle Motor Power-on Hours
0x03 0x010 4 6605 --- Head Flying Hours
0x03 0x018 4 96 --- Head Load Events
0x03 0x020 4 0 --- Number of Reallocated Logical Sectors
0x03 0x028 4 0 --- Read Recovery Attempts
0x03 0x030 4 0 --- Number of Mechanical Start Failures
0x04 ===== = = === == General Errors Statistics (rev 1) ==
0x04 0x008 4 0 --- Number of Reported Uncorrectable Errors
0x04 0x010 4 0 --- Resets Between Cmd Acceptance and Completion
0x05 ===== = = === == Temperature Statistics (rev 1) ==
0x05 0x008 1 27 --- Current Temperature
0x05 0x010 1 27 N-- Average Short Term Temperature
0x05 0x018 1 26 N-- Average Long Term Temperature
0x05 0x020 1 34 --- Highest Temperature
0x05 0x028 1 16 --- Lowest Temperature
0x05 0x030 1 32 N-- Highest Average Short Term Temperature
0x05 0x038 1 25 N-- Lowest Average Short Term Temperature
0x05 0x040 1 28 N-- Highest Average Long Term Temperature
0x05 0x048 1 26 N-- Lowest Average Long Term Temperature
0x05 0x050 4 0 --- Time in Over-Temperature
0x05 0x058 1 55 --- Specified Maximum Operating Temperature
0x05 0x060 4 0 --- Time in Under-Temperature
0x05 0x068 1 5 --- Specified Minimum Operating Temperature
0x06 ===== = = === == Transport Statistics (rev 1) ==
0x06 0x008 4 30 --- Number of Hardware Resets
0x06 0x018 4 0 --- Number of Interface CRC Errors
0x07 ===== = = === == Solid State Device Statistics (rev 1) ==
|||_ C monitored condition met
||__ D supports DSN
|___ N normalized value
Pending Defects log (GP Log 0x0c) supported [please try: '-l defects']
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 4 0 Command failed due to ICRC error
0x0002 4 0 R_ERR response for data FIS
|___ N normalized value
Pending Defects log (GP Log 0x0c) supported [please try: '-l defects']
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 4 0 Command failed due to ICRC error
0x0002 4 0 R_ERR response for data FIS
0x0003 4 0 R_ERR response for device-to-host data FIS
0x0004 4 0 R_ERR response for host-to-device data FIS
0x0005 4 0 R_ERR response for non-data FIS
0x0006 4 0 R_ERR response for device-to-host non-data FIS
0x0007 4 0 R_ERR response for host-to-device non-data FIS
0x0008 4 0 Device-to-host non-data FIS retries
0x0009 4 6 Transition from drive PhyRdy to drive PhyNRdy
0x000a 4 6 Device-to-host register FISes sent due to a COMRESET
0x000b 4 0 CRC errors within host-to-device FIS
0x000d 4 0 Non-CRC errors within host-to-device FIS
0x000f 4 0 R_ERR response for host-to-device data FIS, CRC
0x0010 4 0 R_ERR response for host-to-device data FIS, non-CRC
0x0012 4 0 R_ERR response for host-to-device non-data FIS, CRC
0x0013 4 0 R_ERR response for host-to-device non-data FIS, non-CRC
I suspect thoses last line to be the error reason :
0x0009 4 6 Transition from drive PhyRdy to drive PhyNRdy
0x000a 4 6 Device-to-host register FISes sent due to a COMRESET
cause the message log say "UNIT ATTENTION : poweron, reset or bus device reset occured) (see captured screen before)
WHAT CAN I DO ?
Hdd proabaly not bad, computer top quality HDD build for NAS, good temp, all brand new material, last freeNas 11-1U6 version...
Help !!!!
Thanks a lot.