What the smart message mean?

Fei · Apr 12, 2016

Hi

Today, I upgrade my freenas to 9.3.1 latest , the system show critical alert after reboot.

Code:

[root@160nas] /usr/local/etc# cat /var/log/messages |grep smart
Apr 13 00:20:03 160nas smartd[2586]: Device: /dev/ada4, FAILED SMART self-check. BACK UP DATA NOW!
Apr 13 00:20:03 160nas smartd[2586]: Device: /dev/ada4, Failed SMART usage Attribute: 5 Reallocated_Sector_Ct.
Apr 13 00:20:03 160nas smartd[2586]: Device: /dev/ada2, 2 Offline uncorrectable sectors

I check my pool , it don't have any problem,so what the message mean?

Code:

pool: vol
state: ONLINE
  scan: scrub repaired 0 in 7h28m with 0 errors on Sat Apr  2 04:15:08 2016
config:

        NAME                                            STATE     READ WRITE CKSUM
        vol                                             ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/5d30fdd7-0668-11e4-8362-6805ca089be0  ONLINE       0     0     0
            gptid/5d98120a-0668-11e4-8362-6805ca089be0  ONLINE       0     0     0
            gptid/5decc3f1-0668-11e4-8362-6805ca089be0  ONLINE       0     0     0
            gptid/5e524f24-0668-11e4-8362-6805ca089be0  ONLINE       0     0     0

errors: No known data errors

Bidule0hm · Apr 12, 2016

ada2 and ada4 are starting to have bad sectors. Usually a few bad sectors isn't a problem but if the number rise you'll want to replace the drives ;)

However as you have a RAID-Z1 you can't afford to have two drives failing at the same time so replace ada4 (as it seems in a worse state than ada2) before it's too late. I hope you have a backup of your data just in case.

The output of smartctl -a adaX for each drive would be useful to be more precise/confirm the answer :)

danb35 · Apr 12, 2016

Fei said:
I check my pool , it don't have any problem,so what the message mean?

The messages mean exactly what they say: ada4 has failed a SMART self-test and has too many reallocated sectors, and ada2 has two offline/unreadable sectors. You need to replace ada4 immediately, and should keep an eye on ada2. There's no reason you should expect that this would show up in your pool status--it could, but it's also possible that the bad sectors are in locations with no data at the moment.

Fei · Apr 12, 2016

Bidule0hm said:
The output of smartctl -a adaX for each drive would be useful to be more precise/confirm the answer :)

Hi

I running samrtctl -a ada2 & ada4 .

Bidule0hm · Apr 12, 2016

Ok, ada4 is failing hard, almost 2k reallocated sectors...

No SMART test has ever run on these drives; did you setup the tests in the web GUI?

Then I wanted to know why two drives with less than 6k POH are failing when I saw that: 194 Temperature_Celsius 0x0022 082 080 000 Old_age Always - 70 I can say you're probably the winner of the hottest drives I saw on this forum... The drives should be kept under 40 °C, note that 5 or 10 °C more is very concerning and yours are 30 °C more than that... You should stop the server right now and re-think the thermal design because with a 4x 3 TB RAID-Z1 you'll probably loose your data more or less soon.

danb35 · Apr 12, 2016

Next time you post SMART results, please just post them inline in code tags, rather than as attachments. Other than that, what @Bidule0hm said--you've got major problems:

You're cooking your drives. You need to get your server out of the attic, get proper airflow, and get the drives cooled right down. Point a household fan at the server, if that's the best you can do. It may be too late for the drives that are in there, but it'll still pay dividends for the replacements.
You need to configure the SMART service to email you alerts, and to warn at a sensible level (i.e., not more than 40 C) for temperature.
You need to schedule regular SMART tests, as these drives have never seen one. I run a short test daily and a long test weekly; you could go as long as a short test weekly and a long test monthly.

Run long SMART tests on all your drives ('smartctl -t long /dev/adaX'). Once those finish (about 6-8 hours), post the results of 'smartctl -x /dev/adaX' for each drive, inline, in code tags.

Fei · Apr 12, 2016

danb35 said:
Run long SMART tests on all your drives ('smartctl -t long /dev/adaX'). Once those finish (about 6-8 hours), post the results of 'smartctl -x /dev/adaX' for each drive, inline, in code tags.

When I running 'smartctl -t long ' test ,How to check this test is done ?

gpsguy · Apr 13, 2016

If you run smartctl -a /dev/adaX

and look at the summary info near the end, the test will say completed if it's finished.

Sent from my iPhone using Tapatalk

Fei · Apr 13, 2016

Hi

I power-off my freenas and add a internal SAS HBA to enhance disk I/O(3Gb -> 6Gb) , so fail disk number was change.
ada2 -> da1
ada4 -> da2

Code:

Apr 13 11:38:52 160nas smartd[8655]: Device: /dev/da2 [SAT], FAILED SMART self-check. BACK UP DATA NOW!
Apr 13 11:38:52 160nas smartd[8655]: Device: /dev/da2 [SAT], Failed SMART usage Attribute: 5 Reallocated_Sector_Ct.
Apr 13 11:38:52 160nas smartd[8655]: Device: /dev/da2 [SAT], previous self-test completed with error (unknown test element)
Apr 13 11:38:52 160nas smartd[8655]: Device: /dev/da1 [SAT], 2 Offline uncorrectable sectors

Smart test result as below, I found many error so I will replace these disk .

Code:

smartctl -a /dev/da1
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p31 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Black
Device Model:     WDC WD3001FAEX-00MJRA0
Serial Number:    WD-WMC130049225
LU WWN Device Id: 5 0014ee 0036b2aed
Firmware Version: 01.01L01
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Apr 13 22:49:35 2016 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (35040) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 380) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x70b5) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   163   145   021    Pre-fail  Always       -       10841
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       494
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       5958
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       494
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       75
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       418
194 Temperature_Celsius     0x0022   091   080   000    Old_age   Always       -       61
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       2
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      5957         -
# 2  Extended offline    Aborted by host               90%      5947         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Code:

[root@160nas] ~# smartctl -a /dev/da2
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p31 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Black
Device Model:     WDC WD3001FAEX-00MJRA0
Serial Number:    WD-WCC130244440
LU WWN Device Id: 5 0014ee 207edb281
Firmware Version: 01.01L01
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Apr 13 22:51:11 2016 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
See vendor-specific Attribute list for failed Attributes.

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (  73) The previous self-test completed having
                                        a test element that failed and the test
                                        element that failed is not known.
Total time to complete Offline
data collection:                (35100) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 381) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x70b5) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   179   165   021    Pre-fail  Always       -       10041
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       501
  5 Reallocated_Sector_Ct   0x0033   133   133   140    Pre-fail  Always   FAILING_NOW 1966
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       5959
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       501
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       82
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       418
194 Temperature_Celsius     0x0022   092   080   000    Old_age   Always       -       60
196 Reallocated_Event_Count 0x0032   166   166   000    Old_age   Always       -       34
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       18

SMART Error Log Version: 1
ATA Error Count: 30 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 30 occurred at disk power-on lifetime: 5313 hours (221 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 b0 cb 66 41 e0  Error: UNC 176 sectors at LBA = 0x004166cb = 4286155

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 b0 c0 66 41 e0 00      05:48:36.910  READ DMA
  c8 00 b0 c0 66 41 e0 00      05:48:36.786  READ DMA
  c8 00 b0 c0 66 41 e0 00      05:48:36.661  READ DMA
  c8 00 b0 c0 66 41 e0 00      05:48:36.536  READ DMA
  c8 00 b0 c0 66 41 e0 00      05:48:36.376  READ DMA

Error 29 occurred at disk power-on lifetime: 5313 hours (221 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 b0 cb 66 41 e0  Error: UNC 176 sectors at LBA = 0x004166cb = 4286155

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 b0 c0 66 41 e0 00      05:48:36.786  READ DMA
  c8 00 b0 c0 66 41 e0 00      05:48:36.661  READ DMA
  c8 00 b0 c0 66 41 e0 00      05:48:36.536  READ DMA
  c8 00 b0 c0 66 41 e0 00      05:48:36.376  READ DMA
  c8 00 b0 10 66 41 e0 00      05:48:36.375  READ DMA

Error 28 occurred at disk power-on lifetime: 5313 hours (221 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 b0 cb 66 41 e0  Error: UNC 176 sectors at LBA = 0x004166cb = 4286155

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 b0 c0 66 41 e0 00      05:48:36.661  READ DMA
  c8 00 b0 c0 66 41 e0 00      05:48:36.536  READ DMA
  c8 00 b0 c0 66 41 e0 00      05:48:36.376  READ DMA
  c8 00 b0 10 66 41 e0 00      05:48:36.375  READ DMA
  c8 00 b0 60 65 41 e0 00      05:48:36.374  READ DMA

Error 27 occurred at disk power-on lifetime: 5313 hours (221 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 b0 cb 66 41 e0  Error: UNC 176 sectors at LBA = 0x004166cb = 4286155

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 b0 c0 66 41 e0 00      05:48:36.536  READ DMA
  c8 00 b0 c0 66 41 e0 00      05:48:36.376  READ DMA
  c8 00 b0 10 66 41 e0 00      05:48:36.375  READ DMA
  c8 00 b0 60 65 41 e0 00      05:48:36.374  READ DMA
  c8 00 b0 b0 64 41 e0 00      05:48:36.374  READ DMA

Error 26 occurred at disk power-on lifetime: 5313 hours (221 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 b0 cb 66 41 e0  Error: UNC 176 sectors at LBA = 0x004166cb = 4286155

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 b0 c0 66 41 e0 00      05:48:36.376  READ DMA
  c8 00 b0 10 66 41 e0 00      05:48:36.375  READ DMA
  c8 00 b0 60 65 41 e0 00      05:48:36.374  READ DMA
  c8 00 b0 b0 64 41 e0 00      05:48:36.374  READ DMA
  c8 00 b0 00 64 41 e0 00      05:48:36.373  READ DMA

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: unknown failure    90%      5951         -
# 2  Extended offline    Completed: unknown failure    90%      5949         -
# 3  Extended offline    Completed: unknown failure    90%      5948         -
# 4  Extended offline    Completed: unknown failure    90%      5948         -
# 5  Extended offline    Completed: unknown failure    90%      5948         -
# 6  Extended offline    Completed: unknown failure    90%      5947         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

danb35 · Apr 13, 2016

danb35 said:
Run long SMART tests on all your drives ('smartctl -t long /dev/adaX'). Once those finish (about 6-8 hours), post the results of 'smartctl -x /dev/adaX' for each drive, inline, in code tags.

Edit: A SAS HBA certainly won't hurt anything, but it won't really benefit you when using spinning rust. We already knew that da2 (the former ada4) was bad, and now it has a failed SMART self-test to confirm it. da1 (the former ada2) doesn't look too bad except for the temperature.

Fei · Apr 13, 2016

Hi danb35

If da1 smart test is normal , so I can ignore da1 error message from Web Gui (Alert System) ?

Code:

/dev/da1 [SAT], 2 Offline uncorrectable sectors

danb35 · Apr 13, 2016

Fei said:
If da1 smart test is normal , so I can ignore da1 error message from Web Gui (Alert System) ?

Two offline sectors doesn't (IMO) mean the drive needs to be replaced right away, but you should keep an eye on it. If the number starts climbing, you'll want to replace it.

Your last SMART results show the temperatures are lower, but still way too high.

Fei · Apr 13, 2016

danb35 said:
Two offline sectors doesn't (IMO) mean the drive needs to be replaced right away, but you should keep an eye on it. If the number starts climbing, you'll want to replace it.

Your last SMART results show the temperatures are lower, but still way too high.

Thanks your response .
The temperature issue I will improve it.

Important Announcement for the TrueNAS Community.

What the smart message mean?

Fei

Explorer

Bidule0hm

Server Electronics Sorcerer

danb35

Hall of Famer

Fei

Explorer

Attachments

Bidule0hm

Server Electronics Sorcerer

danb35

Hall of Famer

Fei

Explorer

gpsguy

Active Member

Fei

Explorer

danb35

Hall of Famer

Fei

Explorer

danb35

Hall of Famer

Fei

Explorer

Similar threads

Important Announcement for the TrueNAS Community.

What the smart message mean?

Explorer

Server Electronics Sorcerer

Hall of Famer

Explorer

Attachments

Server Electronics Sorcerer

Hall of Famer

Explorer

Active Member

Explorer

Hall of Famer

Explorer

Hall of Famer

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "What the smart message mean?"

Similar threads