SMART attribute: temperature value decoding

bestboy · Jun 15, 2014

Hi guys,

today my small mirror pool encountered a check sum error for the first time:

Code:

pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub repaired 128K in 8h27m with 0 errors on Sun Jun 15 08:27:10 2014
config:
 
        NAME                                            STATE    READ WRITE CKSUM
        tank                                            ONLINE      0    0    0
          mirror-0                                      ONLINE      0    0    0
            gptid/82f41c52-6036-11e3-b273-002590d6a1c4  ONLINE      0    0    0
            gptid/83c52ace-6036-11e3-b273-002590d6a1c4  ONLINE      0    0    0
          mirror-1                                      ONLINE      0    0    0
            gptid/580753fb-6691-11e3-b8b4-002590d6a1c4  ONLINE      0    0    1
            gptid/58ae0af9-6691-11e3-b8b4-002590d6a1c4  ONLINE      0    0    0
 
errors: No known data errors

mirror-1 consists of 2 2TiB Samsung Spin Point F4 drives that have been moved over from my old NAS. They are kind of worn out and I expected them to fail sooner or later. And thanks to ZFS and its magnificent early warning mechanism, I'm now in the process of ordering 2 new 3TiB WD REDs as a replacement.

So when I got the check sum error in ZFS today, I wanted to take a closer look at the SMART data. I've been using SMART monitoring tasks, but up until now all those SMART tests have been OK. I run smartctl manually on my drives and noticed that I don't really understand the temperature values. They are not straight forward and confusing, so I thought I'd ask you guys about them. (BTW: here's the SMART data of the failing drive in case anyone is interested).

So about the temperatures: Here are the SMART attribute values from one of my WD REDs that are in the healthy mirror-0:

Code:

Device Model:    WDC WD30EFRX-68EUZN0
 
[...]
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate    POSR-K  200  200  051    -    0
  3 Spin_Up_Time            POS--K  176  175  021    -    6158
  4 Start_Stop_Count        -O--CK  100  100  000    -    103
  5 Reallocated_Sector_Ct  PO--CK  200  200  140    -    0
  7 Seek_Error_Rate        -OSR-K  200  200  000    -    0
  9 Power_On_Hours          -O--CK  094  094  000    -    4423
10 Spin_Retry_Count        -O--CK  100  100  000    -    0
11 Calibration_Retry_Count -O--CK  100  253  000    -    0
12 Power_Cycle_Count      -O--CK  100  100  000    -    77
192 Power-Off_Retract_Count -O--CK  200  200  000    -    19
193 Load_Cycle_Count        -O--CK  167  167  000    -    100101
194 Temperature_Celsius    -O---K  119  114  000    -    31
196 Reallocated_Event_Count -O--CK  200  200  000    -    0
197 Current_Pending_Sector  -O--CK  200  200  000    -    0
198 Offline_Uncorrectable  ----CK  100  253  000    -    0
199 UDMA_CRC_Error_Count    -O--CK  200  200  000    -    0
200 Multi_Zone_Error_Rate  ---R--  200  200  000    -    0
 
[...]
 
Index    Estimated Time  Temperature Celsius
  10    2014-06-15 08:45    31  ************
...    ..( 63 skipped).    ..  ************
  74    2014-06-15 09:49    31  ************
  75    2014-06-15 09:50    30  ***********
  76    2014-06-15 09:51    31  ************
  77    2014-06-15 09:52    30  ***********
  78    2014-06-15 09:53    31  ************
  79    2014-06-15 09:54    31  ************
  80    2014-06-15 09:55    30  ***********
  81    2014-06-15 09:56    30  ***********
  82    2014-06-15 09:57    31  ************
...    ..(  4 skipped).    ..  ************
  87    2014-06-15 10:02    31  ************
  88    2014-06-15 10:03    30  ***********
...    ..( 17 skipped).    ..  ***********
106    2014-06-15 10:21    30  ***********
107    2014-06-15 10:22    31  ************
108    2014-06-15 10:23    30  ***********
...    ..(  3 skipped).    ..  ***********
112    2014-06-15 10:27    30  ***********
113    2014-06-15 10:28    31  ************
...    ..( 95 skipped).    ..  ************
209    2014-06-15 12:04    31  ************
210    2014-06-15 12:05    32  *************
...    ..( 17 skipped).    ..  *************
228    2014-06-15 12:23    32  *************
229    2014-06-15 12:24    31  ************
...    ..( 50 skipped).    ..  ************
280    2014-06-15 13:15    31  ************
281    2014-06-15 13:16    30  ***********
...    ..(158 skipped).    ..  ***********
440    2014-06-15 15:55    30  ***********
441    2014-06-15 15:56    31  ************
...    ..(  7 skipped).    ..  ************
449    2014-06-15 16:04    31  ************
450    2014-06-15 16:05    30  ***********
451    2014-06-15 16:06    31  ************
...    ..( 35 skipped).    ..  ************
  9    2014-06-15 16:42    31  ************
[...]

Well, as you can see the Temperature_Celsius attribute has a VALUE of 119 and a WORST of 114.
Now, how should I interpret these numbers? They are obviously no temperature values. I doubt that any drive could ever reach 114°C. So how can I decode these manufacturer specific values?
I'd really like to know my real WORST temperature so I can decide, if I need more cooling. My temperature history looks good atm, but it's just covering a short period of time. With the encoded WORST value I don't really know, if my drive has ever exceeded the 40°C limit :(

edit: fixed link to the correct SMART data

wywywywy · Jun 17, 2014

194 Temperature_Celsius -O---K 119 114 000 - 31

It does say raw value of 31 C there...

EDIT -

By the way the way it reads is 150 minus the value you see.

This is because of the way the threshold works - anything higher than the threshold means good, anything lower than the threshold means bad. But when it comes to temperature higher means bad, so the value needs to be flipped over.

solarisguy · Jun 17, 2014

Code:

zpool tank clear

You had attributed the error to the drive, but one error could have originated from the SATA port and a cosmic ray hit...

If you based your decision on Multi_Zone_Error_Rate, that was not really warranted.

bestboy · Jun 17, 2014

wywywywy said:
By the way the way it reads is 150 minus the value you see.

Thank you very much wywywywy. That's just what I needed to know. Considering the offset value of 150 and the WORST value of 114 my highest temperature was 36°C. That's just fine.

bestboy · Jun 17, 2014

solarisguy said:
Code:
zpool tank clear

You had attributed the error to the drive, but one error could have originated from the SATA port and a cosmic ray hit...

If you based your decision on Multi_Zone_Error_Rate, that was not really warranted.

Well, tbh I decided to replace the drives based on the scrub report. I was under the impression that, when ZFS has to intervene and apply corrections, then things are going south. And since all my hardware is new - except for the 2 HDs from my old NAS - I did not consider issues with the controller. I also did not consider ionizing radiation from cosmic rays, decaying Radon in my basement or somesuch. I just jumped to the conclusion that the old drives must be starting to fail...
So, do you think I'm overreacting by replacing those 2 drives due to just 1 check sum error? Would you recommend to just clear the pool, shorten the scrub intervals and see what happens next?

solarisguy · Jun 17, 2014

Yes, I think you had jumped the gun :) And I had also made a mistake

it should have been zpool clear tank

You are right, if ZFS intervenes that means something went south. However, the disk is only one component in the entire data path.

Ericloewe · Jun 17, 2014

solarisguy said:
Code:
zpool tank clear

You had attributed the error to the drive, but one error could have originated from the SATA port and a cosmic ray hit...

If you based your decision on Multi_Zone_Error_Rate, that was not really warranted.

I disagree. That parameter is supposed to measure the sectors' error rate, so it does mean that the issue is internal to the drive. A single bitflip would account for a large increase, if any, given the drive's baseline error rate (which is said to be non-zero).

cyberjock · Jun 17, 2014

Based on your SMART output and the error(and assuming it was actually the same disk) what I get is that the disk has no problem but the data was corrupted. This is the proverbial "silent corruption" that ZFS is designed to detect and correct. In your case, it did exactly what it was supposed to do and identified and corrected the issue. The reason for the corruption is anywhere in the data path from the platter to the CPU. As for the exact cause it looks like it was a one-time fluke so I wouldn't be too worried about it at the present time.

Unrelated comment:

Load_Cycle_Count -O--CK 167 167 000 - 100101

That's pretty high. You should check out the guide on fixing your WDRed's head parking time with wdidle or a firmware update. Googling or searching the forums will give you everything you need to know about it and how to fix it. You shouldn't ignore this as this could shorten the life of your drive in the long-term.

bestboy · Jun 17, 2014

Thank you guys for your insights. Luckily I was still able to cancel the order for the disks. They had not been shipped yet. I'm probably gonna have to order replacements soon tho, because I'm running out of free space. Still, I'll keep the drives and clear the pool for now.

cyberjock said:
Based on your SMART output and the error(and assuming it was actually the same disk) what I get is that the disk has no problem but the data was corrupted.

Actually, I just noticed a mistake. I upload the wrong smartctl output. The link I posted above showed the 2nd Samsung Spinpoint on ada3. However, the check sum error was detected on ada2. Sorry about that.

The drives look very similar tho. If anything, the drive with the check sum error even looks slightly better...
ada2 (<- check sum error)
ada3

cyberjock said:
Unrelated comment:

Load_Cycle_Count -O--CK 167 167 000 - 100101

That's pretty high. You should check out the guide on fixing your WDRed's head parking time with wdidle or a firmware update. Googling or searching the forums will give you everything you need to know about it and how to fix it. You shouldn't ignore this as this could shorten the life of your drive in the long-term.

Eww... I'll look into that. That kind of sucks. Now that I know about this issue, I'm probably not going for more WD REDs when I have to do the replacement of the Samsung Spinpoints...

solarisguy · Jun 17, 2014

Do not worry about buying WD Red model

I have WD Red drives made in April 2014, the model is WD40EFRX. I have downloaded WD Red SMART load/Unload utility from http://support.wd.com/product/download.asp?groupid=619&sid=201 It took me some time to connect the drives for the update, since the utility did not see the disks through my USB-SATA bridges (it could update WD Red drives in WD My Book though...). When I was finally able to run the utility against my drives, it turned out that they did not require to be updated :)

Yatti420 · Jun 17, 2014

100,000 load unload is too high unless drives are very old.. With how I believe WD manufactures these drives I don't think there is anything special for red v green.. They are same drives different firmware and imo not that great of a drive..

Their are alot of disadvantages performance wise to not making sure that idle timer is disabled or set so high as not to run.. I lost my first green at about this point around 100k..

Sent from my SGH-I257M using Tapatalk 2

solarisguy · Jun 17, 2014

I did not buy any WD Green hard drives this year, but I was under the impression, that due to the firmware differences, newly made WD Reds did not require an update, while WD Greens did. At least there were no reports to the contrary.

cyberjock · Jun 17, 2014

The whole problem with WD Reds was a temporary problem in the firmware. It's since been fixed. But the issue exists as WD Greens are supposed to park themselves.

I'm running 24 WD Greens in my server and I can't praise them enough. I've had 3 failures in over 3 years of 24x7 operation for all of the disks. The sucky part is that now they are all out of warranty so if they start failing the replacement plan will get expensive. ;)

John Morales · Oct 26, 2015

wywywywy said:
By the way the way it reads is 150 minus the value you see.

This is because of the way the threshold works - anything higher than the threshold means good, anything lower than the threshold means bad. But when it comes to temperature higher means bad, so the value needs to be flipped over.

I realize that this post is over a year old, but I would like to know where you were able to find that the normalized value for these WD reds are offset by 150. I've been googling and cannot find any source of this information. But it does look legit.

Code:

freenas# cat show_drive_temp.sh
#!/bin/bash
for i in /dev/da?; do
        smartctl -A $i | awk -v drive=$i '$2 ~ /Temperature_Celsius/ { printf("Drive %s: Current %s, Highest %s\n", drive, 150 - $4, 150 - $5)}'
done
freenas# ./show_drive_temp.sh 
Drive /dev/da0: Current 31, Highest 36
Drive /dev/da1: Current 31, Highest 37
Drive /dev/da2: Current 32, Highest 38
Drive /dev/da3: Current 31, Highest 36
Drive /dev/da4: Current 31, Highest 36
Drive /dev/da5: Current 30, Highest 35
Drive /dev/da6: Current 31, Highest 36
Drive /dev/da7: Current 29, Highest 35

cyberjock · Oct 27, 2015

John Morales said:
I realize that this post is over a year old, but I would like to know where you were able to find that the normalized value for these WD reds are offset by 150. I've been googling and cannot find any source of this information. But it does look legit.

Easy...

194 Temperature_Celsius 0x0022 114 111 000 Old_age Always - 38

See the 114 and 38? What's 114 + 38? ;)

Ericloewe · Oct 27, 2015

cyberjock said:
Easy...

194 Temperature_Celsius 0x0022 114 111 000 Old_age Always - 38

See the 114 and 38? What's 114 + 38? ;)

Wait a second, that's not 150.

Is that a WD 5400RPM drive?

John Morales · Oct 27, 2015

Ericloewe said:
Wait a second, that's not 150. Is that a WD 5400RPM drive?

lol, I assumed he was trolling me and didn't even bother doing the math... ;)

cyberjock · Oct 29, 2015

Oh shit. I wasn't trolling you. I, for some stupid reason, really thought that the sum was 150. LOL. The joke's on me!

In all seriousness, that is supposed to be how you know the sum (at least, from the info I had). But apparently its not 150. Could it be 152?

John Morales · Oct 29, 2015

heh, it seems to add up to 150 for me, I'm using WD Reds, I was just curious if that number was just assumed or was there a reference to how WD normalizes it's values.

Code:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
194 Temperature_Celsius     0x0022   116   113   000    Old_age   Always       -       34

i.e. x - raw_value = value
solve for x.. got it.

cyberjock · Oct 29, 2015

I don't think there is a reference anywhere. It's just been assumed by solving for x. ;)

Important Announcement for the TrueNAS Community.

SMART attribute: temperature value decoding

Contributor

Dabbler

Guru

Contributor

Contributor

Guru

Server Wrangler

Inactive Account

Contributor

Guru

Wizard

Guru

Inactive Account

Cadet

Inactive Account

Server Wrangler

Cadet

Inactive Account

Cadet

Inactive Account

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "SMART attribute: temperature value decoding"

Similar threads