Hi guys,
today my small mirror pool encountered a check sum error for the first time:
mirror-1 consists of 2 2TiB Samsung Spin Point F4 drives that have been moved over from my old NAS. They are kind of worn out and I expected them to fail sooner or later. And thanks to ZFS and its magnificent early warning mechanism, I'm now in the process of ordering 2 new 3TiB WD REDs as a replacement.
So when I got the check sum error in ZFS today, I wanted to take a closer look at the SMART data. I've been using SMART monitoring tasks, but up until now all those SMART tests have been OK. I run smartctl manually on my drives and noticed that I don't really understand the temperature values. They are not straight forward and confusing, so I thought I'd ask you guys about them. (BTW: here's the SMART data of the failing drive in case anyone is interested).
So about the temperatures: Here are the SMART attribute values from one of my WD REDs that are in the healthy mirror-0:
Well, as you can see the Temperature_Celsius attribute has a VALUE of 119 and a WORST of 114.
Now, how should I interpret these numbers? They are obviously no temperature values. I doubt that any drive could ever reach 114°C. So how can I decode these manufacturer specific values?
I'd really like to know my real WORST temperature so I can decide, if I need more cooling. My temperature history looks good atm, but it's just covering a short period of time. With the encoded WORST value I don't really know, if my drive has ever exceeded the 40°C limit :(
edit: fixed link to the correct SMART data
today my small mirror pool encountered a check sum error for the first time:
Code:
pool: tank state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-9P scan: scrub repaired 128K in 8h27m with 0 errors on Sun Jun 15 08:27:10 2014 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 gptid/82f41c52-6036-11e3-b273-002590d6a1c4 ONLINE 0 0 0 gptid/83c52ace-6036-11e3-b273-002590d6a1c4 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 gptid/580753fb-6691-11e3-b8b4-002590d6a1c4 ONLINE 0 0 1 gptid/58ae0af9-6691-11e3-b8b4-002590d6a1c4 ONLINE 0 0 0 errors: No known data errors
mirror-1 consists of 2 2TiB Samsung Spin Point F4 drives that have been moved over from my old NAS. They are kind of worn out and I expected them to fail sooner or later. And thanks to ZFS and its magnificent early warning mechanism, I'm now in the process of ordering 2 new 3TiB WD REDs as a replacement.
So when I got the check sum error in ZFS today, I wanted to take a closer look at the SMART data. I've been using SMART monitoring tasks, but up until now all those SMART tests have been OK. I run smartctl manually on my drives and noticed that I don't really understand the temperature values. They are not straight forward and confusing, so I thought I'd ask you guys about them. (BTW: here's the SMART data of the failing drive in case anyone is interested).
So about the temperatures: Here are the SMART attribute values from one of my WD REDs that are in the healthy mirror-0:
Code:
Device Model: WDC WD30EFRX-68EUZN0 [...] Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 0 3 Spin_Up_Time POS--K 176 175 021 - 6158 4 Start_Stop_Count -O--CK 100 100 000 - 103 5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0 7 Seek_Error_Rate -OSR-K 200 200 000 - 0 9 Power_On_Hours -O--CK 094 094 000 - 4423 10 Spin_Retry_Count -O--CK 100 100 000 - 0 11 Calibration_Retry_Count -O--CK 100 253 000 - 0 12 Power_Cycle_Count -O--CK 100 100 000 - 77 192 Power-Off_Retract_Count -O--CK 200 200 000 - 19 193 Load_Cycle_Count -O--CK 167 167 000 - 100101 194 Temperature_Celsius -O---K 119 114 000 - 31 196 Reallocated_Event_Count -O--CK 200 200 000 - 0 197 Current_Pending_Sector -O--CK 200 200 000 - 0 198 Offline_Uncorrectable ----CK 100 253 000 - 0 199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0 200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 0 [...] Index Estimated Time Temperature Celsius 10 2014-06-15 08:45 31 ************ ... ..( 63 skipped). .. ************ 74 2014-06-15 09:49 31 ************ 75 2014-06-15 09:50 30 *********** 76 2014-06-15 09:51 31 ************ 77 2014-06-15 09:52 30 *********** 78 2014-06-15 09:53 31 ************ 79 2014-06-15 09:54 31 ************ 80 2014-06-15 09:55 30 *********** 81 2014-06-15 09:56 30 *********** 82 2014-06-15 09:57 31 ************ ... ..( 4 skipped). .. ************ 87 2014-06-15 10:02 31 ************ 88 2014-06-15 10:03 30 *********** ... ..( 17 skipped). .. *********** 106 2014-06-15 10:21 30 *********** 107 2014-06-15 10:22 31 ************ 108 2014-06-15 10:23 30 *********** ... ..( 3 skipped). .. *********** 112 2014-06-15 10:27 30 *********** 113 2014-06-15 10:28 31 ************ ... ..( 95 skipped). .. ************ 209 2014-06-15 12:04 31 ************ 210 2014-06-15 12:05 32 ************* ... ..( 17 skipped). .. ************* 228 2014-06-15 12:23 32 ************* 229 2014-06-15 12:24 31 ************ ... ..( 50 skipped). .. ************ 280 2014-06-15 13:15 31 ************ 281 2014-06-15 13:16 30 *********** ... ..(158 skipped). .. *********** 440 2014-06-15 15:55 30 *********** 441 2014-06-15 15:56 31 ************ ... ..( 7 skipped). .. ************ 449 2014-06-15 16:04 31 ************ 450 2014-06-15 16:05 30 *********** 451 2014-06-15 16:06 31 ************ ... ..( 35 skipped). .. ************ 9 2014-06-15 16:42 31 ************ [...]
Well, as you can see the Temperature_Celsius attribute has a VALUE of 119 and a WORST of 114.
Now, how should I interpret these numbers? They are obviously no temperature values. I doubt that any drive could ever reach 114°C. So how can I decode these manufacturer specific values?
I'd really like to know my real WORST temperature so I can decide, if I need more cooling. My temperature history looks good atm, but it's just covering a short period of time. With the encoded WORST value I don't really know, if my drive has ever exceeded the 40°C limit :(
edit: fixed link to the correct SMART data