After about 3 years running 24/7 my freenas server now is reporting beginnings of failing drive(s)
my setup:
Norco 24bay 4U case.
Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz
Supermicro X10SRL-F
128 GB samsung memory (at the time it was cheap enough to go right to 128, good choice now it seems)
24x WD Red series
4 vdevs of Z2 6disk, in a single pool.
Now in 2 of the vdevs i get Smart long read errors listed
and disk 2:
Should i replace those discs right now? Or can i keep them until the point they go "offline"
The pool seems working fine, but its a little unclear for me if ZFS actually "dropped" the disks and is now running with 2 Z1 vdevs in the pool instead of Z2, or that they are just notifications of a soon to be dead disk?
my setup:
Norco 24bay 4U case.
Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz
Supermicro X10SRL-F
128 GB samsung memory (at the time it was cheap enough to go right to 128, good choice now it seems)
24x WD Red series
4 vdevs of Z2 6disk, in a single pool.
Now in 2 of the vdevs i get Smart long read errors listed
Code:
Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 8 3 Spin_Up_Time 0x0027 178 178 021 Pre-fail Always - 8066 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 15 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 069 069 000 Old_age Always - 22952 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 15 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 13 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 472 194 Temperature_Celsius 0x0022 116 108 000 Old_age Always - 36 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 2 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 19500 - # 2 Extended offline Completed: read failure 90% 19406 116357160 # 3 Short offline Completed without error 00% 19333 - # 4 Short offline Completed without error 00% 19165 - # 5 Extended offline Completed: read failure 90% 19070 116357160 # 6 Short offline Completed without error 00% 18997 - # 7 Short offline Completed without error 00% 18757 - # 8 Extended offline Completed: read failure 90% 18662 116357160 # 9 Short offline Completed without error 00% 18589 - #10 Short offline Completed without error 00% 18422 - #11 Extended offline Completed: read failure 90% 18327 116357160 #12 Short offline Completed without error 00% 18254 - #13 Short offline Completed without error 00% 18039 - #14 Extended offline Completed without error 00% 17953 - #15 Short offline Completed without error 00% 17871 - #16 Short offline Completed without error 00% 17703 - #17 Extended offline Completed: read failure 20% 17615 1775422536 #18 Short offline Completed without error 00% 17535 - #19 Short offline Completed without error 00% 17294 - #20 Extended offline Completed without error 00% 17212 - #21 Short offline Completed without error 00% 17127 - 1 of 5 failed self-tests are outdated by newer successful extended offline self-test #14
and disk 2:
Code:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 41 3 Spin_Up_Time 0x0027 172 172 021 Pre-fail Always - 8400 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 15 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 071 070 000 Old_age Always - 21760 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 15 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 13 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 1977 194 Temperature_Celsius 0x0022 119 108 000 Old_age Always - 33 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 90% 19501 2589409 # 2 Extended offline Completed: read failure 90% 19406 2589408 # 3 Short offline Completed: read failure 90% 19333 2589408 # 4 Short offline Completed: read failure 50% 19165 2589408 # 5 Extended offline Completed without error 00% 19080 - # 6 Short offline Completed without error 00% 18997 - # 7 Short offline Completed without error 00% 18758 - # 8 Extended offline Completed without error 00% 18672 - # 9 Short offline Completed without error 00% 18590 - #10 Short offline Completed without error 00% 18422 - #11 Extended offline Completed without error 00% 18337 - #12 Short offline Completed without error 00% 18255 - #13 Short offline Completed without error 00% 18039 - #14 Extended offline Completed without error 00% 17954 - #15 Short offline Completed without error 00% 17871 - #16 Short offline Completed without error 00% 17703 - #17 Extended offline Completed without error 00% 17618 - #18 Short offline Completed without error 00% 17535 - #19 Short offline Completed without error 00% 17295 - #20 Extended offline Completed without error 00% 17212 - #21 Short offline Completed without error 00% 17127 -
Should i replace those discs right now? Or can i keep them until the point they go "offline"
The pool seems working fine, but its a little unclear for me if ZFS actually "dropped" the disks and is now running with 2 Z1 vdevs in the pool instead of Z2, or that they are just notifications of a soon to be dead disk?