Hi everyone,
I've had a couple sleepless nights, so I'm hoping to get some clarity...
A couple of days ago, my FreeNAS pool went into a degraded state. Looking at zpool status, one disk showed that it was removed and several had "too many errors." These were all in the oldest vdevs of my system, plus they were all on the chassis backplane (Rosewill) as opposed to the DS4243s the newer disks are sitting in. Long story short, I narrowed it down to the backplane and moved the disks to spare spots in a DS4243. Everything is back online.
Next crisis: I of course took a look at smart reports on my disks. I'm a bit embarrassed to admit I didn't know the significance of Load Cycle Count. My oldest disks are WD Reds that apparently had parking set. I checked, and they're out of their warranty period. Five of the drives have LCCs of around 800K! Three of them also have a few read errors. But all six have high LCCs and I'm a bit nervous. Obviously applying WDIDLE and the other firmware update for Reds is on my schedule after work today.
As an additional wrinkle, one of the drives "has experienced an unrecoverable error" according to zpool status (3 checksum errors). I'm not sure if that is due to the drive itself or is just a residual problem from the backplane issue.
Here are my thoughts:
1) I'm not sure how long the drives can realistically keep going for. I could be giving up another year or two of use.
2) But if I wait for a drive to actually fail, I'm running the risk that multiple drives go at the same time, especially given the high LCC and the demand put on the drives during the resilvering process.
3) But if I replace them, the resilvering process might hose multiple drives sooner rather than later (I guess this is kind of silly... but delaying gives me a chance to wait for a sale and do a full local backup).
4) If I replace them now, do I just go ahead and replace all six, one at a time? If I do that, what do I do with the 6 4TB drives?
5) I have 3 4TB red drives that I kept as spares... I have to decide whether to try to use those, or upgrade to 8 or 10TB drives for the whole vdev.
I have 6 VDEVs of 6 drives each (2 w/ 4TB drives, 3 with 8TB drives, and 1 with 10TB drives). While I do have online backups if worse comes to worse, I'd obviously prefer not to have to go that route. Part of me is tempted to see if I can holdout for another sale on the 10TB easystores, buy 8, stripe them together just to copy my data, destroy the pool, and create a fresh pool without the old drives. But I don't relish that option, either...
So what would you do?
Here are the relevant SMART values for the older 6 drives:
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 21
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 51
9 Power_On_Hours 0x0032 031 031 000 Old_age Always - 50571
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 26
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 811913
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 52
9 Power_On_Hours 0x0032 031 031 000 Old_age Always - 50571
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 26
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 826762
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 10
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 62
9 Power_On_Hours 0x0032 039 039 000 Old_age Always - 45120
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 41
193 Load_Cycle_Count 0x0032 087 087 000 Old_age Always - 340213
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 52
9 Power_On_Hours 0x0032 031 031 000 Old_age Always - 50571
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 26
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 797650
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 57
9 Power_On_Hours 0x0032 031 031 000 Old_age Always - 50571
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 32
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 815956
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 15
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 64
9 Power_On_Hours 0x0032 031 031 000 Old_age Always - 50576
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 38
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 791589
I've had a couple sleepless nights, so I'm hoping to get some clarity...
A couple of days ago, my FreeNAS pool went into a degraded state. Looking at zpool status, one disk showed that it was removed and several had "too many errors." These were all in the oldest vdevs of my system, plus they were all on the chassis backplane (Rosewill) as opposed to the DS4243s the newer disks are sitting in. Long story short, I narrowed it down to the backplane and moved the disks to spare spots in a DS4243. Everything is back online.
Next crisis: I of course took a look at smart reports on my disks. I'm a bit embarrassed to admit I didn't know the significance of Load Cycle Count. My oldest disks are WD Reds that apparently had parking set. I checked, and they're out of their warranty period. Five of the drives have LCCs of around 800K! Three of them also have a few read errors. But all six have high LCCs and I'm a bit nervous. Obviously applying WDIDLE and the other firmware update for Reds is on my schedule after work today.
As an additional wrinkle, one of the drives "has experienced an unrecoverable error" according to zpool status (3 checksum errors). I'm not sure if that is due to the drive itself or is just a residual problem from the backplane issue.
Here are my thoughts:
1) I'm not sure how long the drives can realistically keep going for. I could be giving up another year or two of use.
2) But if I wait for a drive to actually fail, I'm running the risk that multiple drives go at the same time, especially given the high LCC and the demand put on the drives during the resilvering process.
3) But if I replace them, the resilvering process might hose multiple drives sooner rather than later (I guess this is kind of silly... but delaying gives me a chance to wait for a sale and do a full local backup).
4) If I replace them now, do I just go ahead and replace all six, one at a time? If I do that, what do I do with the 6 4TB drives?
5) I have 3 4TB red drives that I kept as spares... I have to decide whether to try to use those, or upgrade to 8 or 10TB drives for the whole vdev.
I have 6 VDEVs of 6 drives each (2 w/ 4TB drives, 3 with 8TB drives, and 1 with 10TB drives). While I do have online backups if worse comes to worse, I'd obviously prefer not to have to go that route. Part of me is tempted to see if I can holdout for another sale on the 10TB easystores, buy 8, stripe them together just to copy my data, destroy the pool, and create a fresh pool without the old drives. But I don't relish that option, either...
So what would you do?
Here are the relevant SMART values for the older 6 drives:
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 21
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 51
9 Power_On_Hours 0x0032 031 031 000 Old_age Always - 50571
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 26
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 811913
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 52
9 Power_On_Hours 0x0032 031 031 000 Old_age Always - 50571
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 26
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 826762
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 10
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 62
9 Power_On_Hours 0x0032 039 039 000 Old_age Always - 45120
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 41
193 Load_Cycle_Count 0x0032 087 087 000 Old_age Always - 340213
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 52
9 Power_On_Hours 0x0032 031 031 000 Old_age Always - 50571
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 26
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 797650
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 57
9 Power_On_Hours 0x0032 031 031 000 Old_age Always - 50571
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 32
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 815956
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 15
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 64
9 Power_On_Hours 0x0032 031 031 000 Old_age Always - 50576
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 38
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 791589