How Dangerous are High Load Cycle Counts?

sheepdot · Aug 20, 2019

Hi everyone,
I've had a couple sleepless nights, so I'm hoping to get some clarity...
A couple of days ago, my FreeNAS pool went into a degraded state. Looking at zpool status, one disk showed that it was removed and several had "too many errors." These were all in the oldest vdevs of my system, plus they were all on the chassis backplane (Rosewill) as opposed to the DS4243s the newer disks are sitting in. Long story short, I narrowed it down to the backplane and moved the disks to spare spots in a DS4243. Everything is back online.

Next crisis: I of course took a look at smart reports on my disks. I'm a bit embarrassed to admit I didn't know the significance of Load Cycle Count. My oldest disks are WD Reds that apparently had parking set. I checked, and they're out of their warranty period. Five of the drives have LCCs of around 800K! Three of them also have a few read errors. But all six have high LCCs and I'm a bit nervous. Obviously applying WDIDLE and the other firmware update for Reds is on my schedule after work today.

As an additional wrinkle, one of the drives "has experienced an unrecoverable error" according to zpool status (3 checksum errors). I'm not sure if that is due to the drive itself or is just a residual problem from the backplane issue.

Here are my thoughts:
1) I'm not sure how long the drives can realistically keep going for. I could be giving up another year or two of use.
2) But if I wait for a drive to actually fail, I'm running the risk that multiple drives go at the same time, especially given the high LCC and the demand put on the drives during the resilvering process.
3) But if I replace them, the resilvering process might hose multiple drives sooner rather than later (I guess this is kind of silly... but delaying gives me a chance to wait for a sale and do a full local backup).
4) If I replace them now, do I just go ahead and replace all six, one at a time? If I do that, what do I do with the 6 4TB drives?
5) I have 3 4TB red drives that I kept as spares... I have to decide whether to try to use those, or upgrade to 8 or 10TB drives for the whole vdev.
I have 6 VDEVs of 6 drives each (2 w/ 4TB drives, 3 with 8TB drives, and 1 with 10TB drives). While I do have online backups if worse comes to worse, I'd obviously prefer not to have to go that route. Part of me is tempted to see if I can holdout for another sale on the 10TB easystores, buy 8, stripe them together just to copy my data, destroy the pool, and create a fresh pool without the old drives. But I don't relish that option, either...

So what would you do?

Here are the relevant SMART values for the older 6 drives:
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 21
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 51
9 Power_On_Hours 0x0032 031 031 000 Old_age Always - 50571
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 26
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 811913

1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 52
9 Power_On_Hours 0x0032 031 031 000 Old_age Always - 50571
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 26
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 826762

1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 10
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 62
9 Power_On_Hours 0x0032 039 039 000 Old_age Always - 45120
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 41
193 Load_Cycle_Count 0x0032 087 087 000 Old_age Always - 340213

1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 52
9 Power_On_Hours 0x0032 031 031 000 Old_age Always - 50571
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 26
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 797650

1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 57
9 Power_On_Hours 0x0032 031 031 000 Old_age Always - 50571
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 32
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 815956

1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 15
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 64
9 Power_On_Hours 0x0032 031 031 000 Old_age Always - 50576
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 38
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 791589

Mlovelace · Aug 20, 2019

My understanding is that drives are typically rated for around 600,000 load cycles. Beyond that you can run into read/write head damage. I would replace the drives that are already beyond that threshold, and disable it on the others moving forward.

sheepdot · Aug 23, 2019

Thanks. I've already run wdidle on all the old drives, ordered some new drives, and will replace when they come in. I can use the old ones as a partial local backup until they fail.

Heracles · Aug 23, 2019

Hey Sheepdot,

Are your vDev RaidZ-1 or 2 ? If they are RaidZ-1, you are experiencing here one of the many reasons why RaidZ-1 is discouraged. If you are using RaidZ-2, than you can relax a little more. Also, you did not mentioned how loaded that pool is, the size of your data, how much is used for snapshots, etc.

I agree with Mlovelace that your drives are to be replaced as soon as practical. The question is how to design the best way out of this...

Should you be RaidZ-2, you can re-silver without much risks. You keep redundancy in your vDev even after removing one drive, so you will survive single read errors during the re-silvering process. Identify the highest risk drives and replace them one at a time.

Should you be Raid-Z1, to create a second pool may be a good option. If you can create yourself a big enough mirror or big enough RaidZ2 pool, do it and ZFS send / receive to that new pool. There is a cost, but what is the value of your data and the cost of a backup restore ?

As for me, to wait is the thing to avoid. The only thing that would be more dangerous would be to panic and react doing anything.

So give us all the details about your vDev, how pool is loaded, etc. From that, we may be able to figure out a better plan out of that situation.

Good luck,

sheepdot · Aug 23, 2019

Hi Heracles,

Yeah, I should have specified that the vDevs are all RaidZ2. Usually I wouldn't worry (because of that redundancy), but it is the fact that the risk factors are heavily loaded in one vDev that concerned me. The system is loaded at about 65%, but that includes snapshots (144TB total).

I keep an extra couple of drives as hot spares for every vDev I create, so right now I have 6 extra 8TB and 2 extra 10TB drives. Those are already burn-in tested and ready to go if a drive fails while I wait for the next batch of 8 to come in. Once that batch comes in, I'll burn-test them and make a big pool with the new 8 + the two 10TB drives just so that I can do a copy of my main datasets. Once the copy is done, I'll take the old hot spares and start replacing them in the vDev that houses the 4TB drives with high LCCs. That's the plan of attack for now... unless the scrub I'm running finds something horrible or one of the drives starts throwing errors.

Heracles · Aug 24, 2019

Hi again,

sheepdot said:
unless the scrub I'm running

Know that a scrub is basically as intensive as a re-silver : it reads 100% of everything written on the drives. If you do not fear a scrub, you should not fear a re-silver.

sheepdot said:
The system is loaded at about 65%, but that includes snapshots (144TB total)

Ok; that is clearly too much for a single vDev to be deployed and hosts the data temporarily.

Once the scrub done, I would start replacing the highest risk drives right away. You have enough 8TB safe drives to replace all the risky drives. Once the 10 TB drives arrive, you can create your temp vDev and re-work your main pool.

Good luck,

sheepdot · Aug 24, 2019

Heracles said:
Know that a scrub is basically as intensive as a re-silver : it reads 100% of everything written on the drives. If you do not fear a scrub, you should not fear a re-silver.

Fair enough.

Ok; that is clearly too much for a single vDev to be deployed and hosts the data temporarily.

Once the scrub done, I would start replacing the highest risk drives right away. You have enough 8TB safe drives to replace all the risky drives. Once the 10 TB drives arrive, you can create your temp vDev and re-work your main pool.

I'm not sure what you mean about re-working my main pool. If I'm replacing the drives one at a time anyway (and if that is successful), I don't see the point in re-working the pool.

Heracles · Aug 25, 2019

sheepdot said:
I have 6 VDEVs of 6 drives each (2 w/ 4TB drives, 3 with 8TB drives, and 1 with 10TB drives).

sheepdot said:
The system is loaded at about 65%, but that includes snapshots (144TB total).

The numbers don't add up...
2x 4x 4TB = 16 TB
3x 4x 8TB = 96 TB
1x 4x 10TB = 40 TB

Total pool size is 152 TB
You said it was loaded at 65% with 144TB.

144TB is more close to 100% of capacity (which is very dangerous).
65% would be more around 100TB.

A pool loaded up tp 65% if significantly loaded.

It would be a good time for you to evaluate your needs and re-design your solution.

Ex: 24x 8TB drives. You could go for 2x 12 drives RaidZ2 instead of 3x 6 drives. You would recover some TB.
You may also be able to do something different with your 10 TB drives.

It is recommended to keep vdev similar in their design, like all of them That-Many drives RaidZ2. Maybe going for 8 drives-wide vDev : 1x 8x 10TB ; 3x 8x 8TB. Just these ones would give you 204 TB.

Up to you to see if this would be a good time for you to re-design your pool or not.

Good luck with that,

sheepdot · Aug 25, 2019

Heracles said:
The numbers don't add up...
2x 4x 4TB = 16 TB
3x 4x 8TB = 96 TB
1x 4x 10TB = 40 TB

Total pool size is 152 TB
You said it was loaded at 65% with 144TB.

Looks like you're missing 16TB (2x 4x 4TB = 32 TB)... but I'm going based on zpool list:
vol1 228T 144T 84.3T - - 63% 1.00x ONLINE /mnt

I get what you're saying about wider vDevs, but I'm willing to sacrifice some space for the additional redundancy. Thanks for the input!

Heracles · Aug 25, 2019

sheepdot said:
(2x 4x 4TB = 32 TB)

You are right. My bad!

You have a good opportunity to review your pool here and have it the best way to answer your need. If that best way is the actual way, good for you.

Good luck while re-silvering all these doubtful drives,

Important Announcement for the TrueNAS Community.

How Dangerous are High Load Cycle Counts?

sheepdot

Dabbler

Mlovelace

Guru

sheepdot

Dabbler

Heracles

Wizard

sheepdot

Dabbler

Heracles

Wizard

sheepdot

Dabbler

Heracles

Wizard

sheepdot

Dabbler

Heracles

Wizard

Similar threads

Important Announcement for the TrueNAS Community.

How Dangerous are High Load Cycle Counts?

Dabbler

Guru

Dabbler

Wizard

Dabbler

Wizard

Dabbler

Wizard

Dabbler

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "How Dangerous are High Load Cycle Counts?"

Similar threads