New 4TB Seagate NAS drive ST4000VN000, want to disable ERC (aka TLER)

Status
Not open for further replies.

scurrier

Patron
Joined
Jan 2, 2014
Messages
297
Some people prefer to leave ERC enabled because it will allow the array to heal sectors quicker in applications where they need to avoid interruptions. But, just like the title says, I want to disable ERC on this drive because I have read on the forums that keeping it enabled may force an error during RAIDZ resilvering, which may have otherwise been possible to recover from if the recovery had not been time-limited. I am using FreeNAS at home and don't need to prevent short interruptions that the error recovery may cause.

So, after some snooping around on the web I found that many use the smartctl utility to adjust this parameter. So, I used the "scterc" command to set the timeouts to zero, which effectively disables the feature. But, upon reboot, the feature was back enabled and set to 7 seconds. I was afraid of this.

Is it possible to disable this feature permanently? Or can it be controlled in FreeNAS?

What are people's opinions of leaving it enabled in a home-use application? Or maybe setting it to a value more like 90, to give the drive more time to see if it can recover?

Also, a slightly off topic question: What utilities do people run on their new drives to "validate" that they are healthy and ready for use in production?

Thanks for your help!
 

scurrier

Patron
Joined
Jan 2, 2014
Messages
297
Hmm... I just tried setting ERC to 90 seconds. It also will not persist during a reboot. After reboot, I'm back to 90.

Maybe I should just settle for needing to change these all manually before doing a resilver?
 

TheSmoker

Patron
Joined
Sep 19, 2012
Messages
225
You can add it to start up script so everytime freenas reboots it would set up those as you want it.

Sent from my iPad using Tapatalk HD
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Not to argue with you over your choice to disable TLER, but you do realize that most people here(I'm not part of this group) argue that without TLER the drive shouldn't be used in ZFS?

You're the first person I've seen to deliberately want to disable TLER. Quite an interesting twist!
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I think most of us who like TLER recognize that it is basically a matter of whether or not it is appropriate/desirable for a given purpose.

In a ZFS environment with redundancy, it is unclear why you would prefer the drive come to a halt and beat itself with a brick trying to recover the data. It seems better to just discard the attempt and allow recovery from parity/redundancy. But if your users don't mind the fileserver going catatonic, then by all means...
 

scurrier

Patron
Joined
Jan 2, 2014
Messages
297
Thanks for everyone's replies. I REALLY appreciate the help I've been given so far on these forums. My FreeNAS education would be much harder without you. I'm getting excited to stop researching and start purchasing hardware, but I'm not quite there yet. Although, I did buy a single drive to play with, already. Maybe I should pay all of this assistance forward by making a build log of my hardware choices and FreeNAS setup or something for the benefit of future newbs.

To answer the question of why I would want to do this. My server is for home purposes and so if it stalls out temporarily while someone is using it, that could be OK. As long as it is rare and recovers eventually. The benefit that I see from this is that, when resilvering, if a read error is encountered, the drive will have some time to try to recover from it, before it becomes a URE (Unrecoverable Read Error). Maybe that's a rare event- that seven seconds isn't enough to recover, but 90 seconds is? Anyway, that's my logic. Please tell me if this is stupid.

The benefit of having ERC/TLER permanently disabled would be that I would not have the risk of forgetting to enable (EDIT: I meant to say disable!) it before resilvering. But, I am thinking I might want to go ahead and do a favor to my future self by making a cheat sheet with my today's knowledge of how to properly resilver so that I don't make a mistake 4 years from now when something finally fails. If I leave ERC enabled, I could make a warning in my cheat sheet to remember to disable it before resilvering. Then I would avoid OP issue of having it reset during power cycles and also get the benefit of no unexpected slow-downs during typical use.

Yeah. That's what I'll do.

P.S. I have an unreasonable disdain for the term "resilvering," perhaps because it seems like a word someone would use to try to sound more knowledgeable that others. But I've read that this is the correct term over "rebuilding." I don't know why.

P.P.S. Here is the thread I found today (on accident after an unsuccessful search!) about validating drives before using them in production.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
The benefit of having ERC/TLER permanently disabled would be that I would not have the risk of forgetting to enable it before resilvering. But, I am thinking I might want to go ahead and do a favor to my future self by making a cheat sheet with my today's knowledge of how to properly resilver so that I don't make a mistake 4 years from now when something finally fails. If I leave ERC enabled, I could make a warning in my cheat sheet to remember to disable it before resilvering. Then I would avoid OP issue of having it reset during power cycles and also get the benefit of no unexpected slow-downs during typical use.

See, and your logic with this completely escapes me. So let's say you scrub a drive, here's the 2 possible scenarios(aside from no errors of course)

1. TLER disabled: The drive will stop and continue to retry. This will place the scrub on hold while the drive tries to solve its problems. Either it will, it won't, or the drive will be dropped from your pool as it is detached from the SATA/SAS controller. Obviously, the latter is worst case. If it does and passes the checksum the scrub will not be able to determine there is a fault and you will have a semi-readable sector that will continue to store your precious data. If it doesn't solve its problem then zfs will record the read error and will write the correct data to the sector. Now, assuming your drive firmware is doing its job, it will remap that sector to your spare sectors and you will be left with a drive that will properly read the drive.

2. TLER enabled: The drive will stop and continue to retry for up to the TLER limit. This will place the scrub on hold durng this time while the drive tries to solve its problem. Either it will or it will hit the TLER limit and the sector will be reported as bad. ZFS will then write the correct data to the sector using parity data. Again, assuming your drive firmware is doing its job, it will remap that sector and the data written will end up in your spare sectors area. You will be left with a drive that properly reads the drive.

So, leaving TLER enabled is doing 2 positive things:

1. Preventing a drive from being dropped from the pool.
2. Conservatively failing sectors that might end up reading properly if the drive is given enough time to read the sectors.

So explain to me how disabling TLER is a benefit? Because I think we are both arguing opposite sides of the situation, and only 1 of us can be right.
 

scurrier

Patron
Joined
Jan 2, 2014
Messages
297
I have a feeling that you are right and that I am confusing you. Maybe I am using the term resilvering wrong. By resilver, I mean rebuild. Not scrub. As in "when your RAIDZ loses a drive, becomes degraded, and you put a new drive back into it and rebuild."

I've been lead to believe that if an unrecoverable read error (URE) is encountered during the rebuild, the whole array will die. You won't just lose the file at the location, you will lose the ZPOOL.

To use your format:

1. TLER disabled: You try to rebuild a degraded array and encounter an error. If the error can be recovered in >(the TLER time limit) but not in <(the TLER time limit), having TLER disabled just saved your array because it prevented the read error from becoming unrecoverable due to stopping recovery early.

2. TLER enabled: You try to rebuild a degraded array and encounter an error. If the error cannot be recovered in <(the TLER time limit), it will stop trying to recover and you will lose the array. Maybe it could have been recovered if you had given it more time, though.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
A URE won't necessarily cause the whole array to die. Corruption in the right places can cause irreparable and catastrophic damage of your pool. For that reason, RAIDZ1 should be avoided. If you are using RAIDZ2+ then your whole argument is mute because you'd have a second disk's parity to protect you even with a single failed disk.

TLER being disabled during a resilvering process isn't doing you any favors either. Try doing a disk replacement on a pool that already has a failed disk. We've seen people that have had resilvers claim to take decades because of the long delay in error recovery.

Frankly, if you are in a situation where a TLER settings is going to make the difference between your pool living or dying during a resilver, you've already failed at properly configuring and managing risk.
 

scurrier

Patron
Joined
Jan 2, 2014
Messages
297
I think you might have just boiled it down there for me in your last paragraph. Thanks.
 
Status
Not open for further replies.
Top