Why TLER is actually needed

BloodyIron · May 21, 2016

Hi Folks,

I did it! I caught the logs of /var/log/messages when one of my drives was detatched by FNAS! This would have been resolved if the drive had TLER. But I wanted to show you what it looks like when a drive times out because it doesn't have TLER.

Code:

May 21 08:34:10 [REDACTED-HOSTNAME] mpt0: request 0xfffffe00008ba9b0:22505 timed out for ccb 0xfffff80008565800 (req->ccb 0xfffff80008565800)
May 21 08:34:10 [REDACTED-HOSTNAME] mpt0: attempting to abort req 0xfffffe00008ba9b0:22505 function 0
May 21 08:34:10 [REDACTED-HOSTNAME] mpt0: request 0xfffffe00008bb7c0:22506 timed out for ccb 0xfffff80008603000 (req->ccb 0xfffff80008603000)
May 21 08:34:11 [REDACTED-HOSTNAME] mpt0: mpt_wait_req(1) timed out
May 21 08:34:11 [REDACTED-HOSTNAME] mpt0: mpt_recover_commands: abort timed-out. Resetting controller
May 21 08:34:11 [REDACTED-HOSTNAME] mpt0: mpt_cam_event: 0x0
May 21 08:34:11 [REDACTED-HOSTNAME] mpt0: mpt_cam_event: 0x0
May 21 08:34:11 [REDACTED-HOSTNAME] mpt0: completing timedout/aborted req 0xfffffe00008ba9b0:22505
May 21 08:34:11 [REDACTED-HOSTNAME] mpt0: completing timedout/aborted req 0xfffffe00008bb7c0:22506
May 21 08:34:23 [REDACTED-HOSTNAME] mpt0: mpt_cam_event: 0x1b
May 21 08:34:23 [REDACTED-HOSTNAME] (da0:mpt0:0:73:0): READ(10). CDB: 28 00 bb 1c 9c 00 00 00 08 00
May 21 08:34:23 [REDACTED-HOSTNAME] (da0:mpt0:0:73:0): CAM status: Selection Timeout
May 21 08:34:23 [REDACTED-HOSTNAME] (da0:mpt0:0:73:0): Retrying command
May 21 08:34:23 [REDACTED-HOSTNAME] (da0:mpt0:0:73:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
May 21 08:34:23 [REDACTED-HOSTNAME] (da0:mpt0:0:73:0): CAM status: Selection Timeout
May 21 08:34:23 [REDACTED-HOSTNAME] (da0:mpt0:0:73:0): Retrying command
May 21 08:34:23 [REDACTED-HOSTNAME] (da0:mpt0:0:73:0): WRITE(10). CDB: 2a 00 ba 5f b1 48 00 00 28 00
May 21 08:34:23 [REDACTED-HOSTNAME] (da0:mpt0:0:73:0): CAM status: Selection Timeout
May 21 08:34:23 [REDACTED-HOSTNAME] (da0:mpt0:0:73:0): Retrying command
May 21 08:34:24 [REDACTED-HOSTNAME] da0 at mpt0 bus 0 scbus0 target 73 lun 0
May 21 08:34:24 [REDACTED-HOSTNAME] da0: <ATA ST32000542AS CC34> s/n             5XW1AALB detached
May 21 08:34:24 [REDACTED-HOSTNAME] GEOM_ELI: Device da0p1.eli destroyed.
May 21 08:34:24 [REDACTED-HOSTNAME] GEOM_ELI: Detached da0p1.eli on last close.
May 21 08:34:24 [REDACTED-HOSTNAME] (da0:mpt0:0:73:0): Periph destroyed

Now, I'm very confident this is a TLER related issue because:

This drive DOES NOT have TLER, but I have two other drives in the pool that do.
This drive has done this before, a few months ago. (I can't replace it yet, sadface)
My two drives that do have TLER have not once done this

So, what were the symptoms leading up to this?

Shitty IOPS, my VMs were running like poop
I could not identify any bottleneck (before I saw this happened)
zpool status INITIALLY said shit was fine (probably while it was retrying with the disk)
EVENTUALLY zpool status showed the disk as "REMOVED"
Well, I read /var/log/messages, but I had to scroll up a lot to find it as NFS messages hid it, kinda

So, what do I do? Well, since I can't replace it just yet, and the drive is probably "fine", I'm going to reboot. It's simpler to do that so the pool can resort itself easily. Considering no data was probably lost, I should be fine. What I really should do is replace the two disks without TLER with ones that do (such as WD REDs), but unfortunately that's not in the cards right now.

Anyways, I thought all you peeps out there would appreciate seeing what it looks like when your drive drops due to timing out of commands, something resolvable by TLER.

jgreco · May 21, 2016

"resetting controller" kinda makes it look like it might actually be something else. It'd be helpful to have someone more familiar with the mpt failure modes; I've seen plenty of timeouts that didn't result in a drive being dropped but I don't have any handy to compare to at this moment.

I think you might be jumping to conclusions here a bit.

BloodyIron · May 21, 2016

I'm all ears for alternative explanations. Resetting controller could be the controller on the drive itself, not entirely sure. But this only has happened on the drives without TLER.

Yeah, if you have more contributing info, by all means, share pls :)

jgreco said:
"resetting controller" kinda makes it look like it might actually be something else. It'd be helpful to have someone more familiar with the mpt failure modes; I've seen plenty of timeouts that didn't result in a drive being dropped but I don't have any handy to compare to at this moment.

I think you might be jumping to conclusions here a bit.

Bigtexun · Jun 2, 2016

Keep in mind that TLER is a fairly new feature, and most of us chug along for years before we see drive issues. So just because you have drives that *could* cause this symptom, that doesn't mean that is what the issue really is.

But the fact is, if it is the same drive that keeps dropping, you should consider replacing it soon. But other problems cause cause issues like this, such as a bad cable (data or power) a bad drive enclosure, or even a problematic port on your drive controller. Try swapping positions with another drive and see if the problem moves with the drive, or not.

I have some machines with really cheap drives, in crappy drive enclosures. Under windows, after a few months of operation, I start dropping drives from a hardware raid controller. It seemed to be specific to the drive position, and I replaced everything, including the enclosure (with another one just like it). I went down the list, drives (brand and type), cables, enclosures, controllers, controller models/drivers, motherboard, processor, memory, power supply... and the problem kept coming back. The thing that fixed it was removing the enclosure.

Funny thing... later I realize that only my windows machine had the problem, not my FreeNAS systems. But on my FreeNAS machines I didn't plug the enclosures into the controller, instead I use a SAS expander. I swear it seems the SAS expanders are better at handling the enclosure issue than the controllers.

But yes, TLER will "mask" a drive failing error correction, and sounds great for performance reasons. But if your drive is resetting it's controller, that usually means the drive is going bad.

A controller reset is usually on the drive itself. The board in your computer isn't really the drive controller, the drive controller is the board on the drive. Ever since IDE drives hit the market, the controller was moved from the card in the computer, to the drive itself... The card in the computer is still called a controller, but for a JBOD or straight SATA device, the card in the computer is more of an I/O interface. We still call them controllers because we never changed the terminology when IDE changed the electronics. Of course there IS a term for them that is correct, the HBA (host bus adapter), but most people still call them controllers. In the early days of SCSI the controller was a board that was added to a drive, and there were things that came before SCSI, but I digress.

I think you are right in thinking that TLER is designed to prevent drive drop-outs like you are seeing, but if your drive is failing, it is failing. Make sure it isn't something else like a cable or bad power, but TLER isn't going to make a drive stop failing. If your drive had TLER, and it was using the feature frequently, that would still mean the drive is failing. but because you don't have TLER, you don't really know what is failing. What about the smart stats?

Since I have machines with dozens of drives, I developed some methods of quickly looking at and comparing smart numbers in a way that make them easy to see side by side, that way a failing drive sticks out like a sore thumb.

On each of my FreeNAS systems, I have a file called drives, which is just a list of the drives, in /dev/drive format. I have another file which is a list of the smart attributes called attributes. I run these command lines to build a report for my drives:
cat drives | xargs -n1 smartctl -a >zzz
cat attributes |xargs -I xxx -n1 grep xxx zzz

This outputs an easy to read and compare table that makes reviewing drive status a breeze...

Drives looks like this:
/dev/da0
/dev/da1
/dev/da2
/dev/da3
...

Attributes looks like this:
overall-health
Raw_Read_Error_Rate
Throughput_Performance
Spin_Up_Time
Start_Stop_Count
Reallocated_Sector_Ct
Seek_Error_Rate
Seek_Time_Performance
Power_On_Hours
Spin_Retry_Count
Power_Cycle_Count
G-Sense_Error_Rate
Power-Off_Retract_Count
Load_Cycle_Count
Temperature_Celsius
Reallocated_Event_Count
Current_Pending_Sector
Offline_Uncorrectable
UDMA_CRC_Error_Count
Disk_Shift
Loaded_Hours
Load_Retry_Count
Load_Friction
Load-in_Time
Head_Flying_Hours

I think the attributes may change for different versions of smartctl, or different drive vendors, I built mine by editing the smart output of a single drive.

My temporary file called "zzz" is a full report on each drive, which is handy to look at if I want to drill in deeper on what my quick-table command lines expose.

Of course this works best if your drives are the same model, so that the numbers have the same meaning. there is a lot of variation between manufacturers and models on what gets reported in smart, and what it looks like.

But TLER is a band-aide feature that will cause you to not notice a problem until the drive is completely dead. The problem with dead drives is they fail in clusters. Drives don't just drop dead, they slowly die. If one drive is dying, you may have several dying. If you wait until a drive is totally dead, you may find it impossible to resilver a new drive because several drives have overlapping unreadable segments. Failed media often occur in the same spots for the same reasons. In the old Raid 5 days, you only had a complete array with ALL drives running. If you lost one drive, you lost data because the odds were that there was at least one bad sector on another drive. Raid 5 was bad because people thought it meant you could recover from a dead drive. Raid 6 was the first real system that could recover from a failed drive because you still had error correction with one bad drive... But even raid 6 could fail for you.

I've been dealing with unreadable backup tapes, failed raid 5, and all manner of other ways to realize you lost your data since the dawn of hard drives. I was one of the guys that recovered files from drives that were full of powdered aluminum after a head crash... I have not had much luck doing that on modern drives, the best plan is to prevent the problem in the first place... like by replacing your failing drives before you have a second failure. TLER is not a fix, it is more of a way to cover an error up until it is so bad you have lost parity and some or all of your data. If you have TLER, you better be well backed up (and not with tapes, tapes only work while the tape drives are still new, I haven't seen a reliable tape drive since the days of the 9-track), and you better have alarms that wake you up at night to tell you when TLER events are occurring.

I love FreeNAS, only because it simplifies standing up a FreeBSD system to hold my data... it allows me to create SAN's and backup solutions. But it is only a tool. And it is only as good as I make it. I believe in being well backed up, in parity, geography, and system redundancy. So my opinions about a failing drive are a little extreme, I have been loosing data on hard drives tapes and other media since 1978. My backups have backups... So forgive me if I'm a bit old and cranky, and ramble on about this like a geezer with dementia. ;)

jgreco · Jun 2, 2016

Bigtexun said:
Keep in mind that TLER is a fairly new feature,

??!!???!!!!

WHAT??????????

Maybe if you're counting in decades, or if you're being super-pedantic and talking about the actual marketing term "TLER". The current time bounded recovery options can track directly back to the early days of SCSI where they were implemented as READ RETRY COUNT and RECOVERY TIME LIMIT ... I'm positive I can grab a 20 year old SCSI drive from inventory and find this feature.

Make sure it isn't something else like a cable or bad power, but TLER isn't going to make a drive stop failing.

That's like saying the temperature sensor in a drive isn't there to let you know what the weather's like outside. It's literally true but it's a weird thing to say.

TLER has nothing to do with stopping a drive from failing. It has to do with mitigating the damage done to the service level being provided by the server. In some cases, you can get a beautiful cascade failure going if the underlying storage is acting crappy, causing a hypervisor's iSCSI stack to decide the datastore's dead, which then works its way up the chain by timing out a virtual machine's virtual SCSI bus, which the crappy Linux fails to handle well, and corrupts your VM's virtual disk .... all because the frickin' lowest level I/O subsystem couldn't exhibit reasonable behaviour. (Yes I know I stacked several shouldn't-happen things, but we're actually discussing one right now, so, fair play).

But TLER is a band-aide feature that will cause you to not notice a problem until the drive is completely dead.

This implies that you have some very strange ideas about TLER and identifying server problems.

You're NOT supposed to notice a drive failing problem by your NAS having a stroke and freezing for a minute or two. You're SUPPOSED to notice a drive failing problem when SMART signals that there's undesirable trends in the fail-y sector counts.

TLER is definitely NOT a band-aid feature. It is a feature that's not necessarily important in every case, because in some environments it is fine for the NAS to stall when a sector's having a problem, and if it is your home NAS and it's just storing your family pictures and your backups, then TLER is of questionable value. However, for a departmental file server, or an iSCSI storage server, or an NFS server acting as an application backend, going catatonic for a minute at a time isn't an acceptable behaviour, and TLER/ERC/etc are the primary mechanisms to provide some bounds to the amount of time it takes to complete a disk transaction.

You definitely shouldn't be relying on TLER to cause a server to "keep working until it dies" if there's a failing drive, but the point of TLER is definitely to cause a server to "keep working even though a problem has developed" when there's a failing drive. What you've written seems to suggest you're fixated on the "keep working" aspect. That's not it at all. It is what happens *next* that's important. You've assumed that it is "and the admin won't give a fsck and keeps running it until dead."

The problem with dead drives is they fail in clusters. Drives don't just drop dead, they slowly die.

That's a lot more applicable if you drink the homogeneous pool kool-aid. If you're building a heterogeneous pool, like you should, then this isn't as large a risk unless maybe you're improperly cooling your drives (or whacking on your chassis with a hammer) and causing all the drives to experience adverse physical conditions.

BloodyIron · Jun 2, 2016

1. TLER isn't new.
2. I've done a bunch more study since my original post, and TLER isn't as required as I originally understood it. But it is if performance is important, ala hosting VM images.
3. The problem is the drop is erratic, and it only happened months ago. I don't see sufficient indication that the drive is near failure. SMART testing (which happens every 5 days or so) shows it's fine.
4. TLER doesn't mask a failure, that's not what it does, it gives your environment the ability to maintain a certain level of performance while an error gets corrected. This does not hide a failing drive in any way, as you still see SMART stats of remaps and shit like that. Even still, it doesn't hide other types of failures either, like controller issues (which seems to be my case, and I improperly diagnosed).
5. Your SMART aggregation method is pretty neat. Surely there's more you can do to streamline it though?
6. I'm not going to run this until a drive is dead. I've already replaced 2 drives in this as I was informed of PFA's (Pre-Failure Alerts), but that was over a year ago IIRC.
7. Nothing wrong with hearing from someone who's done a whole lot of IT. We may disagree on some things, but it's nice to hear all sorts of interesting things. :)

Bigtexun said:
Keep in mind that TLER is a fairly new feature, and most of us chug along for years before we see drive issues. So just because you have drives that *could* cause this symptom, that doesn't mean that is what the issue really is.

But the fact is, if it is the same drive that keeps dropping, you should consider replacing it soon. But other problems cause cause issues like this, such as a bad cable (data or power) a bad drive enclosure, or even a problematic port on your drive controller. Try swapping positions with another drive and see if the problem moves with the drive, or not.

I have some machines with really cheap drives, in crappy drive enclosures. Under windows, after a few months of operation, I start dropping drives from a hardware raid controller. It seemed to be specific to the drive position, and I replaced everything, including the enclosure (with another one just like it). I went down the list, drives (brand and type), cables, enclosures, controllers, controller models/drivers, motherboard, processor, memory, power supply... and the problem kept coming back. The thing that fixed it was removing the enclosure.

Funny thing... later I realize that only my windows machine had the problem, not my FreeNAS systems. But on my FreeNAS machines I didn't plug the enclosures into the controller, instead I use a SAS expander. I swear it seems the SAS expanders are better at handling the enclosure issue than the controllers.

But yes, TLER will "mask" a drive failing error correction, and sounds great for performance reasons. But if your drive is resetting it's controller, that usually means the drive is going bad.

A controller reset is usually on the drive itself. The board in your computer isn't really the drive controller, the drive controller is the board on the drive. Ever since IDE drives hit the market, the controller was moved from the card in the computer, to the drive itself... The card in the computer is still called a controller, but for a JBOD or straight SATA device, the card in the computer is more of an I/O interface. We still call them controllers because we never changed the terminology when IDE changed the electronics. Of course there IS a term for them that is correct, the HBA (host bus adapter), but most people still call them controllers. In the early days of SCSI the controller was a board that was added to a drive, and there were things that came before SCSI, but I digress.

I think you are right in thinking that TLER is designed to prevent drive drop-outs like you are seeing, but if your drive is failing, it is failing. Make sure it isn't something else like a cable or bad power, but TLER isn't going to make a drive stop failing. If your drive had TLER, and it was using the feature frequently, that would still mean the drive is failing. but because you don't have TLER, you don't really know what is failing. What about the smart stats?

Since I have machines with dozens of drives, I developed some methods of quickly looking at and comparing smart numbers in a way that make them easy to see side by side, that way a failing drive sticks out like a sore thumb.

On each of my FreeNAS systems, I have a file called drives, which is just a list of the drives, in /dev/drive format. I have another file which is a list of the smart attributes called attributes. I run these command lines to build a report for my drives:
cat drives | xargs -n1 smartctl -a >zzz
cat attributes |xargs -I xxx -n1 grep xxx zzz

This outputs an easy to read and compare table that makes reviewing drive status a breeze...

Drives looks like this:
/dev/da0
/dev/da1
/dev/da2
/dev/da3
...

Attributes looks like this:
overall-health
Raw_Read_Error_Rate
Throughput_Performance
Spin_Up_Time
Start_Stop_Count
Reallocated_Sector_Ct
Seek_Error_Rate
Seek_Time_Performance
Power_On_Hours
Spin_Retry_Count
Power_Cycle_Count
G-Sense_Error_Rate
Power-Off_Retract_Count
Load_Cycle_Count
Temperature_Celsius
Reallocated_Event_Count
Current_Pending_Sector
Offline_Uncorrectable
UDMA_CRC_Error_Count
Disk_Shift
Loaded_Hours
Load_Retry_Count
Load_Friction
Load-in_Time
Head_Flying_Hours

I think the attributes may change for different versions of smartctl, or different drive vendors, I built mine by editing the smart output of a single drive.

My temporary file called "zzz" is a full report on each drive, which is handy to look at if I want to drill in deeper on what my quick-table command lines expose.

Of course this works best if your drives are the same model, so that the numbers have the same meaning. there is a lot of variation between manufacturers and models on what gets reported in smart, and what it looks like.

But TLER is a band-aide feature that will cause you to not notice a problem until the drive is completely dead. The problem with dead drives is they fail in clusters. Drives don't just drop dead, they slowly die. If one drive is dying, you may have several dying. If you wait until a drive is totally dead, you may find it impossible to resilver a new drive because several drives have overlapping unreadable segments. Failed media often occur in the same spots for the same reasons. In the old Raid 5 days, you only had a complete array with ALL drives running. If you lost one drive, you lost data because the odds were that there was at least one bad sector on another drive. Raid 5 was bad because people thought it meant you could recover from a dead drive. Raid 6 was the first real system that could recover from a failed drive because you still had error correction with one bad drive... But even raid 6 could fail for you.

I've been dealing with unreadable backup tapes, failed raid 5, and all manner of other ways to realize you lost your data since the dawn of hard drives. I was one of the guys that recovered files from drives that were full of powdered aluminum after a head crash... I have not had much luck doing that on modern drives, the best plan is to prevent the problem in the first place... like by replacing your failing drives before you have a second failure. TLER is not a fix, it is more of a way to cover an error up until it is so bad you have lost parity and some or all of your data. If you have TLER, you better be well backed up (and not with tapes, tapes only work while the tape drives are still new, I haven't seen a reliable tape drive since the days of the 9-track), and you better have alarms that wake you up at night to tell you when TLER events are occurring.

I love FreeNAS, only because it simplifies standing up a FreeBSD system to hold my data... it allows me to create SAN's and backup solutions. But it is only a tool. And it is only as good as I make it. I believe in being well backed up, in parity, geography, and system redundancy. So my opinions about a failing drive are a little extreme, I have been loosing data on hard drives tapes and other media since 1978. My backups have backups... So forgive me if I'm a bit old and cranky, and ramble on about this like a geezer with dementia. ;)

Important Announcement for the TrueNAS Community.

Why TLER is actually needed

BloodyIron

Contributor

jgreco

Resident Grinch

BloodyIron

Contributor

Bigtexun

Dabbler

jgreco

Resident Grinch

BloodyIron

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

Why TLER is actually needed

BloodyIron

Contributor

jgreco

Resident Grinch

BloodyIron

Contributor

Bigtexun

Dabbler

jgreco

Resident Grinch

BloodyIron

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Why TLER is actually needed"

Similar threads