BloodyIron
Contributor
- Joined
- Feb 28, 2013
- Messages
- 133
Hi Folks,
I did it! I caught the logs of /var/log/messages when one of my drives was detatched by FNAS! This would have been resolved if the drive had TLER. But I wanted to show you what it looks like when a drive times out because it doesn't have TLER.
Now, I'm very confident this is a TLER related issue because:
So, what were the symptoms leading up to this?
Anyways, I thought all you peeps out there would appreciate seeing what it looks like when your drive drops due to timing out of commands, something resolvable by TLER.
I did it! I caught the logs of /var/log/messages when one of my drives was detatched by FNAS! This would have been resolved if the drive had TLER. But I wanted to show you what it looks like when a drive times out because it doesn't have TLER.
Code:
May 21 08:34:10 [REDACTED-HOSTNAME] mpt0: request 0xfffffe00008ba9b0:22505 timed out for ccb 0xfffff80008565800 (req->ccb 0xfffff80008565800) May 21 08:34:10 [REDACTED-HOSTNAME] mpt0: attempting to abort req 0xfffffe00008ba9b0:22505 function 0 May 21 08:34:10 [REDACTED-HOSTNAME] mpt0: request 0xfffffe00008bb7c0:22506 timed out for ccb 0xfffff80008603000 (req->ccb 0xfffff80008603000) May 21 08:34:11 [REDACTED-HOSTNAME] mpt0: mpt_wait_req(1) timed out May 21 08:34:11 [REDACTED-HOSTNAME] mpt0: mpt_recover_commands: abort timed-out. Resetting controller May 21 08:34:11 [REDACTED-HOSTNAME] mpt0: mpt_cam_event: 0x0 May 21 08:34:11 [REDACTED-HOSTNAME] mpt0: mpt_cam_event: 0x0 May 21 08:34:11 [REDACTED-HOSTNAME] mpt0: completing timedout/aborted req 0xfffffe00008ba9b0:22505 May 21 08:34:11 [REDACTED-HOSTNAME] mpt0: completing timedout/aborted req 0xfffffe00008bb7c0:22506 May 21 08:34:23 [REDACTED-HOSTNAME] mpt0: mpt_cam_event: 0x1b May 21 08:34:23 [REDACTED-HOSTNAME] (da0:mpt0:0:73:0): READ(10). CDB: 28 00 bb 1c 9c 00 00 00 08 00 May 21 08:34:23 [REDACTED-HOSTNAME] (da0:mpt0:0:73:0): CAM status: Selection Timeout May 21 08:34:23 [REDACTED-HOSTNAME] (da0:mpt0:0:73:0): Retrying command May 21 08:34:23 [REDACTED-HOSTNAME] (da0:mpt0:0:73:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 May 21 08:34:23 [REDACTED-HOSTNAME] (da0:mpt0:0:73:0): CAM status: Selection Timeout May 21 08:34:23 [REDACTED-HOSTNAME] (da0:mpt0:0:73:0): Retrying command May 21 08:34:23 [REDACTED-HOSTNAME] (da0:mpt0:0:73:0): WRITE(10). CDB: 2a 00 ba 5f b1 48 00 00 28 00 May 21 08:34:23 [REDACTED-HOSTNAME] (da0:mpt0:0:73:0): CAM status: Selection Timeout May 21 08:34:23 [REDACTED-HOSTNAME] (da0:mpt0:0:73:0): Retrying command May 21 08:34:24 [REDACTED-HOSTNAME] da0 at mpt0 bus 0 scbus0 target 73 lun 0 May 21 08:34:24 [REDACTED-HOSTNAME] da0: <ATA ST32000542AS CC34> s/n 5XW1AALB detached May 21 08:34:24 [REDACTED-HOSTNAME] GEOM_ELI: Device da0p1.eli destroyed. May 21 08:34:24 [REDACTED-HOSTNAME] GEOM_ELI: Detached da0p1.eli on last close. May 21 08:34:24 [REDACTED-HOSTNAME] (da0:mpt0:0:73:0): Periph destroyed
Now, I'm very confident this is a TLER related issue because:
- This drive DOES NOT have TLER, but I have two other drives in the pool that do.
- This drive has done this before, a few months ago. (I can't replace it yet, sadface)
- My two drives that do have TLER have not once done this
So, what were the symptoms leading up to this?
- Shitty IOPS, my VMs were running like poop
- I could not identify any bottleneck (before I saw this happened)
- zpool status INITIALLY said shit was fine (probably while it was retrying with the disk)
- EVENTUALLY zpool status showed the disk as "REMOVED"
- Well, I read /var/log/messages, but I had to scroll up a lot to find it as NFS messages hid it, kinda
Anyways, I thought all you peeps out there would appreciate seeing what it looks like when your drive drops due to timing out of commands, something resolvable by TLER.