New drive removed

CyberGar · Feb 20, 2014

Recently replaced a drive with unreadable errors. RMA sent new drive. replaced and resilvered it. Took about 6 hrs. During the resilvering, I saw on the console CAM status: ATA STatus Error, Error 5, Retries exhausted. I had been seeing these errors with the old drive, also and guessed that this was just another sign it was failing.
Now I see the dreaded "lost device" line along with ATA Status Error lines... It's pretty ugly on that screen...
in the FreeNAS Shell, i ran zpool status and it claims to be resilvering the drive that is labeled "REMOVED"... The volume is dregraded and I don't know if FreeNAS even knows what's going on with that drive right now...
I did try to reboot to see if it would redetect the drive. Perhaps it did, started resilvering it, had it fail during that, and now it's not yet aware that it's failed...

ugh...

Is there something I need to do other than RMA another drive? Are there settings that can help handle a drive that might be slow to respond? I am using WD 3TB Green drives. I know they can spin down. I'm wondering it FreeNAS is not giving them enough time to respond before tossing them out... My other two drives seem to be ticking along fine so far...

Build: FreeNAS-9.2.0-RELEASE-x64
Platform: AMD A4-5300 APU with Radeon
Memory: 15802MB
Load Average: 0.44, 0.35, 0.31

cyberjock · Feb 20, 2014

I'd try a different SATA cable first.

And I'd post the output of your smartctl -a /dev/XXXX to see what is going on internally to the disk.

CyberGar · Feb 20, 2014

Is there a recommendation for cable?

I cannot run a smartctl because the disk is shown as REMOVED.

Code:

zpool status:
pool: DataVolume1
state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: resilvered 57.9G in 4h11m with 0 errors on Thu Feb 20 10:17:11 2014
config:
 
NAME STATE READ WRITE CKS
UM
DataVolume1 DEGRADED 0 0
0
raidz1-0 DEGRADED 0 0
0
gptid/dbc8cc9a-757e-11e2-a365-d43d7e53b87c ONLINE 0 0
0
6572269270920856019 REMOVED 0 0
0 was /dev/gptid/9afb262d-983b-11e3-9c87-d43d7e53b87c
gptid/ddaa8916-757e-11e2-a365-d43d7e53b87c ONLINE 0 0
0
 
errors: No known data errors

CyberGar · Feb 21, 2014

Maybe I pissed someone off... The word "REMOVED" was in upper case because that's how it was in the status. Not yelling or trying to be a smartass...

I changed out the SATA cable to the drive.

When I fired up, freeNAS had the following warning:
The volume DataVolume1 (ZFS) status is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.Wait for the resilver to complete.

I thought this may be a good sign...

Alas, my hopes were short-lived....

About 5 or 10 minutes into the resilvering, this starts appearing on the screen:

Code:

Feb 21 19:56:12 GarNAS kernel: ahcich1: Timeout on slot 23 port 0
Feb 21 19:56:12 GarNAS kernel: ahcich1: is 00000000 cs 00000000 ss 00800000 rs 00800000 tfd 40 serr 00000000 cmd 0000f717
Feb 21 19:56:12 GarNAS kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 80 38 d6 8b 40 17 00 00 00 00 00
Feb 21 19:56:12 GarNAS kernel: (ada1:ahcich1:0:0:0): CAM status: Command timeout
Feb 21 19:56:12 GarNAS kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 21 19:56:43 GarNAS kernel: ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Feb 21 19:57:37 GarNAS kernel: ahcich1: Timeout on slot 26 port 0
Feb 21 19:57:37 GarNAS kernel: ahcich1: is 00000000 cs 00000000 ss 04000000 rs 04000000 tfd 40 serr 00000000 cmd 0000fa17
Feb 21 19:57:37 GarNAS kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 38 69 8c 40 17 00 00 01 00 00
Feb 21 19:57:37 GarNAS kernel: (ada1:ahcich1:0:0:0): CAM status: Command timeout
Feb 21 19:57:37 GarNAS kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 21 19:58:08 GarNAS kernel: ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Feb 21 19:58:51 GarNAS kernel: ahcich1: Timeout on slot 29 port 0
Feb 21 19:58:51 GarNAS kernel: ahcich1: is 00000000 cs 20000000 ss 00000000 rs 20000000 tfd c0 serr 00000000 cmd 0000fd17
Feb 21 19:58:51 GarNAS kernel: (ada1:ahcich1:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Feb 21 19:58:51 GarNAS kernel: (ada1:ahcich1:0:0:0): CAM status: Command timeout
Feb 21 19:58:51 GarNAS kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 21 19:59:22 GarNAS kernel: ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Feb 21 20:00:01 GarNAS kernel: ahcich1: Timeout on slot 31 port 0
Feb 21 20:00:01 GarNAS kernel: ahcich1: is 00000000 cs 00000000 ss 80000000 rs 80000000 tfd 40 serr 00000000 cmd 0000ff17
Feb 21 20:00:01 GarNAS kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 00 c1 8d 40 17 00 00 01 00 00
Feb 21 20:00:01 GarNAS kernel: (ada1:ahcich1:0:0:0): CAM status: Command timeout
Feb 21 20:00:01 GarNAS kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 21 20:00:33 GarNAS kernel: ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)
Feb 21 20:01:18 GarNAS kernel: ahcich1: Timeout on slot 31 port 0
Feb 21 20:01:18 GarNAS kernel: ahcich1: is 00000000 cs 00000000 ss 80000000 rs 80000000 tfd 40 serr 00000000 cmd 0000ff17
Feb 21 20:01:18 GarNAS kernel: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 b8 14 eb 40 82 00 00 00 00 00
Feb 21 20:01:18 GarNAS kernel: (ada1:ahcich1:0:0:0): CAM status: Command timeout
Feb 21 20:01:18 GarNAS kernel: (ada1:ahcich1:0:0:0): Retrying command
Feb 21 20:01:50 GarNAS kernel: ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)

I can only assume the new drive is also bad.

Could it be a bad SATA port? It was fine for nearly a year before I started seeing these messages and errors.

It's still attempting to resilver and the zpool status says there are no data errors... This is very confusing...

pirateghost · Feb 21, 2014

You should use code tags to post that output

Sent from my Nexus 5

cyberjock · Feb 22, 2014

Nope, you didn't piss anyone off with the caps. I use it the same way to describe "DETACHED", "REMOVED", and "UNAVAIL" statuses. :)

That new error looks like ada1 is failing now. Now you see why I asked for the smartctl -a /dev/XXX command output.

Edit: It could be the cable, but the smartctl outputs would be helpful in making that assessment. The condition of the pool has no bearing on rumming smartctl commands. If the hard drive is listed by "camcontrol devlist" then you should be able to run smartctl. If you can't, you're probably using a RAID controller that doesn't support passing SMART commands.

CyberGar · Feb 22, 2014

The drive has dissappeared from the list (View Disks). I think this drive is just a complete lemon...

Code:

[root@GarNAS ~]# camcontrol devlist <WDC WD30EZRX-00DC0B0 80.00A80> at scbus0 target 0 lun 0 (ada0,pass0) <WDC WD30EZRX-00DC0B0 80.00A80> at scbus2 target 0 lun 0 (ada2,pass2) < 1100> at scbus4 target 0 lun 0 (pass3,da0)

Doesn't show up in device list there...

Think I will replace with new Seagate or something. Afraid the RMA disks WD is sending are refurbs or something. Can't really trust that.

I read through your PowerPoint for noobs and aside from the ECC RAM, I think I have a pretty good setup going. I've been able to get 90+MB/Sec transfers. Just using for home
storage of movies, etc. AND have an external drive for backup. Trying to do this right and learn some things along the way.

I DO have a RAIDz1 only because I could not afford 6 3TB drives when I built it.

Anyway... I appreciate the help so far and the great info in yoru Guide!!!!

Current NAS:
MSI FM2-A75MA-E35 Micro ATX Mobo (6: 6Gb/s SATA)
AMD A4-5300 Trinity 3.4GHz FMS Dual-Core APU
16GB GeIL EVO CORSA DDR3 1600 RAM
3x WD Green WD30EZRX

Motherboard doesn't support ECC, so may have to look at building new rig this year...

Thanks again!

cyberjock · Feb 22, 2014

You do realize that a backup won't save you if you use non-ECC RAM and the RAM goes bad? Consider yourself warned...

And RAIDZ1 is 'dead'. See the link in my signature. Just chatted in the forums yesterday with someone that had a RAIDZ1, one disk failed, and his pool was lost completely due to some kind of corruption that wasn't repairable due to no more redundancy. He thought he could beat the odds(everyone always thinks that...) and now he'll never do that again. He also has no data, but that's not a ZFS problem.

I'm sorry, but I disagree with you when you say "aside from ECC RAM, I think I have a pretty good setup going".

Important Announcement for the TrueNAS Community.

New drive removed

CyberGar

Cadet

cyberjock

Inactive Account

CyberGar

Cadet

CyberGar

Cadet

pirateghost

Unintelligible Geek

cyberjock

Inactive Account

CyberGar

Cadet

cyberjock

Inactive Account

Similar threads

Important Announcement for the TrueNAS Community.

New drive removed

Cadet

Inactive Account

Cadet

Cadet

Unintelligible Geek

Inactive Account

Cadet

Inactive Account

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "New drive removed"

Similar threads