READ_FPDMA_QUEUED on WD Red. RMA or wait & watch?

Status
Not open for further replies.

saurav

Contributor
Joined
Jul 29, 2012
Messages
139
So one of my 2Y+ old 4TB WD-REDs in my HP N36L (specs in sig) got this during a bi-weekly scrub

Code:
Jan 16 04:42:52 (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 f8 fb f8 40 04 00 00 01 00 00
Jan 16 04:42:52 (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Jan 16 04:42:52 (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 16 04:42:52 (ada0:ahcich0:0:0:0): RES: 41 40 a0 fc f8 40 04 00 00 00 00
Jan 16 04:42:52 (ada0:ahcich0:0:0:0): Retrying command
Jan 16 04:42:56 (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 f8 fb f8 40 04 00 00 01 00 00
Jan 16 04:42:56 (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Jan 16 04:42:56 (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 16 04:42:56 (ada0:ahcich0:0:0:0): RES: 41 40 a0 fc f8 40 04 00 00 00 00
Jan 16 04:42:56 (ada0:ahcich0:0:0:0): Retrying command
Jan 16 04:42:59 (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 f8 fb f8 40 04 00 00 01 00 00
Jan 16 04:42:59 (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Jan 16 04:42:59 (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 16 04:42:59 (ada0:ahcich0:0:0:0): RES: 41 40 a0 fc f8 40 04 00 00 00 00
Jan 16 04:42:59 (ada0:ahcich0:0:0:0): Retrying command
Jan 16 04:43:03 (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 f8 fb f8 40 04 00 00 01 00 00
Jan 16 04:43:03 (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Jan 16 04:43:03 (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 16 04:43:03 (ada0:ahcich0:0:0:0): RES: 41 40 a0 fc f8 40 04 00 00 00 00
Jan 16 04:43:03 (ada0:ahcich0:0:0:0): Retrying command
Jan 16 04:43:06 (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 f8 fb f8 40 04 00 00 01 00 00
Jan 16 04:43:06 (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Jan 16 04:43:06 (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 16 04:43:06 (ada0:ahcich0:0:0:0): RES: 41 40 a0 fc f8 40 04 00 00 00 00
Jan 16 04:43:06 (ada0:ahcich0:0:0:0): Error 5, Retries exhausted


Which zfs repaired (timestamps match exactly)

Code:
  pool: tank
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
   still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
   the pool may no longer be accessible by software that does not support
   the features. See zpool-features(7) for details.
  scan: scrub repaired 128K in 4h56m with 0 errors on Mon Jan 16 04:57:02 2017
config:

   NAME  STATE  READ WRITE CKSUM
   tank  ONLINE  0  0  0
	raidz2-0  ONLINE  0  0  0
	  gptid/39f2dbfd-4794-11e4-8a24-68b59972b65f  ONLINE  0  0  0
	  gptid/3adbcda1-4794-11e4-8a24-68b59972b65f  ONLINE  0  0  0
	  gptid/3bc8e677-4794-11e4-8a24-68b59972b65f  ONLINE  0  0  0
	  gptid/3cb63aab-4794-11e4-8a24-68b59972b65f  ONLINE  0  0  0

However, the disk passed a long SMART test the very next day, although the error is logged in SMART data (smartctl -x /dev/ada0)

As somewhat of an aside, shouldn't this be detected by SMART tests, since the error is logged in SMART data? I got to know about this from the "daily security run output" email containing kernel logs.

Even after extensive google'ing, I'm not sure if this is a failing disk, something to do with loose cables/PSU, or just a one-off disk error that has been totally handled by ZFS. But considering this is a backup & RAID-Z2, I guess there's no need to press the RMA button yet? And in any case since nothing too bad shows up in "smartctl -a /dev/ada0", I guess it won't qualify anyway.

Any thoughts? Am I reading the situation correctly? Btw, is there a link to WD's RMA policy/process somewhere?

Regards,
Saurav.
 

JackShine

Dabbler
Joined
Nov 13, 2014
Messages
27
Your drives look fine.

SMART is a ridiculously cryptic utility, I just look at the temp. And spin up. And that.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Definitely not normal. Sounds like a bad contact on the SATA connectors (and sometimes it's a bad contact on the power connector), you should try to reseat both ends of the cable and/or change the cable.
 

saurav

Contributor
Joined
Jul 29, 2012
Messages
139
In this box, the disks in drive bays connect directly to SATA connectors on the backplane, which connects to the mobo through a single miniSAS cable. I can re-seat the disk itself, and maybe the miniSAS too.

HP ProLiant MicroServer - a quick pictorial tour
 

Henius

Cadet
Joined
Jun 10, 2016
Messages
7
Having the same issue. WD Reds are connected via SATA directly to MB. The box was sitting still without any interference.

Code:
> (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 00 f0 f1 40 6c 00 00 00 00 00
> (ada0:ahcich0:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada0:ahcich0:0:0:0): Retrying command
> (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 b0 47 11 40 3c 00 00 00 00 00
> (ada0:ahcich0:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada0:ahcich0:0:0:0): Retrying command
> (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 b0 0a ad 40 51 00 00 00 00 00
> (ada0:ahcich0:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada0:ahcich0:0:0:0): Retrying command


Not sure how to resolve it, as the errors are annoying, but whole pool seems to be healthy.

Any advice?
 

vryeksksk

Dabbler
Joined
Apr 20, 2018
Messages
10
I have exact same issue on one of my 4tb WD red drive. Logs are spammed with those errors.

Code:
> ahcich8: Timeout on slot 14 port 0
> ahcich8: is 00000000 cs 00004000 ss 00000000 rs 00004000 tfd 58 serr 00080000 cmd 0004ce17
> ahcich8: Error while READ LOG EXT
> (ada4:ahcich8:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 b8 e0 6b 20 40 69 00 00 00 00 00
> (ada4:ahcich8:0:0:0): CAM status: ATA Status Error
> (ada4:ahcich8:0:0:0): ATA status: 00 ()
> (ada4:ahcich8:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
> (ada4:ahcich8:0:0:0): Retrying command
> ahcich8: Timeout on slot 31 port 0
> ahcich8: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd 58 serr 00080000 cmd 0004df17
> ahcich8: Error while READ LOG EXT
> (ada4:ahcich8:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 e0 1e 91 40 68 00 00 01 00 00
> (ada4:ahcich8:0:0:0): CAM status: ATA Status Error
> (ada4:ahcich8:0:0:0): ATA status: 00 ()
> (ada4:ahcich8:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
> (ada4:ahcich8:0:0:0): Retrying command
> (ada4:ahcich8:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 b0 e0 1f 91 40 68 00 00 00 00 00
> (ada4:ahcich8:0:0:0): CAM status: ATA Status Error
> (ada4:ahcich8:0:0:0): ATA status: 00 ()
> (ada4:ahcich8:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
> (ada4:ahcich8:0:0:0): Retrying command
> ahcich8: Timeout on slot 31 port 0
> ahcich8: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd 58 serr 00080000 cmd 0004df17
> ahcich8: Error while READ LOG EXT
> (ada4:ahcich8:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 68 7a 28 40 69 00 00 01 00 00
> (ada4:ahcich8:0:0:0): CAM status: ATA Status Error
> (ada4:ahcich8:0:0:0): ATA status: 00 ()
> (ada4:ahcich8:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00


While the whole pool is healthy and smart shows nothing of concern.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Have you eliminated cabling, backplanes and the SATA controller?
 

vryeksksk

Dabbler
Joined
Apr 20, 2018
Messages
10
Have you eliminated cabling, backplanes and the SATA controller?
This probably wasn't directed at me but I'll reply too.
I connect directly to the mobo with sata cables bought from one batch but i do admit they are cheap as fuck. I will try to replace them with something better and report back "soon".

Oh and I just noticed that OP said that its one of his 2YO drives, my drive is brand new. 6 WD REDs bought in a single batch only one has this problem.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Even after extensive google'ing, I'm not sure if this is a failing disk, something to do with loose cables/PSU, or just a one-off disk error that has been totally handled by ZFS. But considering this is a backup & RAID-Z2, I guess there's no need to press the RMA button yet? And in any case since nothing too bad shows up in "smartctl -a /dev/ada0", I guess it won't qualify anyway.
I had a server that was causing problems like this constantly. Eight of the 60 drives in the server were affected. We initially replaced three of the drives thinking the drives were at fault, but when the errors persisted, we had the vendor replace the disk controller that ran the affected bank of drives. That still didn't correct the fault and the only thing remaining was the cables between the disks and the controllers. Since that component was integral to the chassis, and the server was under three months old, the vendor replaced the entire chassis. The same disks we were originally having problems with are still working great in the new chassis and it is about 11 months since we had a disk fault of any kind in that system. That probably isn't an option for you, but the cables connecting the drives could be at fault if you are having communications issues.
 

vryeksksk

Dabbler
Joined
Apr 20, 2018
Messages
10
Okay guys i did some troubleshooting and it turns out its not the Sata cable or the drive. Its a goddamn Sata port on the motherboard. Now I'm not sure how RMA worthy is that on a otherwise perfectly working mobo.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Unless you want to install a SAS HBA. You can run SATA drives from a SAS controller. That is what I have in almost all the servers I tend to for work and at home.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I'm in favor of replacing the motherboard anyway, if it's in warranty.
 

vryeksksk

Dabbler
Joined
Apr 20, 2018
Messages
10
I'm in favor of replacing the motherboard anyway, if it's in warranty.
Yeah it is. It's brand new (one month old) ASRock AB350 Pro4.

Now could you guys give me some tips how i would go about proving that it's broken?

If the store tests it on Windows I'm like 99% sure windows won't complain about the port like freenas does.

Also I think that this problem with this port is what causes my freenas to hang on some operations (my other thread).
Moreover it looks like that this error also causes freenas to send this error:

Code:
New alerts:
* Device: /dev/ada4, not capable of SMART self-check

Alerts:
* Device: /dev/ada4, not capable of SMART self-check


And once i got this
Code:
New alerts:
* Device: /dev/ada4, unable to open ATA device
* Device: /dev/ada5, unable to open ATA device

Alerts:
* Device: /dev/ada4, unable to open ATA device
* Device: /dev/ada5, unable to open ATA device

But after reboot it started working correctly.


Also another question, shouldnt the dashboard graphs show something? On my system they are not working at all. They either show nothing or show incomplete data while net data shows everything correctly.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Well, show them the errors and explain that the port is wonky.
 
Status
Not open for further replies.
Top