READ_FPDMA_QUEUED on WD Red. RMA or wait & watch?

saurav · Jan 17, 2017

So one of my 2Y+ old 4TB WD-REDs in my HP N36L (specs in sig) got this during a bi-weekly scrub

Code:

Jan 16 04:42:52 (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 f8 fb f8 40 04 00 00 01 00 00
Jan 16 04:42:52 (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Jan 16 04:42:52 (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 16 04:42:52 (ada0:ahcich0:0:0:0): RES: 41 40 a0 fc f8 40 04 00 00 00 00
Jan 16 04:42:52 (ada0:ahcich0:0:0:0): Retrying command
Jan 16 04:42:56 (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 f8 fb f8 40 04 00 00 01 00 00
Jan 16 04:42:56 (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Jan 16 04:42:56 (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 16 04:42:56 (ada0:ahcich0:0:0:0): RES: 41 40 a0 fc f8 40 04 00 00 00 00
Jan 16 04:42:56 (ada0:ahcich0:0:0:0): Retrying command
Jan 16 04:42:59 (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 f8 fb f8 40 04 00 00 01 00 00
Jan 16 04:42:59 (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Jan 16 04:42:59 (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 16 04:42:59 (ada0:ahcich0:0:0:0): RES: 41 40 a0 fc f8 40 04 00 00 00 00
Jan 16 04:42:59 (ada0:ahcich0:0:0:0): Retrying command
Jan 16 04:43:03 (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 f8 fb f8 40 04 00 00 01 00 00
Jan 16 04:43:03 (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Jan 16 04:43:03 (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 16 04:43:03 (ada0:ahcich0:0:0:0): RES: 41 40 a0 fc f8 40 04 00 00 00 00
Jan 16 04:43:03 (ada0:ahcich0:0:0:0): Retrying command
Jan 16 04:43:06 (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 f8 fb f8 40 04 00 00 01 00 00
Jan 16 04:43:06 (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Jan 16 04:43:06 (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 16 04:43:06 (ada0:ahcich0:0:0:0): RES: 41 40 a0 fc f8 40 04 00 00 00 00
Jan 16 04:43:06 (ada0:ahcich0:0:0:0): Error 5, Retries exhausted

Which zfs repaired (timestamps match exactly)

Code:

  pool: tank
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
   still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
   the pool may no longer be accessible by software that does not support
   the features. See zpool-features(7) for details.
  scan: scrub repaired 128K in 4h56m with 0 errors on Mon Jan 16 04:57:02 2017
config:

   NAME  STATE  READ WRITE CKSUM
   tank  ONLINE  0  0  0
	raidz2-0  ONLINE  0  0  0
	  gptid/39f2dbfd-4794-11e4-8a24-68b59972b65f  ONLINE  0  0  0
	  gptid/3adbcda1-4794-11e4-8a24-68b59972b65f  ONLINE  0  0  0
	  gptid/3bc8e677-4794-11e4-8a24-68b59972b65f  ONLINE  0  0  0
	  gptid/3cb63aab-4794-11e4-8a24-68b59972b65f  ONLINE  0  0  0

However, the disk passed a long SMART test the very next day, although the error is logged in SMART data (smartctl -x /dev/ada0)

As somewhat of an aside, shouldn't this be detected by SMART tests, since the error is logged in SMART data? I got to know about this from the "daily security run output" email containing kernel logs.

Even after extensive google'ing, I'm not sure if this is a failing disk, something to do with loose cables/PSU, or just a one-off disk error that has been totally handled by ZFS. But considering this is a backup & RAID-Z2, I guess there's no need to press the RMA button yet? And in any case since nothing too bad shows up in "smartctl -a /dev/ada0", I guess it won't qualify anyway.

Any thoughts? Am I reading the situation correctly? Btw, is there a link to WD's RMA policy/process somewhere?

Regards,
Saurav.

JackShine · Jan 19, 2017

Your drives look fine.

SMART is a ridiculously cryptic utility, I just look at the temp. And spin up. And that.

Bidule0hm · Jan 19, 2017

Definitely not normal. Sounds like a bad contact on the SATA connectors (and sometimes it's a bad contact on the power connector), you should try to reseat both ends of the cable and/or change the cable.

saurav · Jan 19, 2017

In this box, the disks in drive bays connect directly to SATA connectors on the backplane, which connects to the mobo through a single miniSAS cable. I can re-seat the disk itself, and maybe the miniSAS too.

HP ProLiant MicroServer - a quick pictorial tour

Henius · Aug 27, 2018

Having the same issue. WD Reds are connected via SATA directly to MB. The box was sitting still without any interference.

Code:

> (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 00 f0 f1 40 6c 00 00 00 00 00
> (ada0:ahcich0:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada0:ahcich0:0:0:0): Retrying command
> (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 b0 47 11 40 3c 00 00 00 00 00
> (ada0:ahcich0:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada0:ahcich0:0:0:0): Retrying command
> (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 b0 0a ad 40 51 00 00 00 00 00
> (ada0:ahcich0:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada0:ahcich0:0:0:0): Retrying command

Not sure how to resolve it, as the errors are annoying, but whole pool seems to be healthy.

Any advice?

vryeksksk · Aug 28, 2018

I have exact same issue on one of my 4tb WD red drive. Logs are spammed with those errors.

Code:

> ahcich8: Timeout on slot 14 port 0
> ahcich8: is 00000000 cs 00004000 ss 00000000 rs 00004000 tfd 58 serr 00080000 cmd 0004ce17
> ahcich8: Error while READ LOG EXT
> (ada4:ahcich8:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 b8 e0 6b 20 40 69 00 00 00 00 00
> (ada4:ahcich8:0:0:0): CAM status: ATA Status Error
> (ada4:ahcich8:0:0:0): ATA status: 00 ()
> (ada4:ahcich8:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
> (ada4:ahcich8:0:0:0): Retrying command
> ahcich8: Timeout on slot 31 port 0
> ahcich8: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd 58 serr 00080000 cmd 0004df17
> ahcich8: Error while READ LOG EXT
> (ada4:ahcich8:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 e0 1e 91 40 68 00 00 01 00 00
> (ada4:ahcich8:0:0:0): CAM status: ATA Status Error
> (ada4:ahcich8:0:0:0): ATA status: 00 ()
> (ada4:ahcich8:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
> (ada4:ahcich8:0:0:0): Retrying command
> (ada4:ahcich8:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 b0 e0 1f 91 40 68 00 00 00 00 00
> (ada4:ahcich8:0:0:0): CAM status: ATA Status Error
> (ada4:ahcich8:0:0:0): ATA status: 00 ()
> (ada4:ahcich8:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
> (ada4:ahcich8:0:0:0): Retrying command
> ahcich8: Timeout on slot 31 port 0
> ahcich8: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd 58 serr 00080000 cmd 0004df17
> ahcich8: Error while READ LOG EXT
> (ada4:ahcich8:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 68 7a 28 40 69 00 00 01 00 00
> (ada4:ahcich8:0:0:0): CAM status: ATA Status Error
> (ada4:ahcich8:0:0:0): ATA status: 00 ()
> (ada4:ahcich8:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00

While the whole pool is healthy and smart shows nothing of concern.

Ericloewe · Aug 29, 2018

Have you eliminated cabling, backplanes and the SATA controller?

vryeksksk · Aug 29, 2018

Ericloewe said:
Have you eliminated cabling, backplanes and the SATA controller?

This probably wasn't directed at me but I'll reply too.
I connect directly to the mobo with sata cables bought from one batch but i do admit they are cheap as fuck. I will try to replace them with something better and report back "soon".

Oh and I just noticed that OP said that its one of his 2YO drives, my drive is brand new. 6 WD REDs bought in a single batch only one has this problem.

Chris Moore · Aug 29, 2018

saurav said:
Even after extensive google'ing, I'm not sure if this is a failing disk, something to do with loose cables/PSU, or just a one-off disk error that has been totally handled by ZFS. But considering this is a backup & RAID-Z2, I guess there's no need to press the RMA button yet? And in any case since nothing too bad shows up in "smartctl -a /dev/ada0", I guess it won't qualify anyway.

I had a server that was causing problems like this constantly. Eight of the 60 drives in the server were affected. We initially replaced three of the drives thinking the drives were at fault, but when the errors persisted, we had the vendor replace the disk controller that ran the affected bank of drives. That still didn't correct the fault and the only thing remaining was the cables between the disks and the controllers. Since that component was integral to the chassis, and the server was under three months old, the vendor replaced the entire chassis. The same disks we were originally having problems with are still working great in the new chassis and it is about 11 months since we had a disk fault of any kind in that system. That probably isn't an option for you, but the cables connecting the drives could be at fault if you are having communications issues.

vryeksksk · Sep 12, 2018

Okay guys i did some troubleshooting and it turns out its not the Sata cable or the drive. Its a goddamn Sata port on the motherboard. Now I'm not sure how RMA worthy is that on a otherwise perfectly working mobo.

Ericloewe · Sep 12, 2018

vryeksksk said:
Now I'm not sure how RMA worthy is that on a otherwise perfectly working mobo.

Very. Definitely have it fixed/replaced.

Chris Moore · Sep 12, 2018

Unless you want to install a SAS HBA. You can run SATA drives from a SAS controller. That is what I have in almost all the servers I tend to for work and at home.

Ericloewe · Sep 12, 2018

I'm in favor of replacing the motherboard anyway, if it's in warranty.

vryeksksk · Sep 12, 2018

Ericloewe said:
I'm in favor of replacing the motherboard anyway, if it's in warranty.

Yeah it is. It's brand new (one month old) ASRock AB350 Pro4.

Now could you guys give me some tips how i would go about proving that it's broken?

If the store tests it on Windows I'm like 99% sure windows won't complain about the port like freenas does.

Also I think that this problem with this port is what causes my freenas to hang on some operations (my other thread).
Moreover it looks like that this error also causes freenas to send this error:

Code:

New alerts:
* Device: /dev/ada4, not capable of SMART self-check

Alerts:
* Device: /dev/ada4, not capable of SMART self-check

And once i got this

Code:

New alerts:
* Device: /dev/ada4, unable to open ATA device
* Device: /dev/ada5, unable to open ATA device

Alerts:
* Device: /dev/ada4, unable to open ATA device
* Device: /dev/ada5, unable to open ATA device

But after reboot it started working correctly.

Also another question, shouldnt the dashboard graphs show something? On my system they are not working at all. They either show nothing or show incomplete data while net data shows everything correctly.

Ericloewe · Sep 12, 2018

Well, show them the errors and explain that the port is wonky.

Important Announcement for the TrueNAS Community.

READ_FPDMA_QUEUED on WD Red. RMA or wait & watch?

saurav

Contributor

JackShine

Dabbler

Bidule0hm

Server Electronics Sorcerer

saurav

Contributor

Henius

Cadet

vryeksksk

Dabbler

Ericloewe

Server Wrangler

vryeksksk

Dabbler

Chris Moore

Hall of Famer

vryeksksk

Dabbler

Ericloewe

Server Wrangler

Chris Moore

Hall of Famer

Ericloewe

Server Wrangler

vryeksksk

Dabbler

Ericloewe

Server Wrangler

Similar threads

Important Announcement for the TrueNAS Community.

READ_FPDMA_QUEUED on WD Red. RMA or wait & watch?

Contributor

Dabbler

Server Electronics Sorcerer

Contributor

Cadet

Dabbler

Server Wrangler

Dabbler

Hall of Famer

Dabbler

Server Wrangler

Hall of Famer

Server Wrangler

Dabbler

Server Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "READ_FPDMA_QUEUED on WD Red. RMA or wait & watch?"

Similar threads