Help diagnosing kernel messages

Status
Not open for further replies.

langlandh

Cadet
Joined
Jan 19, 2014
Messages
8
Hello,

I set up a new system, which seems to be working, and I would like help diagnosing error messages I'm getting via automatic email notification. I've gone through the manual, and some of the archives here, and based on those have tried a few things, outlined below.

The error message arrives in the email with subject "freenas.local security run output". It takes the form:

freenas.local kernel log messages:
> (ada1:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 c0 f8 07 41 40 0e 00 00 00 00 00
> (ada1:ahcich3:0:0:0): CAM status: ATA Status Error
> (ada1:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
> (ada1:ahcich3:0:0:0): RES: 41 10 f8 07 41 40 0e 00 00 00 00
> (ada1:ahcich3:0:0:0): Retrying command
> (ada1:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 08 1c c4 40 0f 00 00 00 00 00

-or-

> (ada1:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 b0 b8 c2 c8 40 0e 00 00 00 00 00
> (ada1:ahcich3:0:0:0): RES: 41 10 b8 c2 c8 40 0e 00 00 00 00

repeated various numbers of times per day in the email with subject "freenas.local security run output". Similar "WRITE_FPDMA_QUEUED" wording as in this thread,
http://forums.freenas.org/threads/help-my-freenas-has-gone-south.9265/
though my SMART results (see below) don't seem to give any indications I'm able to interpret from the report.

The messages appear only for ada1. (My system has 4 drives; ada0 through ada3.)

System details:

Build FreeNAS-9.2.0-RELEASE-x64 (ab098f4)
Platform Intel(R) Pentium(R) CPU G2020 @ 2.90GHz
Motherboard: Supermicro X9SCM-B
Memory 16335MB (Kingston KVR16E11/8I, ECC selected for compatibility with motherboard.)
These are set up as a pool using the zfs mirror option, and are listed as drives ada0 through ada3.

All of the data on this machine are backed up elsewhere, so I have some leeway about how to proceed.

I've tried two things to understand what the error might mean:

(1) zfs scrub. Results:
pool: bits
state: ONLINE
scan: scrub repaired 0 in 0h6m with 0 errors on Sat Jan 11 07:46:06 2014
config:

NAME STATE READ WRITE CKSUM
bits ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/85310526-73e9-11e3-bcee-002590d2fee7 ONLINE 0 0 0
gptid/85f3a2ca-73e9-11e3-bcee-002590d2fee7 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/864970a8-73e9-11e3-bcee-002590d2fee7 ONLINE 0 0 0
gptid/869e70ea-73e9-11e3-bcee-002590d2fee7 ONLINE 0 0 0

errors: No known data errors

(2)

smartctl -q noserial -a /dev/ada1
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, [url="http://www.smartmontools.org"]www.smartmontools.org[/url]

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red (AF)
Device Model: WDC WD20EFRX-68EUZN0
Firmware Version: 80.00A80
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Sun Jan 19 10:53:41 2014 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (27360) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 276) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x703d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 171 169 021 Pre-fail Always - 4416
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 33
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 406
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 23
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 20
193 Load_Cycle_Count 0x0032 198 198 000 Old_age Always - 7219
194 Temperature_Celsius 0x0022 134 118 000 Old_age Always - 13
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 307 -
# 2 Extended offline Completed without error 00% 271 -
# 3 Extended offline Completed without error 00% 247 -
# 4 Extended offline Completed without error 00% 223 -
# 5 Short offline Completed without error 00% 217 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Nothing really stands out for me.
This forum has some really knowledgeable people, let me thank you in advance.
 

Yatti420

Wizard
Joined
Aug 12, 2012
Messages
1,437
Sata cables? What Power supply you running? Capacitors?.. Standard hardware checks in order.. I don't like seeing those retries ever.. I'm a little tired but there may be a pattern forming with the WD reds.. Have you checked the idle timer using WDIdle?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
You realize we JUST had this question asked twice last week and like 5 times in the last month?
 

langlandh

Cadet
Joined
Jan 19, 2014
Messages
8
Yatti,
Power supply: Seasonic SSR-360GP 360W. Staggered spinup enabled. Adequate if my understanding of this is correct.
Hardware checks in progress, including cable checks. Will report.

Cyberjock,
Sorry to be such a goof about asking the same question. My understanding of some of the terms (less good than yours) might be a factor.
 

langlandh

Cadet
Joined
Jan 19, 2014
Messages
8
Hardware troubleshooting result: the problem follows the SATA cable. I seem to have been more fortunate than SmallGuy was. None of my 4 new WD Reds deserve any bad diagnosis thus far. Of course some additional uptime hours need to be accumulated to decrease the level of tentativity of this conclusion, but at this stage all indications are "cable".

Thanks Yatti420 for your patient response.
Please consider my request for help complete.

Before I sign off on this thread, I would love to get more insight based on the collective experience and insight of the folks here. My hunch is that if the same cabling issue were used in a typical desktop deployment it could have gone unnoticed for a long time. What was it about this FreeNAS implementation that gave me the heads-up, when so many other systems would have just (kind-of) happily kept chugging?

Also, any insights about what physically happens with cables would be interesting. My guess is the connectors go wrong most often. I took my best magnifier to the connectors of my offending SATA cable. I think I might have observed a small burr-like object at one of the pins. I will try to call in a favor to get this examined with a more capable microscope. But meanwhile, I'd love to hear the experience of the experienced folks here.
 

Yatti420

Wizard
Joined
Aug 12, 2012
Messages
1,437
This does keep getting posted.. Varying reasons however as the op pointed out.. I'd keep an eye out on that drive just incase it's flakey..

I'm not sure about cables but depending on where you get them from I think assembly can be an issue.. I've found one or two in my collection over the years and they always cause big problems.. I try to use new ones if possible..
 

langlandh

Cadet
Joined
Jan 19, 2014
Messages
8
Final result. I feel quite chastened now; my hasty initial troubleshooting of the cable was erroneous. A systematic, non-hurried sequence of checks isolated the cause: one of the disk drives. (a new WD red, one of the four purchased new for this system.)
 

SmallGuy

Guru
Joined
Jun 7, 2013
Messages
560
Welcome to the club! ;)
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
This is the kind of thing I'd like to be able to expand upon in the Hardware forum in the build/burn-in or maybe a new troubleshooting sticky...
 

langlandh

Cadet
Joined
Jan 19, 2014
Messages
8
This is the kind of thing I'd like to be able to expand upon in the Hardware forum in the build/burn-in or maybe a new troubleshooting sticky...
One of the things I like best about freenas.org is that it encourages (and shows) "enthusiasts" how to think like "professionals". Your hardware guide is a good example. You and many of the others here influence more than you might realize. Thanks.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Sometimes professionals are encouraged by the excitement of enthusiasts too. Quite frankly I could get burned out on tech if it was just a constant grind. So it's a bit of a two-way street. And you know, helps to have your preconceived notions challenged now and then...
 

marcevan

Patron
Joined
Dec 15, 2013
Messages
432
I've been consistently getting this on ADA2 in my 6x WD Red 3.0 RaidZ2.

Smartctl showed that particular drive was fine while another (ADA4) had definitive issues. I've since replaced ADA4 but ADA2 keeps posting to the console.

I ruled out hardware other than the drives (new cables, new mobo, new RAM, new CPU, new Power) and the same issue persists, so I'm going to replace that drive when my replacement comes and see if it's what I suspect: not the cable per se but the sata attachment point might be damaged on the drive.
 
Status
Not open for further replies.
Top