What does this email mean? Subject: SMART error (FailedReadSmartErrorLog) detected on host: freenas

Status
Not open for further replies.
Joined
Jan 11, 2014
Messages
13
Just setup the system and initialized some storage and copied a few things over from my old NAS, but I'm getting emails (on 2 drives of my 5) that read as follows. Did I set something up incorrectly?

X-Google-Original-From: root@freenas.local
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Subject: SMART error (FailedReadSmartErrorLog) detected on host: freenas
Date: Sun, 27 Apr 2014 20:49:04 -0000
X-FreeNAS-Host: freenas.local
X-Mailer: FreeNAS

This message was generated by the smartd daemon running on:

host name: freenas
DNS domain: local

The following warning/error was logged by the smartd daemon:

Device: /dev/ada2, Read SMART Error Log Failed

Device info:
WDC WD30EFRX-68AX9N0, S/N:WD-WCC1T1123267, WWN:5-0014ee-20897e3ae, FW:80.00A80, 3.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
No additional messages about this problem will be sent.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
That's an email indicating a SMART error, just like the subject line says.

You should examine the device with the device info on that email and determine if something is wrong.
 
Joined
Jan 11, 2014
Messages
13
I've run short, long, and conveyance tests on all of my drives and two of them continue to throw this error - ada2 and ada4. There are no SMART errors logged when I do a smartctl -l selftest on the devices.

Should the drives just be swapped out? Is there some other test I should run? I have a spare that I could certainly put in place of ada2 - seems that drive has been up for about 2 years.

here's the output from smartctl:

Code:
sudo smartctl -l selftest /dev/ada2
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
 
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Conveyance offline  Completed without error      00%    19713        -
# 2  Short offline      Completed without error      00%    19709        -
# 3  Extended offline    Completed without error      00%    19695        -
# 4  Extended offline    Interrupted (host reset)      10%    19556        -
# 5  Short offline      Completed without error      00%    19546        -
# 6  Extended offline    Aborted by host              10%    16876        -
# 7  Extended offline    Aborted by host              10%    16708        -
# 8  Extended offline    Aborted by host              10%    16541        -
# 9  Extended offline    Aborted by host              10%    16373        -
#10  Extended offline    Aborted by host              10%    16206        -
#11  Extended offline    Completed without error      00%    15545        -
#12  Extended offline    Completed without error      00%    15209        -
#13  Extended offline    Completed without error      00%    15041        -
#14  Extended offline    Completed without error      00%    14873        -
#15  Extended offline    Completed without error      00%    14706        -
#16  Extended offline    Completed without error      00%    14538        -
#17  Extended offline    Completed without error      00%    14371        -
#18  Extended offline    Completed without error      00%    14203        -
#19  Extended offline    Completed without error      00%    14036        -
#20  Extended offline    Completed without error      00%    13867        -
 
#21  Extended offline    Completed without error      00%    13700        -


Code:
sudo smartctl -l selftest /dev/ada4
Password:
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
 
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Conveyance offline  Completed without error      00%      3903        -
# 2  Short offline      Completed without error      00%      3899        -
# 3  Extended offline    Completed without error      00%      3884        -
 
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Normally that means the disk is fine. You didn't post all of the smart info, so I can't check to see if anything else looks wrong.

But, you've 99% ruled out the disks.. so now you need to look at SATA cables, SATA controller, power supply, etc.
 
Joined
Jan 11, 2014
Messages
13
here's the full output on ada2 which is giving me this message more than ada4, just to give some closure in case someone's google search finds this:

Code:
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
 
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model:     WDC WD30EZRX-00MMMB0
Serial Number:    WD-WCAWZ0633981
LU WWN Device Id: 5 0014ee 205ca50b4
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed May  7 13:59:12 2014 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x84)Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)The previous self-test routine completed
without error or no self-test has ever 
been run.
Total time to complete Offline 
data collection:(51360) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:            (0x0003)Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:        (0x01)Error logging supported.
General Purpose Logging supported.
Short self-test routine 
recommended polling time: (   2) minutes.
Extended self-test routine
recommended polling time: ( 494) minutes.
Conveyance self-test routine
recommended polling time: (   5) minutes.
SCT capabilities:       (0x3035)SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   169   145   021    Pre-fail  Always       -       8550
  4 Start_Stop_Count        0x0032   098   098   000    Old_age   Always       -       2608
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   073   073   000    Old_age   Always       -       19767
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       81
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       58
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2604
194 Temperature_Celsius     0x0022   117   101   000    Old_age   Always       -       35
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Conveyance offline  Completed without error       00%     19713         -
# 2  Short offline       Completed without error       00%     19709         -
# 3  Extended offline    Completed without error       00%     19695         -
# 4  Extended offline    Interrupted (host reset)      10%     19556         -
# 5  Short offline       Completed without error       00%     19546         -
# 6  Extended offline    Aborted by host               10%     16876         -
# 7  Extended offline    Aborted by host               10%     16708         -
# 8  Extended offline    Aborted by host               10%     16541         -
# 9  Extended offline    Aborted by host               10%     16373         -
#10  Extended offline    Aborted by host               10%     16206         -
#11  Extended offline    Completed without error       00%     15545         -
#12  Extended offline    Completed without error       00%     15209         -
#13  Extended offline    Completed without error       00%     15041         -
#14  Extended offline    Completed without error       00%     14873         -
#15  Extended offline    Completed without error       00%     14706         -
#16  Extended offline    Completed without error       00%     14538         -
#17  Extended offline    Completed without error       00%     14371         -
#18  Extended offline    Completed without error       00%     14203         -
#19  Extended offline    Completed without error       00%     14036         -
#20  Extended offline    Completed without error       00%     13867         -
#21  Extended offline    Completed without error       00%     13700         -
 
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
 
[ted@kegofbeer ~]$ 
 
[\code]
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Well, you've had 2500 start/stop cycles but only 81 power-on cycles.

So either you are putting your disks to sleep or they are stopping and starting because of a power-related problem.

Other than that, everthing else looks perfect. The drive is in pristine shape with no bad errors to speak of.
 
Joined
Jan 11, 2014
Messages
13
I just want to follow up on my own posting in case someone stumbles here in the future. I should also apologize for not posting out all of my machine specs in the first post. I'm using a ASRock C2750 with 16GB of Crucial ECC RAM. Apparently this board has had some SATA controller issues that I was unaware of until now (despite trying to do my best due diligence when purchasing this $400 piece of hardware).

I've narrowed the problem down - occurs during heavy SATA loads and ZFS scrubs. Appears not related to power supply (tried swapping it out).

You can read some info about it here:

http://www.amazon.com/review/R2BJFL...-glance&nodeID=541966&store=pc#wasThisHelpful

and on the tweaktown forums

The problem I'm now having is that the SATA interface appears to just go down during ZFS scrubs and the whole system become unresponsive. Have to do a hard reset or power cycle. I'm going to try re-flashing to an updated BIOS to see if that helps, but I'm pretty frustrated.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
There was some talk of issues with FreeBSD and the Marvell SATA controllers used in that motherboard, but it's the same motherboard used in the FreeNAS Mini and I believe the problems were solved. At least nobody has complained lately...
 
Joined
Jan 11, 2014
Messages
13
I'm planning on updating the BIOS and if that doesn't work trying a 3rd power supply and if that doesn't work I'm going to RMA the Mobo. Will keep this thread posted...
 
Joined
Oct 6, 2014
Messages
4
I'm getting the same emails too. Same motherboard, same ram (32gb though) any solutions? Drives all appear perfect when running tests using smartctl
 
Joined
Oct 6, 2014
Messages
4
From what I can see on the smartctl man pages it appears to represent an error with the smart logs. I ran a short test on all drives and it doesn't appear to be coming up (emailing me) anymore. Hopefully a one off?
I rebooted my NAS the other day and all my reports under the reporting tab in the gui became corrupted and I had to delete them. Reckon this could have anything to do with that? First smart tests on my schedule since the system reports became corrupted.
 

Nindustries

Patron
Joined
Jun 12, 2013
Messages
269
From what I can see on the smartctl man pages it appears to represent an error with the smart logs. I ran a short test on all drives and it doesn't appear to be coming up (emailing me) anymore. Hopefully a one off?
I rebooted my NAS the other day and all my reports under the reporting tab in the gui became corrupted and I had to delete them. Reckon this could have anything to do with that? First smart tests on my schedule since the system reports became corrupted.
I haven't encountered corrupted system reports before.. this will probably have another underlying cause.
Checking my BIOS version this evening.
 
Joined
Oct 6, 2014
Messages
4
I haven't encountered corrupted system reports before.. this will probably have another underlying cause.
Checking my BIOS version this evening.

My BMC version is 0.19.0 and I'll check BIOS later too. I think based on reading your other thread that my problem could be bad sata cables too. I got 4 cheap cables given to me and am using those, haven't got around to grabbing some new ones yet. If this solves your issues please let me know, as should be an easy fix for me too :)
 
Joined
Jan 11, 2014
Messages
13
This resolved on my system with a power supply swap, checking cables/connections, and I did remove one older drive from the array.
Emails stopped.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526

phonoflux

Dabbler
Joined
Aug 23, 2012
Messages
21
Chiming in here. Having the same issues on 9.2.1.8. I am also the same board as others, ASRock C2750, 16gb ECC on a new build copying data to the box. I'm using 8 disks on all the white sata ports with raidZ2. This happens when copying data for extended periods, doing dd's, etc. FYI there's 12 ports, 8 white, 4 blue. I read somewhere that I should avoid the blue ports, can't find this information any more.

I initially thought it may be due to my old 'spare' sata cables I used to throw this together while I wait for the raid card and it's cables to show up, so I got 8 new ones yesterday and am still having the same issue. I've done smart longs on all but 3 of the 8 drives as I shut the system down to swap out the sata cables without checking if ALL the drives had finished doing their long tests (they are running now, and i'll be patient this time)

As i'm still mostly in testing/playing mode on the new NAS so I made a point of randomly choosing sata ports when plugging the drives back in (but still only using the white ports) to see if FreeNAS gives a hoot and it didn't, which was great.

What is rather interesting is that both prior to the sata cable swap and after the swap, the errors were only limited to ada2-5. After my random sata port choosing when plugging them back in, the drive serial numbers reported in the smart errors after booting back up were different (to be expected) but the errors remained only on port ada2-5.

Could I be onto something here? I have the serial numbers of the disks currently plugged into ada2-5 (from the smart errors) but short of tracing the cables is there a way to know what ports/controllers they are plugged into? This board has another 4 ports that I could swap those disks over to and see if the errors continue... Diagnostic info follows.

SMART is emailing me copies of what is showing up in the server console too. Examples:
"Device: /dev/ada4, failed to read SMART Attribute Data"
"Device: /dev/ada4, Read SMART Error Log Failed"
"Device: /dev/ada3, failed to read SMART Attribute Data"
"Device: /dev/ada5, Read SMART Error Log Failed"

Smart output from ada5. Similar stuff for the other drives affected, happy to paste them in if it may help.
# smartctl -a /dev/ada5
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p12 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: WDC WD40EFRX-68WT0N0
Serial Number: WD-<snip>
LU WWN Device Id: 5 0014ee 2b574fad0
Firmware Version: 82.00A82
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Nov 2 09:48:56 2014 NZDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (52560) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 526) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x703d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 177 176 021 Pre-fail Always - 8141
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 10
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 39
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 10
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 6
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 8
194 Temperature_Celsius 0x0022 118 115 000 Old_age Always - 34
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 26 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Console output:
Nov 1 22:28:57 phononas2 smartd[3078]: Device: /dev/ada2, failed to read SMART Attribute Data
Nov 1 22:28:57 phononas2 kernel: ahcich2: Timeout on slot 29 port 0
Nov 1 22:28:57 phononas2 kernel: ahcich2: is 00000000 cs 20000000 ss 00000000 rs 20000000 tfd 40 serr 00000000 cmd 10009d17
Nov 1 23:28:58 phononas2 kernel: ahcich3: Timeout on slot 22 port 0
Nov 1 23:28:58 phononas2 kernel: ahcich3: is 00000000 cs 00400000 ss 00000000 rs 00400000 tfd 40 serr 00000000 cmd 10009617
Nov 2 01:28:58 phononas2 kernel: ahcich2: Timeout on slot 17 port 0
Nov 2 01:28:58 phononas2 kernel: ahcich2: is 00000000 cs 00020000 ss 00000000 rs 00020000 tfd 40 serr 00000000 cmd 10009117
Nov 2 01:29:21 phononas2 kernel: ahcich5: Timeout on slot 29 port 0
Nov 2 01:29:21 phononas2 kernel: ahcich5: is 00000000 cs 20000000 ss 00000000 rs 20000000 tfd 40 serr 00000000 cmd 10009d17
Nov 2 01:58:57 phononas2 kernel: ahcich2: Timeout on slot 18 port 0
Nov 2 01:58:58 phononas2 kernel: ahcich2: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd 40 serr 00000000 cmd 10009217
Nov 2 03:28:57 phononas2 kernel: ahcich3: Timeout on slot 5 port 0
Nov 2 03:28:57 phononas2 kernel: ahcich3: is 00000000 cs 00000020 ss 00000000 rs 00000020 tfd 50 serr 00000000 cmd 10008517
Nov 2 04:28:57 phononas2 kernel: ahcich3: Timeout on slot 13 port 0
Nov 2 04:28:57 phononas2 kernel: ahcich3: is 00000000 cs 00002000 ss 00000000 rs 00002000 tfd 40 serr 00000000 cmd 10008d17
Nov 2 04:58:58 phononas2 smartd[3078]: Device: /dev/ada4, failed to read SMART Attribute Data
Nov 2 04:58:58 phononas2 kernel: ahcich4: Timeout on slot 29 port 0
Nov 2 04:58:58 phononas2 kernel: ahcich4: is 00000000 cs 20000000 ss 00000000 rs 20000000 tfd 40 serr 00000000 cmd 10009d17
Nov 2 06:28:57 phononas2 smartd[3078]: Device: /dev/ada3, failed to read SMART Attribute Data
Nov 2 06:28:57 phononas2 kernel: ahcich3: Timeout on slot 26 port 0
Nov 2 06:28:57 phononas2 kernel: ahcich3: is 00000000 cs 04000000 ss 00000000 rs 04000000 tfd 40 serr 00000000 cmd 10009a17
Nov 2 06:58:58 phononas2 kernel: ahcich5: Timeout on slot 14 port 0
Nov 2 06:58:58 phononas2 kernel: ahcich5: is 00000000 cs 00004000 ss 00000000 rs 00004000 tfd 40 serr 00000000 cmd 10008e17
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
No, that looks like your ahci (SATA/SAS) controller is having problems. If I were a betting man I'd say you're going to need to RMA your motherboard. :/
 

phonoflux

Dabbler
Joined
Aug 23, 2012
Messages
21
Ah Mr Jock, thanks for replying even if you bring potentially bad news heh. I'll follow the cables and see where they are plugged in and try to see if theyall are all on one controller. If so that will give me around thinking it's a contcomit'sroller.

Do you think those errors warrant an rma?
 
Status
Not open for further replies.
Top