The volume RAID (ZFS) state is UNAVAIL: One or more devices are faulted in response to IO failures.

Status
Not open for further replies.
Joined
Feb 14, 2017
Messages
1
A couple of years ago I built a FreeNAS machine around an HP Microserver GEN8 G2020T. It has 4 x 4GB WD Hard disks configured as a single raidz pool, boots from a MicroSD card. It's been working pretty much trouble free since then, it's recently started faulting the RAID pool with the message: The volume RAID (ZFS) state is UNAVAIL: One or more devices are faulted in response to IO failures.

If I reboot the machine then import the pool everything is typically OK

Code:
[root@freenas] ~# zpool status RAID
  pool: RAID
 state: ONLINE
  scan: scrub repaired 0 in 14h11m with 0 errors on Mon Feb 13 08:37:21 2017
config:

  NAME  STATE  READ WRITE CKSUM
  RAID  ONLINE  0  0  0
  raidz1-0  ONLINE  0  0  0
  gptid/8be90dc0-ea53-11e6-a7d1-a0481cb8579c  ONLINE  0  0  0
  gptid/286666d8-5f35-11e3-aab1-a0481cb8579c  ONLINE  0  0  0
  gptid/294514ff-5f35-11e3-aab1-a0481cb8579c  ONLINE  0  0  0
  gptid/2a284dd0-5f35-11e3-aab1-a0481cb8579c  ONLINE  0  0  0

errors: No known data errors


I'm struggling to work out what the problem is. Here's what I've tried:

One of the hard disk had persistently had a "Current Pending Sector Count" of 1, generating message like. "Device: /dev/ada0, 1 Currently unreadable (pending) sectors" I replaced this disk. That made no difference.

The SMART status of the disks looks OK, here is a typical example:

Code:
[root@freenas] ~# smartctl -a /dev/ada1
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:  Western Digital Red
Device Model:  WDC WD40EFRX-68WT0N0
Serial Number:  WD-WCC4E0266365
LU WWN Device Id: 5 0014ee 25e702e10
Firmware Version: 80.00A80
User Capacity:  4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:  512 bytes logical, 4096 bytes physical
Rotation Rate:  5400 rpm
Device is:  In smartctl database [for details use: -P show]
ATA Version is:  ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:  Tue Feb 14 18:53:29 2017 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
  was never started.
  Auto Offline Data Collection: Disabled.
Self-test execution status:  (  0) The previous self-test routine completed
  without error or no self-test has ever
  been run.
Total time to complete Offline
data collection:  (54780) seconds.
Offline data collection
capabilities:  (0x7b) SMART execute Offline immediate.
  Auto Offline data collection on/off support.
  Suspend Offline collection upon new
  command.
  Offline surface scan supported.
  Self-test supported.
  Conveyance Self-test supported.
  Selective Self-test supported.
SMART capabilities:  (0x0003) Saves SMART data before entering
  power-saving mode.
  Supports SMART auto save timer.
Error logging capability:  (0x01) Error logging supported.
  General Purpose Logging supported.
Short self-test routine
recommended polling time:  (  2) minutes.
Extended self-test routine
recommended polling time:  ( 548) minutes.
Conveyance self-test routine
recommended polling time:  (  5) minutes.
SCT capabilities:  (0x703d) SCT Status supported.
  SCT Error Recovery Control supported.
  SCT Feature Control supported.
  SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x002f  200  200  051  Pre-fail  Always  -  0
  3 Spin_Up_Time  0x0027  248  177  021  Pre-fail  Always  -  4591
  4 Start_Stop_Count  0x0032  100  100  000  Old_age  Always  -  51
  5 Reallocated_Sector_Ct  0x0033  200  200  140  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x002e  200  200  000  Old_age  Always  -  0
  9 Power_On_Hours  0x0032  063  063  000  Old_age  Always  -  27368
 10 Spin_Retry_Count  0x0032  100  253  000  Old_age  Always  -  0
 11 Calibration_Retry_Count 0x0032  100  253  000  Old_age  Always  -  0
 12 Power_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  51
192 Power-Off_Retract_Count 0x0032  200  200  000  Old_age  Always  -  38
193 Load_Cycle_Count  0x0032  043  043  000  Old_age  Always  -  473558
194 Temperature_Celsius  0x0022  124  114  000  Old_age  Always  -  28
196 Reallocated_Event_Count 0x0032  200  200  000  Old_age  Always  -  0
197 Current_Pending_Sector  0x0032  200  200  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0030  100  253  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x0032  200  200  000  Old_age  Always  -  0
200 Multi_Zone_Error_Rate  0x0008  200  200  000  Old_age  Offline  -  0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline  Completed without error  00%  27094  -
# 2  Short offline  Completed without error  00%  27089  -
# 3  Short offline  Completed without error  00%  27065  -
# 4  Short offline  Completed without error  00%  27041  -
# 5  Extended offline  Completed without error  00%  27026  -
# 6  Short offline  Completed without error  00%  26993  -
# 7  Short offline  Completed without error  00%  26969  -
# 8  Short offline  Completed without error  00%  26946  -
# 9  Short offline  Completed without error  00%  26921  -
#10  Short offline  Completed without error  00%  26900  -
#11  Short offline  Completed without error  00%  26873  -
#12  Extended offline  Completed without error  00%  26859  -
#13  Short offline  Completed without error  00%  26826  -
#14  Short offline  Completed without error  00%  26802  -
#15  Short offline  Completed without error  00%  26778  -
#16  Short offline  Completed without error  00%  26754  -
#17  Short offline  Completed without error  00%  26730  -
#18  Extended offline  Completed without error  00%  26715  -
#19  Short offline  Completed without error  00%  26658  -
#20  Short offline  Completed without error  00%  26634  -
#21  Short offline  Completed without error  00%  26610  -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
  1  0  0  Not_testing
  2  0  0  Not_testing
  3  0  0  Not_testing
  4  0  0  Not_testing
  5  0  0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Although do note the Load_Cycle_Count, this is a known problem with this model. I upgraded the firmware a while ago to mitigate the issue.

I'm wondering if it may be a problem with the server/disk controller, but can't think of a way of diagnosing this. Are there any logs I should be looking at?

Any help gratefully received.

Andrew
 

Glorious1

Guru
Joined
Nov 23, 2014
Messages
1,211
I guess it is intermittent, since zpool status showed everything OK. I would run a new long smart test on all the drives. If the smart data look OK for all the drives, you may have a SATA cable issue. Several times over two years I have had some kind of problem like that and reseating all the cables made it go away.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
If I reboot the machine then import the pool everything is typically OK
If you reboot the machine, the counters are reset. The all-clear output is fake, in a way.

Start by identifying the offending disk. IO failures with good SMART data suggest controller or cable issues.
 
Status
Not open for further replies.
Top