Reboot caused Disk to go Unavailable

Status
Not open for further replies.

eqartimus

Explorer
Joined
Aug 2, 2014
Messages
66
BuildFreeNAS-9.2.1.6-RELEASE-x64 (ddd1e39)
PlatformIntel(R) Xeon(R) CPU E3-1226 v3 @ 3.30GHz
Memory32402MB ECC

ASRock Rack C226 WS LGA1150
6 x Western Digital Red WD20EFRX 2TB

Raid is passworded.

I rebooted today, and when system came back on-line, it showed status DEGRADED ...

The system has been running perfectly without errors for about 8 months. I checked status and noticed that one drive was UNAVAIL. So I assumed a cable had come loose.

I opened system, reseated all cables, rebooted and then found that 2 drives are now UNAVAIL. They all show up in bios as connected.

Open shown under the Unavail drive is REPLACE only. After rebooting a few more times, I tried using the REPLACE button. Now I have option to Detach and Replace for that drive, and the other still only shows REPLACE.

What the heck is going on here? I have been googling all morning for answer about UNAVAIL, but I am behind the curve again. Seems that everything I learned a year ago about Freenas and such has left my brain :(
 

eqartimus

Explorer
Joined
Aug 2, 2014
Messages
66
I have good temperature in server room, and have monitored drive temps ... all these disks where bought together 9 months ago, seems unlikely that they would both fail... one after the other ... while I was messing with the box.
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
Please post the results from the Command Line;
# zpool status
# smartctl -a /dev/ada0
* (example only, your drives may be designated another way)

(copy and paste in between code tags to preserve format within your post)
Thanks!
 

eqartimus

Explorer
Joined
Aug 2, 2014
Messages
66
Hello, I went ahead and pressed the REPLACE button. zpool status then showed that the system was resilvering the drive ... 2 hours later and the drive is online with no issue ... I have started that same process on the second drive ... I will let you know how it worked. Perhaps there was a better/quicker way to bring a drive back in-line, but here I am ...

Thank you for your time and interest. I will be back in aprox 3hours after the second drive has completed resilvering.

In any case, I am curious what caused the hick-up in the first place ... and, presuming that this fixes things, was there a quicker solution?

But, you can wait to answer those questions (if you want) following the 3.5 hour resilvering underway.
 

eqartimus

Explorer
Joined
Aug 2, 2014
Messages
66
BTW, this is the results of the commands you asked for:

Code:
[root@freenas] ~# zpool status
  pool: Deerhaven1
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Aug  8 14:59:02 2015
        263G scanned out of 2.94T at 369M/s, 2h7m to go
        43.8G resilvered, 8.74% done
config:

        NAME                                                  STATE     READ WRITE CKSUM
        Deerhaven1                                            DEGRADED     0     0     0
          raidz2-0                                            DEGRADED     0     0     0
            gptid/38009294-1dcb-11e4-82ae-d05099013a39.eli    ONLINE       0     0     0
            gptid/b61781ee-3df2-11e5-b1f9-d05099013a39.eli    ONLINE       0     0     0
            gptid/3965262b-1dcb-11e4-82ae-d05099013a39.eli    ONLINE       0     0     0
            replacing-3                                       UNAVAIL      0     0     0
              16150001156572819795                            UNAVAIL      0     0     0  was /dev/gptid/3a143aab-1dcb-11e4-82ae-d05099013a39.eli
              gptid/e4576c5f-3e07-11e5-b1f9-d05099013a39.eli  ONLINE       0     0     0  (resilvering)
            gptid/3ac8051b-1dcb-11e4-82ae-d05099013a39.eli    ONLINE       0     0     0
            gptid/3b79f677-1dcb-11e4-82ae-d05099013a39.eli    ONLINE       0     0     0

errors: No known data errors


Code:
[root@freenas] ~# smartctl -a /dev/ada1
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red (AF)
Device Model:     WDC WD20EFRX-68EUZN0
Serial Number:    WD-WCC4M2786029
LU WWN Device Id: 5 0014ee 25f7c5029
Firmware Version: 80.00A80
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Aug  8 15:09:30 2015 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (26460) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 267) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   175   175   021    Pre-fail  Always       -       4233
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       21
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   088   088   000    Old_age   Always       -       8785
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       21
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       11
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       24
194 Temperature_Celsius     0x0022   108   100   000    Old_age   Always       -       39
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      8777         -
# 2  Short offline       Completed without error       00%      8770         -
# 3  Short offline       Completed without error       00%      8746         -
# 4  Short offline       Completed without error       00%      8722         -
# 5  Short offline       Completed without error       00%      8698         -
# 6  Short offline       Completed without error       00%      8674         -
# 7  Short offline       Completed without error       00%      8650         -
# 8  Short offline       Completed without error       00%      8626         -
# 9  Extended offline    Completed without error       00%      8609         -
#10  Short offline       Completed without error       00%      8602         -
#11  Short offline       Completed without error       00%      8578         -
#12  Short offline       Completed without error       00%      8554         -
#13  Extended offline    Completed without error       00%      8537         -
#14  Short offline       Completed without error       00%      8530         -
#15  Short offline       Completed without error       00%      8507         -
#16  Short offline       Completed without error       00%      8483         -
#17  Short offline       Completed without error       00%      8471         -
#18  Short offline       Completed without error       00%      8435         -
#19  Short offline       Completed without error       00%      8411         -
#20  Short offline       Completed without error       00%      8387         -
#21  Extended offline    Completed without error       00%      8370         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


That is the info from the first drive that failed, and subsequently was re-silvered. Drive 3 is currently in-process of resilver as you can see from above code.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Are your driver's plugged straight into motherboard or have you configured some kind of raid? This behavior looks kind you have created a raid under zfs, which isn't recommended. I could be wrong though this is just a guess.
 

eqartimus

Explorer
Joined
Aug 2, 2014
Messages
66
Drives are plugged straight into motherboard. The ROM is set to AHCI not RAID. So I don't know what else I could have buggered up.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Drives are plugged straight into motherboard. The ROM is set to AHCI not RAID. So I don't know what else I could have buggered up.
Great! Thanks for the info. Maybe buggy power supply? Strange things can happen when power is variable.
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
Your experience is a great example of how RAIDZ2 allows you to deal with two drive failures :cool:
In any case, I am curious what caused the hick-up in the first place
So if I understand you correctly, ada2 and ada3 became unavailable to start with right?
Strange things can happen when power is variable.
As the smart output did not seem to show anything amiss, SweetAndLow may have a good point. Worth looking into for sure.
 

eqartimus

Explorer
Joined
Aug 2, 2014
Messages
66
Actually ada1 (drive 2) came up from reboot as UNAVAIL. I turned everything off and reseated cables, then ada1 and ada3 showed UNAVAIL. Checking bios, the drives WERE connected. So I clicked REPLACE on ada1, and after some time, it returned. I am in process with ada3 atm.
 

gpsguy

Active Member
Joined
Jan 22, 2012
Messages
4,472
Are all of your HDD's connected to the Intel controller? Or, are some on the Marvell controller? If the latter, I would move them to the Intel.

Some folks with the FreeNAS mini (which has both Intel & Marvell controllers) have run into issues using the Marvell controller.
 

eqartimus

Explorer
Joined
Aug 2, 2014
Messages
66
All healthy now ... go figure. I am happy, but still don't understand why I needed to resilver two drives .
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
Actually ada1 (drive 2) came up from reboot as UNAVAIL. I turned everything off and reseated cables, then ada1 and ada3 showed UNAVAIL. Checking bios, the drives WERE connected. So I clicked REPLACE on ada1, and after some time, it returned. I am in process with ada3 atm.
The typical PSU has groups of 5 wires for SATA power connectors consisting of 3.3v-orange, 5v-red, 12v-yellow and 2 black groud wires.
Is there a chance your two drives in question are powered by a group of five wires that power just those two drives? If this is indeed the case,
THAT's were I'd start looking
 
Status
Not open for further replies.
Top