Z1 Pool in Faulted state, any hope?

Status
Not open for further replies.

TempestDash

Cadet
Joined
Jul 12, 2013
Messages
4
I am (embarrassingly) running FreeNAS-8.0.1-BETA4-amd64 because it had been working without issue for well over a year now. Well, suddenly my CIFS shares weren't working so I rebooted the box (checking the control panel didn't appear to yield any memorable error messages, but I forget now what it said). After the boot, my only storage volume had "unknown" in all its specs (size, available, etc). Checking the log showed that it was making zpool commands but the OS was reporting that it didn't recognize the volume.

I searched online for some help and ran zpool import and discovered this:

Code:
freenas# zpool status
  pool: Storage-A
state: FAULTED
status: The pool metadata is corrupted and the pool cannot be opened.
action: Destroy and re-create the pool from a backup source.
  see: http://www.sun.com/msg/ZFS-8000-72
scrub: none requested
config:
 
        NAME        STATE    READ WRITE CKSUM
        Storage-A  FAULTED      0    0    1  corrupted data
          raidz1    ONLINE      0    0    6
            ada0p2  ONLINE      0    0    0
            ada1p2  ONLINE      0    0    0
            ada3p2  ONLINE      0    0    0
            ada4p2  ONLINE      0    0    0


Is there any hope for me? I ran SMART on the four drives in this array and learned that the 4th drive (ada4p2) needed to be replaced. But it doesn't seem like that's going to make much of a difference, as I can't even get the pool imported into a Degraded state so I can take the disk offline.

Should I just shut it down, replace the disk, and try to import it again? Should I upgrade to the newest FreeNAS version? Should I just give up and lay down in traffic? While I'm annoyed at the loss of 2 TB of video files, I can recover those. I'm a little more miffed at the 1 TB in pictures and 100 GB in documents I've lost...

More info:

ZDB -l
Code:
freenas# zdb -l /dev/ada4p2
--------------------------------------------
LABEL 0
--------------------------------------------
    version=15
    name='Storage-A'
    state=0
    txg=11493913
    pool_guid=10764243186972375573
    hostid=3767420894
    hostname=''
    top_guid=13484309572823447892
    guid=4440550159471886995
    vdev_tree
        type='raidz'
        id=0
        guid=13484309572823447892
        nparity=1
        metaslab_array=23
        metaslab_shift=32
        ashift=9
        asize=5992598208512
        is_log=0
        children[0]
                type='disk'
                id=0
                guid=5911223021406621106
                path='/dev/ada0p2'
                whole_disk=0
                DTL=4140
        children[1]
                type='disk'
                id=1
                guid=15594108829430195818
                path='/dev/ada1p2'
                whole_disk=0
                DTL=4139
        children[2]
                type='disk'
                id=2
                guid=2071351665733571589
                path='/dev/ada3p2'
                whole_disk=0
                DTL=4138
        children[3]
                type='disk'
                id=3
                guid=4440550159471886995
                path='/dev/ada4p2'
                whole_disk=0
                DTL=4137
--------------------------------------------
LABEL 1
--------------------------------------------
    version=15
    name='Storage-A'
    state=0
    txg=11493913
    pool_guid=10764243186972375573
    hostid=3767420894
    hostname=''
    top_guid=13484309572823447892
    guid=4440550159471886995
    vdev_tree
        type='raidz'
        id=0
        guid=13484309572823447892
        nparity=1
        metaslab_array=23
        metaslab_shift=32
        ashift=9
        asize=5992598208512
        is_log=0
        children[0]
                type='disk'
                id=0
                guid=5911223021406621106
                path='/dev/ada0p2'
                whole_disk=0
                DTL=4140
        children[1]
                type='disk'
                id=1
                guid=15594108829430195818
                path='/dev/ada1p2'
                whole_disk=0
                DTL=4139
        children[2]
                type='disk'
                id=2
                guid=2071351665733571589
                path='/dev/ada3p2'
                whole_disk=0
                DTL=4138
        children[3]
                type='disk'
                id=3
                guid=4440550159471886995
                path='/dev/ada4p2'
                whole_disk=0
                DTL=4137
--------------------------------------------
LABEL 2
--------------------------------------------
failed to unpack label 2
--------------------------------------------
LABEL 3
--------------------------------------------
failed to unpack label 3


gpart show
Code:
freenas# gpart show
=>        34  3907029101  ada0  GPT  (1.8T)
          34          94        - free -  (47K)
        128    4194304    1  freebsd-swap  (2.0G)
    4194432  3902834703    2  freebsd-zfs  (1.8T)
 
=>        34  2930277101  ada1  GPT  (1.4T)
          34          94        - free -  (47K)
        128    4194304    1  freebsd-swap  (2.0G)
    4194432  2926082703    2  freebsd-zfs  (1.4T)
 
=>    63  8027649  ada2  MBR  (3.8G)
      63  1930257    1  freebsd  [active]  (943M)
  1930320      63        - free -  (32K)
  1930383  1930257    2  freebsd  (943M)
  3860640    3024    3  freebsd  (1.5M)
  3863664    41328    4  freebsd  (20M)
  3904992  4122720        - free -  (2.0G)
 
=>        34  3907029101  ada3  GPT  (1.8T)
          34          94        - free -  (47K)
        128    4194304    1  freebsd-swap  (2.0G)
    4194432  3902834703    2  freebsd-zfs  (1.8T)
 
=>      0  1930257  ada2s1  BSD  (943M)
        0      16          - free -  (8.0K)
      16  1930241      1  !0  (943M)
 
=>        34  2930277101  ada4  GPT  (1.4T)
          34          94        - free -  (47K)
        128    4194304    1  freebsd-swap  (2.0G)
    4194432  2926082703    2  freebsd-zfs  (1.4T)


camcontrol devlist
Code:
freenas# camcontrol devlist
<ST2000DL003-9VT166 CC32>          at scbus2 target 0 lun 0 (pass0,ada0)
<WDC WD15EARS-00Z5B1 80.00A80>    at scbus3 target 0 lun 0 (pass1,ada1)
<SanDisk SDCFX3-4096 HDX 4.03>    at scbus4 target 0 lun 0 (pass2,ada2)
<Hitachi HDS5C3020ALA632 ML6OA580>  at scbus6 target 0 lun 0 (pass3,ada3)
<ST31500341AS CC1H>                at scbus7 target 0 lun 0 (pass4,ada4)


Please help if you can.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
Ok, here's some advice: I'll try to see you through to the end if you are patient....

1. STOP. No really, STOP. Don't go trying to do things you searched for on the internet or that you think are good things to try. Some of them, once done, can be destructive and cannot be undone.
2. Did you listen to me about #1? This is the most common way people end up losing everything. They go doing stuff they shouldn't and then can't believe it when everything is lost.
3. Do you have a spare hard drive available to replace the failed drive? If not, you should order one now. We may need it for recovering your data. Do not discard your failed drive. You may need it to save your pool.
4. Do you have a spare USB stick? You need to install FreeNAS 8.3.1-p2 64 bit on it. Don't do any settings changes, just install it and leave it at defaults for now.
5. Your current USB stick that has FreeNAS installed should be set aside for now. We may need to use it later, but we don't want to mess with it right now.
6. Do a RAM test on your machine. You can make a boot disk from www.memtest86.com. Just boot from it and let it run. 3 passes with no errors means you have good RAM. If your RAM is bad, your chances of recovery go way down based on past experiences. This could take a few hours, so your best bet is to start it and let it run overnight.
7. If you have a backup, you should check on them. If you have thorough backups then your best bet is to destroy and recreate the zpool as you may spend a lot of time on this and there is no guarantee you'll get good data when we're done. Right now its sounding like you may not have backups of anything.
8. You posted all the good stuff, but I need one more thing. Provide the output of smartctl -a -q noserial /dev/ada0 through ada4. I want to see how all of your hard drives are doing.
9. Post all of your hardware specs. Especially RAM, CPU, motherboard, and what controller your hard drives are connected to.
10. Don't go trying to do imports and stuff. Just leave everything be for now. Aside from the RAM test, you could even shut down the machine for now and just wait.

Before you lay down in traffic, provide the answers/info I asked for above. Most importantly, don't go trying to run commands to get your pool working. I know you're probably in a panic, praying to whatever diety you believe in, and perhaps wondering if you're still going to be married in a month. But right now is when you need to keep your head clear, not panic, and recognize that if your data is gone, its already gone. What you are going to do now is fight to get your data back. This is a slow and tedious process but there is a chance you may see your data again.

Edit: If we were able to get at your data, but you were in a position where you'd have to destroy and recreate your zpool from scratch, do you have enough spare storage on other computers in the house to store all of your data? If not, you should consider how you would deal with that scenario if we got to that. If you have a friend that is willing to loan you some hard drives or something you may want to see about calling in any favors...
 

TempestDash

Cadet
Joined
Jul 12, 2013
Messages
4
Well, #1 may be too little too late but I shall do nothing further without say so.

I have a spare disk, not that it is probably going to help me much.
I have a USB drive which I put the memtest on and it's running now. So far no issues.
I've downloaded FreeNAS 8.3.1-p2 64-bit, but haven't yet put it on the USB stick because of the aforementioned memtest.

My current flash drive is actually a CF-card installed in a CF-to-IDE adapter, but I'll set it aside for now regardless. :)

Hardware:
Motherboard: ASUS M2N68-AM PLUS AM3/AM2+/AM2 NVIDIA Geforce 7025/nForce 630a Micro ATX AMD Motherboard
CPU: AMD Athlon 64 X2 5600+ Brisbane 2.9GHz Socket AM2 65W Dual-Core Processor ADO5600IAA5DO - OEM
RAM: G.SKILL 2GB (2 x 1GB) 240-Pin DDR2 SDRAM DDR2 800 (PC2 6400) Dual Channel Kit Desktop Memory Model F2-6400CL5D-

If I had a backup, I would be lying down in traffic. As the story always goes, I was about to arrange cloud backup of my array, but that is also now too little too late.

I was also thinking of ditching my antiquated Athlon box in favor of a Synology NAS, so perhaps my FreeNAS installation did this to me out of spite.

I ran the smart checks on all the drives before starting the memtest... and then I had to reload my browser and lost all the diagnostics. I only have the tests from apa3 and 4 (and part of apa1) at the moment but will provide the rest once the 3 passes are done:

smartctl -a -q noserial /dev/ada1
Code:
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (  1) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_                                    FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000b  100  100  016    Pre-fail  Always      -                                          0
  2 Throughput_Performance  0x0005  136  136  054    Pre-fail  Offline      -                                          94
  3 Spin_Up_Time            0x0007  146  146  024    Pre-fail  Always      -                                          382 (Average 368)
  4 Start_Stop_Count        0x0012  100  100  000    Old_age  Always      -                                          26
  5 Reallocated_Sector_Ct  0x0033  100  100  005    Pre-fail  Always      -                                          0
  7 Seek_Error_Rate        0x000b  100  100  067    Pre-fail  Always      -                                          0
  8 Seek_Time_Performance  0x0005  146  146  020    Pre-fail  Offline      -                                          29
  9 Power_On_Hours          0x0012  098  098  000    Old_age  Always      -                                          16769
10 Spin_Retry_Count        0x0013  100  100  060    Pre-fail  Always      -                                          0
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -                                          26
192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -                                          40
193 Load_Cycle_Count        0x0012  100  100  000    Old_age  Always      -                                          40
194 Temperature_Celsius    0x0002  166  166  000    Old_age  Always      -                                          36 (Min/Max 20/47)
196 Reallocated_Event_Count 0x0032  100  100  000    Old_age  Always      -                                          0
197 Current_Pending_Sector  0x0022  100  100  000    Old_age  Always      -                                          0
198 Offline_Uncorrectable  0x0008  100  100  000    Old_age  Offline      -                                          0
199 UDMA_CRC_Error_Count    0x000a  200  200  000    Old_age  Always      -                                          0
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]
 
 
SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

smartctl -a -q noserial /dev/ada3
Code:
freenas# smartctl -a -q noserial /dev/ada3
smartctl 5.41 2011-06-09 r3365 [FreeBSD 8.2-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
 
=== START OF INFORMATION SECTION ===
Model Family:    Hitachi Deskstar 5K3000
Device Model:    Hitachi HDS5C3020ALA632
Firmware Version: ML6OA580
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:  8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Sat Jul 13 00:36:43 2013 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (  0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (21268) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (  1) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000b  100  100  016    Pre-fail  Always      -      0
  2 Throughput_Performance  0x0005  136  136  054    Pre-fail  Offline      -      94
  3 Spin_Up_Time            0x0007  146  146  024    Pre-fail  Always      -      382 (Average 368)
  4 Start_Stop_Count        0x0012  100  100  000    Old_age  Always      -      26
  5 Reallocated_Sector_Ct  0x0033  100  100  005    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x000b  100  100  067    Pre-fail  Always      -      0
  8 Seek_Time_Performance  0x0005  146  146  020    Pre-fail  Offline      -      29
  9 Power_On_Hours          0x0012  098  098  000    Old_age  Always      -      16769
10 Spin_Retry_Count        0x0013  100  100  060    Pre-fail  Always      -      0
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      26
192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      40
193 Load_Cycle_Count        0x0012  100  100  000    Old_age  Always      -      40
194 Temperature_Celsius    0x0002  166  166  000    Old_age  Always      -      36 (Min/Max 20/47)
196 Reallocated_Event_Count 0x0032  100  100  000    Old_age  Always      -      0
197 Current_Pending_Sector  0x0022  100  100  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0008  100  100  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x000a  200  200  000    Old_age  Always      -      0
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]
 
 
SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

smartctl -a -q noserial /dev/ada4
Code:
freenas# smartctl -a -q noserial /dev/ada4
smartctl 5.41 2011-06-09 r3365 [FreeBSD 8.2-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
 
=== START OF INFORMATION SECTION ===
Model Family:    Seagate Barracuda 7200.11
Device Model:    ST31500341AS
Firmware Version: CC1H
User Capacity:    1,500,301,910,016 bytes [1.50 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:  8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Sat Jul 13 00:37:50 2013 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
 
General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (  0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  609) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (  1) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (  2) minutes.
SCT capabilities:              (0x103f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000f  117  099  006    Pre-fail  Always      -      122012285
  3 Spin_Up_Time            0x0003  100  094  000    Pre-fail  Always      -      0
  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      68
  5 Reallocated_Sector_Ct  0x0033  001  001  036    Pre-fail  Always  FAILING_NOW 4118
  7 Seek_Error_Rate        0x000f  081  060  030    Pre-fail  Always      -      126494888
  9 Power_On_Hours          0x0032  065  065  000    Old_age  Always      -      31442
10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0
12 Power_Cycle_Count      0x0032  100  037  020    Old_age  Always      -      59
184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0
187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0
188 Command_Timeout        0x0032  100  099  000    Old_age  Always      -      25770196998
189 High_Fly_Writes        0x003a  001  001  000    Old_age  Always      -      215
190 Airflow_Temperature_Cel 0x0022  051  044  045    Old_age  Always  In_the_past 49 (0 114 49 46)
194 Temperature_Celsius    0x0022  049  056  000    Old_age  Always      -      49 (0 20 0 0)
195 Hardware_ECC_Recovered  0x001a  047  020  000    Old_age  Always      -      122012285
197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      2
198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      2
199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0
240 Head_Flying_Hours      0x0000  100  253  000    Old_age  Offline      -      179387899083473
241 Total_LBAs_Written      0x0000  100  253  000    Old_age  Offline      -      3793779357
242 Total_LBAs_Read        0x0000  100  253  000    Old_age  Offline      -      876807786
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]
 
 
SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


You can see the failure in ada4 above.

It is also worth nothing that this situation may have been avoidable, if I had realized that periodic SMART tests hadn't been configured on my install.

I noticed in your signature that you have a newbie's guide to basic NAS concepts and terminology. It looks pretty good. What FreeNAS really needs, though, is a newbies guide through a common setup, so that things like turning on S.M.A.R.T. tests (and reminding the user to configure their e-mail server) can be placed more prominently for idiots like myself.

Thank you for your quick reply!
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
It is also worth nothing that this situation may have been avoidable, if I had realized that periodic SMART tests hadn't been configured on my install.

It is possible that the situation may have been avoidable. Running with only 1 parity disk is risky statistically. RAID5 was considered to be "dead" in 2009.

Your hard drives look good aside from ada4. If the other disks are good that's a good sign for you.

I'm not sure why exactly you are considering switching to a Synology NAS. They have some advantages, but you should also recognize the fact that if we do manage to save your data, it may be only because you are using ZFS. You have a lot less options with recovery of a hardware RAID with disk failure unless you are okay with spending 4-5 figures to get your data back.

I noticed in your signature that you have a newbie's guide to basic NAS concepts and terminology. It looks pretty good. What FreeNAS really needs, though, is a newbies guide through a common setup, so that things like turning on S.M.A.R.T. tests (and reminding the user to configure their e-mail server) can be placed more prominently for idiots like myself.

I'd agree, but if you do what I did when I first heard about FreeNAS, you shouldn't be accidentally skipping anything.. for me, I:

1. Printed out the manual and read it cover to cover. Did a lot of Google searches since I had a lot of questions.
2. Tested FreeNAS in several VMs, doing all sorts of nasty stuff to it.
3. When I finally put it on real hardware, I did all sorts of nasty stuff again.
4. Reset FreeNAS to defaults and went screen by screen setting up every possible feature I could. I even put in my City, State, Zip Code, etc despite thinking there's no value.

Created a script that emails me nightly with a brief overview of SMART data along with all of the SMART data on all of the drives. I check out the brief overview and if I have any reason to be concerned I look at the detailed info. I wasn't interested in logging into my server daily/weekly/whatever to check SMART data manually, so I made the server do the hard work.

Since you said you did do some stuff, can you write back with what you tried and what output you have or can remember? I'd like to know what you did and didn't do and what kind of results you got, error messages or otherwise.

You have 1/3rd of the recommended minimum RAM for ZFS. If you have more RAM, I'd highly recommend you install it. Insufficient RAM can cause system instability and ZFS to be very very slow. Naturally if you do have spare RAM you can add you should test it after adding the RAM again (boo) but it may help. Generally if you said you were building a ZFS system with 2GB I'd dismiss you as being silly and not even trying to follow the minimum recommendations of the manual. Plenty of people have had data loss because of insufficient RAM causing system panics.
 

TempestDash

Cadet
Joined
Jul 12, 2013
Messages
4
Wow, I really misunderstood what my risk was in the RAID5 configuration. I had anticipated that I would get a fault one day and just replace the drive and rebuild. I didn't realize the potential for full RAID collapse was as high as that article implies.

I take back what I said earlier. There should just be a big sign pointing newbies away from all of this. There little chance a person will learn enough, even when trying to be cautious (as I believed I was being), to avoid losing all their data at least once. And I'm a fairly risk averse person... and I have a CS degree and took sysadmin classes in college. :/

Footprint was why I was considering Synology. My FreeNAS box is in a mid-tower. Though I suppose I could put together a better one in the mITX form factor and use an external SATA drive chassis.

Anyway...

Memtest ran all night (I'm in EST and it was nearly 3am). 5 passes, no errors. I don't have any spare RAM handy, but if it will help during this rebuild process, I live next to a Microcenter. I can see if there are some discounted sticks lying around....

Wait. I take that back. I just rebuilt my primary desktop. I have 8GB of RAM in my basement. I'll go swap that into the machine and re-run memtest.

As far as what I did, not much did anything but generate error messages (or report back that I was using flags for zpool it didn't understand). I essentially Googled "FreeNAS No Such Pool" and did everything I could find. Specifically:
Code:
zpool import
zpool -f Storage-A
zpool clear
gpart recover Storage-A
zpool online Storage-A


On the FreeNAS GUI, it still says the Volume is there, so I also tried hitting the button to replace the drive and export the drive, but those did nothing but produce errors.
 

TempestDash

Cadet
Joined
Jul 12, 2013
Messages
4
Aha. I wasn't actually running with 2GB of RAM. When I opened the machine to swap the RAM out I discovered I had 4 GB in there in two 2GB sticks of ADATA DDR2 800. Apparently I had upgraded when I built the FreeNAS box. I also discovered the additional RAM I had on hand was DDR3 and therefore incompatible. Ah well... I guess I'm stuck with the 4GB of ADATA RAM for now.

When I was in the machine, I also discovered I have a PCI SATA RAID card in there. Vantec 6-Port SATA II 150 PCI Host Card with RAID

Two of the drives in there are connected to this additional RAID card.

One of the things I'm interested in if I DO recover my data, is a way to set up a cloud backup to BackBlaze or Carbonite. I don't need all 3 TB of data going there, just one mount point.

The other thing I noticed while in my basement is that I have a stack of smaller hard drives (500GB - 750 GB) just doing nothing. Perhaps they could be added in some way to increase redundancy?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
I sent you a PM. We'll continue this in PMs...
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
I was able to recover his data after a few hours with Teamviewer. He was pretty lucky on that one. I thought it was game over when he said he had tried the -f flag. :p
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
I'm curious what the underlying fix was. I'm not asking for in depth detail, but can you provide a high level overview of what was needed to recover from the meta data corruption?

I see how bad the 4th disk was with over 4,000 bad sectors. I assume smartd wasn't configured, and the disk got this bad without noticing?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
Honestly, I don't remember exactly what we did and what command we ran. :(

I do remember that he didn't have smartd configured. It seems about 1/2 the people that end up with an unmountable zpool don't have smartd setup with emails. :(
 
Status
Not open for further replies.
Top