2 Faulted Drives - Any chance at saving data

Sauce · Mar 10, 2019

Hi,
The volume ARRAY2 state is UNAVAIL: One or more devices are faulted in response to IO failures.
Version FreeNAS-11.1-U1
Drives are All SATA connected hotswap Drives.

I had a failed drive 1.86K errors , and before I could replace it, another drive faulted with failed 89 errors this morning.

Here is the results of a zpool status -x

Code:

  pool: ARRAY2
state: DEGRADED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://illumos.org/msg/ZFS-8000-JQ
  scan: scrub repaired 0 in 0 days 12:54:59 with 1 errors on Sun Mar  3 12:55:00          2019
config:
        NAME                                            STATE     READ WRITE CKS         UM
        NEWSTREAM3-ARRAY2                               DEGRADED 8.46K    86 8.6         2K
          raidz1-0                                      DEGRADED 16.9K    83 17.         3K
            gptid/fa5f83ad-2dcc-11e9-90a0-001517f59eb7  FAULTED     42 1.86K              2  too many errors
            gptid/e4671feb-bca1-11e8-94db-001517f59eb7  DEGRADED     5    89              0  too many errors
            gptid/e551d03b-bca1-11e8-94db-001517f59eb7  DEGRADED     0     0              0  too many errors
            gptid/e61f2ab1-bca1-11e8-94db-001517f59eb7  DEGRADED     0     0              0  too many errors
errors: 1538 data errors, use '-v' for a list

Then I ran a zpool clear ( hope this didnt make it worse) and now I can see 2 drives are online and 2 offline.

Code:

pool: ARRAY2
state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://illumos.org/msg/ZFS-8000-JQ
  scan: scrub repaired 0 in 0 days 12:54:59 with 1 errors on Sun Mar  3 12:55:00 2019
config:
        NAME                                            STATE     READ WRITE CKSUM
        NEWSTREAM3-ARRAY2                               UNAVAIL      0     0     0
          raidz1-0                                      UNAVAIL      0     0     0
            gptid/fa5f83ad-2dcc-11e9-90a0-001517f59eb7  FAULTED     18     0     0  too many errors
            gptid/e4671feb-bca1-11e8-94db-001517f59eb7  FAULTED     15     0     0  too many errors
            gptid/e551d03b-bca1-11e8-94db-001517f59eb7  ONLINE       0     0     0
            gptid/e61f2ab1-bca1-11e8-94db-001517f59eb7  ONLINE       0     0     0
errors: 1584 data errors, use '-v' for a list.

The only way out I can think of this pickle, before giving in and deleting and re-creating the whole pool again is to somehow force it online like nothings wrong, and try a re-silver ?

Assuming the data is actually ok on gptid/e4671feb-bca1-11e8-94db-001517f59eb7 and its a bad cable, Is it possible to force it online as it only has a few errors, and then replace \ resilver the badly failed drive first, as I now have received a replacement drive for the first failure ?

I suspect its is bad cable, as the drive has passed all SMART tests, surface read\write tests when it failed the same way several months ago

Heracles · Mar 10, 2019

This is why we tell people to avoid RaidZ1 and never think that snapshots are backups...

This is also why I do not like Raid-10. Whenever a drive fails, the entire pool is dependent of a single drive and any error on that single drive can be unrecoverable.

Some hardware expert can repair broken hard drives, but usually at a cost so high that a residential user can not think about it.

If you were using RaidZ1 without backup, sorry for you and I hope lesson is learned now...

Sauce · Mar 10, 2019

The failed Array contains our backups, so it would be a major PITA to lose them, but it would not be a complete disaster.
Yeah, I wish I had replaced the failed drive ( but I think its a cable or something random, not the drive itself) straight away, but I didnt have a replacement ready.
To re-iterate, I believe both failed drives are actually OK.

Heracles · Mar 10, 2019

Hi again,

Well, if the content by itself was backups, the impact is lower than when loosing the main and only copy of your data.

So from what you said, I guess your pool was RaidZ1.

You said that a first drive went offline. Maybe from a cable or something else.

You kept the pool working without that failed drive No1.

Then a second drive turned to offline and now the pool is gone.

If that is right, know that failed drive No1 is now useless. Because the pool kept working without it, failed drive No1 is now de-synced with the pool and so can not be used to recover anything. If the drive itself is still good, it would need to be completely re-silvered from the pool before being of any use.

So that leaves failed disk No2. If you can bring that one back online, you may be able to re-access your pool in a degraded state. Because the pool stopped working after it went offline, the pool and that drive may still be in sync. If that work, do not waste any time and re-silver the pool with a new hard drive.

Once re-silvered, you can add a hot spare to that pool, so FreeNAS will re-silver to that one instantly should it be needed. Not as good as RaidZ2, but better then what you had and did.

But at the end, you need a pool with a better protection than RaidZ1 and the only way to have one is to completely migrate your data out of this pool and to a new one.

Good luck,

Chris Moore · Mar 10, 2019

Sauce said:
raidz1

Having two drives faulted in RAIDz1 is a problem. That is why we suggest RAIDz2 which can survive two faulted drives.

Sauce said:
Here is the results of a zpool status -x

When you post listings like this, please use code tags like this [CODE] your text here [/CODE] so that your output can be more easily readable like this:

Code:

root@Emily-NAS:/tmp # zpool status
  pool: Backup
 state: ONLINE
  scan: scrub repaired 0 in 0 days 03:31:21 with 0 errors on Sat Mar  9 23:47:00 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        Backup                                          ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/2e919d3d-2c1a-11e9-af8b-00074306773b  ONLINE       0     0     0
            gptid/2f292da6-2c1a-11e9-af8b-00074306773b  ONLINE       0     0     0
            gptid/2fb95d07-2c1a-11e9-af8b-00074306773b  ONLINE       0     0     0
            gptid/30514e6b-2c1a-11e9-af8b-00074306773b  ONLINE       0     0     0
          raidz1-1                                      ONLINE       0     0     0
            gptid/41d3312f-2c1a-11e9-af8b-00074306773b  ONLINE       0     0     0
            gptid/426b7b47-2c1a-11e9-af8b-00074306773b  ONLINE       0     0     0
            gptid/43029d18-2c1a-11e9-af8b-00074306773b  ONLINE       0     0     0
            gptid/af54c9c6-4277-11e9-af8b-00074306773b  ONLINE       0     0     0

errors: No known data errors

Chris Moore · Mar 10, 2019

Sauce said:
I suspect its is bad cable, as the drive has passed all SMART tests, surface read\write tests when it failed the same way several months ago

Please post your hardware details as this could be the result of some incompatible hardware. Here is a guide:

Forum Guidelines
https://www.ixsystems.com/community/threads/forum-guidelines.45124/

Sauce · Mar 10, 2019

Hi Chris,
Nice to meet you, Consider myself converted !

Sauce · Mar 10, 2019

Ill dump all the hardware specs in a minute, but Heracles, what would be the best procedure for Drive No.2

1. Shutdown Freenas
2. Check drive, (assume I get it online)
3. Boot up Freenas
4. ? Will it just show up again by itself ?

Heracles · Mar 10, 2019

To shutdown FreeNAS first is safe and will let you check all the cables and connection.
Next, unplug the power from the server, use an anti-static protection before opening the case.
Check all your connections and if you can do some clean up with compressed air, that is often a good thing.
Once you are sure all your connections are good, close the case, replug the power and reboot.

If indeed that disk was offline because of a poor connection and the sync between it and the pool was preserved, you may have a chance to see FreeNAS load your pool in a degraded state.

But now that I see you zpool status (it was not visible when I first replied and I did not double checked your first post before my second reply), I would say that you can try it but the hope is pretty low.

Because you were using RaidZ1, you can re-silver your pool only if all of the N-1 disks in the pool are in perfect shape and do not contain any corruption. Because there is no more redundancy, there is no way for ZFS to recover from an error in any of the remaining drives.

From what I see, I understand that Failed Disk No2 suffered errors even before Failed Disk No1 died. So that would mean itself would contain errors. Should it contains errors, it may well be impossible to re-silver the pool because the N-1 disks are not pristine and no more redundancy is available to fix these errors.

So you can try, but I would not celebrate before the pool is back as Healthy and you confirmed your data are usable...

Still, good luck...

Sauce · Mar 10, 2019

OK, the ARRAY2 is back online in a degraded state.
I left out the major failed drive No1 ( suspect its healthy tho)and its showing as UNAVIL and the button shows REPLACE.
I didnt mark it offline before removing it however.
Is the best course of action to add in the new drive, and hit REPLACE ?

Sauce · Mar 10, 2019

RTFM, ARRAY is resilvering...

OK, so I think as Chris has mentioned, my hardware is incompatible.
When I setup this Box, I bought a new RAID card that I thought was supported, but its revision was not.
Due to cost, I didnt want to waste it, and so we installed a custom driver and got it going.
Bad decision in hindsight...

Heracles · Mar 10, 2019

Wow! You are lucky one!

Happy that the procedure worked for you!

So now that re-silvering is in progress, I would first let it run. If indeed everything was caused by poor connections, re-silvering have a high probability to clear everything. The only requirement is for errors in Faulted Drive 1 not be on the same blocks as for Faulted Drive 2. Statistically, this is high probability, so re-silvering has a high chance of putting your pool back together.

Still, you now learned that RaidZ1 is not a strong enough protection. The easy and quick way to add redundancy would be by adding a hot spare. So once re-silvering is done, add an extra drive in the server and mark it as a hot spare. If your hardware is not hotplug, you will need to power off for that. With a hot spare in place, re-silvering should start by itself as soon as required.

Then, you will need to design and build a pool with at least RaidZ2. So for that, you need to identify your needs, design the solution, build it properly and then migrate your data to that new pool.

Lucky you and I wish that luck will stay with you until you moved to a solution stable enough not to require luck anymore,

Important Announcement for the TrueNAS Community.

2 Faulted Drives - Any chance at saving data

Sauce

Cadet

Heracles

Wizard

Sauce

Cadet

Heracles

Wizard

Chris Moore

Hall of Famer

Chris Moore

Hall of Famer

Sauce

Cadet

Sauce

Cadet

Heracles

Wizard

Sauce

Cadet

Sauce

Cadet

Heracles

Wizard

Similar threads

Important Announcement for the TrueNAS Community.

2 Faulted Drives - Any chance at saving data

Cadet

Wizard

Cadet

Wizard

Hall of Famer

Hall of Famer

Cadet

Cadet

Wizard

Cadet

Cadet

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "2 Faulted Drives - Any chance at saving data"

Similar threads