How to repair ZFS Pool in UNAVAIL state?

Hendrixx

Dabbler
Joined
Jul 6, 2020
Messages
32
Hi,

I have a (big?) problem right now with my TrueNAS ZFS Pool and i want to recover my data on it.

The problem is that right now it is in a UNAVAIL state:

Code:
root@freenas:/ # zpool import
   pool: ZFS_POOL01
     id: 5460348760061402288
  state: UNAVAIL
status: The pool was last accessed by another system.
 action: The pool cannot be imported due to damaged devices or data.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-EY
 config:

    ZFS_POOL01                                      UNAVAIL  insufficient replicas
      raidz2-0                                      ONLINE
        gptid/6ce0a688-ad42-11e8-9918-bc5ff4b16f83  ONLINE
        gptid/6de01fb9-ad42-11e8-9918-bc5ff4b16f83  ONLINE
        gptid/6ee80ae3-ad42-11e8-9918-bc5ff4b16f83  ONLINE
        gptid/6ff80af2-ad42-11e8-9918-bc5ff4b16f83  ONLINE
        gptid/71c378f0-ad42-11e8-9918-bc5ff4b16f83  ONLINE
      raidz2-2                                      UNAVAIL  insufficient replicas
        gptid/84cf1654-1ddf-11ea-8ccd-000e0cd95d8e  UNAVAIL  cannot open
        gptid/84e6d4e6-1ddf-11ea-8ccd-000e0cd95d8e  UNAVAIL  cannot open
        15889172290366759367                        UNAVAIL  cannot open
        gptid/84fe4a22-1ddf-11ea-8ccd-000e0cd95d8e  UNAVAIL  cannot open
    logs   
      gptid/cb2ce328-a906-11ea-9169-000e0cd95d8e    ONLINE



So here's how it started:

I received some errors on one of my drives in the raidz2-2 VDEV so i removed the drive and send it in for warrenty.
I'm still waiting on the replacement disk right now.

After reboot it tried to 'resilver' but 2 other drives in the raidz2-2 got errors so now my pool was DEGRADED.
I checked the drives left in the raidz2-2 and only 1 was still in ONLINE state.
Problem was that the resilver process stopped after about 1-2% and i received a 'pool I/O is currently suspended' message and couldn't access my pool anymore.

Besides that i also received continues errors after boot caused by the collectd process.
To stop this i killed the collectd process from shell which fixed the flow of errors (until the next reboot).

So i decided to install a fresh new install of TrueNAS on my USB stick and try to import my pool again.
Problem now is that when i try to import the pool (ZFS_POOL01) it doesn't work anymore :(

When i try to import the pool it now says UNAVAIL.
I tried to use the zpool import -f ZFS_POOL01 command as suggested and a few variations (-fFX, -fF, -FX, etc).

Is there any way to recover from this?

Unfortunately i don't have backups, i was in process of setting up a project to create a second TrueNAS server
to replicate the most important data but it is not finished yet so no backup :(

Note: I used the Seagate SeaTools bootable disk to check my 3 SAS disks for errors (long test) but it didn't found
any errors. Kinda weird because TrueNAS marked them as DEGRADED (to many errors).

My setup:
TrueNAS-12.0-RELEASE -> installed on 16GB Bootable USB stick
5x WD SATA 4TB disks (red disks) -> connected on Sata connections on ASRock B75 Pro3-M motherboard.
LSI 9300-8i SAS Controller -> with 4x Seagate Exos 7e8 6TB
1x SSD 128GB used as LOG disk (ZIL)

ZFS_POOL01 setup:
1x VDEV RAIDZ-2 -> 5x 4TB WD Red disks
1x VDEV RAIDZ-2 -> 4x 6TB Seagate Exos 7e8 disks (added later)
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
Unfortunately i don't have backups, i was in process of setting up a project to create a second TrueNAS server
to replicate the most important data but it is not finished yet so no backup :(

Unfortunately, this is way too common... People always start doing their backups after they lost everything...

So your first vDev is all clean. Nothing to talk about this one. Everything wrong is with the second vDev. That is more than suspicious. To have 4 failed drives in the same vDev at once.... Statistically, this is excluded. As such, the problem is probably more with the controller or with cables.

You are better not to force anything by yourself, or to guess anything. Your data --may--be recoverable, but it will be very high risk until it is done.

So first step would be to ensure everything is plugged nicely and completely : cards, cables, etc. If that brings back the pool, great and do your backup right away. If it does not, you will need the help and advices from other senior members. Personally, I never played with that and took all the measures possible to never do it. I enforced the 3 copies rule from the beginning, so I will always have a backup to go to no matter what the situation I have to face.
 

Hendrixx

Dabbler
Joined
Jul 6, 2020
Messages
32
Unfortunately, this is way too common... People always start doing their backups after they lost everything...

Yeah i know :(
Problem is that its a pretty big pool with around 16/17 TB of data.
Mostly movies, music, etc. But also my iTunes lib and Photo lib.
I did make a backup of my photo lib so thats something at least.

I suspected the SAS controller as well because i do see lots of "DATA PROTECT - Access denied - no access rights" errors
for all 3 SAS disks on my SAS Controller.
I already tried to reconnect al my SAS cables and everything but without any luck.
Strange thing is that the Seatools app i used to check these SAS disks on the controller could read all drives without problems
and couldn't find any errors after Long checks.

Any ideas how i could save the first VDEV with all the disks still in ONLINE mode?
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Any ideas how i could save the first VDEV with all the disks still in ONLINE mode?
You can't. Your files are more or less evenly spread across both vdevs - not per file but on the block level.
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
Any ideas how i could save the first VDEV with all the disks still in ONLINE mode?

Unfortunately, there is no such thing as saving one vDev out of the many that are in the pool. Either the pool has all its vDev and can mount, or it misses a least one vDev and from there, can not mount at all. vDev are not independent file systems, each holding one file or the other. They are nothing but unrelated blocks that only ZFS can makes sense from. And for that, ZFS needs all its blocks. To spread the blocks on all vDev speeds up reads and writes. For Raid-Zx, it also spread the redundancy. But the consequence is that should any vDev be missing, the entire structure collapses.
 

Hendrixx

Dabbler
Joined
Jul 6, 2020
Messages
32
Ok ... so basically if i can't get those drives to work in my second VDEV then all the data is lost?

Or is there anything else i could try?
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
Ok ... so basically if i can't get those drives to work in my second VDEV then all the data is lost?

Pretty much.... yes....

Still, considering what you described in the way these drives ended up not available, they may still hold the data you need to put back the pool together. I will not try to tell you how to do it because I never worked with that and never will. Other seniors here may have more input for you, so I advise you against trying to force anything by yourself, avoid shooting in the dark because you may well shoot yourself in the foot.

Good luck with that and keep reading the forum during that time. There a many threads about recovering damaged pools...
 

Hendrixx

Dabbler
Joined
Jul 6, 2020
Messages
32
Mmm ... ok guess i will be spending a lot of time on the forums to search for ways to try to recover my pool.

Thx for all the info
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
I guess (sorry, my crystal ball is at the shop ;) you still have a good chance. 3 disks disconnected all at once hint at a controller or backplane problem. Any chance you can bring a second known working server, if only borrowed for a couple of days?

BTW: what does camcontrol devlist say for your current machine?
 

Hendrixx

Dabbler
Joined
Jul 6, 2020
Messages
32
I guess (sorry, my crystal ball is at the shop ;) you still have a good chance. 3 disks disconnected all at once hint at a controller or backplane problem. Any chance you can bring a second known working server, if only borrowed for a couple of days?

BTW: what does camcontrol devlist say for your current machine?

I am hoping that it is something with my LSI SAS Controller.
Unfortunately as of this moment i do not have a complete second working server or anything.
I do have a second desktop but with low specs (like only 6GB Memory).

I have ordered a new breakout cable for my LSI SAS Controller to the SAS drives.
I just want to check for sure and they are not expensive.
I am also thinking of buying a second LSI 9300-8i SAS Controller and try that.

Output from 'camcontrol devlist':
Code:
root@freenas:/ # camcontrol devlist
<SEAGATE ST6000NM0285 EF04>        at scbus0 target 0 lun 0 (pass0,da0)
<SEAGATE ST6000NM0285 EF04>        at scbus0 target 1 lun 0 (pass1,da1)
<SEAGATE ST6000NM0285 EF04>        at scbus0 target 3 lun 0 (pass2,da2)
<WDC WD40EFRX-68N32N0 82.00A82>    at scbus1 target 0 lun 0 (pass3,ada0)
<WDC WD40EFRX-68N32N0 82.00A82>    at scbus2 target 0 lun 0 (pass4,ada1)
<SSD2SC120GE2DA08B-T 560ABBF0>     at scbus3 target 0 lun 0 (pass5,ada2)
<WDC WD40EFRX-68N32N0 82.00A82>    at scbus5 target 0 lun 0 (pass6,ada3)
<WDC WD40EFRX-68N32N0 82.00A82>    at scbus7 target 0 lun 0 (pass7,ada4)
<WDC WD40EFRX-68N32N0 82.00A82>    at scbus8 target 0 lun 0 (pass8,ada5)
<AHCI SGPIO Enclosure 2.00 0001>   at scbus9 target 0 lun 0 (pass9,ses0)
<AVEX AX1611        CF 1.9C>       at scbus11 target 0 lun 0 (da3,pass10)
<AVEX AX1611        MS 1.9C>       at scbus11 target 0 lun 1 (da4,pass11)
<AVEX AX1611    MMC/SD 1.9C>       at scbus11 target 0 lun 2 (da5,pass12)
<AVEX AX1611        SM 1.9C>       at scbus11 target 0 lun 3 (da6,pass13)
<Kingston DataTraveler 2.0 1.00>   at scbus12 target 0 lun 0 (da7,pass14)


So it looks like my 3 Seagate drives do show up there.
I just don't understand it doesn't work now.
 
Top