Reattach previously replaced disk to a pool

RedClick7000

Cadet
Joined
Apr 1, 2021
Messages
1
This issu may be very similar to https://www.truenas.com/community/threads/how-to-put-same-offline-disk-back.10475/ unfortunatly there's no final outcome.

Hi everyone,
As a lack of knowledge and experience with ZFS, I find myself in serious troubles with a FreeNAS system, after trying to expand an existing pool, risking the lost of almost the entire data.
I'm aware that this wouldn't be any issue at all, if there was a full-backup of all data. But unfortunately this did not yet happen, due to a short budget. And yes indeed, this is way too risky. So I find myself desperately trying to recover the mentioned pool - nightmare scenario!

Context:
In order to expand an existing RAID-Z1 type pool named "STORE_01" from 3x4TB to 5x4TB the following steps has been taken:
  • Offloading the whole data to external drives
  • destroying the old pool
  • Shuting the entire system down and installing 2 new 4TB, lets name them HDD4 and HDD5, in addition to the existing 3 drives (HDD1-2-3).
  • Booted and created new pool with a total of 5x4TB in RAID-Z1 configuration (HDD1,2,3,4,5).
  • Data has then been copied back into new pool.
  • Config has been backed up
  • Manual FreeNAS OS update to 11.3 U5 has been made.
  • Config backed up again.
  • System status in GUI says pool healthy.
  • Ereasing external drives where data had been offloaded to, as this storage is needed elsewhere.
Problem
The HDD5 (Z3032RS7) used in the newly created pool is a temporary borrowed drive from another system, while a new disk was on its way to be delivered (lets call this factory new disk HDD6). When HDD6 (ZGY8LNPK) arrived, my intention was to swap these disks with each other in order to return the borrowed disk (HDD5) to its origin system.

There for I:
  • Shut down the entire system
  • Pulled borrowed drive (HDD5) and swapped it with newly arrived one (HDD6), via the same drive-slot. HDD5 has been left untouched since.
  • Booted system and pool status shows degraded (HDD5 disk not present or similar)
  • Used "Replace" command via GUI to integrate newly inserted disk (HDD6) as part of the pool.
  • Resilver process on HDD6 started but never ended. (Starts over and over again...)
  • Turns out that HDD1 (ZGY02CYQ) is in a critical state with severe disk read errors.
    SMART test states 14% health and shell is screaming non stop:
    "ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
    RES: 51 40 00 ef 76 03 03 00 00 00 00
    Error 5, Retries exhausted"
    That's why the resilver process on the newly added HDD6 can't finish/be performed at all.
    So this lets the system with 2 failed drives (HDD1, HDD6) out of 5, in a RAID-Z1 type configuration, although only 1 at a time can be tolerated.
Nevertheless the pool status is still available and shows "only" degraded. A lot of data is still accessible.
  • At this stage I backed up all available data. But there's a significant part missing of course.
Attempts
The idea was to put back the borrowed 5th disk (HDD5), in order to recover the pool. As HDD5 has not been touched since it was pulled I expected to reinsert it and therefore having HDD2-3-4-5 working, which would allow me to replace the badly damaged HDD1.
So i did:
  • Shut down the system
  • Put back the borrowed 5th disk HDD5 (which was left untouched meanwhile)
  • Reboot in hope that pool recognizes the replaced disk.
It seems that the disk cannot be reattached to the pool. (Pool stays in degraded and HDD5 shows: FAULTED).
"Online" command on HDD5 via GUI fails.

(SMART Test of HDD5 states healthy though)


Further observations
  • When I boot with HDD 1-2-3-4-5 --> Pool "STORE_01" shows up as DEGRADED, HDD5=FAULTED, HDD1-2-3-4 = ONLINE ...after a fiew minutes HDD2-3-4 showing degraded but HDD1 always remains ONLINE.
  • When I boot with HDD1-2-3-4-6 --> Pool "STORE_01" shows up as HEALTHY, HDD1-2-3-4-6=ONLINE ...after a fiew minutes HDD1-2-3-4-6 showing DEGRADED
  • I inspected HDD5 on a workstation with UFS Explorer Professional and the drives seems still containing a valid RAID-Z partition with data on it (can extract random files from it).
  • After 3 days trying to find a solution, back and forth i stopped and shutdown the system to avoid any further damage. And get help...
Question
  • Is there any way to get a replaced disk back into a pool in this situation? (HDD5)
    (Pool Export/Import?; Metadata manipulation?; Advanced shell commands to reattach the disk?; ...anything)
  • I'm thinking about to find a way to borrow 4 more disks in order to clone every disk in the system before continuing any further with manipulations. This would allow for more freedome with more aggressive experimentations. But is this even worth at this stage?
Commentaries
  • This wouldn't be an issue if a proper backup would have been performed from the beginning on.
  • Scrubbed the whole pool after newly created pool before proceeding with any other operation.
  • If I had performed a proper disk replacement procedure, instead of just pulling disks out.
  • It first took me some time to figure out that the real issue wasn't the 5th disk swap but the 1st one with the read errors.
  • I read about another thread (https://www.truenas.com/community/threads/how-to-put-same-offline-disk-back.10475/) which looks very similar, but there's no final outcome of the recover attempt.
  • I'm just about to learn all of this and I have very poor experience with ZFS systems.
  • There's still 1 free slot and I could physicaly attach the new HDD6 (ZGY8LNPK) and borrowed HDD5 (Z3032RS7) drive and at the same time if this is of any importance.

I would appreciate every help!



Hardware information
CPU: Intel(R) Pentium(R) CPU G4560 @ 3.50GHz
Architecture: amd64
Mainboard: ASROCK H270 Performance
Memory: DDR4 16GB ECC
PCI-E SATA-Port Expansion card
Pool-1: RAID-Z1 5x4TB Seagate ST4000 NAS
Pool-2: Stripe 2x2TB

OS-Image: USB-Flashdrive
Version: FreeNAS-11.3-U5

Disk configuration of concerned pool "STORE_01"
POOL "STORE_01"
Disk-Nicknames / ada# / state / Serial Number (comment)
HDD1 / ada0 / read errors / ZGY02CYQ (damaged)
HDD2 / ada1 / healthy / ZDH0Y71R (untouched, ok)
HDD3 / ada2 / healthy / ZGY1JCZC (untouched, ok)
HDD4 / ada3 / healthy / Z3032RDV (untouched, ok)
HDD5 / ada4 / unknown / Z3032RS7 (borrowed spare disk)

HDD6 / ZGY8LNPK (newly received disk, currently removed after failed resilver)

POOL "STORE_02"
HDD7 / ada5 / (pool 2 "STORE_02")
HDD8 / ada6 / (pool 2 "STORE_02")

POOL "STORE_da0"
HDD-USB1 / da0 / (temporary usb attached backup drive)

OS Partition:
FreeNAS-BOOT / da1 / (USB Flashdrive with Boot Image)

Permanent system outputs
"ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
RES: 51 40 00 ef 76 03 03 00 00 00 00
Error 5, Retries exhausted"
...

Command outputs
# zpool status
Code:
# zpool status
  pool: STORE_01
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Apr  2 00:23:45 2021
        344G scanned at 3.18G/s, 447M issued at 4.14M/s, 7.36T total
        0 resilvered, 0.01% done, no estimated completion time
config:

        NAME                                            STATE     READ WRITE CKSUM
        STORE_01                                        DEGRADED     0     0 1.37K
          raidz1-0                                      DEGRADED     0     0 5.50K
            gptid/84df4518-85ae-11eb-847a-6805ca64a504  ONLINE       0     0 0
            gptid/85a5e45a-85ae-11eb-847a-6805ca64a504  ONLINE       0     0 0
            gptid/86845e6e-85ae-11eb-847a-6805ca64a504  ONLINE       0     0 0
            gptid/874f7135-85ae-11eb-847a-6805ca64a504  ONLINE       0     0 0
            8684723719616545912                         FAULTED      0     0 0  was /dev/ada4p2

errors: 445 data errors, use '-v' for a list

  pool: STORE_02
 state: ONLINE
  scan: none requested
config:

        NAME                                          STATE     READ WRITE CKSUM
        STORE_02                                      ONLINE       0     0     0
          gptid/66745f2f-87fd-11eb-b74e-6805ca64a504  ONLINE       0     0     0
          gptid/67290b6a-87fd-11eb-b74e-6805ca64a504  ONLINE       0     0     0

errors: No known data errors

  pool: STORE_da0
 state: ONLINE
  scan: none requested
config:

        NAME                                          STATE     READ WRITE CKSUM
        STORE_da0                                     ONLINE       0     0     0
          gptid/e153c247-8ffc-11eb-b996-6805ca64a504  ONLINE       0     0     0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:01:46 with 0 errors on Fri Mar 26 03:46:46 2021
config:


gpart status
Code:
# gpart status
  Name  Status  Components
ada0p1      OK  ada0
ada0p2      OK  ada0
ada1p1      OK  ada1
ada1p2      OK  ada1
ada2p1      OK  ada2
ada2p2      OK  ada2
ada3p1      OK  ada3
ada3p2      OK  ada3
ada4p1      OK  ada4
ada4p2      OK  ada4
ada5p1      OK  ada5
ada5p2      OK  ada5
ada6p1      OK  ada6
ada6p2      OK  ada6
  da0p1      OK  da0
  da0p2      OK  da0
  da1p1      OK  da1
  da1p2      OK  da1


camcontrol devlist
Code:
<ST4000VN008-2DR166 SC60>          at scbus0 target 0 lun 0 (pass0,ada0)
<ST4000VN008-2DR166 SC60>          at scbus1 target 0 lun 0 (pass1,ada1)
<ST4000VN008-2DR166 SC60>          at scbus2 target 0 lun 0 (pass2,ada2)
<ST4000VN000-1H4168 SC46>          at scbus3 target 0 lun 0 (pass3,ada3)
<ST4000VN000-1H4168 SC46>          at scbus5 target 0 lun 0 (pass4,ada4)
<AHCI SGPIO Enclosure 2.00 0001>   at scbus6 target 0 lun 0 (pass5,ses0)
<WDC WD20EARS-00MVWB0 51.0AB51>    at scbus7 target 0 lun 0 (pass6,ada5)
<ST2000DM001-1CH164 CC24>          at scbus9 target 0 lun 0 (pass7,ada6)
<ADATA HD710 PRO 9203>             at scbus12 target 0 lun 0 (pass8,da0)
<SanDisk Ultra Fit 1.00>           at scbus13 target 0 lun 0 (pass9,da1)
 
Top