RedClick7000
Cadet
- Joined
- Apr 1, 2021
- Messages
- 1
This issu may be very similar to https://www.truenas.com/community/threads/how-to-put-same-offline-disk-back.10475/ unfortunatly there's no final outcome.
Hi everyone,
As a lack of knowledge and experience with ZFS, I find myself in serious troubles with a FreeNAS system, after trying to expand an existing pool, risking the lost of almost the entire data.
I'm aware that this wouldn't be any issue at all, if there was a full-backup of all data. But unfortunately this did not yet happen, due to a short budget. And yes indeed, this is way too risky. So I find myself desperately trying to recover the mentioned pool - nightmare scenario!
Context:
In order to expand an existing RAID-Z1 type pool named "STORE_01" from 3x4TB to 5x4TB the following steps has been taken:
The HDD5 (Z3032RS7) used in the newly created pool is a temporary borrowed drive from another system, while a new disk was on its way to be delivered (lets call this factory new disk HDD6). When HDD6 (ZGY8LNPK) arrived, my intention was to swap these disks with each other in order to return the borrowed disk (HDD5) to its origin system.
There for I:
The idea was to put back the borrowed 5th disk (HDD5), in order to recover the pool. As HDD5 has not been touched since it was pulled I expected to reinsert it and therefore having HDD2-3-4-5 working, which would allow me to replace the badly damaged HDD1.
So i did:
"Online" command on HDD5 via GUI fails.
(SMART Test of HDD5 states healthy though)
Further observations
Hardware information
CPU: Intel(R) Pentium(R) CPU G4560 @ 3.50GHz
Architecture: amd64
Mainboard: ASROCK H270 Performance
Memory: DDR4 16GB ECC
PCI-E SATA-Port Expansion card
Pool-1: RAID-Z1 5x4TB Seagate ST4000 NAS
Pool-2: Stripe 2x2TB
OS-Image: USB-Flashdrive
Version: FreeNAS-11.3-U5
Disk configuration of concerned pool "STORE_01"
POOL "STORE_01"
Disk-Nicknames / ada# / state / Serial Number (comment)
HDD1 / ada0 / read errors / ZGY02CYQ (damaged)
HDD2 / ada1 / healthy / ZDH0Y71R (untouched, ok)
HDD3 / ada2 / healthy / ZGY1JCZC (untouched, ok)
HDD4 / ada3 / healthy / Z3032RDV (untouched, ok)
HDD5 / ada4 / unknown / Z3032RS7 (borrowed spare disk)
HDD6 / ZGY8LNPK (newly received disk, currently removed after failed resilver)
POOL "STORE_02"
HDD7 / ada5 / (pool 2 "STORE_02")
HDD8 / ada6 / (pool 2 "STORE_02")
POOL "STORE_da0"
HDD-USB1 / da0 / (temporary usb attached backup drive)
OS Partition:
FreeNAS-BOOT / da1 / (USB Flashdrive with Boot Image)
Permanent system outputs
"ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
RES: 51 40 00 ef 76 03 03 00 00 00 00
Error 5, Retries exhausted"
...
Command outputs
# zpool status
gpart status
camcontrol devlist
Hi everyone,
As a lack of knowledge and experience with ZFS, I find myself in serious troubles with a FreeNAS system, after trying to expand an existing pool, risking the lost of almost the entire data.
I'm aware that this wouldn't be any issue at all, if there was a full-backup of all data. But unfortunately this did not yet happen, due to a short budget. And yes indeed, this is way too risky. So I find myself desperately trying to recover the mentioned pool - nightmare scenario!
Context:
In order to expand an existing RAID-Z1 type pool named "STORE_01" from 3x4TB to 5x4TB the following steps has been taken:
- Offloading the whole data to external drives
- destroying the old pool
- Shuting the entire system down and installing 2 new 4TB, lets name them HDD4 and HDD5, in addition to the existing 3 drives (HDD1-2-3).
- Booted and created new pool with a total of 5x4TB in RAID-Z1 configuration (HDD1,2,3,4,5).
- Data has then been copied back into new pool.
- Config has been backed up
- Manual FreeNAS OS update to 11.3 U5 has been made.
- Config backed up again.
- System status in GUI says pool healthy.
- Ereasing external drives where data had been offloaded to, as this storage is needed elsewhere.
The HDD5 (Z3032RS7) used in the newly created pool is a temporary borrowed drive from another system, while a new disk was on its way to be delivered (lets call this factory new disk HDD6). When HDD6 (ZGY8LNPK) arrived, my intention was to swap these disks with each other in order to return the borrowed disk (HDD5) to its origin system.
There for I:
- Shut down the entire system
- Pulled borrowed drive (HDD5) and swapped it with newly arrived one (HDD6), via the same drive-slot. HDD5 has been left untouched since.
- Booted system and pool status shows degraded (HDD5 disk not present or similar)
- Used "Replace" command via GUI to integrate newly inserted disk (HDD6) as part of the pool.
- Resilver process on HDD6 started but never ended. (Starts over and over again...)
- Turns out that HDD1 (ZGY02CYQ) is in a critical state with severe disk read errors.
SMART test states 14% health and shell is screaming non stop:
"ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
RES: 51 40 00 ef 76 03 03 00 00 00 00
Error 5, Retries exhausted"
That's why the resilver process on the newly added HDD6 can't finish/be performed at all.
So this lets the system with 2 failed drives (HDD1, HDD6) out of 5, in a RAID-Z1 type configuration, although only 1 at a time can be tolerated.
- At this stage I backed up all available data. But there's a significant part missing of course.
The idea was to put back the borrowed 5th disk (HDD5), in order to recover the pool. As HDD5 has not been touched since it was pulled I expected to reinsert it and therefore having HDD2-3-4-5 working, which would allow me to replace the badly damaged HDD1.
So i did:
- Shut down the system
- Put back the borrowed 5th disk HDD5 (which was left untouched meanwhile)
- Reboot in hope that pool recognizes the replaced disk.
"Online" command on HDD5 via GUI fails.
(SMART Test of HDD5 states healthy though)
Further observations
- When I boot with HDD 1-2-3-4-5 --> Pool "STORE_01" shows up as DEGRADED, HDD5=FAULTED, HDD1-2-3-4 = ONLINE ...after a fiew minutes HDD2-3-4 showing degraded but HDD1 always remains ONLINE.
- When I boot with HDD1-2-3-4-6 --> Pool "STORE_01" shows up as HEALTHY, HDD1-2-3-4-6=ONLINE ...after a fiew minutes HDD1-2-3-4-6 showing DEGRADED
- I inspected HDD5 on a workstation with UFS Explorer Professional and the drives seems still containing a valid RAID-Z partition with data on it (can extract random files from it).
- After 3 days trying to find a solution, back and forth i stopped and shutdown the system to avoid any further damage. And get help...
- Is there any way to get a replaced disk back into a pool in this situation? (HDD5)
(Pool Export/Import?; Metadata manipulation?; Advanced shell commands to reattach the disk?; ...anything) - I'm thinking about to find a way to borrow 4 more disks in order to clone every disk in the system before continuing any further with manipulations. This would allow for more freedome with more aggressive experimentations. But is this even worth at this stage?
- This wouldn't be an issue if a proper backup would have been performed from the beginning on.
- Scrubbed the whole pool after newly created pool before proceeding with any other operation.
- If I had performed a proper disk replacement procedure, instead of just pulling disks out.
- It first took me some time to figure out that the real issue wasn't the 5th disk swap but the 1st one with the read errors.
- I read about another thread (https://www.truenas.com/community/threads/how-to-put-same-offline-disk-back.10475/) which looks very similar, but there's no final outcome of the recover attempt.
- I'm just about to learn all of this and I have very poor experience with ZFS systems.
- There's still 1 free slot and I could physicaly attach the new HDD6 (ZGY8LNPK) and borrowed HDD5 (Z3032RS7) drive and at the same time if this is of any importance.
Hardware information
CPU: Intel(R) Pentium(R) CPU G4560 @ 3.50GHz
Architecture: amd64
Mainboard: ASROCK H270 Performance
Memory: DDR4 16GB ECC
PCI-E SATA-Port Expansion card
Pool-1: RAID-Z1 5x4TB Seagate ST4000 NAS
Pool-2: Stripe 2x2TB
OS-Image: USB-Flashdrive
Version: FreeNAS-11.3-U5
Disk configuration of concerned pool "STORE_01"
POOL "STORE_01"
Disk-Nicknames / ada# / state / Serial Number (comment)
HDD1 / ada0 / read errors / ZGY02CYQ (damaged)
HDD2 / ada1 / healthy / ZDH0Y71R (untouched, ok)
HDD3 / ada2 / healthy / ZGY1JCZC (untouched, ok)
HDD4 / ada3 / healthy / Z3032RDV (untouched, ok)
HDD5 / ada4 / unknown / Z3032RS7 (borrowed spare disk)
HDD6 / ZGY8LNPK (newly received disk, currently removed after failed resilver)
POOL "STORE_02"
HDD7 / ada5 / (pool 2 "STORE_02")
HDD8 / ada6 / (pool 2 "STORE_02")
POOL "STORE_da0"
HDD-USB1 / da0 / (temporary usb attached backup drive)
OS Partition:
FreeNAS-BOOT / da1 / (USB Flashdrive with Boot Image)
Permanent system outputs
"ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
RES: 51 40 00 ef 76 03 03 00 00 00 00
Error 5, Retries exhausted"
...
Command outputs
# zpool status
Code:
# zpool status pool: STORE_01 state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Fri Apr 2 00:23:45 2021 344G scanned at 3.18G/s, 447M issued at 4.14M/s, 7.36T total 0 resilvered, 0.01% done, no estimated completion time config: NAME STATE READ WRITE CKSUM STORE_01 DEGRADED 0 0 1.37K raidz1-0 DEGRADED 0 0 5.50K gptid/84df4518-85ae-11eb-847a-6805ca64a504 ONLINE 0 0 0 gptid/85a5e45a-85ae-11eb-847a-6805ca64a504 ONLINE 0 0 0 gptid/86845e6e-85ae-11eb-847a-6805ca64a504 ONLINE 0 0 0 gptid/874f7135-85ae-11eb-847a-6805ca64a504 ONLINE 0 0 0 8684723719616545912 FAULTED 0 0 0 was /dev/ada4p2 errors: 445 data errors, use '-v' for a list pool: STORE_02 state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM STORE_02 ONLINE 0 0 0 gptid/66745f2f-87fd-11eb-b74e-6805ca64a504 ONLINE 0 0 0 gptid/67290b6a-87fd-11eb-b74e-6805ca64a504 ONLINE 0 0 0 errors: No known data errors pool: STORE_da0 state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM STORE_da0 ONLINE 0 0 0 gptid/e153c247-8ffc-11eb-b996-6805ca64a504 ONLINE 0 0 0 errors: No known data errors pool: freenas-boot state: ONLINE scan: scrub repaired 0 in 0 days 00:01:46 with 0 errors on Fri Mar 26 03:46:46 2021 config:
gpart status
Code:
# gpart status Name Status Components ada0p1 OK ada0 ada0p2 OK ada0 ada1p1 OK ada1 ada1p2 OK ada1 ada2p1 OK ada2 ada2p2 OK ada2 ada3p1 OK ada3 ada3p2 OK ada3 ada4p1 OK ada4 ada4p2 OK ada4 ada5p1 OK ada5 ada5p2 OK ada5 ada6p1 OK ada6 ada6p2 OK ada6 da0p1 OK da0 da0p2 OK da0 da1p1 OK da1 da1p2 OK da1
camcontrol devlist
Code:
<ST4000VN008-2DR166 SC60> at scbus0 target 0 lun 0 (pass0,ada0) <ST4000VN008-2DR166 SC60> at scbus1 target 0 lun 0 (pass1,ada1) <ST4000VN008-2DR166 SC60> at scbus2 target 0 lun 0 (pass2,ada2) <ST4000VN000-1H4168 SC46> at scbus3 target 0 lun 0 (pass3,ada3) <ST4000VN000-1H4168 SC46> at scbus5 target 0 lun 0 (pass4,ada4) <AHCI SGPIO Enclosure 2.00 0001> at scbus6 target 0 lun 0 (pass5,ses0) <WDC WD20EARS-00MVWB0 51.0AB51> at scbus7 target 0 lun 0 (pass6,ada5) <ST2000DM001-1CH164 CC24> at scbus9 target 0 lun 0 (pass7,ada6) <ADATA HD710 PRO 9203> at scbus12 target 0 lun 0 (pass8,da0) <SanDisk Ultra Fit 1.00> at scbus13 target 0 lun 0 (pass9,da1)