Morning Stupidity

DAXQ · Dec 24, 2020

Morning all - I have a FreeNAS build - older IBM
OS Version: FreeNAS-11.3-U1
Model: System x3650 M3 -[7945AC1]-
Memory: 24 GiB
6 * HGST Travelstar 7K1000 2.5-Inch 1TB 7200 RPM SATA III 32MB Cache Internal Hard Drive
Set up in RaidZ3 Pool 5 Disks 1 Spare.
---
One of the disks in pool started giving me errors - failed smart, currently unreadable, etc... so I decided to replace it. Since my 6th drive was in the server & online (not in use) - I tried replacing the failing disk : I went to Storage > Pools > Gear > Status - clicked the three vertical dots and selected replace. In the dialog I selected my 6th disk from the pull down and clicked replace.

System started Replacing:

Screenshot 2020-12-24 at 10.58.28 AM - Display 2.png

That was 7 days ago.

zpool status -x
pool: z3_1TDrives
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Thu Dec 24 08:08:36 2020
2.02T scanned at 669M/s, 243G issued at 78.5M/s, 2.70T total
47.4G resilvered, 8.77% done, 0 days 09:08:55 to go
config:

NAME STATE READ WRITE CKSUM
z3_1TDrives ONLINE 0 0 0
raidz3-0 ONLINE 0 0 0
gptid/804a0592-62f6-11ea-9204-5cf3fcba544c ONLINE 0 0 0
replacing-1 ONLINE 0 0 12
gptid/806a11e8-62f6-11ea-9204-5cf3fcba544c ONLINE 0 330 0
gptid/10d452b4-43ba-11eb-8926-5cf3fcba544c ONLINE 0 0 0
gptid/81f193c9-62f6-11ea-9204-5cf3fcba544c ONLINE 0 0 0
gptid/818034fa-62f6-11ea-9204-5cf3fcba544c ONLINE 0 0 0
gptid/8267798a-62f6-11ea-9204-5cf3fcba544c ONLINE 0 0 0

errors: No known data errors

It just keeps looping - it will get to differing amounts done - goes up in G resilvered and down in G resilvered but never finishes.

I have a second pool of smaller drives in the same server so I am currently in the process of copying my data to the second pool, expecting i'm going to have to destroy this pool and rebuild it without the bad drive (if that's my only option). I was wondering if anyone had an idea what may have happened, if I did something incorrectly or if there is a better way to resolve this issue without having to completely rebuild?

Thanks in advance for any advice.

Chris Moore · Dec 24, 2020

DAXQ said:
That was 7 days ago.

That is no good... I just started a rebuild this morning.

It should have been done in a day. Even on the 12TB drives at work, they finish in 12 hours.

Since you are using laptop drives, I could see it being a little slower, but there is something wrong here.

I would suggest checking the drive health of all the other drives in the pool.

For future reference, the rebuild completes faster if you remove the defective drive and then do the replace. With the old drive online, the resilver runs much slower. I found that out through empirical testing. I can't count the number of times I have replaced drives over the years. I used to build my servers with used drives and I was having failures constantly. Sometimes two or three in the same month. I even had two on the same day one time.

DAXQ · Dec 24, 2020

I kinda thought I should have removed the old (broke drive) but did not. Is there any way to kill the replace and start over?

DAXQ · Dec 24, 2020

Well I think I have stopped it (dunno fer sure) - gonna let my data copy continue before messing with it much further, but I ended up offlining both disks in the replace/resilver with:

zpool offline z3_1TDrives gptid/806a11e8-62f6-11ea-9204-5cf3fcba544c
zpool offline z3_1TDrives gptid/10d452b4-43ba-11eb-8926-5cf3fcba544c

Both commands completed without errors.

Then I detached the new (newer) drive with:
zpool detach z3_1TDrives gptid/10d452b4-43ba-11eb-8926-5cf3fcba544c

To pull it out of the pool - again completed without error.

When my copy finishes (moving my data from the Z3_1TDrives pool to z3_600GDrives pool) - I will reboot the server and pull the failing drive then run the replace with the non failing drive by itself in the server - See how it goes.

the status now is:


zpool status z3_1TDrives
  pool: z3_1TDrives
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Dec 24 11:49:23 2020
        2.70T scanned at 1.69G/s, 1.21T issued at 776M/s, 2.70T total
        0 resilvered, 44.87% done, 0 days 00:33:33 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        z3_1TDrives                                     DEGRADED     0     0     0
          raidz3-0                                      DEGRADED     0     0     0
            gptid/804a0592-62f6-11ea-9204-5cf3fcba544c  ONLINE       0     0     0
            16702323888863933501                        OFFLINE      0   330     0  was /dev/gptid/806a11e8-62f6-11ea-9204-5cf3fcba544c
            gptid/81f193c9-62f6-11ea-9204-5cf3fcba544c  ONLINE       0     0     0
            gptid/818034fa-62f6-11ea-9204-5cf3fcba544c  ONLINE       0     0     0
            gptid/8267798a-62f6-11ea-9204-5cf3fcba544c  ONLINE       0     0     0

errors: No known data errors
root@daxqFN[..._1TDrives/datastore/ill_data/userdata]#

The resilver seems to be continuing to increase its % done, but the replace is gone - wont know for sure how it all turns out till Monday.

Chris Moore · Dec 24, 2020

You don't have a lot of risk of loosing the pool, since it is RAIDz3, you could just remove the defective drive. Offline it, from the GUI, then disconnect it from the system if the system supports hot-swap, and take it right out. I replaced the drive in my system without a shutdown or reboot. Snatched the bad drive right out, so to speak. Some care must be taken.

DAXQ · Dec 28, 2020

I ended up moving everything to the smaller drives (my other z3 pool has 8 x 600GB IBM-ESXS MBF2600RC - I stupidly thought since I had no spares for this pool I should use the 1T drives) - when in reality - if I needed too, I could use a 1T as a spare for this pool of better drives - just not thinking I think! Everything has washed out and I think i'm better off for it in the end, have more space on better drives and have a few 1TB drives that can be used as spares if needed in a pinch. Thanks for all the help and information.

Constantin · Dec 28, 2020

... I'd always double and triple-check the tolerance of your PSU, cabling, etc. with hot-swapping drives. That may or may not end badly, depending on the backplane design, and so on. Test with an empty pool undergoing bad blocks testing to make sure the margin is there.

Important Announcement for the TrueNAS Community.

Morning Stupidity

DAXQ

Contributor

Chris Moore

Hall of Famer

DAXQ

Contributor

DAXQ

Contributor

Chris Moore

Hall of Famer

DAXQ

Contributor

Constantin

Vampire Pig

Similar threads

Important Announcement for the TrueNAS Community.

Morning Stupidity

DAXQ

Contributor

Chris Moore

Hall of Famer

DAXQ

Contributor

DAXQ

Contributor

Chris Moore

Hall of Famer

DAXQ

Contributor

Constantin

Vampire Pig

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Morning Stupidity"

Similar threads