SOLVED RAID-Z2 replace disk -- resilvering continuos resetting

jmm20 · Jul 17, 2020

Hello,

Freenas Setup:

FreeNAS-11.3-U1

Intel Xeon E5-2620 v4
32 Gb DDR4 ECC Registered
Supermicro X10SRL-F
Supermicro AOC-3008-8le (HBA card)
Supermicro AOC-SGP-I4 (Ethernet card)
Intel X520-DA2 ( 10GB card)
Chassis Supermicro CSE-846
Backplane BPN-SAS2-846EL1

It has two pools:

POOL A: RAID-Z2 10 x Seagate Ironwolf 6TB
POOL B: RAID-Z2 10 x WD RED 3TB

Well, the last week two disk of the pool A started to present SMART Error, so I had to order new disks for replace the disks was failling.

I received the disks this week, when I went to replace the disk I observed that one of the two disks wich was failling not was detected by the pool, and the other one was the same situation, so I marked one of the disk as OFFLINE and shutdown the system for put into the new disk ... when I power on the system the disk wich I had not mark as offline was detect by the pool again, so I though the disk was failling (bad sectors and SMART errors) and sometimes was detected by the pool and others times not.

To solve this situation as soon as possible I selected the new disk to replace the old disk wich I had put out the system and it started the process of replacing ... It happened on 2,5 -- 3 days ago and the resilvering process has not finished still, and I don't undestard how is possible that when the resilvering's percent process is about 30% - 40 % restart to 1% and continues ... I don't know how to solve this situation and if I had made something wrong.

what I should to do? replace the two disks that are failling at the same time?

zpool status


  pool: RAID_Z2_6
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Jul 17 21:28:56 2020
    4.63T scanned at 1.33G/s, 2.14T issued at 631M/s, 21.5T total
    209G resilvered, 9.95% done, 0 days 08:56:13 to go
config:

    NAME                                              STATE     READ WRITE CKSUM
    RAID_Z2_6                                         DEGRADED     0     0     0
      raidz2-0                                        DEGRADED     0     0     0
        gptid/9fb9450e-f423-11e8-9f4f-001b21c02750    ONLINE       0     0     0
        replacing-1                                   DEGRADED     0     0     0
          3238665428308762357                         OFFLINE      0     0     0  was /dev/gptid/a0e27e05-f423-11e8-9f4f-001b21c02750
          gptid/32cd0f08-c680-11ea-ad45-002590ea28fc  ONLINE       0     0     0
        gptid/a1e91575-f423-11e8-9f4f-001b21c02750    ONLINE       0     0     0
        gptid/a2f9e145-f423-11e8-9f4f-001b21c02750    ONLINE       0     0     0
        gptid/a41b38b7-f423-11e8-9f4f-001b21c02750    ONLINE       0     0     0
        gptid/a537c0f5-f423-11e8-9f4f-001b21c02750    ONLINE       0     0     0
        gptid/a67c0ef7-f423-11e8-9f4f-001b21c02750    ONLINE       0     0     0
        gptid/a8366c8f-f423-11e8-9f4f-001b21c02750    ONLINE       0     0     0
        gptid/a968311b-f423-11e8-9f4f-001b21c02750    ONLINE       0     0     0
        gptid/aa8328c9-f423-11e8-9f4f-001b21c02750    ONLINE       0     0     0

errors: No known data errors

Thanks.

Redcoat · Jul 17, 2020

jmm20 said:
Hello,

Freenas Setup:

FreeNAS-11.3-U1

Intel Xeon E5-2620 v4
32 Gb DDR4 ECC Registered
Supermicro X10SRL-F
Supermicro AOC-3008-8le (HBA card)
Supermicro AOC-SGP-I4 (Ethernet card)
Intel X520-DA2 ( 10GB card)
Chassis Supermicro CSE-846
Backplane BPN-SAS2-846EL1

It has two pools:

POOL A: RAID-Z2 10 x Seagate Ironwolf 6TB
POOL B: RAID-Z2 10 x WD RED 3TB

Well, the last week two disk of the pool A started to present SMART Error, so I had to order new disks for replace the disks was failling.

I received the disks this week, when I went to replace the disk I observed that one of the two disks wich was failling not was detected by the pool, and the other one was the same situation, so I marked one of the disk as OFFLINE and shutdown the system for put into the new disk ... when I power on the system the disk wich I had not mark as offline was detect by the pool again, so I though the disk was failling (bad sectors and SMART errors) and sometimes was detected by the pool and others times not.

To solve this situation as soon as possible I selected the new disk to replace the old disk wich I had put out the system and it started the process of replacing ... It happened on 2,5 -- 3 days ago and the resilvering process has not finished still, and I don't undestard how is possible that when the resilvering's percent process is about 30% - 40 % restart to 1% and continues ... I don't know how to solve this situation and if I had made something wrong.

what I should to do? replace the two disks that are failling at the same time?

zpool status
pool: RAID_Z2_6 state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Fri Jul 17 21:28:56 2020 4.63T scanned at 1.33G/s, 2.14T issued at 631M/s, 21.5T total 209G resilvered, 9.95% done, 0 days 08:56:13 to go config: NAME STATE READ WRITE CKSUM RAID_Z2_6 DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 gptid/9fb9450e-f423-11e8-9f4f-001b21c02750 ONLINE 0 0 0 replacing-1 DEGRADED 0 0 0 3238665428308762357 OFFLINE 0 0 0 was /dev/gptid/a0e27e05-f423-11e8-9f4f-001b21c02750 gptid/32cd0f08-c680-11ea-ad45-002590ea28fc ONLINE 0 0 0 gptid/a1e91575-f423-11e8-9f4f-001b21c02750 ONLINE 0 0 0 gptid/a2f9e145-f423-11e8-9f4f-001b21c02750 ONLINE 0 0 0 gptid/a41b38b7-f423-11e8-9f4f-001b21c02750 ONLINE 0 0 0 gptid/a537c0f5-f423-11e8-9f4f-001b21c02750 ONLINE 0 0 0 gptid/a67c0ef7-f423-11e8-9f4f-001b21c02750 ONLINE 0 0 0 gptid/a8366c8f-f423-11e8-9f4f-001b21c02750 ONLINE 0 0 0 gptid/a968311b-f423-11e8-9f4f-001b21c02750 ONLINE 0 0 0 gptid/aa8328c9-f423-11e8-9f4f-001b21c02750 ONLINE 0 0 0 errors: No known data errors

Thanks.

Is it possible that the replacement disk(s) you have purchased are SMR drives?

See here https://www.ixsystems.com/community/threads/list-of-known-smr-drives.83993/ to check them out.

Yorick · Jul 17, 2020

not if they’re iron wolves, they’d need to have purchased wd reds 6tb efax as replacement - surely no one would do that in an Ironwolf pool, or would they?

*bites nails, on edge of seat*

jmm20 · Jul 18, 2020

Redcoat said:
Is it possible that the replacement disk(s) you have purchased are SMR drives?

See here https://www.ixsystems.com/community/threads/list-of-known-smr-drives.83993/ to check them out.

old disks are Seagate Ironwolf 6TB st6000vn0033

and new disks are Seagate Ironwolf 6 TB ST6000VN0001.

I was looking for information about CRM and SRM drives and I didn't find any information about Seagate ironwolf disks were "afected" by SRM technology.

In fact, my intention is replace all disks of the pool by new Seagate Ironwol 6TB disks.

Yorick said:
not if they’re iron wolves, they’d need to have purchased wd reds 6tb efax as replacement - surely no one would do that in an Ironwolf pool, or would they?

*bites nails, on edge of seat*

I belived WD RED 6TB were not the best because them use SRM technology.

Yorick · Jul 18, 2020

Yeah the Ironwolfs are all CMR, that's not what's causing your trouble.

Yorick · Jul 18, 2020

How much room / power do you have in that case?

As I see it your options are:
- Leave the failing drives in, put in their replacements as well and resilver. This can however slow down resilver if there are a lot of errors when trying to read from those drives
- Remove the failing drives, put in their replacements and resilver. Scary because now you are without redundancy - one more drive with bad blocks and you're staring data loss in the face. Arguably with two failing drives you are potentially without redundancy already, however. It would speed up the resilver, as I understand it, when the drives you are replacing have a lot of read errors.
- Wait for the current resilver to finish, then replace the other failing drive. At least you have some semblance of parity going at all times.

How far along is that first resilver now?

jmm20 · Jul 18, 2020

Yorick said:
How much room / power do you have in that case?

As I see it your options are:
- Leave the failing drives in, put in their replacements as well and resilver. This can however slow down resilver if there are a lot of errors when trying to read from those drives
- Remove the failing drives, put in their replacements and resilver. Scary because now you are without redundancy - one more drive with bad blocks and you're staring data loss in the face. Arguably with two failing drives you are potentially without redundancy already, however. It would speed up the resilver, as I understand it, when the drives you are replacing have a lot of read errors.
- Wait for the current resilver to finish, then replace the other failing drive. At least you have some semblance of parity going at all times.

How far along is that first resilver now?

Hi Yorick,

The first resilver start on last wednesday ... so resilver is continuously is resetting, it goes to 31,9 % and later restart from 0% ...

I think the best option at this moment is stop the resilver process and get out the other drive which are failing and restart the resilver process.

My doubt is, should I replace the second drive or would be better wait to first resilver end and later replace second drive?

What do you think about it?

And most important, How can stop resilvering process with security?

Yorick · Jul 18, 2020

I’ll let someone with more experience chime in, I’ve never actually had a drive fail yet and haven’t encountered the kind of issue you are dealing with.

I think that replacing both failed drives would be best, and I don’t know how to best stop a resilver in progress. A quick Google tells me that detaching the spare will work, but you need to be very careful: Detach one of the working drives and your pool is dead in the water, since you already lost two.

Here’s what I found, and do let someone else with actual first hand experience validate this idea.

>>
According to the docs:

An in-progress spare replacement can be cancelled by detaching the hot spare.

It sounds like you did a manual replace but detaching the new disk might work the same.
>>

pschatz100 · Jul 18, 2020

While this may seem obvious, before attempting to replace the old drives, did you burn in the new ones?

Also, check all the power and power cables very carefully. I had a resilvering problem one time, and it was due to a drive power cable that was not securely plugged.

c77dk · Jul 18, 2020

Try searching the forums - there's a thread about Ironwolfs and lsi controllers. Might be the same problem for you. The solution in that thread was a firmware upgrade of the drives. (on the phone so not easy for me to find the thread)

jmm20 · Jul 20, 2020

pschatz100 said:
While this may seem obvious, before attempting to replace the old drives, did you burn in the new ones?

Also, check all the power and power cables very carefully. I had a resilvering problem one time, and it was due to a drive power cable that was not securely plugged.

No, I didn't burn in the new ones? Should I do it?

c77dk said:
Try searching the forums - there's a thread about Ironwolfs and lsi controllers. Might be the same problem for you. The solution in that thread was a firmware upgrade of the drives. (on the phone so not easy for me to find the thread)

It possible to be a problem between LSI card and new disk, but It is strange, isn't it? I mean ... if the LSI card, Backplane and old disks have worked good for more than one year ... The new disks are same type than oldest (Not same model, different part number)

I'm going to look for information about ironwolfs and lsi controller.

HELP:

How should I stop the resielvering process? Manual hot detach not sound as good option.

Anyone know how to do it?

Thanks to all for your time and your responses.

jmm20 · Jul 25, 2020

I've finally solved my problem.

The resilverization process was continuously reset because the second unit had too many bad sectors and intelligent errors, the system was able to complete the process by removing the unit that was damaged and the process was completed successfully.

I am leaving the solution I found in case anyone in the future finds it interesting.

Thanks to all who have responded to the post.

Important Announcement for the TrueNAS Community.

SOLVED RAID-Z2 replace disk -- resilvering continuos resetting

jmm20

Dabbler

Redcoat

MVP

Yorick

Wizard

jmm20

Dabbler

Yorick

Wizard

Yorick

Wizard

jmm20

Dabbler

Yorick

Wizard

pschatz100

Guru

c77dk

Patron

jmm20

Dabbler

jmm20

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

SOLVED RAID-Z2 replace disk -- resilvering continuos resetting

Dabbler

MVP

Wizard

Dabbler

Wizard

Wizard

Dabbler

Wizard

Guru

Patron

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "RAID-Z2 replace disk -- resilvering continuos resetting"

Similar threads