how many disks can fail in a RAIDZ2

troymp

Cadet
Joined
Oct 25, 2022
Messages
8
Hi there, how many disk can fail on a RAIDZ2?

What are the steps for replacing a "Faulted" disk? I cannot put the disk into offline. The two SPARE disks are currently online which I am not sure how this occured.

So can I remove the faulted disk and replace with a new disk.

1687770197063.png
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
RAID-Z2 has a 2 disk limit before data will occur. You are at one failed disk, "da1p2", at present.

The manual should have the details on how to replace a disk.

From reading the output, you have exactly 1 hot spare. The screen could have possibly cut off your 2nd hot spare... Your "da13p2" hot spare is currently replacing "da7p2". Though I am not sure why since "da7p2" is ONLINE.

I'd suggest supplying the output of zpool status in code tags to be sure.

Last, 13 disks in a RAID-Z2 vDev is on the edge of excessive. In general 10 to 12 disks is considered the maximum. Over time this can slow things down. Especially if you are storing a great many small files that don't use up a 11 disk stripe, (not including the 2 parity disk space).
 
Joined
Jul 3, 2015
Messages
926
Yeah looks like one of your hot-spares has kicked in and you may only have one like @Arwen said based on your output. I would remove and replace the faulted disk first and then once resilver is complete offline da7 and replace that with a new disk. Once that’s complete your hot-spare should return to being a hot-spare (depending on what version of TN you are running).
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Your "da13p2" hot spare is currently replacing "da7p2". Though I am not sure why since "da7p2" is ONLINE.
Probably a historic disk failure since corrected, but without properly returning the spare (using the GUI or zpool detach)
 
Joined
Jul 3, 2015
Messages
926
Probably a historic disk failure since corrected, but without properly returning the spare (using the GUI or zpool detach)
Yeah good point.
 

troymp

Cadet
Joined
Oct 25, 2022
Messages
8
Hi there, thanks so much for the response. The manual advises to put a disk into offline mode and then replace it with a new disk.

If I cannot put the da1p2 into offline mode, can I simply remove the disk and replace it.

In addition, what do I do with da13p2? Do I put it into Offline and then remove it? There are no issues with DA13.

What are the steps for loading a Spare Disk? That is, do I mount the disk and leave it within the SPARE section and when it is required it will come on line

1687811361593.png


1687810956508.png


1687810984354.png


1687811004255.png



1687811019924.png


1687811041295.png


Below is the most recent alert which is giving some better information

Pool primary state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:

  • Disk LENOVO ST16000NM002G ZL26NA4D0000C043HHPU is UNAVAIL -DA13
  • Disk 141653064080923613 is FAULTED -DA1 (ZL26M7AC0000C043G8CA)

Checking DA13
1687811667119.png



1687811686225.png


1687811708120.png


1687811728633.png


Below is DA1
1687811877544.png


1687811990050.png



1687812021578.png



1687812058445.png
 

Attachments

  • 1687811783463.png
    1687811783463.png
    30.9 KB · Views: 86
  • 1687811760659.png
    1687811760659.png
    36.9 KB · Views: 114

troymp

Cadet
Joined
Oct 25, 2022
Messages
8
Sorry I read a bit closely.
I put DA7 into offline and replace this disk. After replacing the disk and making it a Hot-swap, then I can put DA1 into Offline and replace it with the new disk within DA7.

It appears the DA7 has 1 non-medium error. Does this mean a physical hard ware issue?

1687812475662.png


1687812493610.png


1687812518276.png
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
It appears the DA7 has 1 non-medium error. Does this mean a physical hard ware issue?
Since it was just a corrected error, I wouldn't worry about it too much.

Keep an eye on that counter though. If it continues to increase, stop trusting that drive.
 
Joined
Jul 3, 2015
Messages
926
Can you send the output of zpool status again now you have done your replacement? Just wondering if your hot-spare has gone back to its correct position.
 

troymp

Cadet
Joined
Oct 25, 2022
Messages
8
Hi there, I powered down the NAS server and reseated DA1, DA7 and DA11. The disks DA1 and DA7 where showing offline after the power on. I was able to put DA1 and DA7 into online. I wanted to see if it would fail or show issues. The disks are online and I am waiting for the resilvering process to complete.
How do you know that DA13sp2 is repacing DA7SP2? And why is the disk in DA13 is showing twice in two different Spare disk locations.
Can I make the DA13 and DA7 a hotswap ?

I am waiting for a disk replacement for DA1.

1687902332908.png

1687902357076.png


1687902408115.png


1687902678884.png

1687902718529.png
 
Joined
Jul 3, 2015
Messages
926
Looks like your hot-spare is stuck in the pool (this is a bug only recently fix in the latest version of TN).

Try zpool detach primary gptid/xxxxxxxxxxx (this being the gptid of da13)
 

troymp

Cadet
Joined
Oct 25, 2022
Messages
8
Looks like your hot-spare is stuck in the pool (this is a bug only recently fix in the latest version of TN).

Try zpool detach primary gptid/xxxxxxxxxxx (this being the gptid of da13)
Hi Johnny, the command worked. How do replace the disk for the faulted DA1P2 with the disk within DA13SP2. Do put DA13P2 into Offline and then replace within DA1P2 config.

1687986183847.png
 

troymp

Cadet
Joined
Oct 25, 2022
Messages
8
Hi Johnny, the command worked. How do replace the disk for the faulted DA1P2 with the disk within DA13SP2. Do put DA13P2 into Offline and then replace within DA1P2 config.

View attachment 67789

I tried to put the DA1P2 into offline and this did not work. And now DA13SP2 is showing up under Primary/Spare again. I am waiting for the resilver to complete
Any suggestion on resolving the DA13p2 under Primary/Spare and using this disk to replace DA1P2 is appreciated.

1687986942590.png
 
Joined
Jul 3, 2015
Messages
926
Ok so all you have done by setting DA1 to offline is trigger your hot-spare (DA13) to come back into the pool which is fine. Do you have a new/spare drive you can use to replace DA1? If not order one. Once this resilver is complete remove DA1 and replace DA1 with your new drive. Once resilver is complete use the above mentioned command to send DA13 back to being a hot-spare again and you should be golden.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
@troymp - As @Johnny Fartpants implied, your ZFS Hot Spare drive is working mostly as designed. It's taking over for the failed disk. This restores all your vDev's redundancy. Whence the hot spare is fully synchronized, you can loose up to 2 other disks before data / pool loss.

The method in how ZFS displays hot spares in use can be confusing to new ZFS users. Basically ZFS will mirror the failing / failed drive with the hot spare. Showing it as a 2 drive sub-unit in the RAID-Zx vDev, called SPARE in the GUI.


What I meant by "working mostly as designed", is that your hot spare should have kicked in immediately after being freed up for your next failed disk. Probably some glitch in the methodology in OpenZFS hot spares or how you tried to replace the disk. Nothing much to worry about.
 

troymp

Cadet
Joined
Oct 25, 2022
Messages
8
Hi there, thanks for all the information. I had to perform a reboot which resolved the double of the disk. This then allowed me to remove the problematic disk. The replacement disk is in the drive slot but I have not done anything else. Should I load the new disk as a hot swap spare or can I just leave it as is?
 
Top