A little confused regarding spares...

RchGrav

Dabbler
Joined
Feb 21, 2014
Messages
36
Can someone help clear up the steps to resolve this so it doesn't show faulted anymore.
I'm also a bit confused as to why its showing the spares as a stripe in the GUI.

Thanks in advance...

Code:
root@silo:~ # zpool status
  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:12 with 0 errors on Sun Mar 31 03:45:12 2019
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada0p2  ONLINE       0     0     0
            ada1p2  ONLINE       0     0     0

errors: No known data errors

  pool: tank
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0 in 0 days 16:29:24 with 0 errors on Sun Mar 17 16:29:49 2019
config:

        NAME                                              STATE     READ WRITE CKSUM
        tank                                              DEGRADED     0     0     0
          mirror-0                                        ONLINE       0     0     0
            gptid/284957e0-54f7-11e5-952d-0cc47a34f672    ONLINE       0     0     0
            gptid/28aa3d5b-54f7-11e5-952d-0cc47a34f672    ONLINE       0     0     0
          mirror-1                                        DEGRADED     0     0     0
            gptid/5560a54f-54f7-11e5-952d-0cc47a34f672    ONLINE       0     0     0
            spare-1                                       DEGRADED     0     0     0
              gptid/55bfeb35-54f7-11e5-952d-0cc47a34f672  FAULTED     21     5     0  too many errors
              gptid/a25d0618-5b03-11e5-ba30-0cc47a34f672  ONLINE       0     0     0
          mirror-2                                        ONLINE       0     0     0
            gptid/8cb54a53-54f7-11e5-952d-0cc47a34f672    ONLINE       0     0     0
            gptid/b3e42571-5695-11e5-aa4b-0cc47a34f672    ONLINE       0     0     0
          mirror-3                                        DEGRADED     0     0     0
            gptid/f21b968f-54f7-11e5-952d-0cc47a34f672    ONLINE       0     0     0
            spare-1                                       DEGRADED     0     0     0
              gptid/f2802035-54f7-11e5-952d-0cc47a34f672  FAULTED     27   379     0  too many errors
              gptid/5cd20210-5b03-11e5-ba30-0cc47a34f672  ONLINE       0     0     0
          mirror-4                                        ONLINE       0     0     0
            gptid/524b645d-54fc-11e5-952d-0cc47a34f672    ONLINE       0     0     0
            gptid/e009e2d6-54fd-11e5-952d-0cc47a34f672    ONLINE       0     0     0
        spares
          10021450293346954392                            INUSE     was /dev/gptid/5cd20210-5b03-11e5-ba30-0cc47a34f672
          12468965327649564557                            INUSE     was /dev/gptid/a25d0618-5b03-11e5-ba30-0cc47a34f672

errors: No known data errors


Capture1.JPG

Capture2.JPG
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I don't use spares, so my understanding of it is theoretical.
The spare is substituting, temporarily, for the faulted disk to maintain redundancy. Once you replace the faulted disk, the spare will return to the 'spares' group and the pool will be made whole.
You just need to get some replacement disks.
 
Joined
Jul 3, 2015
Messages
926
This looks pretty normal and as @Chris Moore said you just need to replace the 'faulted' drives and once the resilver is complete your hot-spares will return to their original position waiting to save the day again at some point in the future. Remember when you do the replacement via the GUI you are replacing the 'faulted' drives and NOT the spares.

Also, hot-spares are only ever temporary replacements not permanent and this is probably where you are getting confused.

PS: spares appearing as a stripe is perfectly normal so long as the vdev is called 'spares'.
 

RchGrav

Dabbler
Joined
Feb 21, 2014
Messages
36
Thank you!..
This sounds correct based upon what happened when I built this and tested the spares functionality... Your comments fell nicely into place.


@Chris Moore You don't use spares!? That's either a compliment to ZFS or a crushing blow to your data's hopes of true love. :p

@Johnny Fartpants Nailed it.

I don't think the data was in any true danger & It felt like the time expended to clarify was worth it. This was all the exact stuff I needed to hear.

(Wishful Thinking: I was planning on doing these one at a time.. but is that even necessary if they are in different vdevs? Assuming that the spares are not in fact striped, and instead are just taking the place of a failed drive, there shouldn't be any drive contention while rebuilding.. I'm already presuming I will be shutting everything down to verify serial numbers* with smart... unless there is an easier way w/ a LSI SAS 3 setup in a 10 series supermicro chassis. *It would be nice to only only shut down completely once, if its necessary.... I recall that the "LSI 3 -> BSD Hot Swap" aspects of it seemed to work during the limited testing of it that I did.. but I don't have a clue how to identify I have the right drive without having eyes on the physical disk labels.. )

I'm not against running an update before doing this if there is any relevant reasoning.. currently running FreeNAS-11.1-U4 on this system.

Regards & Thanks!

Rich
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
@Chris Moore You don't use spares!? That's either a compliment to ZFS or a crushing blow to your data's hopes of true love. :p
I maintain a stock of pre-tested cold spares and I get a daily status email in addition to the system auto-magically emails me when there is a fault to a disk. I am also pretty fanatical about replacing a drive at the very first sign of a problem. One bad sector is enough for me to pull the drive and replace it with a spare. Because I am checking the system so often, I don't see the need of hot-spares, in my specific situation. That is not to say that hot-spares don't have a purpose. If I had a system that was remote, or was not able to be tended regularly, I do see the value. I will give you an example. When I took on the position I have where I work, the job had been vacant for about a year. There were servers that had been running unattended for that entire time. Nobody was checking them with any regularity. When I made my initial assessments, I found Nine drives that needed to be replaced because they were in a failed state. In one of those servers, the system had swapped in the hot spare for one failed disk and two other disks had also failed in the array group. That means that if a single additional disk had failed, they would have lost the entire storage pool, which was around 150TB of data some of which had no backup at all and the rest would have needed to be restored from the original tar-gzip file that exists on a USB hard drive where it was originally brought in from 'outside'. I have optimized the process of unpacking those tar files so we can get one unpacked in about a day, but it was taking them about three days to unpack those drives before I found a better way. They have a file cabinet filled with those drives. At that time, it would have taken them weeks or even months to recover had that server gone down.
The only reason it survived as long as it did was because of the hot-spare, but it would never have gotten to that condition if there was anyone monitoring it.
I was planning on doing these one at a time.. but is that even necessary if they are in different vdevs?
No. You can replace multiple drives at the same time. I started two resilvers in one of the servers at work just before I went home from work today. Each vdev is a separate failure domain. The resilver does take IOPS away from overall pool performance though.
I'm presuming I will be shutting everything down to verify serial numbers* with smart... unless there is an easier way w/ a LSI SAS 3 setup in a 10 series supermicro chassis.
FreeNAS has a way to illuminate the error light on the drive bay if this system has a SAS backplane with LEDs. Can you give a bit more detail about what chassis these drives are in? If it is a Supermicro or Chenbro or even QNAP enclosure, with a SAS backplane, you can use the command sesutil locate da10 on, to turn on the locate LED of the drive bay associated with the drive da10.
You can absolutely swap the drives out with the server running if you have hot-swap bays. I have replaced six drives in one of the servers at work since the last time it was rebooted, over two months ago.
*It would be nice to only only shut down completely once, if its necessary.... When I tested this the "Hotswap" aspects of it seemed to work BTW.... but seeming to work vs actually working may not be one in the same.. unless it has matured..
No need to shutdown.
This is one of the chassis I use at work:
https://www.supermicro.com/products/system/4U/6048/SSG-6048R-E1CR60L.cfm
It is easy to replace a drive. The only potential problem is if you have swap space in use on the drive being replaced. If you are not using swap space or if you offline the drive first, you should be fine.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
PS. Use the command sesutil locate da10 off, to turn the locate light off when you are done. I forgot to do that on my system at work because I was running late leaving and I ride a carpool, so staying late wasn't an option.
 

RchGrav

Dabbler
Joined
Feb 21, 2014
Messages
36
Can you give a bit more detail about what chassis these drives are in? If it is a Supermicro or Chenbro or even QNAP enclosure, with a SAS backplane, you can use the command sesutil locate da10 on, to turn on the locate LED of the drive bay associated with the drive da10.
You can absolutely swap the drives out with the server running if you have hot-swap bays. I have replaced six drives in one of the servers at work since the last time it was rebooted, over two months ago.

Thanks for the detailed replies above.. I know exactly what you mean re: replacing the drives manually.. Lately I've been just trying to err on the side of caution.. transitioning away from parity to mirrors, using hot spares (If the system isn't right under my nose like you mentioned), configuring snapshots, scheduled scrubs & smart tests..

Here is the link to the system I used when I set this one up..
https://www.supermicro.com/products/system/2U/6028/SSG-6028R-E1CR12L.cfm
Looks to be the same as yours... albeit the 2u version..

So.. in a nutshell.. I can just pull the faulted drives once I use sesutil to locate them without shutting down,. put in the fresh empty drives. Maybe have to use replace in the GUI, but maybe not if they just start to resilver on their own. once the new drives are active and everything is happy and resilvered the hotspares return to their roles?? Sound about right?
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
So.. in a nutshell.. I can just pull the faulted drives once I use sesutil to locate them without shutting down,.
Not exactly. If you used the default configuration process for FreeNAS, you will have a swap partition on every data drive in the pool. The system will grab the first ten drives that come ready and used their swap space, so you might not know if the drive is being used for swap or not. Because of that, you should go into the "Volume Status" tab and select the drive to be removed and click the "Offline" button. This should gracefully allow the system to stop using the drive.
1555257467354.png
Then you can just snatch it out the front of the unit and put the new drive in.
put in the fresh empty drives. Maybe have to use replace in the GUI, but maybe not if they just start to resilver on their own.
No, they do not start to resilver on their own. FreeNAS will wait for you to tell it what to do.
You absolutely should use the replace function through the GUI because kicking off a resilver from the command line does not partition the drive. Sometimes starting a replace fails. If it does, it is probably because there is some garbage data on the drive, so you will need to use the GUI to "Wipe" the new drive.
1555257754384.png
After the disk wipe, try the replace again. I had to do that for all three of the drives I replaced at work last week. I had three drives (in different vdevs) resilvering at the same time. It spikes CPU and disk activity when you do that which will affect the responsiveness of the server to the network.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I don't use spares, so my understanding of it is theoretical.
The spare is substituting, temporarily, for the faulted disk to maintain redundancy. Once you replace the faulted disk, the spare will return to the 'spares' group and the pool will be made whole.
You just need to get some replacement disks.

No, the spare becomes the working disk, replacing the failed disk. You pull the failed disk, putting in a new disk, and you can mark that as your new spare.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
No, the spare becomes the working disk, replacing the failed disk. You pull the failed disk, putting in a new disk, and you can mark that as your new spare.
What is the explanation of the pool status from the original post?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
What is the explanation of the pool status from the original post?

I can't tell. I generally have a single failure to deal with at a time, and I know the starting state of the system, so I don't really pay a lot of attention to "DEGRADED" or "FAULTED" and just proceed through the workflow. I have no idea what "INUSE" is either.
 

RchGrav

Dabbler
Joined
Feb 21, 2014
Messages
36
Not exactly. If you used the default configuration process for FreeNAS, you will have a swap partition on every data drive in the pool. The system will grab the first ten drives that come ready and used their swap space, so you might not know if the drive is being used for swap or not. Because of that, you should go into the "Volume Status" tab and select the drive to be removed and click the "Offline" button. This should gracefully allow the system to stop using the drive.

I guess my thinking was if a spare standing in for a faulted drive the device that experienced a fault would already be offline.. but that is just an assumption based upon my classical understanding of the way spares work, however based on the above a drive can technically go into a faulted state but still be operational as a swap device...

As far as the starting state of the system goes.. vdevs were created 2 devices at a time as mirror until only 2 drives remained unused.. the last 2 drives were added as spares ones at a time.
 

RchGrav

Dabbler
Joined
Feb 21, 2014
Messages
36
No, the spare becomes the working disk, replacing the failed disk. You pull the failed disk, putting in a new disk, and you can mark that as your new spare.

I just felt my heart sink a little bit.. because I adjusted my understanding.. I've just reverted it though..

I feel like its safe to say that it doesn't change the workflow of replacing the faulted drives.. just the end resulting membership of the devices.. However I'm confused why the spares wouldn't be showing as the new replacement devices in the vdevs.. and just as @Chris Moore stated I'm confused regarding the pool status. Unless the status indicates its more degraded than it was when it has the spares available.. whereas now it doesn't have spares..

I'll surely report back what actually takes place w/ updated screenshots and zpool status when the faulted devices are replaced.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Joined
Jul 3, 2015
Messages
926
Hot-Spares are only ever temporary. When the faulted disk is replaced the replacement disk joins the pool and the original hot-spare goes back to being a hot-spare after re-silver.

IN-USE means the hot-spare is in the pool having temporarily replaced the faulted drive.
 
Joined
Jul 3, 2015
Messages
926

Stephen Hayne

Dabbler
Joined
May 27, 2016
Messages
14
OK, sorry to resurrect this thread, but it came up in the search... pls point me to a better one if I have chosen unwisely.

I think I don't really understand what is going on through the GUI (or the cmd line, either probably). I started getting msgs (SuperMicro 48 bay box - I have 4 of them, 10gb lagg, btw), that a disk was faulting, in one of the mirror pairs in a 24TB volume. I nursed it a bit, and then decided to "replace" it, selecting one of the striped spares I had assigned. It complained, so I canceled, and chose another disk that was sitting idle in the box. Now I have this:

Capture.PNG


with the spares (striped) looking like this:

Capture2.PNG


What the heck happened? And what do I do now? From what I read, the hot spare is now INUSE (spare-0), da5 was the spare I tried to cancel out of putting in during replacement, and da36 is the one I chose that wasn't a spare, but was idle.

I can't select OFFLINE or DETACH from da10 or spare-0. I can select OFFLINE from da5, da27, and I an select OFFLINE or DETACH from da36.

Please tell me the workflow to get this back to a mirrored pair?

Looks like I should:
1) light up the LED if I can (I don' think my backplane supports it), but I have records of which Serial# is in which bay.
2) hot swap da10
3) resilvering is complete, DETACH da36?

Thank you for any help. I built these boxes 4+ years ago, and only now am having errors - they are overprovisioned and only using about 25% capacity... :) this is my first fail.

Stephen
 

Stephen Hayne

Dabbler
Joined
May 27, 2016
Messages
14
Oh - here is the result from zpool status:

Code:
[root@sm2] ~# zpool status
  pool: SM2_infrastructure
 state: ONLINE
  scan: scrub repaired 0 in 0h57m with 0 errors on Sun Mar 31 00:57:06 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        SM2_infrastructure                              ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/11fddb29-c613-11e5-8fa9-002590c44bba  ONLINE       0     0     0
            gptid/12f1687a-c613-11e5-8fa9-002590c44bba  ONLINE       0     0     0

errors: No known data errors

  pool: SM2_ssd_pool
 state: ONLINE
  scan: scrub repaired 0 in 0h32m with 0 errors on Sun Mar 24 00:32:45 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        SM2_ssd_pool                                    ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/97213b77-a9c6-11e5-9ff8-002590c44bba  ONLINE       0     0     0
            gptid/9760c3e1-a9c6-11e5-9ff8-002590c44bba  ONLINE       0     0     0
            gptid/97a07b97-a9c6-11e5-9ff8-002590c44bba  ONLINE       0     0     0
            gptid/97e0dc43-a9c6-11e5-9ff8-002590c44bba  ONLINE       0     0     0
            gptid/98231417-a9c6-11e5-9ff8-002590c44bba  ONLINE       0     0     0
            gptid/986614ad-a9c6-11e5-9ff8-002590c44bba  ONLINE       0     0     0
            gptid/98aa97b9-a9c6-11e5-9ff8-002590c44bba  ONLINE       0     0     0
            gptid/98ee89f7-a9c6-11e5-9ff8-002590c44bba  ONLINE       0     0     0
            gptid/99323483-a9c6-11e5-9ff8-002590c44bba  ONLINE       0     0     0
            gptid/99765dc8-a9c6-11e5-9ff8-002590c44bba  ONLINE       0     0     0

errors: No known data errors

  pool: SM2_stable_pool
 state: DEGRADED
  scan: resilvered 616G in 2h32m with 0 errors on Fri Apr 19 13:27:39 2019
config:

        NAME                                                STATE     READ WRITE CKSUM
        SM2_stable_pool                                     DEGRADED     0     0     0
          mirror-0                                          ONLINE       0     0     0
            gptid/030deda1-eb51-11e8-a953-002590c44bba      ONLINE       0     0     0
            gptid/99b7f4e5-2921-11e7-8bbe-002590c44bba      ONLINE       0     0     0
          mirror-1                                          ONLINE       0     0     0
            gptid/2dc44292-a9c7-11e5-9ff8-002590c44bba      ONLINE       0     0     0
            gptid/2e84ea5f-a9c7-11e5-9ff8-002590c44bba      ONLINE       0     0     0
          mirror-2                                          ONLINE       0     0     0
            gptid/2f3a11de-a9c7-11e5-9ff8-002590c44bba      ONLINE       0     0     0
            gptid/2fe8611d-a9c7-11e5-9ff8-002590c44bba      ONLINE       0     0     0
          mirror-3                                          ONLINE       0     0     0
            gptid/30a73711-a9c7-11e5-9ff8-002590c44bba      ONLINE       0     0     0
            gptid/317a924e-a9c7-11e5-9ff8-002590c44bba      ONLINE       0     0     0
          mirror-4                                          ONLINE       0     0     0
            gptid/32516de7-a9c7-11e5-9ff8-002590c44bba      ONLINE       0     0     0
            gptid/332cce85-a9c7-11e5-9ff8-002590c44bba      ONLINE       0     0     0
          mirror-5                                          ONLINE       0     0     0
            gptid/340b05fb-a9c7-11e5-9ff8-002590c44bba      ONLINE       0     0     0
            gptid/34ba9c48-a9c7-11e5-9ff8-002590c44bba      ONLINE       0     0     0
          mirror-6                                          DEGRADED     0     0     0
            replacing-0                                     FAULTED      0     0     0
              spare-0                                       FAULTED      0     0     0
                gptid/155dee1b-4568-11e7-be80-002590c44bba  FAULTED      0    15     0  too many errors
                gptid/5e600107-cb7b-11e5-8fa9-002590c44bba  ONLINE       0     0     0
              gptid/e234cc5c-62c3-11e9-a953-002590c44bba    ONLINE       0     0     0
            gptid/1631a160-4568-11e7-be80-002590c44bba      ONLINE       0     0     0
          mirror-7                                          ONLINE       0     0     0
            gptid/0d24df0b-59d1-11e7-be80-002590c44bba      ONLINE       0     0     0
            gptid/0df51703-59d1-11e7-be80-002590c44bba      ONLINE       0     0     0
          mirror-8                                          ONLINE       0     0     0
            gptid/490dc8cb-59d1-11e7-be80-002590c44bba      ONLINE       0     0     0
            gptid/49cf2831-59d1-11e7-be80-002590c44bba      ONLINE       0     0     0
          mirror-9                                          ONLINE       0     0     0
            gptid/7618aed4-59d1-11e7-be80-002590c44bba      ONLINE       0     0     0
            gptid/76f4dec8-59d1-11e7-be80-002590c44bba      ONLINE       0     0     0
        cache
          gptid/350a3d9c-a9c7-11e5-9ff8-002590c44bba        ONLINE       0     0     0
          gptid/35476667-a9c7-11e5-9ff8-002590c44bba        ONLINE       0     0     0
        spares
          1359512483288840190                               INUSE     was /dev/gptid/5e600107-cb7b-11e5-8fa9-002590c44bba
          gptid/2173a885-4568-11e7-be80-002590c44bba        AVAIL

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0h9m with 0 errors on Thu Apr 18 03:54:58 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        freenas-boot                                    ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/e69d8811-61fb-11e5-ba41-002590c44bba  ONLINE       0     0     0
            gptid/e5436266-92f2-11e5-8c9d-002590c44bba  ONLINE       0     0     0

errors: No known data errors
[root@sm2] ~#
 
Last edited:

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
Hi Stephen,

Unfortunately, the output of your zpool status is incomplete... Can you please post the complete output ?

Also, one of your vdev is named mirror-6, but already contains 3 drives as Online. Usually, a mirror is made of 2 drives. It is possible to do 3-way mirrors (or even more), but this is not very common. Also, none of your other mirrors are like that. Even when 3-way mirrors are used, they are normally used for the entire pool and not for a single vDev in a larger pool.

So please, confirm the exact structure of your pool before we can figure out what is problematic and what can be done to fix it safely.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
OK, sorry to resurrect this thread, but it came up in the search... pls point me to a better one if I have chosen unwisely.
The thing you should have done is to start a new thread.
I think I don't really understand what is going on through the GUI
I think we are going to need to do some CLI (command line) things to get this fixed.
I nursed it a bit, and then decided to "replace" it,
Hot-Spares are only activated by the complete failure of a drive that is having problems. I don't know what you did here, but you appear to have activated a spare, and also somehow added another mirror. It is a really strange looking situation.
It complained, so I canceled, and chose another disk that was sitting idle in the box.
I don't know how this happened, because I don't know exactly what steps you took, but I think we can clear this up. Just don't worry about it too much because you have fully three disks holding the data for that vdev right now, so you should't have any risk of data loss.
(I don' think my backplane supports it),
If you give the model number of the system, I could tell you for sure, but the 48 bay chassis I am familiar with should do this. You need to be able to SSH in from a terminal, I like Cygwin, but you can use PuTTY if you like. It is just so you can use the command line. The command is sesutil locate da10 on to start the light blinking, then sesutil locate da10 off when you are done. It works in systems that have a SAS expander backplane. I have (at work) system from Supermicro, Chenbro and QNAP that all work with that.
but I have records of which Serial# is in which bay.
Don't need the serial number, just the da#...
If you need to get the device number from the gptid, you can use glabel status to show the gptids and da#..
The drive
Code:
 gptid/5e600107-cb7b-11e5-8fa9-002590c44bba 
is the spare, and the GUI doesn't make it clear but you say,
da5 was the spare
I would use glabel status to be certain.

From the command line, you can't address the drive by the da# for this, so you must be able to relate the da# you are working on back to the gptid, because the pool was formed using the gptid, not da#. You should be able to give the command,
detach SM2_stable_pool gptid/5e600107-cb7b-11e5-8fa9-002590c44bba
which should return the spare to the spare group.

Please do that and let me know what the result is.
 
Top