A little confused regarding spares...

RchGrav · Apr 3, 2019

Can someone help clear up the steps to resolve this so it doesn't show faulted anymore.
I'm also a bit confused as to why its showing the spares as a stripe in the GUI.

Thanks in advance...

Code:

root@silo:~ # zpool status
  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:12 with 0 errors on Sun Mar 31 03:45:12 2019
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada0p2  ONLINE       0     0     0
            ada1p2  ONLINE       0     0     0

errors: No known data errors

  pool: tank
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0 in 0 days 16:29:24 with 0 errors on Sun Mar 17 16:29:49 2019
config:

        NAME                                              STATE     READ WRITE CKSUM
        tank                                              DEGRADED     0     0     0
          mirror-0                                        ONLINE       0     0     0
            gptid/284957e0-54f7-11e5-952d-0cc47a34f672    ONLINE       0     0     0
            gptid/28aa3d5b-54f7-11e5-952d-0cc47a34f672    ONLINE       0     0     0
          mirror-1                                        DEGRADED     0     0     0
            gptid/5560a54f-54f7-11e5-952d-0cc47a34f672    ONLINE       0     0     0
            spare-1                                       DEGRADED     0     0     0
              gptid/55bfeb35-54f7-11e5-952d-0cc47a34f672  FAULTED     21     5     0  too many errors
              gptid/a25d0618-5b03-11e5-ba30-0cc47a34f672  ONLINE       0     0     0
          mirror-2                                        ONLINE       0     0     0
            gptid/8cb54a53-54f7-11e5-952d-0cc47a34f672    ONLINE       0     0     0
            gptid/b3e42571-5695-11e5-aa4b-0cc47a34f672    ONLINE       0     0     0
          mirror-3                                        DEGRADED     0     0     0
            gptid/f21b968f-54f7-11e5-952d-0cc47a34f672    ONLINE       0     0     0
            spare-1                                       DEGRADED     0     0     0
              gptid/f2802035-54f7-11e5-952d-0cc47a34f672  FAULTED     27   379     0  too many errors
              gptid/5cd20210-5b03-11e5-ba30-0cc47a34f672  ONLINE       0     0     0
          mirror-4                                        ONLINE       0     0     0
            gptid/524b645d-54fc-11e5-952d-0cc47a34f672    ONLINE       0     0     0
            gptid/e009e2d6-54fd-11e5-952d-0cc47a34f672    ONLINE       0     0     0
        spares
          10021450293346954392                            INUSE     was /dev/gptid/5cd20210-5b03-11e5-ba30-0cc47a34f672
          12468965327649564557                            INUSE     was /dev/gptid/a25d0618-5b03-11e5-ba30-0cc47a34f672

errors: No known data errors

Chris Moore · Apr 4, 2019

I don't use spares, so my understanding of it is theoretical.
The spare is substituting, temporarily, for the faulted disk to maintain redundancy. Once you replace the faulted disk, the spare will return to the 'spares' group and the pool will be made whole.
You just need to get some replacement disks.

Johnny Fartpants · Apr 4, 2019

This looks pretty normal and as @Chris Moore said you just need to replace the 'faulted' drives and once the resilver is complete your hot-spares will return to their original position waiting to save the day again at some point in the future. Remember when you do the replacement via the GUI you are replacing the 'faulted' drives and NOT the spares.

Also, hot-spares are only ever temporary replacements not permanent and this is probably where you are getting confused.

PS: spares appearing as a stripe is perfectly normal so long as the vdev is called 'spares'.

RchGrav · Apr 10, 2019

Thank you!..
This sounds correct based upon what happened when I built this and tested the spares functionality... Your comments fell nicely into place.

@Chris Moore You don't use spares!? That's either a compliment to ZFS or a crushing blow to your data's hopes of true love. :p

@Johnny Fartpants Nailed it.

I don't think the data was in any true danger & It felt like the time expended to clarify was worth it. This was all the exact stuff I needed to hear.

(Wishful Thinking: I was planning on doing these one at a time.. but is that even necessary if they are in different vdevs? Assuming that the spares are not in fact striped, and instead are just taking the place of a failed drive, there shouldn't be any drive contention while rebuilding.. I'm already presuming I will be shutting everything down to verify serial numbers* with smart... unless there is an easier way w/ a LSI SAS 3 setup in a 10 series supermicro chassis. *It would be nice to only only shut down completely once, if its necessary.... I recall that the "LSI 3 -> BSD Hot Swap" aspects of it seemed to work during the limited testing of it that I did.. but I don't have a clue how to identify I have the right drive without having eyes on the physical disk labels.. )

I'm not against running an update before doing this if there is any relevant reasoning.. currently running FreeNAS-11.1-U4 on this system.

Regards & Thanks!

Rich

Chris Moore · Apr 10, 2019

RchGrav said:
@Chris Moore You don't use spares!? That's either a compliment to ZFS or a crushing blow to your data's hopes of true love. :p

I maintain a stock of pre-tested cold spares and I get a daily status email in addition to the system auto-magically emails me when there is a fault to a disk. I am also pretty fanatical about replacing a drive at the very first sign of a problem. One bad sector is enough for me to pull the drive and replace it with a spare. Because I am checking the system so often, I don't see the need of hot-spares, in my specific situation. That is not to say that hot-spares don't have a purpose. If I had a system that was remote, or was not able to be tended regularly, I do see the value. I will give you an example. When I took on the position I have where I work, the job had been vacant for about a year. There were servers that had been running unattended for that entire time. Nobody was checking them with any regularity. When I made my initial assessments, I found Nine drives that needed to be replaced because they were in a failed state. In one of those servers, the system had swapped in the hot spare for one failed disk and two other disks had also failed in the array group. That means that if a single additional disk had failed, they would have lost the entire storage pool, which was around 150TB of data some of which had no backup at all and the rest would have needed to be restored from the original tar-gzip file that exists on a USB hard drive where it was originally brought in from 'outside'. I have optimized the process of unpacking those tar files so we can get one unpacked in about a day, but it was taking them about three days to unpack those drives before I found a better way. They have a file cabinet filled with those drives. At that time, it would have taken them weeks or even months to recover had that server gone down.
The only reason it survived as long as it did was because of the hot-spare, but it would never have gotten to that condition if there was anyone monitoring it.

RchGrav said:
I was planning on doing these one at a time.. but is that even necessary if they are in different vdevs?

No. You can replace multiple drives at the same time. I started two resilvers in one of the servers at work just before I went home from work today. Each vdev is a separate failure domain. The resilver does take IOPS away from overall pool performance though.

RchGrav said:
I'm presuming I will be shutting everything down to verify serial numbers* with smart... unless there is an easier way w/ a LSI SAS 3 setup in a 10 series supermicro chassis.

FreeNAS has a way to illuminate the error light on the drive bay if this system has a SAS backplane with LEDs. Can you give a bit more detail about what chassis these drives are in? If it is a Supermicro or Chenbro or even QNAP enclosure, with a SAS backplane, you can use the command sesutil locate da10 on, to turn on the locate LED of the drive bay associated with the drive da10.
You can absolutely swap the drives out with the server running if you have hot-swap bays. I have replaced six drives in one of the servers at work since the last time it was rebooted, over two months ago.

RchGrav said:
*It would be nice to only only shut down completely once, if its necessary.... When I tested this the "Hotswap" aspects of it seemed to work BTW.... but seeming to work vs actually working may not be one in the same.. unless it has matured..

No need to shutdown.
This is one of the chassis I use at work:
https://www.supermicro.com/products/system/4U/6048/SSG-6048R-E1CR60L.cfm
It is easy to replace a drive. The only potential problem is if you have swap space in use on the drive being replaced. If you are not using swap space or if you offline the drive first, you should be fine.

Chris Moore · Apr 10, 2019

PS. Use the command sesutil locate da10 off, to turn the locate light off when you are done. I forgot to do that on my system at work because I was running late leaving and I ride a carpool, so staying late wasn't an option.

RchGrav · Apr 14, 2019

Chris Moore said:
Can you give a bit more detail about what chassis these drives are in? If it is a Supermicro or Chenbro or even QNAP enclosure, with a SAS backplane, you can use the command sesutil locate da10 on, to turn on the locate LED of the drive bay associated with the drive da10.
You can absolutely swap the drives out with the server running if you have hot-swap bays. I have replaced six drives in one of the servers at work since the last time it was rebooted, over two months ago.

Thanks for the detailed replies above.. I know exactly what you mean re: replacing the drives manually.. Lately I've been just trying to err on the side of caution.. transitioning away from parity to mirrors, using hot spares (If the system isn't right under my nose like you mentioned), configuring snapshots, scheduled scrubs & smart tests..

Here is the link to the system I used when I set this one up..
https://www.supermicro.com/products/system/2U/6028/SSG-6028R-E1CR12L.cfm
Looks to be the same as yours... albeit the 2u version..

So.. in a nutshell.. I can just pull the faulted drives once I use sesutil to locate them without shutting down,. put in the fresh empty drives. Maybe have to use replace in the GUI, but maybe not if they just start to resilver on their own. once the new drives are active and everything is happy and resilvered the hotspares return to their roles?? Sound about right?

Chris Moore · Apr 14, 2019

RchGrav said:
So.. in a nutshell.. I can just pull the faulted drives once I use sesutil to locate them without shutting down,.

Not exactly. If you used the default configuration process for FreeNAS, you will have a swap partition on every data drive in the pool. The system will grab the first ten drives that come ready and used their swap space, so you might not know if the drive is being used for swap or not. Because of that, you should go into the "Volume Status" tab and select the drive to be removed and click the "Offline" button. This should gracefully allow the system to stop using the drive.

Then you can just snatch it out the front of the unit and put the new drive in.

RchGrav said:
put in the fresh empty drives. Maybe have to use replace in the GUI, but maybe not if they just start to resilver on their own.

No, they do not start to resilver on their own. FreeNAS will wait for you to tell it what to do.
You absolutely should use the replace function through the GUI because kicking off a resilver from the command line does not partition the drive. Sometimes starting a replace fails. If it does, it is probably because there is some garbage data on the drive, so you will need to use the GUI to "Wipe" the new drive.

After the disk wipe, try the replace again. I had to do that for all three of the drives I replaced at work last week. I had three drives (in different vdevs) resilvering at the same time. It spikes CPU and disk activity when you do that which will affect the responsiveness of the server to the network.

jgreco · Apr 14, 2019

Chris Moore said:
I don't use spares, so my understanding of it is theoretical.
The spare is substituting, temporarily, for the faulted disk to maintain redundancy. Once you replace the faulted disk, the spare will return to the 'spares' group and the pool will be made whole.
You just need to get some replacement disks.

No, the spare becomes the working disk, replacing the failed disk. You pull the failed disk, putting in a new disk, and you can mark that as your new spare.

Chris Moore · Apr 14, 2019

jgreco said:
No, the spare becomes the working disk, replacing the failed disk. You pull the failed disk, putting in a new disk, and you can mark that as your new spare.

What is the explanation of the pool status from the original post?

jgreco · Apr 14, 2019

Chris Moore said:
What is the explanation of the pool status from the original post?

I can't tell. I generally have a single failure to deal with at a time, and I know the starting state of the system, so I don't really pay a lot of attention to "DEGRADED" or "FAULTED" and just proceed through the workflow. I have no idea what "INUSE" is either.

RchGrav · Apr 15, 2019

Chris Moore said:
Not exactly. If you used the default configuration process for FreeNAS, you will have a swap partition on every data drive in the pool. The system will grab the first ten drives that come ready and used their swap space, so you might not know if the drive is being used for swap or not. Because of that, you should go into the "Volume Status" tab and select the drive to be removed and click the "Offline" button. This should gracefully allow the system to stop using the drive.

I guess my thinking was if a spare standing in for a faulted drive the device that experienced a fault would already be offline.. but that is just an assumption based upon my classical understanding of the way spares work, however based on the above a drive can technically go into a faulted state but still be operational as a swap device...

As far as the starting state of the system goes.. vdevs were created 2 devices at a time as mirror until only 2 drives remained unused.. the last 2 drives were added as spares ones at a time.

RchGrav · Apr 15, 2019

jgreco said:
No, the spare becomes the working disk, replacing the failed disk. You pull the failed disk, putting in a new disk, and you can mark that as your new spare.

I just felt my heart sink a little bit.. because I adjusted my understanding.. I've just reverted it though..

I feel like its safe to say that it doesn't change the workflow of replacing the faulted drives.. just the end resulting membership of the devices.. However I'm confused why the spares wouldn't be showing as the new replacement devices in the vdevs.. and just as @Chris Moore stated I'm confused regarding the pool status. Unless the status indicates its more degraded than it was when it has the spares available.. whereas now it doesn't have spares..

I'll surely report back what actually takes place w/ updated screenshots and zpool status when the faulted devices are replaced.

Chris Moore · Apr 15, 2019

I don't know. I was just reading the reference for ZFS and I still think I have the correct understanding of how they work:
https://docs.oracle.com/cd/E53394_01/html/E54801/gpegp.html
and
https://www.freebsd.org/cgi/man.cgi?query=zpool&manpath=FreeBSD+8.2-RELEASE

Johnny Fartpants · Apr 16, 2019

Hot-Spares are only ever temporary. When the faulted disk is replaced the replacement disk joins the pool and the original hot-spare goes back to being a hot-spare after re-silver.

IN-USE means the hot-spare is in the pool having temporarily replaced the faulted drive.

Johnny Fartpants · Apr 16, 2019

Chris Moore said:
I was just reading the reference for ZFS and I still think I have the correct understanding of how they work:

You do indeed Mr Moore

Stephen Hayne · Apr 21, 2019

OK, sorry to resurrect this thread, but it came up in the search... pls point me to a better one if I have chosen unwisely.

I think I don't really understand what is going on through the GUI (or the cmd line, either probably). I started getting msgs (SuperMicro 48 bay box - I have 4 of them, 10gb lagg, btw), that a disk was faulting, in one of the mirror pairs in a 24TB volume. I nursed it a bit, and then decided to "replace" it, selecting one of the striped spares I had assigned. It complained, so I canceled, and chose another disk that was sitting idle in the box. Now I have this:

with the spares (striped) looking like this:

What the heck happened? And what do I do now? From what I read, the hot spare is now INUSE (spare-0), da5 was the spare I tried to cancel out of putting in during replacement, and da36 is the one I chose that wasn't a spare, but was idle.

I can't select OFFLINE or DETACH from da10 or spare-0. I can select OFFLINE from da5, da27, and I an select OFFLINE or DETACH from da36.

Please tell me the workflow to get this back to a mirrored pair?

Looks like I should:
1) light up the LED if I can (I don' think my backplane supports it), but I have records of which Serial# is in which bay.
2) hot swap da10
3) resilvering is complete, DETACH da36?

Thank you for any help. I built these boxes 4+ years ago, and only now am having errors - they are overprovisioned and only using about 25% capacity... :) this is my first fail.

Stephen

Stephen Hayne · Apr 21, 2019

Oh - here is the result from zpool status:

Code:

[root@sm2] ~# zpool status
  pool: SM2_infrastructure
 state: ONLINE
  scan: scrub repaired 0 in 0h57m with 0 errors on Sun Mar 31 00:57:06 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        SM2_infrastructure                              ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/11fddb29-c613-11e5-8fa9-002590c44bba  ONLINE       0     0     0
            gptid/12f1687a-c613-11e5-8fa9-002590c44bba  ONLINE       0     0     0

errors: No known data errors

  pool: SM2_ssd_pool
 state: ONLINE
  scan: scrub repaired 0 in 0h32m with 0 errors on Sun Mar 24 00:32:45 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        SM2_ssd_pool                                    ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/97213b77-a9c6-11e5-9ff8-002590c44bba  ONLINE       0     0     0
            gptid/9760c3e1-a9c6-11e5-9ff8-002590c44bba  ONLINE       0     0     0
            gptid/97a07b97-a9c6-11e5-9ff8-002590c44bba  ONLINE       0     0     0
            gptid/97e0dc43-a9c6-11e5-9ff8-002590c44bba  ONLINE       0     0     0
            gptid/98231417-a9c6-11e5-9ff8-002590c44bba  ONLINE       0     0     0
            gptid/986614ad-a9c6-11e5-9ff8-002590c44bba  ONLINE       0     0     0
            gptid/98aa97b9-a9c6-11e5-9ff8-002590c44bba  ONLINE       0     0     0
            gptid/98ee89f7-a9c6-11e5-9ff8-002590c44bba  ONLINE       0     0     0
            gptid/99323483-a9c6-11e5-9ff8-002590c44bba  ONLINE       0     0     0
            gptid/99765dc8-a9c6-11e5-9ff8-002590c44bba  ONLINE       0     0     0

errors: No known data errors

  pool: SM2_stable_pool
 state: DEGRADED
  scan: resilvered 616G in 2h32m with 0 errors on Fri Apr 19 13:27:39 2019
config:

        NAME                                                STATE     READ WRITE CKSUM
        SM2_stable_pool                                     DEGRADED     0     0     0
          mirror-0                                          ONLINE       0     0     0
            gptid/030deda1-eb51-11e8-a953-002590c44bba      ONLINE       0     0     0
            gptid/99b7f4e5-2921-11e7-8bbe-002590c44bba      ONLINE       0     0     0
          mirror-1                                          ONLINE       0     0     0
            gptid/2dc44292-a9c7-11e5-9ff8-002590c44bba      ONLINE       0     0     0
            gptid/2e84ea5f-a9c7-11e5-9ff8-002590c44bba      ONLINE       0     0     0
          mirror-2                                          ONLINE       0     0     0
            gptid/2f3a11de-a9c7-11e5-9ff8-002590c44bba      ONLINE       0     0     0
            gptid/2fe8611d-a9c7-11e5-9ff8-002590c44bba      ONLINE       0     0     0
          mirror-3                                          ONLINE       0     0     0
            gptid/30a73711-a9c7-11e5-9ff8-002590c44bba      ONLINE       0     0     0
            gptid/317a924e-a9c7-11e5-9ff8-002590c44bba      ONLINE       0     0     0
          mirror-4                                          ONLINE       0     0     0
            gptid/32516de7-a9c7-11e5-9ff8-002590c44bba      ONLINE       0     0     0
            gptid/332cce85-a9c7-11e5-9ff8-002590c44bba      ONLINE       0     0     0
          mirror-5                                          ONLINE       0     0     0
            gptid/340b05fb-a9c7-11e5-9ff8-002590c44bba      ONLINE       0     0     0
            gptid/34ba9c48-a9c7-11e5-9ff8-002590c44bba      ONLINE       0     0     0
          mirror-6                                          DEGRADED     0     0     0
            replacing-0                                     FAULTED      0     0     0
              spare-0                                       FAULTED      0     0     0
                gptid/155dee1b-4568-11e7-be80-002590c44bba  FAULTED      0    15     0  too many errors
                gptid/5e600107-cb7b-11e5-8fa9-002590c44bba  ONLINE       0     0     0
              gptid/e234cc5c-62c3-11e9-a953-002590c44bba    ONLINE       0     0     0
            gptid/1631a160-4568-11e7-be80-002590c44bba      ONLINE       0     0     0
          mirror-7                                          ONLINE       0     0     0
            gptid/0d24df0b-59d1-11e7-be80-002590c44bba      ONLINE       0     0     0
            gptid/0df51703-59d1-11e7-be80-002590c44bba      ONLINE       0     0     0
          mirror-8                                          ONLINE       0     0     0
            gptid/490dc8cb-59d1-11e7-be80-002590c44bba      ONLINE       0     0     0
            gptid/49cf2831-59d1-11e7-be80-002590c44bba      ONLINE       0     0     0
          mirror-9                                          ONLINE       0     0     0
            gptid/7618aed4-59d1-11e7-be80-002590c44bba      ONLINE       0     0     0
            gptid/76f4dec8-59d1-11e7-be80-002590c44bba      ONLINE       0     0     0
        cache
          gptid/350a3d9c-a9c7-11e5-9ff8-002590c44bba        ONLINE       0     0     0
          gptid/35476667-a9c7-11e5-9ff8-002590c44bba        ONLINE       0     0     0
        spares
          1359512483288840190                               INUSE     was /dev/gptid/5e600107-cb7b-11e5-8fa9-002590c44bba
          gptid/2173a885-4568-11e7-be80-002590c44bba        AVAIL

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0h9m with 0 errors on Thu Apr 18 03:54:58 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        freenas-boot                                    ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/e69d8811-61fb-11e5-ba41-002590c44bba  ONLINE       0     0     0
            gptid/e5436266-92f2-11e5-8c9d-002590c44bba  ONLINE       0     0     0

errors: No known data errors
[root@sm2] ~#

Heracles · Apr 21, 2019

Hi Stephen,

Unfortunately, the output of your zpool status is incomplete... Can you please post the complete output ?

Also, one of your vdev is named mirror-6, but already contains 3 drives as Online. Usually, a mirror is made of 2 drives. It is possible to do 3-way mirrors (or even more), but this is not very common. Also, none of your other mirrors are like that. Even when 3-way mirrors are used, they are normally used for the entire pool and not for a single vDev in a larger pool.

So please, confirm the exact structure of your pool before we can figure out what is problematic and what can be done to fix it safely.

Chris Moore · Apr 21, 2019

Stephen Hayne said:
OK, sorry to resurrect this thread, but it came up in the search... pls point me to a better one if I have chosen unwisely.

The thing you should have done is to start a new thread.

Stephen Hayne said:
I think I don't really understand what is going on through the GUI

I think we are going to need to do some CLI (command line) things to get this fixed.

Stephen Hayne said:
I nursed it a bit, and then decided to "replace" it,

Hot-Spares are only activated by the complete failure of a drive that is having problems. I don't know what you did here, but you appear to have activated a spare, and also somehow added another mirror. It is a really strange looking situation.

Stephen Hayne said:
It complained, so I canceled, and chose another disk that was sitting idle in the box.

I don't know how this happened, because I don't know exactly what steps you took, but I think we can clear this up. Just don't worry about it too much because you have fully three disks holding the data for that vdev right now, so you should't have any risk of data loss.

Stephen Hayne said:
(I don' think my backplane supports it),

If you give the model number of the system, I could tell you for sure, but the 48 bay chassis I am familiar with should do this. You need to be able to SSH in from a terminal, I like Cygwin, but you can use PuTTY if you like. It is just so you can use the command line. The command is sesutil locate da10 on to start the light blinking, then sesutil locate da10 off when you are done. It works in systems that have a SAS expander backplane. I have (at work) system from Supermicro, Chenbro and QNAP that all work with that.

Stephen Hayne said:
but I have records of which Serial# is in which bay.

Don't need the serial number, just the da#...
If you need to get the device number from the gptid, you can use glabel status to show the gptids and da#..
The drive

Code:

 gptid/5e600107-cb7b-11e5-8fa9-002590c44bba

is the spare, and the GUI doesn't make it clear but you say,

Stephen Hayne said:
da5 was the spare

I would use glabel status to be certain.

From the command line, you can't address the drive by the da# for this, so you must be able to relate the da# you are working on back to the gptid, because the pool was formed using the gptid, not da#. You should be able to give the command,
detach SM2_stable_pool gptid/5e600107-cb7b-11e5-8fa9-002590c44bba
which should return the spare to the spare group.

Please do that and let me know what the result is.

Important Announcement for the TrueNAS Community.

A little confused regarding spares...

Dabbler

Hall of Famer

Guru

Dabbler

Hall of Famer

Hall of Famer

Dabbler

Hall of Famer

Resident Grinch

Hall of Famer

Resident Grinch

Dabbler

Dabbler

Hall of Famer

Guru

Guru

Dabbler

Dabbler

Wizard

Hall of Famer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "A little confused regarding spares..."

Similar threads