Identify and correct failed drive. Unsure what to replace and how

mattragusa

Cadet
Joined
Oct 28, 2018
Messages
8
I'm running 9 drives in a RaidZ2 with a hot spare and cache. I have a failure, and the resilvering processes keeps restarting at 15%. There are two zpools, plus the boot pool, a total of 13 HDs, an SSD, and two USB drives.

I'm trying to figure out what drive has failed, and I'm too much of an amateur to figure it out, but I think I'm close. I attached the screenshot of the pool status in the gui.

Help is determining what the best course of action would be and which HD I need to swap out would be enormously appreciated. I can provide whatever other info that may be needed.

zpool status is

Code:
root@singularity[~]# zpool status
  pool: AccretionDisk
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Nov  4 09:47:33 2020
        3.81T scanned at 420M/s, 1010G issued at 109M/s, 54.0T total
        96.2G resilvered, 1.83% done, 5 days 22:10:24 to go
config:

        NAME                                              STATE     READ WRITE CKSUM
        AccretionDisk                                     DEGRADED     0     0   0
          raidz2-0                                        DEGRADED     0     0   0
            gptid/02887208-3d7f-11ea-844a-e0d55e61a68d    ONLINE       0     0   0
            gptid/83a53ef0-af47-11ea-be93-e0d55e61a68d    ONLINE       0     0   0
            gptid/c7392dab-58e7-11ea-bf01-e0d55e61a68d    ONLINE       0     0   4
            gptid/4043df3c-da48-11ea-9092-e0d55e61a68d    ONLINE       0     0   0
            gptid/15789b19-3e12-11ea-9536-e0d55e61a68d    ONLINE       0     0   0
            gptid/23bf9bf9-3d7f-11ea-844a-e0d55e61a68d    ONLINE       0     0   0
            gptid/2bcb707c-3d7f-11ea-844a-e0d55e61a68d    ONLINE       0     0   0
            gptid/32e64f73-3d7f-11ea-844a-e0d55e61a68d    ONLINE       0     0   0
            spare-8                                       DEGRADED     0     0   0
              3380156371381656963                         UNAVAIL      0     0   0  was /dev/gptid/0605bbc0-445e-11ea-80e4-e0d55e61a68d
              gptid/44bea737-3d7f-11ea-844a-e0d55e61a68d  ONLINE       0     0   0
            gptid/4209a8e3-3d7f-11ea-844a-e0d55e61a68d    ONLINE       0     0   0
        cache
          gptid/4d2e226e-3e21-11ea-9536-e0d55e61a68d       
    spares
          16699762971903790315                            INUSE     was /dev/gptid/44bea737-3d7f-11ea-844a-e0d55e61a68d

errors: No known data errors


glabel status is

Code:
root@singularity[~]# glabel status
                                      Name  Status  Components
gptid/4d2e226e-3e21-11ea-9536-e0d55e61a68d     N/A  nvd0p1
gptid/83a53ef0-af47-11ea-be93-e0d55e61a68d     N/A  ada0p2
gptid/d1383204-5842-11ea-ada5-e0d55e61a68d     N/A  ada1p2
gptid/44bea737-3d7f-11ea-844a-e0d55e61a68d     N/A  ada2p2
gptid/d7e61ca6-3e30-11ea-9536-e0d55e61a68d     N/A  ada4p2
gptid/23bf9bf9-3d7f-11ea-844a-e0d55e61a68d     N/A  da0p2
gptid/2bcb707c-3d7f-11ea-844a-e0d55e61a68d     N/A  da1p2
gptid/32e64f73-3d7f-11ea-844a-e0d55e61a68d     N/A  da2p2
gptid/4209a8e3-3d7f-11ea-844a-e0d55e61a68d     N/A  da3p2
gptid/15789b19-3e12-11ea-9536-e0d55e61a68d     N/A  da4p2
gptid/c7392dab-58e7-11ea-bf01-e0d55e61a68d     N/A  da5p2
gptid/4043df3c-da48-11ea-9092-e0d55e61a68d     N/A  da6p2
gptid/750059f8-3d7c-11ea-b860-e0d55e61a68d     N/A  da7p1
gptid/752b4c07-3d7c-11ea-b860-e0d55e61a68d     N/A  da8p1
gptid/02887208-3d7f-11ea-844a-e0d55e61a68d     N/A  ada3p2
gptid/44b2249a-3d7f-11ea-844a-e0d55e61a68d     N/A  ada2p1
gptid/027bde42-3d7f-11ea-844a-e0d55e61a68d     N/A  ada3p1



camcontrol devlist

Code:
root@singularity[~]# camcontrol devlist
<ATA ST8000DM004-2CX1 0001>        at scbus0 target 12 lun 0 (pass0,da0)
<ATA ST8000DM004-2CX1 0001>        at scbus0 target 15 lun 0 (pass1,da1)
<ATA ST8000DM004-2CX1 0001>        at scbus0 target 16 lun 0 (pass2,da2)
<ATA ST8000DM004-2CX1 0001>        at scbus0 target 19 lun 0 (pass3,da3)
<ATA ST8000NM0055-1RM SN05>        at scbus0 target 20 lun 0 (pass4,da4)
<ATA ST8000DM004-2CX1 0001>        at scbus0 target 22 lun 0 (pass5,da5)
<ATA ST8000VN004-2M21 SC60>        at scbus0 target 23 lun 0 (pass6,da6)
<ST8000NM0055-1RM112 SN05>         at scbus1 target 0 lun 0 (pass7,ada0)
<ST4000DM005-2DP166 0001>          at scbus2 target 0 lun 0 (pass8,ada1)
<ST8000DM004-2CX188 0001>          at scbus4 target 0 lun 0 (pass9,ada2)
<ST8000DM004-2CX188 0001>          at scbus5 target 0 lun 0 (ada3,pass10)
<ST4000VN008-2DR166 SC60>          at scbus6 target 0 lun 0 (pass11,ada4)
<AHCI SGPIO Enclosure 2.00 0001>   at scbus7 target 0 lun 0 (ses0,pass12)
<Samsung Flash Drive FIT 1100>     at scbus9 target 0 lun 0 (pass13,da7)
<Samsung Flash Drive FIT 1100>     at scbus10 target 0 lun 0 (pass14,da8)
 

Attachments

  • Screenshot 2020-11-04 122534.jpg
    Screenshot 2020-11-04 122534.jpg
    92.6 KB · Views: 155
Joined
Jan 7, 2015
Messages
1,155
You might be falling victim to the SMR drive nonsense. Most of those drives are on the SMR verified list. Causes resilvering issues because of the mass amount of data it must put on the drive. Notice the 5 day plus resilver time? This is exactly what happens.
 

JaimieV

Guru
Joined
Oct 12, 2012
Messages
742
Just to add - the drives aren't *failing* as such, which is why you're not seeing real fails. They're just being SMR drives, which means stalling and flushing their caches as you put lots of data through them on a resilver, but stalling for long enough that FreeNAS thinks they've died. Then they come back online, and repeat.

The only fix is to replace them with CMR drives. Hopefully they're all new and you're within the limits of return for these...
 
Joined
Jan 7, 2015
Messages
1,155
Western Digital pulled some shady business practices so the other Big guys followed suit. WD sold NAS drives under the red label that were actually SMR tech. They were great at first becuse there werent any data in the pools. But as the pool filled and drives started failing and you went and bought another "Red" drive and it was taking 9 - 10 days to get resilvered, which as we all know is bonkers. The NAS community exposed them for hiding the fact that the drives were SMR and not CMR. Seagate did it too. People were pissed.

So thats my basic understanding of it. Ive started using Ironwolfs and they seem to be legit. I was formerly using Toshiba Drives but the RMA process sucks bad. I fell victim of this myself as have lots of us, good luck.

The good news is that once those slow p.o.s. drives are resilvered when you start replacing them with CMR drives the resilver times improve dramatically.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
For reference, look at this resource:
 
Top