Identify and correct failed drive. Unsure what to replace and how

mattragusa · Nov 4, 2020

I'm running 9 drives in a RaidZ2 with a hot spare and cache. I have a failure, and the resilvering processes keeps restarting at 15%. There are two zpools, plus the boot pool, a total of 13 HDs, an SSD, and two USB drives.

I'm trying to figure out what drive has failed, and I'm too much of an amateur to figure it out, but I think I'm close. I attached the screenshot of the pool status in the gui.

Help is determining what the best course of action would be and which HD I need to swap out would be enormously appreciated. I can provide whatever other info that may be needed.

zpool status is

Code:

root@singularity[~]# zpool status
  pool: AccretionDisk
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Nov  4 09:47:33 2020
        3.81T scanned at 420M/s, 1010G issued at 109M/s, 54.0T total
        96.2G resilvered, 1.83% done, 5 days 22:10:24 to go
config:

        NAME                                              STATE     READ WRITE CKSUM
        AccretionDisk                                     DEGRADED     0     0   0
          raidz2-0                                        DEGRADED     0     0   0
            gptid/02887208-3d7f-11ea-844a-e0d55e61a68d    ONLINE       0     0   0
            gptid/83a53ef0-af47-11ea-be93-e0d55e61a68d    ONLINE       0     0   0
            gptid/c7392dab-58e7-11ea-bf01-e0d55e61a68d    ONLINE       0     0   4
            gptid/4043df3c-da48-11ea-9092-e0d55e61a68d    ONLINE       0     0   0
            gptid/15789b19-3e12-11ea-9536-e0d55e61a68d    ONLINE       0     0   0
            gptid/23bf9bf9-3d7f-11ea-844a-e0d55e61a68d    ONLINE       0     0   0
            gptid/2bcb707c-3d7f-11ea-844a-e0d55e61a68d    ONLINE       0     0   0
            gptid/32e64f73-3d7f-11ea-844a-e0d55e61a68d    ONLINE       0     0   0
            spare-8                                       DEGRADED     0     0   0
              3380156371381656963                         UNAVAIL      0     0   0  was /dev/gptid/0605bbc0-445e-11ea-80e4-e0d55e61a68d
              gptid/44bea737-3d7f-11ea-844a-e0d55e61a68d  ONLINE       0     0   0
            gptid/4209a8e3-3d7f-11ea-844a-e0d55e61a68d    ONLINE       0     0   0
        cache
          gptid/4d2e226e-3e21-11ea-9536-e0d55e61a68d       
    spares
          16699762971903790315                            INUSE     was /dev/gptid/44bea737-3d7f-11ea-844a-e0d55e61a68d

errors: No known data errors

glabel status is

Code:

root@singularity[~]# glabel status
                                      Name  Status  Components
gptid/4d2e226e-3e21-11ea-9536-e0d55e61a68d     N/A  nvd0p1
gptid/83a53ef0-af47-11ea-be93-e0d55e61a68d     N/A  ada0p2
gptid/d1383204-5842-11ea-ada5-e0d55e61a68d     N/A  ada1p2
gptid/44bea737-3d7f-11ea-844a-e0d55e61a68d     N/A  ada2p2
gptid/d7e61ca6-3e30-11ea-9536-e0d55e61a68d     N/A  ada4p2
gptid/23bf9bf9-3d7f-11ea-844a-e0d55e61a68d     N/A  da0p2
gptid/2bcb707c-3d7f-11ea-844a-e0d55e61a68d     N/A  da1p2
gptid/32e64f73-3d7f-11ea-844a-e0d55e61a68d     N/A  da2p2
gptid/4209a8e3-3d7f-11ea-844a-e0d55e61a68d     N/A  da3p2
gptid/15789b19-3e12-11ea-9536-e0d55e61a68d     N/A  da4p2
gptid/c7392dab-58e7-11ea-bf01-e0d55e61a68d     N/A  da5p2
gptid/4043df3c-da48-11ea-9092-e0d55e61a68d     N/A  da6p2
gptid/750059f8-3d7c-11ea-b860-e0d55e61a68d     N/A  da7p1
gptid/752b4c07-3d7c-11ea-b860-e0d55e61a68d     N/A  da8p1
gptid/02887208-3d7f-11ea-844a-e0d55e61a68d     N/A  ada3p2
gptid/44b2249a-3d7f-11ea-844a-e0d55e61a68d     N/A  ada2p1
gptid/027bde42-3d7f-11ea-844a-e0d55e61a68d     N/A  ada3p1

camcontrol devlist

Code:

root@singularity[~]# camcontrol devlist
<ATA ST8000DM004-2CX1 0001>        at scbus0 target 12 lun 0 (pass0,da0)
<ATA ST8000DM004-2CX1 0001>        at scbus0 target 15 lun 0 (pass1,da1)
<ATA ST8000DM004-2CX1 0001>        at scbus0 target 16 lun 0 (pass2,da2)
<ATA ST8000DM004-2CX1 0001>        at scbus0 target 19 lun 0 (pass3,da3)
<ATA ST8000NM0055-1RM SN05>        at scbus0 target 20 lun 0 (pass4,da4)
<ATA ST8000DM004-2CX1 0001>        at scbus0 target 22 lun 0 (pass5,da5)
<ATA ST8000VN004-2M21 SC60>        at scbus0 target 23 lun 0 (pass6,da6)
<ST8000NM0055-1RM112 SN05>         at scbus1 target 0 lun 0 (pass7,ada0)
<ST4000DM005-2DP166 0001>          at scbus2 target 0 lun 0 (pass8,ada1)
<ST8000DM004-2CX188 0001>          at scbus4 target 0 lun 0 (pass9,ada2)
<ST8000DM004-2CX188 0001>          at scbus5 target 0 lun 0 (ada3,pass10)
<ST4000VN008-2DR166 SC60>          at scbus6 target 0 lun 0 (pass11,ada4)
<AHCI SGPIO Enclosure 2.00 0001>   at scbus7 target 0 lun 0 (ses0,pass12)
<Samsung Flash Drive FIT 1100>     at scbus9 target 0 lun 0 (pass13,da7)
<Samsung Flash Drive FIT 1100>     at scbus10 target 0 lun 0 (pass14,da8)

JohnDigital · Nov 5, 2020

You might be falling victim to the SMR drive nonsense. Most of those drives are on the SMR verified list. Causes resilvering issues because of the mass amount of data it must put on the drive. Notice the 5 day plus resilver time? This is exactly what happens.

JaimieV · Nov 8, 2020

Just to add - the drives aren't *failing* as such, which is why you're not seeing real fails. They're just being SMR drives, which means stalling and flushing their caches as you put lots of data through them on a resilver, but stalling for long enough that FreeNAS thinks they've died. Then they come back online, and repeat.

The only fix is to replace them with CMR drives. Hopefully they're all new and you're within the limits of return for these...

JohnDigital · Nov 8, 2020

Western Digital pulled some shady business practices so the other Big guys followed suit. WD sold NAS drives under the red label that were actually SMR tech. They were great at first becuse there werent any data in the pools. But as the pool filled and drives started failing and you went and bought another "Red" drive and it was taking 9 - 10 days to get resilvered, which as we all know is bonkers. The NAS community exposed them for hiding the fact that the drives were SMR and not CMR. Seagate did it too. People were pissed.

So thats my basic understanding of it. Ive started using Ironwolfs and they seem to be legit. I was formerly using Toshiba Drives but the RMA process sucks bad. I fell victim of this myself as have lots of us, good luck.

The good news is that once those slow p.o.s. drives are resilvered when you start replacing them with CMR drives the resilver times improve dramatically.

sretalla · Nov 9, 2020

For reference, look at this resource:

List of known SMR drives

Hard drives that write data in overlapping, "shingled" tracks, have greater areal density than ones that do not. For cost and capacity reasons, manufacturers are increasingly moving to SMR, Shingled Magnetic Recording. SMR is a form of PMR...

www.truenas.com

Important Announcement for the TrueNAS Community.

Identify and correct failed drive. Unsure what to replace and how

mattragusa

Cadet

Attachments

JohnDigital

Guru

JaimieV

Guru

JohnDigital

Guru

sretalla

Powered by Neutrality

List of known SMR drives

Similar threads

Important Announcement for the TrueNAS Community.

Identify and correct failed drive. Unsure what to replace and how

mattragusa

Cadet

Attachments

JohnDigital

Guru

JaimieV

Guru

JohnDigital

Guru

sretalla

Powered by Neutrality

List of known SMR drives

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Identify and correct failed drive. Unsure what to replace and how"

Similar threads