Spare drive kicked in. Which is the drive that needs replacing now

squeakybadger · Jan 25, 2023

Hi all,

I had a drive error out over the weekend, and the spare kicked it and resilvered.

I've added a new drive in and want to replace the bad one, but I'm not sure which drive to actually replace.

Code:

state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 7.76T in 22:26:53 with 0 errors on Tue Jan 24 02:39:51 2023
config:

        NAME                                              STATE     READ WRITE CKSUM
        PIK                                                  ONLINE       0     0     0
          mirror-0                                        ONLINE       0     0     0
            gptid/f9a1f0a4-b21c-11ea-98b6-90e2ba89e89c    ONLINE       0     0     0
            gptid/fc5af3f5-b21c-11ea-98b6-90e2ba89e89c    ONLINE       0     0     0
          mirror-1                                        ONLINE       0     0     0
            gptid/fb22ecaf-b21c-11ea-98b6-90e2ba89e89c    ONLINE       0     0     0
            gptid/fca8490e-b21c-11ea-98b6-90e2ba89e89c    ONLINE       0     0     0
          mirror-2                                        ONLINE       0     0     0
            gptid/fd35db7d-b21c-11ea-98b6-90e2ba89e89c    ONLINE       0     0     0
            gptid/fd5df32a-b21c-11ea-98b6-90e2ba89e89c    ONLINE       0     0     0
          mirror-3                                        ONLINE       0     0     0
            gptid/fafe6574-b21c-11ea-98b6-90e2ba89e89c    ONLINE       0     0     0
            gptid/fce2c376-b21c-11ea-98b6-90e2ba89e89c    ONLINE       0     0     0
          mirror-4                                        ONLINE       0     0     0
            gptid/fbf3c2f9-b21c-11ea-98b6-90e2ba89e89c    ONLINE       0     0     0
            gptid/fc51679c-b21c-11ea-98b6-90e2ba89e89c    ONLINE       0     0     0
          mirror-6                                        ONLINE       0     0     0
            gptid/b97214c0-16c0-11eb-92ce-90e2ba89e89c    ONLINE       0     0     0
            gptid/94402ea8-0af9-11eb-b080-90e2ba89e89c    ONLINE       0     0     0
          mirror-7                                        ONLINE       0     0     0
            spare-0                                       ONLINE       0     0     0
              gptid/368d862f-16c3-11eb-92ce-90e2ba89e89c  ONLINE       0     0     0
              gptid/3768867b-16c3-11eb-92ce-90e2ba89e89c  ONLINE       0     0 2.67K
            gptid/378cdb9b-16c3-11eb-92ce-90e2ba89e89c    ONLINE       0     0     0
        logs
          gptid/7eab73f1-b495-11ea-b2c1-90e2ba89e89c      ONLINE       0     0     0
        cache
          gptid/f78b4cbc-b21c-11ea-98b6-90e2ba89e89c      ONLINE       0     0     0
        spares
          gptid/3768867b-16c3-11eb-92ce-90e2ba89e89c      INUSE     currently in use

errors: No known data errors

Is it the one the has 2.67k errors? Because that looks like the spare that is in use.

Attached the GUI disk layout to be clearer - da15p2 is the spare, but also seems to be the one with errors?

Any help appreciated!

Thanks.

Davvo · Jan 25, 2023

The faulty one should be gptid/368d862f-16c3-11eb-92ce-90e2ba89e89c, look into the logs (if the spare kicked in something happened).
Run a scrub to make sure things are ok; those cheksum errors are not reassuring so check the cables of the spare as well.

squeakybadger · Jan 25, 2023

Hi Davvo,

Thanks for helping. I'll get a scrub going tonight and see if it brings anything up before I replace the disk.

Checking the logs though, it seems like the drive disconnected after a controller fault (but reconnected and is back online)

Code:

Jan 23 04:12:29 pikdrive mps0: IOC Fault 0x40007e23, Resetting
Jan 23 04:12:29 pikdrive mps0: Reinitializing controller
Jan 23 04:12:29 pikdrive mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
Jan 23 04:12:29 pikdrive mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
Jan 23 04:12:29 pikdrive (da13:mps0:0:25:0): Invalidating pack
Jan 23 04:12:29 pikdrive da13 at mps0 bus 0 scbus14 target 25 lun 0
Jan 23 04:12:29 pikdrive da13: <ATA ST12000NM0008-2H SN02>  s/n ZHZ4CE2A detached
Jan 23 04:12:37 pikdrive GEOM_MIRROR: Device swap1: provider da13p1 disconnected.
Jan 23 04:12:38 pikdrive (da13:mps0:0:25:0): Periph destroyed
Jan 23 04:12:38 pikdrive da13 at mps0 bus 0 scbus14 target 25 lun 0
Jan 23 04:12:38 pikdrive da13: <ATA ST12000NM0008-2H SN02> Fixed Direct Access SPC-4 SCSI device
Jan 23 04:12:38 pikdrive da13: Serial Number ZHZ4CE2A
Jan 23 04:12:38 pikdrive da13: 600.000MB/s transfers
Jan 23 04:12:38 pikdrive da13: Command Queueing enabled
Jan 23 04:12:38 pikdrive da13: 11444224MB (23437770752 512 byte sectors)

We had a lot of render servers writing to the drive all weekend, but I've not encountered the IOC Fault before so that is worrying

Davvo · Jan 25, 2023

It is indeed worrying, please provide your hardware specs.
Please read the following resources.

What's all the noise about HBA's, and why can't I use a RAID controller?

This is relevant to FreeNAS and TrueNAS CORE. Some parts of it might also be relevant to Scale, but I don't really know how reliable the Linux drivers are. 1) An HBA is a Host Bus Adapter. This is a controller that allows SAS and SATA devices to be attached to, and communicate directly with...

www.truenas.com

Multiply your problems with SATA Port Multipliers and cheap SATA controllers

This resource was originally created by user: jgreco on the TrueNAS Community Forums Archive. Please DM this account or comment in this thread to claim it. In the last year or two, we’ve had a resurgence of users asking about SATA Port Multipliers and cheap SATA controllers. Please, do NOT use...

www.truenas.com

squeakybadger · Jan 25, 2023

Hi Davvo,

Hardware specs:

Dell R720
128GB Ram
3TB PciE SSD Cache Drive
16x12TB Seagate 3.5" Drives
LSI SAS 2008(B2)
2x Dell MD1200 Powervaults (2nd Powervault is where the drive error occurred)

Davvo · Jan 25, 2023

squeakybadger said:
2x Dell MD1200 Powervaults (2nd Powervault is where the drive error occurred)

That's where the issue is I believe: if I'm not reading their documentation wrong, those powervaults use the H810 raid controller which is a big no.
Plese read the first resource I linked in my previous post.

That being said, I'm no expert of HBAs so you might want to wait for a more competent opinion.

Redcoat · Jan 25, 2023

squeakybadger said:
LSI SAS 2008(B2)

This thread looks to confirm based on OP's hardware list: https://www.truenas.com/community/threads/firmware-for-sas2008-b2.68646/

squeakybadger · Jan 27, 2023

Hi Davvo,

Scrub came back clean, and no errors since the weekend. I'll keep an eye out for any more Ioc errors but everything looks ok at the moment so I'm hoping it's just a random blip.

I'm going to replace

Code:

gptid/368d862f-16c3-11eb-92ce-90e2ba89e89c

over the weekend.

Once the disk is replaced, what is the procedure for removing the spare from that vdev grouping (and returning it back to the spare state?)

I'm going to replace the spare as well once it is out of the pool, as I don't like those write errors.

I could do with a Truenas update and reboot as well as it has been running over a year, but we always seem to have some important jobs running!

Important Announcement for the TrueNAS Community.

Spare drive kicked in. Which is the drive that needs replacing now

squeakybadger

Dabbler

Attachments

Davvo

MVP

squeakybadger

Dabbler

Davvo

MVP

What's all the noise about HBA's, and why can't I use a RAID controller?

Multiply your problems with SATA Port Multipliers and cheap SATA controllers

squeakybadger

Dabbler

Davvo

MVP

Redcoat

MVP

squeakybadger

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Spare drive kicked in. Which is the drive that needs replacing now

Dabbler

Attachments

MVP

Dabbler

MVP

Dabbler

MVP

MVP

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Spare drive kicked in. Which is the drive that needs replacing now"

Similar threads