Spare drive kicked in. Which is the drive that needs replacing now

squeakybadger

Dabbler
Joined
Feb 10, 2020
Messages
13
Hi all,

I had a drive error out over the weekend, and the spare kicked it and resilvered.

I've added a new drive in and want to replace the bad one, but I'm not sure which drive to actually replace.

Code:
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 7.76T in 22:26:53 with 0 errors on Tue Jan 24 02:39:51 2023
config:

        NAME                                              STATE     READ WRITE CKSUM
        PIK                                                  ONLINE       0     0     0
          mirror-0                                        ONLINE       0     0     0
            gptid/f9a1f0a4-b21c-11ea-98b6-90e2ba89e89c    ONLINE       0     0     0
            gptid/fc5af3f5-b21c-11ea-98b6-90e2ba89e89c    ONLINE       0     0     0
          mirror-1                                        ONLINE       0     0     0
            gptid/fb22ecaf-b21c-11ea-98b6-90e2ba89e89c    ONLINE       0     0     0
            gptid/fca8490e-b21c-11ea-98b6-90e2ba89e89c    ONLINE       0     0     0
          mirror-2                                        ONLINE       0     0     0
            gptid/fd35db7d-b21c-11ea-98b6-90e2ba89e89c    ONLINE       0     0     0
            gptid/fd5df32a-b21c-11ea-98b6-90e2ba89e89c    ONLINE       0     0     0
          mirror-3                                        ONLINE       0     0     0
            gptid/fafe6574-b21c-11ea-98b6-90e2ba89e89c    ONLINE       0     0     0
            gptid/fce2c376-b21c-11ea-98b6-90e2ba89e89c    ONLINE       0     0     0
          mirror-4                                        ONLINE       0     0     0
            gptid/fbf3c2f9-b21c-11ea-98b6-90e2ba89e89c    ONLINE       0     0     0
            gptid/fc51679c-b21c-11ea-98b6-90e2ba89e89c    ONLINE       0     0     0
          mirror-6                                        ONLINE       0     0     0
            gptid/b97214c0-16c0-11eb-92ce-90e2ba89e89c    ONLINE       0     0     0
            gptid/94402ea8-0af9-11eb-b080-90e2ba89e89c    ONLINE       0     0     0
          mirror-7                                        ONLINE       0     0     0
            spare-0                                       ONLINE       0     0     0
              gptid/368d862f-16c3-11eb-92ce-90e2ba89e89c  ONLINE       0     0     0
              gptid/3768867b-16c3-11eb-92ce-90e2ba89e89c  ONLINE       0     0 2.67K
            gptid/378cdb9b-16c3-11eb-92ce-90e2ba89e89c    ONLINE       0     0     0
        logs
          gptid/7eab73f1-b495-11ea-b2c1-90e2ba89e89c      ONLINE       0     0     0
        cache
          gptid/f78b4cbc-b21c-11ea-98b6-90e2ba89e89c      ONLINE       0     0     0
        spares
          gptid/3768867b-16c3-11eb-92ce-90e2ba89e89c      INUSE     currently in use

errors: No known data errors


Is it the one the has 2.67k errors? Because that looks like the spare that is in use.

Attached the GUI disk layout to be clearer - da15p2 is the spare, but also seems to be the one with errors?

Any help appreciated!

Thanks.
 

Attachments

  • replace disk.jpg
    replace disk.jpg
    65.4 KB · Views: 87

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,152
The faulty one should be gptid/368d862f-16c3-11eb-92ce-90e2ba89e89c, look into the logs (if the spare kicked in something happened).
Run a scrub to make sure things are ok; those cheksum errors are not reassuring so check the cables of the spare as well.
 

squeakybadger

Dabbler
Joined
Feb 10, 2020
Messages
13
Hi Davvo,

Thanks for helping. I'll get a scrub going tonight and see if it brings anything up before I replace the disk.

Checking the logs though, it seems like the drive disconnected after a controller fault (but reconnected and is back online)

Code:
Jan 23 04:12:29 pikdrive mps0: IOC Fault 0x40007e23, Resetting
Jan 23 04:12:29 pikdrive mps0: Reinitializing controller
Jan 23 04:12:29 pikdrive mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
Jan 23 04:12:29 pikdrive mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
Jan 23 04:12:29 pikdrive (da13:mps0:0:25:0): Invalidating pack
Jan 23 04:12:29 pikdrive da13 at mps0 bus 0 scbus14 target 25 lun 0
Jan 23 04:12:29 pikdrive da13: <ATA ST12000NM0008-2H SN02>  s/n ZHZ4CE2A detached
Jan 23 04:12:37 pikdrive GEOM_MIRROR: Device swap1: provider da13p1 disconnected.
Jan 23 04:12:38 pikdrive (da13:mps0:0:25:0): Periph destroyed
Jan 23 04:12:38 pikdrive da13 at mps0 bus 0 scbus14 target 25 lun 0
Jan 23 04:12:38 pikdrive da13: <ATA ST12000NM0008-2H SN02> Fixed Direct Access SPC-4 SCSI device
Jan 23 04:12:38 pikdrive da13: Serial Number ZHZ4CE2A
Jan 23 04:12:38 pikdrive da13: 600.000MB/s transfers
Jan 23 04:12:38 pikdrive da13: Command Queueing enabled
Jan 23 04:12:38 pikdrive da13: 11444224MB (23437770752 512 byte sectors)


We had a lot of render servers writing to the drive all weekend, but I've not encountered the IOC Fault before so that is worrying
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,152
It is indeed worrying, please provide your hardware specs.
Please read the following resources.
 

squeakybadger

Dabbler
Joined
Feb 10, 2020
Messages
13
Hi Davvo,

Hardware specs:

Dell R720
128GB Ram
3TB PciE SSD Cache Drive
16x12TB Seagate 3.5" Drives
LSI SAS 2008(B2)
2x Dell MD1200 Powervaults (2nd Powervault is where the drive error occurred)
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,152
2x Dell MD1200 Powervaults (2nd Powervault is where the drive error occurred)
That's where the issue is I believe: if I'm not reading their documentation wrong, those powervaults use the H810 raid controller which is a big no.
Plese read the first resource I linked in my previous post.

That being said, I'm no expert of HBAs so you might want to wait for a more competent opinion.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,924

squeakybadger

Dabbler
Joined
Feb 10, 2020
Messages
13
Hi Davvo,

Scrub came back clean, and no errors since the weekend. I'll keep an eye out for any more Ioc errors but everything looks ok at the moment so I'm hoping it's just a random blip.

I'm going to replace
Code:
gptid/368d862f-16c3-11eb-92ce-90e2ba89e89c
over the weekend.

Once the disk is replaced, what is the procedure for removing the spare from that vdev grouping (and returning it back to the spare state?)


I'm going to replace the spare as well once it is out of the pool, as I don't like those write errors.

I could do with a Truenas update and reboot as well as it has been running over a year, but we always seem to have some important jobs running!
 
Top