SSD Hot Spare invoked, already has errors

Sprint · Mar 11, 2024

Morning all

So this is an odd one. Quick back ground, this is an all SSD pool, consisting of 6x1Tb Samsung Evo 860 Sata SSDs (in 2x 3 wide RaidZ1 vdevs) which has run flawlessly for a number of years. I also had a 7th brand new (when pool was built) 1Tb Samsung Evo 860 assigned as a hot spare. Pool also has slog/dedup/metadata vdevs provided by 2xOptane mirrored.

This morning I noticed that a drive had faulted, it LOOKS like the spare was invoked but that also has errors against it? If it was a HDD I wouldn't be surprised, but a brand new/never seen any work SSD to spit out errors, just doesn't seem right. I'd like to remove the spare SSD from the pool todo some quick test, then (assuming its ok) reconnect it and do a replace of the faulty SSD, but TrueNas isn't able to remove the drive, i get

Code:

[EZFS_BUSY] Pool busy; removal may already be in progress

. I need to tread carefully here, as I really don't want to lose the pool, hence coming to you wonderful people for guidance.

Pool is encrypted, its primary function is as a iSCSI storage for Proxmox VMs, i have backups on a local and offsite backup servers, but I really want to avoid using them as my VMs will lose data.

Code:

pool: SSD_Array
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: resilvered 0B in 01:13:42 with 0 errors on Sun Mar 10 14:46:58 2024
config:

        NAME                                              STATE     READ WRITE CKSUM
        SSD_Array                                         DEGRADED     0     0     0
          raidz1-0                                        DEGRADED     0     0     0
            spare-0                                       UNAVAIL     10    91     0  insufficient replicas
              gptid/64135c7d-5ec1-11ec-b267-ac1f6b781c6e  FAULTED      0   146     0  too many errors
              gptid/0e2850c7-965d-11ec-9cae-ac1f6b781c6e  FAULTED      6   111     0  too many errors
            gptid/6471b79d-5ec1-11ec-b267-ac1f6b781c6e    ONLINE       0     0     0
            gptid/64e1a36a-5ec1-11ec-b267-ac1f6b781c6e    ONLINE       0     0     0
          raidz1-1                                        ONLINE       0     0     0
            gptid/6489cb08-5ec1-11ec-b267-ac1f6b781c6e    ONLINE       0     0     0
            gptid/64b14982-5ec1-11ec-b267-ac1f6b781c6e    ONLINE       0     0     0
            gptid/64cbc764-5ec1-11ec-b267-ac1f6b781c6e    ONLINE       0     0     0
        dedup
          mirror-2                                        ONLINE       0     0     0
            nvd2p4                                        ONLINE       0     0     0
            nvd3p4                                        ONLINE       0     0     0
        special
          mirror-5                                        ONLINE       0     0     0
            nvd2p3                                        ONLINE       0     0     0
            nvd3p3                                        ONLINE       0     0     0
        logs
          mirror-6                                        ONLINE       0     0     0
            nvd2p2                                        ONLINE       0     0     0
            nvd3p2                                        ONLINE       0     0     0
        spares
          gptid/0e2850c7-965d-11ec-9cae-ac1f6b781c6e      INUSE     currently in use

errors: No known data errors

Thanks in advance
Sprint

Edit: Running TrueNAS-13.0-U6.1

artlessknave · Mar 12, 2024

Sprint said:
Pool also has slog/dedup/metadata vdevs provided by 2xOptane mirrored.

erm. why? SSDs mostly make these pointless. usually they just add complexity for complexities sake.

Sprint said:
(in 2x 3 wide RaidZ1 vdevs)

also: why? you could have made this a single raidz2, had the same lost to parity, but had basically the same performance, same space, and better parity overall. (SSDs usually move the bottleneck to the network)

Sprint said:
dedup

traditional dedup is known for being brutal on hardware. you havent posted any of your hardware

Sprint said:
[EZFS_BUSY] Pool busy; removal may already be in progress

onto your actual issue: I am not sure, but I would hazard a guess that you need to offline it first, because it's faulted. as, again, no hardware, I have no idea how these are connected. are they through the same controller as everything else? pcie card? onboard SATA?

not much more can be said.

homer27081990 · Mar 16, 2024

Correct me if I am wrong, but, don't you need to offline and resilver for the spare to become part of the pool? (And then get a new spare...)

artlessknave · Mar 16, 2024

homer27081990 said:
don't you need to offline and resilver for the spare to become part of the pool?

the problem is that their spare also failed for some reason. so it cant replace the failed disk with a failed spare, and it seems to be kinda stuck.

homer27081990 · Mar 16, 2024

artlessknave said:
the problem is that their spare also failed for some reason. so it cant replace the failed disk with a failed spare, and it seems to be kinda stuck.

Totally correct. But... should the spare have 2 gptid entries under it? (Sorry if I asked a silly question)

artlessknave · Mar 16, 2024

homer27081990 said:
Totally correct. But... should the spare have 2 gptid entries under it?

I dont think so. something seems very wrong, but I am not sure what.
personally, I would backup and destroy this pool. probably rebuild it as a single raidz2.

homer27081990 · Mar 17, 2024

artlessknave said:
I dont think so. something seems very wrong, but I am not sure what.
personally, I would backup and destroy this pool. probably rebuild it as a single raidz2.

To my surprise, I think I may honestly have something...

Since GPTID is, in essence, a GUID plus disk info, and by looking at all the gptid's there, together with the fact that for most of them, only the first part is different (that must be the serial-generated one), and the spare disk has 2 gptid labels (so, in the same hardware address, when OpenZFS was querying the disk, it got a response that caused it to assign to that disk a second pool disk id (hence the completely different second gptid for spare?).

I would hazard a guess, that those disks are mounted on their own backplane (the first to fail, and the spare) and it was the backplane that failed (broken miniSAS-HD? PSU issues? Corroded thin traces on PCB?...) intermittently, causing the observed behaviour...

Or, I may just be deluding myself. We'll see.

Etorix · Mar 17, 2024

The spare has one GPTID, which can be seen in two places: Under "spares", and under "spare-0", where it has been engaged to replace a failing drive. Nothing weird here.
But since the spare has failed, it should itself be replaced…

For serving iSCSI shares, I would rebuild the pool as striped mirrors. And definitely remove the partitioned Optanes.

homer27081990 · Mar 17, 2024

Etorix said:
The spare has one GPTID, which can be seen in two places: Under "spares", and under "spare-0", where it has been engaged to replace a failing drive. Nothing weird here.
But since the spare has failed, it should itself be replaced…

For serving iSCSI shares, I would rebuild the pool as striped mirrors. And definitely remove the partitioned Optanes.

If you say so. But in any case, we meant the entry spare-0 under the entry raidz1-0.

Important Announcement for the TrueNAS Community.

SSD Hot Spare invoked, already has errors

Sprint

Explorer

artlessknave

Wizard

homer27081990

Patron

artlessknave

Wizard

homer27081990

Patron

artlessknave

Wizard

homer27081990

Patron

Etorix

Wizard

homer27081990

Patron

Similar threads

Important Announcement for the TrueNAS Community.

SSD Hot Spare invoked, already has errors

Explorer

Wizard

Patron

Wizard

Patron

Wizard

Patron

Wizard

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "SSD Hot Spare invoked, already has errors"

Similar threads