SSD Hot Spare invoked, already has errors

Sprint

Explorer
Joined
Mar 30, 2019
Messages
72
Morning all

So this is an odd one. Quick back ground, this is an all SSD pool, consisting of 6x1Tb Samsung Evo 860 Sata SSDs (in 2x 3 wide RaidZ1 vdevs) which has run flawlessly for a number of years. I also had a 7th brand new (when pool was built) 1Tb Samsung Evo 860 assigned as a hot spare. Pool also has slog/dedup/metadata vdevs provided by 2xOptane mirrored.

This morning I noticed that a drive had faulted, it LOOKS like the spare was invoked but that also has errors against it? If it was a HDD I wouldn't be surprised, but a brand new/never seen any work SSD to spit out errors, just doesn't seem right. I'd like to remove the spare SSD from the pool todo some quick test, then (assuming its ok) reconnect it and do a replace of the faulty SSD, but TrueNas isn't able to remove the drive, i get
Code:
[EZFS_BUSY] Pool busy; removal may already be in progress
. I need to tread carefully here, as I really don't want to lose the pool, hence coming to you wonderful people for guidance.

Pool is encrypted, its primary function is as a iSCSI storage for Proxmox VMs, i have backups on a local and offsite backup servers, but I really want to avoid using them as my VMs will lose data.

Code:
pool: SSD_Array
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: resilvered 0B in 01:13:42 with 0 errors on Sun Mar 10 14:46:58 2024
config:

        NAME                                              STATE     READ WRITE CKSUM
        SSD_Array                                         DEGRADED     0     0     0
          raidz1-0                                        DEGRADED     0     0     0
            spare-0                                       UNAVAIL     10    91     0  insufficient replicas
              gptid/64135c7d-5ec1-11ec-b267-ac1f6b781c6e  FAULTED      0   146     0  too many errors
              gptid/0e2850c7-965d-11ec-9cae-ac1f6b781c6e  FAULTED      6   111     0  too many errors
            gptid/6471b79d-5ec1-11ec-b267-ac1f6b781c6e    ONLINE       0     0     0
            gptid/64e1a36a-5ec1-11ec-b267-ac1f6b781c6e    ONLINE       0     0     0
          raidz1-1                                        ONLINE       0     0     0
            gptid/6489cb08-5ec1-11ec-b267-ac1f6b781c6e    ONLINE       0     0     0
            gptid/64b14982-5ec1-11ec-b267-ac1f6b781c6e    ONLINE       0     0     0
            gptid/64cbc764-5ec1-11ec-b267-ac1f6b781c6e    ONLINE       0     0     0
        dedup
          mirror-2                                        ONLINE       0     0     0
            nvd2p4                                        ONLINE       0     0     0
            nvd3p4                                        ONLINE       0     0     0
        special
          mirror-5                                        ONLINE       0     0     0
            nvd2p3                                        ONLINE       0     0     0
            nvd3p3                                        ONLINE       0     0     0
        logs
          mirror-6                                        ONLINE       0     0     0
            nvd2p2                                        ONLINE       0     0     0
            nvd3p2                                        ONLINE       0     0     0
        spares
          gptid/0e2850c7-965d-11ec-9cae-ac1f6b781c6e      INUSE     currently in use

errors: No known data errors


Thanks in advance
Sprint

Edit: Running TrueNAS-13.0-U6.1
 
Last edited:

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
Pool also has slog/dedup/metadata vdevs provided by 2xOptane mirrored.
erm. why? SSDs mostly make these pointless. usually they just add complexity for complexities sake.
(in 2x 3 wide RaidZ1 vdevs)
also: why? you could have made this a single raidz2, had the same lost to parity, but had basically the same performance, same space, and better parity overall. (SSDs usually move the bottleneck to the network)
traditional dedup is known for being brutal on hardware. you havent posted any of your hardware
[EZFS_BUSY] Pool busy; removal may already be in progress
onto your actual issue: I am not sure, but I would hazard a guess that you need to offline it first, because it's faulted. as, again, no hardware, I have no idea how these are connected. are they through the same controller as everything else? pcie card? onboard SATA?

not much more can be said.
 

homer27081990

Patron
Joined
Aug 9, 2022
Messages
321
1710582997008.png

Correct me if I am wrong, but, don't you need to offline and resilver for the spare to become part of the pool? (And then get a new spare...)
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
don't you need to offline and resilver for the spare to become part of the pool?
the problem is that their spare also failed for some reason. so it cant replace the failed disk with a failed spare, and it seems to be kinda stuck.
 

homer27081990

Patron
Joined
Aug 9, 2022
Messages
321
the problem is that their spare also failed for some reason. so it cant replace the failed disk with a failed spare, and it seems to be kinda stuck.
Totally correct. But... should the spare have 2 gptid entries under it? (Sorry if I asked a silly question)
 

artlessknave

Wizard
Joined
Oct 29, 2016
Messages
1,506
Totally correct. But... should the spare have 2 gptid entries under it?
I dont think so. something seems very wrong, but I am not sure what.
personally, I would backup and destroy this pool. probably rebuild it as a single raidz2.
 

homer27081990

Patron
Joined
Aug 9, 2022
Messages
321
I dont think so. something seems very wrong, but I am not sure what.
personally, I would backup and destroy this pool. probably rebuild it as a single raidz2.
To my surprise, I think I may honestly have something...

Since GPTID is, in essence, a GUID plus disk info, and by looking at all the gptid's there, together with the fact that for most of them, only the first part is different (that must be the serial-generated one), and the spare disk has 2 gptid labels (so, in the same hardware address, when OpenZFS was querying the disk, it got a response that caused it to assign to that disk a second pool disk id (hence the completely different second gptid for spare?).

I would hazard a guess, that those disks are mounted on their own backplane (the first to fail, and the spare) and it was the backplane that failed (broken miniSAS-HD? PSU issues? Corroded thin traces on PCB?...) intermittently, causing the observed behaviour...

Or, I may just be deluding myself. We'll see.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
The spare has one GPTID, which can be seen in two places: Under "spares", and under "spare-0", where it has been engaged to replace a failing drive. Nothing weird here.
But since the spare has failed, it should itself be replaced…

For serving iSCSI shares, I would rebuild the pool as striped mirrors. And definitely remove the partitioned Optanes.
 

homer27081990

Patron
Joined
Aug 9, 2022
Messages
321
The spare has one GPTID, which can be seen in two places: Under "spares", and under "spare-0", where it has been engaged to replace a failing drive. Nothing weird here.
But since the spare has failed, it should itself be replaced…

For serving iSCSI shares, I would rebuild the pool as striped mirrors. And definitely remove the partitioned Optanes.
If you say so. But in any case, we meant the entry spare-0 under the entry raidz1-0.
 
Top