RAID 1 Drives Online but UNHealthy

RiceBuqit

Cadet
Joined
Jun 9, 2021
Messages
4
Hi all,

I've been searching since Monday on how to repair my RAID 1 drives since I've had the unfortunate power outage end of last week.

Backstory is that I had a power outage end of last week and since powering back up, I lost one of my 2 drives for my TrueNAS VM. Whilst I managed to get the drive back in and working, I received the below alert on my Web GUI saying that my pool is online but unhealthy as one or more of my drives has experienced an unrecoverable experience.

1623307760640.png


I've tried to run the command zpool import -f but I got the response saying 'no pools available to import'. I understand that this is a bad sign but I'm sure there's a way to get my pool status back to healthy again, right? Is it possible to completely format the recovered drive and then have the good drive mirror the information back across?

Worst case scenario is I'm willing to start fresh as I've not got much on there that I consider important or that hasn't been backed up elsewhere anyway.

TIA
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
It seems to me that your pool is already imported.

What do you get from zpool status -v ?
 

RiceBuqit

Cadet
Joined
Jun 9, 2021
Messages
4
It seems to me that your pool is already imported.

What do you get from zpool status -v ?
Please see output below:
Code:
root@data[~]# zpool status -v
  pool: ChengJiun_Data
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 3.73G in 00:00:54 with 0 errors on Thu Jun 10 11:54:47 2021
config:

        NAME                                            STATE     READ WRITE CKSUM
        ChengJiun_Data                                  ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/a93a9b36-bc74-11eb-be56-6dd8965887d1  ONLINE       0     0     1
            gptid/a94a8d62-bc74-11eb-be56-6dd8965887d1  ONLINE       0     0     0

errors: No known data errors

  pool: boot-pool
state: ONLINE
  scan: scrub repaired 0B in 00:00:18 with 0 errors on Wed Jun  9 03:45:18 2021
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          da0p2     ONLINE       0     0     0

errors: No known data errors


Since I'm now able to get the drive back online and the zpool is in and working, my question still remains, how do I get it back to 'Healthy'?

TIA
 
Last edited by a moderator:

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
How are you presenting your drives to the TrueNAS VM?

You can temporarily clear the checksum error by zpool clear ChengJiun_Data, since there was a single checksum error on the first drive in the mirror. However, we need to know if you're giving the VM direct access to the drives via a passing through a PCI adapter, or if you're using drive direct access. The former is supported; the latter is known to be very fragile.
 

RiceBuqit

Cadet
Joined
Jun 9, 2021
Messages
4
How are you presenting your drives to the TrueNAS VM?

You can temporarily clear the checksum error by zpool clear ChengJiun_Data, since there was a single checksum error on the first drive in the mirror. However, we need to know if you're giving the VM direct access to the drives via a passing through a PCI adapter, or if you're using drive direct access. The former is supported; the latter is known to be very fragile.

Thanks for your response. By clearing the checksum error, I'm basically telling TrueNAS to 'ignore' the issue, right? Is there really no way to actually fix this?

My MB only has one PCIe slot and that's being used by my GPU. I've given TrueNAS direct access because it's the only option I have (I haven't used server grade specs for my build but it's also a sandbox environment for me to get everything right and then port it all over to a higher spec infrastructure).
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Thanks for your response. By clearing the checksum error, I'm basically telling TrueNAS to 'ignore' the issue, right? Is there really no way to actually fix this?

The only real way to fix it is to ZFS replace the faulty drive. However, since you're using direct access, the drive itself may not be faulty, and this is a glitch with direct access munging a few bits during that checksum read.

You should look at this resource: https://www.truenas.com/community/t...ide-to-not-completely-losing-your-data.12714/
 

RiceBuqit

Cadet
Joined
Jun 9, 2021
Messages
4
The only real way to fix it is to ZFS replace the faulty drive. However, since you're using direct access, the drive itself may not be faulty, and this is a glitch with direct access munging a few bits during that checksum read.

You should look at this resource: https://www.truenas.com/community/t...ide-to-not-completely-losing-your-data.12714/
Thanks for the link. I understand more now and will try to build a bare metal version for the final production - everything is a test environment right now to playout my design.

And since the alert is just a glitch, I'll clear it as suggested.
 
Top