RAID 1 Drives Online but UNHealthy

RiceBuqit · Jun 9, 2021

Hi all,

I've been searching since Monday on how to repair my RAID 1 drives since I've had the unfortunate power outage end of last week.

Backstory is that I had a power outage end of last week and since powering back up, I lost one of my 2 drives for my TrueNAS VM. Whilst I managed to get the drive back in and working, I received the below alert on my Web GUI saying that my pool is online but unhealthy as one or more of my drives has experienced an unrecoverable experience.

I've tried to run the command zpool import -f but I got the response saying 'no pools available to import'. I understand that this is a bad sign but I'm sure there's a way to get my pool status back to healthy again, right? Is it possible to completely format the recovered drive and then have the good drive mirror the information back across?

Worst case scenario is I'm willing to start fresh as I've not got much on there that I consider important or that hasn't been backed up elsewhere anyway.

TIA

sretalla · Jun 10, 2021

It seems to me that your pool is already imported.

What do you get from zpool status -v ?

RiceBuqit · Jun 10, 2021

sretalla said:
It seems to me that your pool is already imported.

What do you get from zpool status -v ?

Please see output below:

Code:

root@data[~]# zpool status -v
  pool: ChengJiun_Data
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 3.73G in 00:00:54 with 0 errors on Thu Jun 10 11:54:47 2021
config:

        NAME                                            STATE     READ WRITE CKSUM
        ChengJiun_Data                                  ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/a93a9b36-bc74-11eb-be56-6dd8965887d1  ONLINE       0     0     1
            gptid/a94a8d62-bc74-11eb-be56-6dd8965887d1  ONLINE       0     0     0

errors: No known data errors

  pool: boot-pool
state: ONLINE
  scan: scrub repaired 0B in 00:00:18 with 0 errors on Wed Jun  9 03:45:18 2021
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          da0p2     ONLINE       0     0     0

errors: No known data errors

Since I'm now able to get the drive back online and the zpool is in and working, my question still remains, how do I get it back to 'Healthy'?

TIA

Samuel Tai · Jun 10, 2021

How are you presenting your drives to the TrueNAS VM?

You can temporarily clear the checksum error by zpool clear ChengJiun_Data, since there was a single checksum error on the first drive in the mirror. However, we need to know if you're giving the VM direct access to the drives via a passing through a PCI adapter, or if you're using drive direct access. The former is supported; the latter is known to be very fragile.

RiceBuqit · Jun 10, 2021

Samuel Tai said:
How are you presenting your drives to the TrueNAS VM?

You can temporarily clear the checksum error by zpool clear ChengJiun_Data, since there was a single checksum error on the first drive in the mirror. However, we need to know if you're giving the VM direct access to the drives via a passing through a PCI adapter, or if you're using drive direct access. The former is supported; the latter is known to be very fragile.

Thanks for your response. By clearing the checksum error, I'm basically telling TrueNAS to 'ignore' the issue, right? Is there really no way to actually fix this?

My MB only has one PCIe slot and that's being used by my GPU. I've given TrueNAS direct access because it's the only option I have (I haven't used server grade specs for my build but it's also a sandbox environment for me to get everything right and then port it all over to a higher spec infrastructure).

Samuel Tai · Jun 10, 2021

RiceBuqit said:
Thanks for your response. By clearing the checksum error, I'm basically telling TrueNAS to 'ignore' the issue, right? Is there really no way to actually fix this?

The only real way to fix it is to ZFS replace the faulty drive. However, since you're using direct access, the drive itself may not be faulty, and this is a glitch with direct access munging a few bits during that checksum read.

You should look at this resource: https://www.truenas.com/community/t...ide-to-not-completely-losing-your-data.12714/

RiceBuqit · Jun 10, 2021

Samuel Tai said:
The only real way to fix it is to ZFS replace the faulty drive. However, since you're using direct access, the drive itself may not be faulty, and this is a glitch with direct access munging a few bits during that checksum read.

You should look at this resource: https://www.truenas.com/community/t...ide-to-not-completely-losing-your-data.12714/

Thanks for the link. I understand more now and will try to build a bare metal version for the final production - everything is a test environment right now to playout my design.

And since the alert is just a glitch, I'll clear it as suggested.

Important Announcement for the TrueNAS Community.

RAID 1 Drives Online but UNHealthy

RiceBuqit

Cadet

sretalla

Powered by Neutrality

RiceBuqit

Cadet

Samuel Tai

Never underestimate your own stupidity

RiceBuqit

Cadet

Samuel Tai

Never underestimate your own stupidity

RiceBuqit

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

RAID 1 Drives Online but UNHealthy

RiceBuqit

Cadet

sretalla

Powered by Neutrality

RiceBuqit

Cadet

Samuel Tai

Never underestimate your own stupidity

RiceBuqit

Cadet

Samuel Tai

Never underestimate your own stupidity

RiceBuqit

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "RAID 1 Drives Online but UNHealthy"

Similar threads