ONLINE (Unhealthy) - What to do?

r00tb33r · Apr 7, 2023

My pool reports as ONLINE (Unhealthy). I'm not sure what to do to correct it. I tried running a scrub but that didn't fix it. I left it alone for a couple of months while I was busy with other things (my data is backed up elsewhere).

Code:

NASvol    (System Dataset Pool)    ONLINE (Unhealthy) |  7.33 TiB (18%) Used  |  33.92 TiB Free

Code:

Pool Status
SCRUB
Status: FINISHED
Errors: 0
Date: 02/04/2023 12:00:00 AM
Name     Read     Write     Checksum     Status   
/mnt/NASvol    0    0    0    ONLINE
RAIDZ2        0    0    0    ONLINE
da0        0    0    1    ONLINE 
da1        0    0    159    ONLINE 
da2        0    0    0    ONLINE 
da3        0    0    0    ONLINE 
da4        0    0    0    ONLINE 
da5        0    0    0    ONLINE 
da6        0    0    0    ONLINE 
da7        0    0    0    ONLINE

I'm confused. The scrub says zero errors.

Seems there are checksum errors on da1? How do I correct that? I'd like to give that disk another shot.

I guess I could replace it with itself and resilver, but is that the best way?

NugentS · Apr 7, 2023

chksum errors are often (but not always) down to iffy cabling.
Try replacing/reseating the SATA cable between the iffy disk and the motherboard

Arwen · Apr 8, 2023

The nice thing about ZFS, is that if it has redundancy, it will automatically correct a problem. But, to let you know about it, ZFS leaves a trail in the Read, Write or Checksum counters.

The automatic correction requires spare blocks from the affected storage device. But, that does not seem to be the case with your Checksum errors.

As NugentS suggests, replace that disk's SATA cable, (and potentially it's power cable depending on your enclosures wiring). Then perform a zpool clear NASvol and zpool scrub NASvol to check for complete fix.

Dice · Apr 11, 2023

One might also have a look at the SMART data of your drives, to get a glimpse into their health.
If we're seeing iffy cabling errors, we'd expect to find some indications in the 199 UDMA_CRC_Error_Count line.

smartctl -a /dev/da1
...and da0 would probably be the first two of interest.

r00tb33r · Apr 12, 2023

Dice said:
One might also have a look at the SMART data of your drives, to get a glimpse into their health.
If we're seeing iffy cabling errors, we'd expect to find some indications in the 199 UDMA_CRC_Error_Count line.

smartctl -a /dev/da1
...and da0 would probably be the first two of interest.

da0:

Code:

199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

da1:

Code:

199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

Also...

Code:

SMART Error Log Version: 1
No Errors Logged

Dice · Apr 13, 2023

Good.
Did u follow Arwens suggestion?

r00tb33r · Apr 13, 2023

Dice said:
Good.
Did u follow Arwens suggestion?

Need to look over the cables and tidy up the cable management in the case, button it up, then clear the counters, then run the scrub. So getting there.

Important Announcement for the TrueNAS Community.

ONLINE (Unhealthy) - What to do?

r00tb33r

Dabbler

NugentS

MVP

Arwen

MVP

Dice

Wizard

r00tb33r

Dabbler

Dice

Wizard

r00tb33r

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

ONLINE (Unhealthy) - What to do?

r00tb33r

Dabbler

NugentS

MVP

Arwen

MVP

Dice

Wizard

r00tb33r

Dabbler

Dice

Wizard

r00tb33r

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "ONLINE (Unhealthy) - What to do?"

Similar threads