ONLINE (Unhealthy) - What to do?

r00tb33r

Dabbler
Joined
Nov 25, 2017
Messages
26
My pool reports as ONLINE (Unhealthy). I'm not sure what to do to correct it. I tried running a scrub but that didn't fix it. I left it alone for a couple of months while I was busy with other things (my data is backed up elsewhere).

Code:
NASvol    (System Dataset Pool)    ONLINE (Unhealthy) |  7.33 TiB (18%) Used  |  33.92 TiB Free

Code:
Pool Status
SCRUB
Status: FINISHED
Errors: 0
Date: 02/04/2023 12:00:00 AM
Name     Read     Write     Checksum     Status   
/mnt/NASvol    0    0    0    ONLINE
RAIDZ2        0    0    0    ONLINE
da0        0    0    1    ONLINE 
da1        0    0    159    ONLINE 
da2        0    0    0    ONLINE 
da3        0    0    0    ONLINE 
da4        0    0    0    ONLINE 
da5        0    0    0    ONLINE 
da6        0    0    0    ONLINE 
da7        0    0    0    ONLINE    


I'm confused. The scrub says zero errors.

Seems there are checksum errors on da1? How do I correct that? I'd like to give that disk another shot.

I guess I could replace it with itself and resilver, but is that the best way?
 
Last edited:

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
chksum errors are often (but not always) down to iffy cabling.
Try replacing/reseating the SATA cable between the iffy disk and the motherboard
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
The nice thing about ZFS, is that if it has redundancy, it will automatically correct a problem. But, to let you know about it, ZFS leaves a trail in the Read, Write or Checksum counters.

The automatic correction requires spare blocks from the affected storage device. But, that does not seem to be the case with your Checksum errors.

As NugentS suggests, replace that disk's SATA cable, (and potentially it's power cable depending on your enclosures wiring). Then perform a zpool clear NASvol and zpool scrub NASvol to check for complete fix.
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
One might also have a look at the SMART data of your drives, to get a glimpse into their health.
If we're seeing iffy cabling errors, we'd expect to find some indications in the 199 UDMA_CRC_Error_Count line.


smartctl -a /dev/da1
...and da0 would probably be the first two of interest.
 

r00tb33r

Dabbler
Joined
Nov 25, 2017
Messages
26
One might also have a look at the SMART data of your drives, to get a glimpse into their health.
If we're seeing iffy cabling errors, we'd expect to find some indications in the 199 UDMA_CRC_Error_Count line.


smartctl -a /dev/da1
...and da0 would probably be the first two of interest.
da0:
Code:
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

da1:
Code:
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0



Also...
Code:
SMART Error Log Version: 1
No Errors Logged
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
Good.
Did u follow Arwens suggestion?
 

r00tb33r

Dabbler
Joined
Nov 25, 2017
Messages
26
Good.
Did u follow Arwens suggestion?
Need to look over the cables and tidy up the cable management in the case, button it up, then clear the counters, then run the scrub. So getting there.
 
Top