SOLVED Pool "unhealthy" but no disks offline or in error state??

ee21

Dabbler
Joined
Nov 2, 2020
Messages
33
I have a 4 disk pool, 2 of which had been attached to an HBA card, and other 2 to internal SATA connectors. I moved the 2 connected to internal SATA to the HBA, and now my pool shows "unhealthy" - yet when I check the status, all disks are online and none show an error:
Screenshot (1).png

Screenshot (2).png


Am I failing to look somewhere to identify the problem being reported? Or is the fact that checksum is anything other than "0" for two drives the issue? Can this be fixed, or should I ignore for now?..
 
Joined
Jan 7, 2015
Messages
1,155
You likely have failing disks. Run a smart test on all your drives.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Are the two drives with checksum errors those that you moved from the motherboard ports to the HBA? Do what John Digital suggests to check the drives themselves, then work on the drive connections that may affect the checksum error response.
 

ee21

Dabbler
Joined
Nov 2, 2020
Messages
33
You likely have failing disks. Run a smart test on all your drives.

I ran a "long" test on all disks in the pool, and none show an error (same output here for all disks):

Code:
SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     33675         -
# 2  Short offline       Completed without error       00%     33651         -
# 3  Extended offline    Completed without error       00%     33647         -
 

ee21

Dabbler
Joined
Nov 2, 2020
Messages
33
Are the two drives with checksum errors those that you moved from the motherboard ports to the HBA? Do what John Digital suggests to check the drives themselves, then work on the drive connections that may affect the checksum error response.

I'm honestly not sure unfortunately.. The drives are all four connected to the same SFF-8087 to SATA breakout cable, that then connects to the HBA card - ports which were used by SSDs previously, and I know all work fine. See above post regarding SMART errors, of which there were none :/
 
Joined
Jan 7, 2015
Messages
1,155
Show the full report please, there are items that give hints to whats wrong (if anything). Your drives are 4 years old and will likely fail soonish. I consider any days over 4 years a practical win.

You can clear the errors and see if they return. Id do a zpool clear tank then a scrub on that pool. Maybe a cable wasnt seated right or something. A scrub should expose it.
 
Joined
Jan 7, 2015
Messages
1,155
Also do a long test, you are doing short tests. smartctl -t long /dev/da#
 
Joined
Jan 7, 2015
Messages
1,155
Nice. Yeah drives over 4 years is not a matter of if, but when. I think my record so far is an old WD Green drive that lasted about 6 years 24/7.
 

Magius

Explorer
Joined
Sep 29, 2016
Messages
70
Nice. Yeah drives over 4 years is not a matter of if, but when. I think my record so far is an old WD Green drive that lasted about 6 years 24/7.
Sorry to rez the dead thread but I wouldn't be so harsh on old drives, especially ones as relatively young as 4 years! All drives, whether 1 day old or 10 years old are a matter of when, not if, so there's nothing magical about 4 years. That's only 35k hours if they were run 24/7, which isn't even that long in drive terms. In my opinion that just means they made it safely out of the infant mortality period and are now in the prime of their life, much less likely to spontaneously die than they were for the first couple years :) I use almost all used drives nowadays and most of the ones I buy have 40k+ hours on them before they get to me. My longest running drives that I bought new (both WD Greens, funny enough) are 11-12 years old with nearly 90k hours on them! In my opinion you'll lose a lot more drives in years 1-3 than you will in years 4-6, but I don't have anything other than anecdotal evidence on a few dozen drives to back me up so I won't argue that point too hard.

I will say that I built my older server in 2008, started with three 2TB drives and within a year or two expanded it to ten 2TBs with a hot spare. I've had 11 total drive failures in that server. Four in the first 1-3 years, two in years 4-6 and five in years 6-12. Again, not enough to mean anything statistically, just anecdotal. Strangely I had two failures on the same day in 2016 and two failures one day apart in 2019, those were exciting times! After my drives were out of warranty is when I started replacing them with used drives with 30k+ hours on them, so most of the drives in the system are now in the 60k+ hours range, with the two lucky "originals" just under 90k. Granted this is my backup server today. I replicate the main server to it once a week, but even in the main server I'm using all used drives with 50-55k hours on them. I have a lot of trust in drives that have proven they can run 24/7 for years on end :)
 
Joined
Jan 7, 2015
Messages
1,155
Well pointed out. Its all anecdotal no doubt. When ive looked at similar drives purchased at same time as ones I have that start failing the 40-50k hours is where im starting to see signs of failure. I think luck is the biggest factor in longevity. Ive now replaced every drive I have at least once, and in my case and with a random mix of drives, strangely the Greens (all WDIDLED) lasted the longest. The REDs ive had all have seemed to start failing shortly after the 3 year warranty period ends as if by design. My Toshiba X series are mostly still going, but their RMA is a horrid mess and have since started the switch to IronWolfs.

I have one 3TB drive in my backup server that squawks about CRC errors recently and its a WD disk with 47k hours. It otherwise is chugging along. It could go a couple more years, luck..

Code:
Model Family:     Western Digital AV-GP (AF)
Device Model:     WDC WD30EURS-63SPKY0
Serial NumModel Family:     Western Digital AV-GP (AF)
Device Model:     WDC WD30EURS-63SPKY0
Serial Number:    WD-WMC1T2754576
LU WWN Device Id: 5 0014ee 658860661
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Jan  9 15:52:53 2021 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       175
  3 Spin_Up_Time            0x0027   195   177   021    Pre-fail  Always       -       5216
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       339
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   036   036   000    Old_age   Always       -       47445
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       339
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       266
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       3075
194 Temperature_Celsius     0x0022   125   106   000    Old_age   Always       -       25
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   199   000    Old_age   Always       -       4
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       36

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     47431         -
# 2  Extended offline    Completed without error       00%     47033         -
# 3  Extended offline    Completed without error       00%     46705         -
# 4  Extended offline    Completed without error       00%     46311         -
# 5  Extended offline    Completed without error       00%     45970         -
# 6  Extended offline    Completed without error       00%     45562         -
# 7  Extended offline    Interrupted (host reset)      30%     44912         -
# 8  Extended offline    Completed without error       00%     40961         -
# 9  Short offline       Completed without error       00%     40950         -
#10  Short offline       Completed without error       00%     40948         -
#11  Short offline       Completed without error       00%     40947         -
#12  Short offline       Completed without error       00%     40946         -
#13  Short offline       Completed without error       00%     40945         -
#14  Short offline       Completed without error       00%     40944         -
#15  Short offline       Completed without error       00%     40943         -
#16  Short offline       Completed without error       00%     40942         -
#17  Short offline       Completed without error       00%     40941         -
#18  Short offline       Completed without error       00%     40941         -
#19  Short offline       Completed without error       00%     40940         -
#20  Short offline       Completed without error       00%     40939         -
#21  Short offline       Completed without error       00%     40938         -
 
Top