Checksum errors on pool drive causing "unhealthy" state

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
We started getting checksum errors on one of our drives about a month ago, that caused the pool to go to unhealthy, and then eventually degraded status. So, we figured we had a bad disk. We replaced it.

The problem recurred with the brand new disk. That seemed odd, so I started googling.

The consensus here is that it's often a faulty cable or bad connection. So, as part of this thread (VMware, iSCSI, dropped connections and lockups), I swapped out my drive controllers AND cables. The problem persists.

So, we thought "maybe the backplane is the cause" and we swapped the disk to a different port on a different backplane. The problem stays with the drive.

Now, normally, I would say "ok, so it's a bad drive" and replace it. But when the original drive and a BRAND NEW drive have exactly the same issue, and I've replaced all the connections, I am at a loss.

What else can I do to get to the bottom of this?

NOTE: The original drives are all HGST HUH728080AL4200 8TB disks. When we replaced the original disk, we were unable to find that model brand new -- only refurb, so we replaced it with a model that had the same base specs (performance, size, RPM, sectors [4kn]), a SEAGATE STMPSKD1CLAR8000.

I know it's not optimal to use a different drive, but since the problem is identical to the HGST, it's hard to say for certain that the mismatch is the cause. I have not had time to take the original HGST drive and do a long test on it in another machine, but I suspect that the original drive is just fine, and something else is causing these checksum errors.

I'm not sure that my other thread directly relates to this or not, but I will post a link to this one there, as well, so people have the full story of what is going on with our system.
 

hescominsoon

Patron
Joined
Jul 27, 2016
Messages
456
i would try replacing all of the drives in that vdev...maybe with another brand but with the same capacity and interface as a test.
 

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
Yeah, it would be nice if I could just buy a bunch of drives to test that.. My IT budget doesn't work that way, sadly.

We are planning to build another server (for offsite backups) soon that will be based on 8TB drives, so I may be able to steal a couple of those to test this idea, to keep all drives in a given vdev the same.

However, since this problem started with matched drives and continues with mismatched ones, I am less certain that the cause is the mismatch. Of course, it's possible that I had a drive fail, and then a mismatch caused an identical symptom, but that seems a little far-fetched to me.
 

hescominsoon

Patron
Joined
Jul 27, 2016
Messages
456
if you have literally changed everything else(hba..done..cables..done..backplane..done..drives..call it done) the only thing i can it being now is PSU. I've seen a psu send crappy power down one line before..it's rare..but this might be just that type of edge case. I am assuming you have two psu units. I would remove one..replace the drives and see if the issue returns(the drives erroring out) if it doesn't then swap the two psus(use the same slot do not use the other slot at this time). if stilt no issue them move that psu to the other slot..if the issue appears you have your problem..and that would be a mainboard issue...or if hte problem appears..then you swap psus and it disappears then you ahve a bad psu. If both psus fail in that slot still switch to the other slot..if the problem disappears you have that first slo is bad...
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Actually, it sounds to me like you replaced a failing drive with a refurbished drive. Its quite possible that the refurbished drive is faulty too. From what you've replaced it does seem that its gotta be the the replacement drive is also faulty. Which is not that surprising if it was originally a faulty drive too (which is what a refurbished drive is)
 

hescominsoon

Patron
Joined
Jul 27, 2016
Messages
456
Actually, it sounds to me like you replaced a failing drive with a refurbished drive. Its quite possible that the refurbished drive is faulty too. From what you've replaced it does seem that its gotta be the the replacement drive is also faulty. Which is not that surprising if it was originally a faulty drive too (which is what a refurbished drive is)
while your theory is plausible..one thing..a refurbed drive is not automatically faulty. My r520 is nothing but refurbs..and they run fine. I have six more on the shelf that are going to go into another server i have incoming...if they pass burn in i'll deploy that server. I also have several smaller used drives..not refurbs...that actually have failed..and not a soft failure...YMMV of course..:)
 

2twisty

Contributor
Joined
Mar 18, 2020
Messages
145
When we bought it, the seller claimed that it was new....But SmartCTL says it has 18k hours on it. Grrrrrrrr.
 

hescominsoon

Patron
Joined
Jul 27, 2016
Messages
456
When we bought it, the seller claimed that it was new....But SmartCTL says it has 18k hours on it. Grrrrrrrr.
so it seem you got one bad drive..and then another..it happens...
 

Dice

Wizard
Joined
Dec 11, 2015
Messages
1,410
have you ran your new drives through the 'standard' burn-in testing process?

 
Top