Checksum errors on pool drive causing "unhealthy" state

2twisty · Jul 20, 2021

We started getting checksum errors on one of our drives about a month ago, that caused the pool to go to unhealthy, and then eventually degraded status. So, we figured we had a bad disk. We replaced it.

The problem recurred with the brand new disk. That seemed odd, so I started googling.

The consensus here is that it's often a faulty cable or bad connection. So, as part of this thread (VMware, iSCSI, dropped connections and lockups), I swapped out my drive controllers AND cables. The problem persists.

So, we thought "maybe the backplane is the cause" and we swapped the disk to a different port on a different backplane. The problem stays with the drive.

Now, normally, I would say "ok, so it's a bad drive" and replace it. But when the original drive and a BRAND NEW drive have exactly the same issue, and I've replaced all the connections, I am at a loss.

What else can I do to get to the bottom of this?

NOTE: The original drives are all HGST HUH728080AL4200 8TB disks. When we replaced the original disk, we were unable to find that model brand new -- only refurb, so we replaced it with a model that had the same base specs (performance, size, RPM, sectors [4kn]), a SEAGATE STMPSKD1CLAR8000.

I know it's not optimal to use a different drive, but since the problem is identical to the HGST, it's hard to say for certain that the mismatch is the cause. I have not had time to take the original HGST drive and do a long test on it in another machine, but I suspect that the original drive is just fine, and something else is causing these checksum errors.

I'm not sure that my other thread directly relates to this or not, but I will post a link to this one there, as well, so people have the full story of what is going on with our system.

hescominsoon · Jul 20, 2021

i would try replacing all of the drives in that vdev...maybe with another brand but with the same capacity and interface as a test.

2twisty · Jul 20, 2021

Yeah, it would be nice if I could just buy a bunch of drives to test that.. My IT budget doesn't work that way, sadly.

We are planning to build another server (for offsite backups) soon that will be based on 8TB drives, so I may be able to steal a couple of those to test this idea, to keep all drives in a given vdev the same.

However, since this problem started with matched drives and continues with mismatched ones, I am less certain that the cause is the mismatch. Of course, it's possible that I had a drive fail, and then a mismatch caused an identical symptom, but that seems a little far-fetched to me.

hescominsoon · Jul 20, 2021

if you have literally changed everything else(hba..done..cables..done..backplane..done..drives..call it done) the only thing i can it being now is PSU. I've seen a psu send crappy power down one line before..it's rare..but this might be just that type of edge case. I am assuming you have two psu units. I would remove one..replace the drives and see if the issue returns(the drives erroring out) if it doesn't then swap the two psus(use the same slot do not use the other slot at this time). if stilt no issue them move that psu to the other slot..if the issue appears you have your problem..and that would be a mainboard issue...or if hte problem appears..then you swap psus and it disappears then you ahve a bad psu. If both psus fail in that slot still switch to the other slot..if the problem disappears you have that first slo is bad...

Stux · Jul 20, 2021

Actually, it sounds to me like you replaced a failing drive with a refurbished drive. Its quite possible that the refurbished drive is faulty too. From what you've replaced it does seem that its gotta be the the replacement drive is also faulty. Which is not that surprising if it was originally a faulty drive too (which is what a refurbished drive is)

hescominsoon · Jul 20, 2021

Stux said:
Actually, it sounds to me like you replaced a failing drive with a refurbished drive. Its quite possible that the refurbished drive is faulty too. From what you've replaced it does seem that its gotta be the the replacement drive is also faulty. Which is not that surprising if it was originally a faulty drive too (which is what a refurbished drive is)

while your theory is plausible..one thing..a refurbed drive is not automatically faulty. My r520 is nothing but refurbs..and they run fine. I have six more on the shelf that are going to go into another server i have incoming...if they pass burn in i'll deploy that server. I also have several smaller used drives..not refurbs...that actually have failed..and not a soft failure...YMMV of course..:)

2twisty · Jul 21, 2021

When we bought it, the seller claimed that it was new....But SmartCTL says it has 18k hours on it. Grrrrrrrr.

hescominsoon · Jul 21, 2021

2twisty said:
When we bought it, the seller claimed that it was new....But SmartCTL says it has 18k hours on it. Grrrrrrrr.

so it seem you got one bad drive..and then another..it happens...

Dice · Aug 1, 2021

have you ran your new drives through the 'standard' burn-in testing process?

Hard Drive Burn-in Testing

@jgreco did a nice system build/test/burn-in guide here, but I (and many others) found the details a bit lacking in the hard drive section. He mentions S.M.A.R.T. tests, but doesn't go over how to run them, or how to view the results, etc. and...

www.truenas.com

Important Announcement for the TrueNAS Community.

Checksum errors on pool drive causing "unhealthy" state

2twisty

Contributor

hescominsoon

Patron

2twisty

Contributor

hescominsoon

Patron

Stux

MVP

hescominsoon

Patron

2twisty

Contributor

hescominsoon

Patron

Dice

Wizard

Hard Drive Burn-in Testing

Similar threads

Important Announcement for the TrueNAS Community.

Checksum errors on pool drive causing "unhealthy" state

Contributor

Patron

Contributor

Patron

MVP

Patron

Contributor

Patron

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Checksum errors on pool drive causing "unhealthy" state"

Similar threads