SOLVED single "Read Errors" on different drives

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
What are the disk manufacturer & models on this second pool?
 

Lesani

Dabbler
Joined
Apr 15, 2022
Messages
25
I only really have one pool at this time.

I had HGST drives

=== START OF INFORMATION SECTION ===
Vendor: HGST
Product: HUS724040ALS640
Revision: A3A0
Compliance: SPC-4
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Logical block size: 512 bytes
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches

I had the feeling the DELL (Seagate) drives gave less errors, so I an now slowly replacing the ever unhealthy, degrading and faulting hgsts with

=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST4000NM0023
Revision: GS0F
Compliance: SPC-4
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Logical block size: 512 bytes
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches


I have replaced the SAS cables with brand new ones, and routed them as far away from the power delivery as possible in this server chassis. I am still running into the same issue with read errors and drives faulting sooner or later. The HGST seem to be the worst offenders, with the last HGST drive (SDD) showing 62 read errors and pending a replacement, there is currently a resilver going on after I replaced an HGST with a Dell/Seagate drive (SDI) which is already showing checksum errors as well...

yRenmTP.png
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
Well if it isn't the disks, must be the controller. Maybe???
 

Lesani

Dabbler
Joined
Apr 15, 2022
Messages
25
Whats a good hba to get that doesn‘t burst the bank? I have 2 mini-sas connectors going to the backplane.

SDD(hgst) is now faulted overnight with too many errors… SDI(Dell/Seagate) is accumulating more abd more checksums
 

Attachments

  • 2E4A7F47-4AA9-4336-99B0-97915BE48ACA.jpeg
    2E4A7F47-4AA9-4336-99B0-97915BE48ACA.jpeg
    58.2 KB · Views: 78

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
The answer to that question is always LSI - it then depends on what flavour you want.
Caveats:
1. Not MegaRAID
2. It needs to flashed to IT Mode

 

Lesani

Dabbler
Joined
Apr 15, 2022
Messages
25
quick update: Brand new cables, different drives (the dell drives described above), same situation. Drives get faulted because of read errors.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Unfortunately, it appears your HBA is going bad.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I note that SCALE at the version you report seems to run ZFS 2.1.2, which may be impacted by this:


 

Lesani

Dabbler
Joined
Apr 15, 2022
Messages
25
Quick update, new HBA arrived, pool is now resilvering (80%) after replacing a drive showing as degraded, apart from a few crc errors when scrubbing before the replacing/resilvering (that were probably real, because of the ordeal before) no additional errors appeared yet. With the other HBA a replace/resilver already lead to lots of "read" errors on all drives, so I assume the problem is fixed and was indeed the HBA. Will update a last time in a few days when I have transferred some data again.
 

revengineer

Contributor
Joined
Oct 27, 2019
Messages
193
As mentioned above, it could still be cooling. I have an LSI HBA and it worked fine for over a year. Then errors started to occur randomly although they increased when the HBA was taxed during a scrub twice a month. I ended up attaching a 40 mm noctua fan onto the heat sink and the errors went away. I have a Supermicro server case, I modified it and took out the forced air jet-engine fans. While I have seven fans inside the box, I concluded that the air flow is just not good enough for the HBA.
 

Lesani

Dabbler
Joined
Apr 15, 2022
Messages
25
Okay, final update: It was the HBA. I got the same exact HBA, only the board looks a bit newer than the one my server was shipped with, and this one has not produced any errors since its in the system (apart from and after the initial resilver/scrub directly after replacing).

This issue is solved, I do not think overheat can be the issue since the HBA is identical, except maybe the thermal compount on the "older" one may be completely dried out. Thanks for all the help!
 

revengineer

Contributor
Joined
Oct 27, 2019
Messages
193
I am glad it's working for you again. These HBAs should last forever. If you find that this new one suffers an untimely death you can revisit the idea of a thermal issue. A 40 mm fan is only ~$10, just saying...
 

Lesani

Dabbler
Joined
Apr 15, 2022
Messages
25
I am glad it's working for you again. These HBAs should last forever. If you find that this new one suffers an untimely death you can revisit the idea of a thermal issue. A 40 mm fan is only ~$10, just saying...
I actually will even replace the thermal compound on the "old" one and see if it won't run again without errors as well ;) no worries your idea/suggestion is duely noted! thanks!
 
Top