SOLVED single "Read Errors" on different drives

Lesani

Dabbler
Joined
Apr 15, 2022
Messages
25
//EDIT:
Issue was solved after various debug steps by replacing the HBA.

The original post:
//EDIT



I just built a 12x4TB Nas out of old server parts as many do, and have started filling it with data. After filling it I moved a bit of data between Datasets, and thats when 2 drives each got a single "read" error in "zpool status"... I cleared it, moved some more data, and another drive or two got a single "read" error... I cleared it again and initiated a manual scrub to have everything checked, now 2 other drives got a single "read" error

Code:
        NAME                                      STATE     READ WRITE CKSUM
        Mountain                                  ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            68bb695f-fbe9-4c9a-b3b8-efb07da3f40d  ONLINE       0     0     0
            cda91862-9ef3-44ac-af79-0b74a98b6fd7  ONLINE       1     0     0
            f0ba8e22-0782-402e-951b-3fff0ec88c88  ONLINE       0     0     0
            8dd8ae60-8ca9-43bf-8876-6501a89a9be0  ONLINE       0     0     0
            63539e45-6fd8-4e0e-bfa8-809e557d1a3c  ONLINE       0     0     0
            02cf3def-9e0f-416d-b2d0-2ab94189b046  ONLINE       1     0     0
            0aa9247b-2ca4-4d9f-a0c8-280c160bc2a2  ONLINE       0     0     0
            356ca48c-dd14-4432-8f07-5a758b4c07d5  ONLINE       0     0     0
            36be4b26-beda-4b4c-9d5d-a818e80f11f1  ONLINE       0     0     0
            29bf742f-6801-4e4b-a1ef-a44dde8da2fc  ONLINE       0     0     0
            dea5f0c9-632c-48b3-bde2-823ecf8046e3  ONLINE       0     0     0
            f10e9159-936c-45df-a525-84633d7659e7  ONLINE       0     0     0


with all the clears only the last 2 errors post-scrub are visible, but in total I think 6 drives hat one error one hat 2 errors, all "read".
I have googled and read here on this forum about CRC errors being cable, "write" errors being "replace that drive a.s.a.p." but somehow I haven't found much on "read" errors apart from one thread describing them as "mostly harmless" I believe... so I wanted to ask what's up with the behavior. Is it a controller issue seeing as it is spread over different drives? It's a "SAS2008 PCI-Express Fusion-MPT SAS-2" card
 
Last edited:

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
1. What's the complete hardware spec - see forum rules
2. Is the HBA flashed to IT mode
3. Is the HBA well cooled
 

Lesani

Dabbler
Joined
Apr 15, 2022
Messages
25
1. ...
Dell R510
2*X5670
128GB ECC 1333Mhz
Dual onboard 1GB NICs (waiting on 40GB QSFP+ Intel NIC)
2. yes it is flashed to IT mode
3. The server chassis is closed, the fans are fully populated and spinning
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
I don't see 1 read error on a drive as being serious. However I don't like them coming back somewhat randomly.
Not sure what to suggest
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
Given they are read errors, how about stress tesing one disk at a time using a non-destructive badblocks run
 

Lesani

Dabbler
Joined
Apr 15, 2022
Messages
25
sure, how do I do that? I am currently running the Dell hardware diagnostics, this is going to take three days if it doesn't stop with errors (first disk of 12, half way done after 3 hours = 72hrs+ predicted), after that I can run any test you deem beneficial
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,949
read up on badblocks - if you use it - it will need to be non-destructive as otherwise it will seriously risk the pool
 

Lesani

Dabbler
Joined
Apr 15, 2022
Messages
25
Soo... I observed a bit and actually did a full (destructive) badblocks test on the drives my pool is now on. They ran through without any errors.

Still, over the past 8 days of (very little) activity on the pool they have again accumulated a few "read" errors:
FZ5XbfO.png


Don't know what to do anymore, the pool works fine, but I don't feel safe with my data and - in case the read errors don't really matter - I still can't use the e-mail alert feature since I don't know if the alert I'm getting is actually relevant or one of the "read" errors
 

Lesani

Dabbler
Joined
Apr 15, 2022
Messages
25
I keep having issues with read errors... ALL disks have been ckecked with badblocks, ALL disks are reported OK in Smartctl... I just don't know what to do... they sit without errors if I'm not using the NAS, but when I have lots of read/write operations on it (copying files etc) I get these read errors, ONLY read errors, never write or checksum...

CITk65x.png
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
@Lesani, can you provide the output of sas2flash? This will confirm if your HBA is in IT mode.

How is the HBA connected to the drives? Can you describe your cabling? This is probably a cabling or EMI problem.
 

Lesani

Dabbler
Joined
Apr 15, 2022
Messages
25
It only gives these 3 lines:

LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved

Which makes me wonder... but the seller described it as being in IT mode (server was prebuilt)

You can see the cabling on this photo I took when I got the server, its those 2 cables from the card in the middle to the backplane. I have re-seated those cables and all drives

YPb2gCJ.jpg
 

Lesani

Dabbler
Joined
Apr 15, 2022
Messages
25
EGOXJHi.png
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Could you run sas2flash -cpci 00:04:00:00 -list? The FW version is an IT version, but having a detailed output would confirm.

I'm also concerned about the cable routing next to the power supply. You typically want to keep these cables away from strong magnetic fields.
 

Lesani

Dabbler
Joined
Apr 15, 2022
Messages
25
Edl7TuZ.png


seems to be IT mode

Not sure how else to route them, they are taking the closest route, the rest of the area is blocked by the fans...
I could only try making them the farthest away of the bunch of cables that is routed through the hole next to the fans
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Try replacing both cables. Does the HBA have adequate ventilation?
 

Lesani

Dabbler
Joined
Apr 15, 2022
Messages
25
Its in the (closed) 2U server chassis you see above, with all fans populated and spinning.
//edit: in a cool environment//

I'll get new cables and report. This is new to me, I assume not all SAS cables are made equal, is there a specific thing I can look out for when getting new ones?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Same spec cable should do. Is there any way to route the cables using the same path as that folded ribbon cable?
 

Lesani

Dabbler
Joined
Apr 15, 2022
Messages
25
Will try rerouting and have it run a bit, will report back, thanks for the help in the meantime @Samuel Tai
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
What are you seeing in dmesg ? are there CAM errors?

I would expect that read errors should also show up with long SMART tests on the drives. Have you done those?
 

Lesani

Dabbler
Joined
Apr 15, 2022
Messages
25
What are you seeing in dmesg ? are there CAM errors?

I would expect that read errors should also show up with long SMART tests on the drives. Have you done those?
I haven't only done SMART tests, I have even done full badblocks runs of all drives... without issues. afaik SMART (long) only does a check within the drive, badblocks transferrs the data over the interface even

Same spec cable should do. Is there any way to route the cables using the same path as that folded ribbon cable?
I have now moved the cable as far away from the power supplies as possible, next to the ribbon cable.



Will let it run now, and also report on dmesg entries if the problem shows up again
 
Last edited:
Top