Multiple disks detaching and reattaching right away

maxxfi

Cadet
Joined
Jul 3, 2019
Messages
6
Hello,

I'd need your help to troubleshoot an issue we have with a FreeNAS backup server.
I'll go into some details later, but in brief, from time to time there is some error accessing some of the disks, which ends up with the host aborting the request, and detaching the disks, usually 2 and sometimes 4 of them, all at the same time... only to re-detect and re-attach them only 10-15 seconds later.
So even if we have RAIDZ2, that often causes a some corruption.

My question is, what could be cause this type of failure?

Here are some details on the system, which is running FreeNAS-9.10.2-U5 (yes, I know it's a bit old):
board: Supermicro H8QG6​
memory: 80 GB​
HBA controller: LSI 2308​
SAS expander: Intel RES2SV240​
volume disks: 20x Seagate NAS HDD ST8000VN0002​
other disks: 2x Intel S3700 100GB for ZIP and 2x Samsung SM863 240GB for L2ARC​

zpool configuration:
Code:
  pool: vol1
 state: ONLINE
  scan: scrub in progress since Fri Jul  5 12:01:35 2019
        5.67G scanned out of 30.5T at 10.7M/s, (scan is slow, no estimated time)
        0 repaired, 0.02% done
config:

    NAME                                            STATE     READ WRITE CKSUM
    vol1                                            ONLINE       0    94     0
      raidz2-0                                      ONLINE       0   188     0
        gptid/49d0dd58-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/4ad07498-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/4bd35639-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/4cd2284c-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/4ddefa41-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/4ee2bdf9-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/4fdf1223-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/50ecf07b-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/51f11d2c-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/52fb6848-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
      raidz2-1                                      ONLINE       0     0     0
        gptid/54096b3c-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/55106d03-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/561d43f4-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/5729d98b-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/582acafb-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/59506a31-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/5a6461cc-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/5b71ce0f-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/5c893896-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/5d9d1ed0-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
    logs
      mirror-2                                      ONLINE       0     0     0
        gptid/70e566a7-96b3-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/7146cdd6-96b3-11e6-b0d1-002590585018  ONLINE       0     0     0
    cache
      gptid/230deadc-96b4-11e6-b0d1-002590585018    ONLINE       0     0     0
      gptid/2360c322-96b4-11e6-b0d1-002590585018    ONLINE       0     0     0


In attachment there is a portion of the messages log when one of such disconnect/reconnect events happened.
Looking at the disks S/N in the various logs throughout all events, it seems it's always the same 4 disks (although sometimes only 2 of them), and they are all located in the vdev raidz2-0.

Interestingly, they are also the 4 disks carried by port #3 on the SAS expander, and for that reason, suspecting a faulty cable or connection,
yesterday I replaced it with a new SFF-8087 terminated cable, but it seems that did not help as the issue still persists.
Disk SMART tests don't show any failure.

So, I'm a little puzzled at the multiple disks failure (which anyway it doesn't seem to be permanent, as disks will happily rejoin the pool shortly after). What in your opinion would be the cause, or at least what you wouldn't consider as a primary suspect?

Thanks
 

Attachments

  • itbu1_msg1.txt
    14.2 KB · Views: 257

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
You should look at the power cabling/supply to that same backplane where you changed the data cable.
 

maxxfi

Cadet
Joined
Jul 3, 2019
Messages
6
Thank you both for the initial suggestions.
I didn't think of checking the power cabling, but it's a good point.
Regarding the expander, the board has 6 ports and they are all used, unfortunately (one port to the HBA, the other 5 to the disks).
Should I decide to replace it, is it a 'transparent' component, i.e. does the system detect any changes if I place another one of the same model, or perhaps even a different model as well?
 
Joined
May 10, 2017
Messages
838
Should I decide to replace it, is it a 'transparent' component, i.e. does the system detect any changes if I place another one of the same model, or perhaps even a different model as well?

No you can use any other expander, but I would swap ports first to confirm if that's the problem or not.
 

tfran1990

Patron
Joined
Oct 18, 2017
Messages
294
looks like it could be the HBA or the expander , the expander has 5 ports dedicated to disc and 1 to the HBA, switch the expander cable to the other port on the HBA and see if anything changes.

Also swap that cable to different ports on the expander and see if the problem follows.
 

maxxfi

Cadet
Joined
Jul 3, 2019
Messages
6
Update on the case: I wasn't able to find whether it's an issue with the power cables as I couldn't find a schematic of the disk backplane describing which of the many power connectors serves what group of disks, but for the time being I 'patched' the system by rerouted the four cache disks to use the motherboard SATA connectors, moving the disks that suffering connectivity problems to to the bays previously used by the cache disks and leaving the unhealthy bays empty.
After clearing the existing errors and scrubbing, the zpool is finally reported as OK.
In the next future, as soon as the budget allows we'll going to order a new disk backplane (possibly with integrated expander) and hopefully fix the issue once and for all.
 
Top