Multiple disks detaching and reattaching right away

maxxfi · Jul 5, 2019

Hello,

I'd need your help to troubleshoot an issue we have with a FreeNAS backup server.
I'll go into some details later, but in brief, from time to time there is some error accessing some of the disks, which ends up with the host aborting the request, and detaching the disks, usually 2 and sometimes 4 of them, all at the same time... only to re-detect and re-attach them only 10-15 seconds later.
So even if we have RAIDZ2, that often causes a some corruption.

My question is, what could be cause this type of failure?

Here are some details on the system, which is running FreeNAS-9.10.2-U5 (yes, I know it's a bit old):

board: Supermicro H8QG6

memory: 80 GB

HBA controller: LSI 2308

SAS expander: Intel RES2SV240

volume disks: 20x Seagate NAS HDD ST8000VN0002

other disks: 2x Intel S3700 100GB for ZIP and 2x Samsung SM863 240GB for L2ARC

zpool configuration:

Code:

  pool: vol1
 state: ONLINE
  scan: scrub in progress since Fri Jul  5 12:01:35 2019
        5.67G scanned out of 30.5T at 10.7M/s, (scan is slow, no estimated time)
        0 repaired, 0.02% done
config:

    NAME                                            STATE     READ WRITE CKSUM
    vol1                                            ONLINE       0    94     0
      raidz2-0                                      ONLINE       0   188     0
        gptid/49d0dd58-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/4ad07498-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/4bd35639-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/4cd2284c-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/4ddefa41-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/4ee2bdf9-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/4fdf1223-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/50ecf07b-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/51f11d2c-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/52fb6848-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
      raidz2-1                                      ONLINE       0     0     0
        gptid/54096b3c-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/55106d03-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/561d43f4-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/5729d98b-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/582acafb-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/59506a31-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/5a6461cc-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/5b71ce0f-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/5c893896-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/5d9d1ed0-96b2-11e6-b0d1-002590585018  ONLINE       0     0     0
    logs
      mirror-2                                      ONLINE       0     0     0
        gptid/70e566a7-96b3-11e6-b0d1-002590585018  ONLINE       0     0     0
        gptid/7146cdd6-96b3-11e6-b0d1-002590585018  ONLINE       0     0     0
    cache
      gptid/230deadc-96b4-11e6-b0d1-002590585018    ONLINE       0     0     0
      gptid/2360c322-96b4-11e6-b0d1-002590585018    ONLINE       0     0     0

In attachment there is a portion of the messages log when one of such disconnect/reconnect events happened.
Looking at the disks S/N in the various logs throughout all events, it seems it's always the same 4 disks (although sometimes only 2 of them), and they are all located in the vdev raidz2-0.

Interestingly, they are also the 4 disks carried by port #3 on the SAS expander, and for that reason, suspecting a faulty cable or connection,
yesterday I replaced it with a new SFF-8087 terminated cable, but it seems that did not help as the issue still persists.
Disk SMART tests don't show any failure.

So, I'm a little puzzled at the multiple disks failure (which anyway it doesn't seem to be permanent, as disks will happily rejoin the pool shortly after). What in your opinion would be the cause, or at least what you wouldn't consider as a primary suspect?

Thanks

sretalla · Jul 5, 2019

You should look at the power cabling/supply to that same backplane where you changed the data cable.

Johnnie Black · Jul 5, 2019

Also use a different port on the expander, or swap with another if none available.

maxxfi · Jul 5, 2019

Thank you both for the initial suggestions.
I didn't think of checking the power cabling, but it's a good point.
Regarding the expander, the board has 6 ports and they are all used, unfortunately (one port to the HBA, the other 5 to the disks).
Should I decide to replace it, is it a 'transparent' component, i.e. does the system detect any changes if I place another one of the same model, or perhaps even a different model as well?

Johnnie Black · Jul 5, 2019

maxxfi said:
Should I decide to replace it, is it a 'transparent' component, i.e. does the system detect any changes if I place another one of the same model, or perhaps even a different model as well?

No you can use any other expander, but I would swap ports first to confirm if that's the problem or not.

tfran1990 · Jul 5, 2019

looks like it could be the HBA or the expander , the expander has 5 ports dedicated to disc and 1 to the HBA, switch the expander cable to the other port on the HBA and see if anything changes.

Also swap that cable to different ports on the expander and see if the problem follows.

maxxfi · Aug 12, 2019

Update on the case: I wasn't able to find whether it's an issue with the power cables as I couldn't find a schematic of the disk backplane describing which of the many power connectors serves what group of disks, but for the time being I 'patched' the system by rerouted the four cache disks to use the motherboard SATA connectors, moving the disks that suffering connectivity problems to to the bays previously used by the cache disks and leaving the unhealthy bays empty.
After clearing the existing errors and scrubbing, the zpool is finally reported as OK.
In the next future, as soon as the budget allows we'll going to order a new disk backplane (possibly with integrated expander) and hopefully fix the issue once and for all.

Important Announcement for the TrueNAS Community.

Multiple disks detaching and reattaching right away

maxxfi

Cadet

Attachments

sretalla

Powered by Neutrality

Johnnie Black

Guru

maxxfi

Cadet

Johnnie Black

Guru

tfran1990

Patron

maxxfi

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

Multiple disks detaching and reattaching right away

maxxfi

Cadet

Attachments

sretalla

Powered by Neutrality

Johnnie Black

Guru

maxxfi

Cadet

Johnnie Black

Guru

tfran1990

Patron

maxxfi

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Multiple disks detaching and reattaching right away"

Similar threads