Multiple simultaneous disk failures?

Status
Not open for further replies.

trschilke

Cadet
Joined
Jan 24, 2016
Messages
5
This morning, I found that the file shares I had mounted from my FreeNAS device were not working. I was unable to connect to the box via the web interface or SSH, so I went to the console and found a string of SCSI errors for multiple drives. Upon reboot, all of the errors returned as soon as the server tried to import the main ZFS storage volume (the root volume is fine).

The errors are:

WRITE(10). CDB: {string of hex}
CAM status: SCSI Status Error
SCSI status: Check Condition
SCSI sense: HARDWARE FAILURE asc:44,0 (Internal target failure)
Error 5, Retries exhausted

I am receiving these errors on seven (yes seven!) disks, and all of them were fine yesterday. The disks are of varying ages (a couple years to a couple months) and are connected to three different physical controllers. They all have the same SMART status as well, a copy of which is attached for four of my drives.

I'm at a loss, since I can't believe that this many drives all failed at the exact same time. Can anyone shed some light on my problem and if there's any way to repair it other than RMA-ing all of the drives and restoring from backups?
 

Attachments

  • smartctl da0.txt
    5.8 KB · Views: 351
  • smartctl da1.txt
    5.8 KB · Views: 327
  • smartctl da2.txt
    5.8 KB · Views: 372
  • smartctl da3.txt
    5.8 KB · Views: 302

Mirfster

Doesn't know what he's talking about
Joined
Oct 2, 2015
Messages
3,215

trschilke

Cadet
Joined
Jan 24, 2016
Messages
5
  • Hardware
    • MB: SUPERMICRO MBD-X10SRA-F-O
    • RAM: Crucial 32GB Kit (16GBx2) DDR4-2133 MT/s (PC4-2133) CL15 DR x4 ECC RDIMM
    • Controllers (HBA, etc): 3xIBM LSI ServeRAID M1015 flashed to IT mode
    • Drives (Storage and OS):
      STORAGE: 23xWD Red 3TB (WD30EFRX) via NORCO RPC-4224 backplanes (one backplane port isn't connecting right)
      OS: Mushkin Enhanced ECO2 2.5" 512GB SATA III Internal Solid State Drive (SSD)
  • FreeNas Version: FreeNAS-9.3-STABLE-201512121950
 

Mirfster

Doesn't know what he's talking about
Joined
Oct 2, 2015
Messages
3,215
Hmm, first guess is that there may be an issue with the backplane itself. But since you have 23 drives attached, I am hesitant to recommend trying to bypass the backplane with fanout/breakout cables...
 

trschilke

Cadet
Joined
Jan 24, 2016
Messages
5
I thought there might be an issue with the backplane too, but the errors are spread across four backplanes as well.
 

Mirfster

Doesn't know what he's talking about
Joined
Oct 2, 2015
Messages
3,215
Are the drives showing the issues connected to the same M1015 HBA? Trying to determine where the single source of failure would be since it seems like you have things pretty well setup for redundancy.
 

trschilke

Cadet
Joined
Jan 24, 2016
Messages
5
Are the drives showing the issues connected to the same M1015 HBA? Trying to determine where the single source of failure would be since it seems like you have things pretty well setup for redundancy.

They're on different controllers as well...

layout.png
 

trschilke

Cadet
Joined
Jan 24, 2016
Messages
5
I actually have a couple backplanes from an old NORCO chassis that I was going to get rid of. Let me see if I can't get some the hard disks relocated and see if that solves the problem.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
Of the drives you provided data about all but da0 were old recorded failures. Also it appears that you have never run a SMART test on these drives. I would recommend you run a long SMART test on all your drives and see what shakes from the trees. The errors recorded by the drives do not appear to be from a hardware issue with the drives, but the Long SMART test will shed some light on if you do have drive failures that you are not aware of.
 
Status
Not open for further replies.
Top