Random Drives Failing

Status
Not open for further replies.

BraveStarr

Cadet
Joined
Aug 20, 2018
Messages
5
Hi Everyone,

I am having a bit of a strange issue with my FreeNas deployment. I have a 45 Drives Solution with 10 x 10TB SATA Drives. These are all in a RAIDz2 pool.

The issue is that drive will randomly fail, we have tried replacing the drives and controllers and are still having the same issue. The drives will show a high number of write failures and 6 read failures and no checksum failures. I have found You can offline the disk and replace it with itself and then another disk will fail. So to me the disks seem fine.

I have had a bit of a poke around and can see the following error:

SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da9:mpr1:0:9:0): Error 6, Retries exhausted

But I have no idea what is causing it, or if this is even indicative of the route cause.

Another peculiarity is that rebooting the unit will clear all the errors and the pool goes back to healthy again.

Any help or pointers would be much appreciated.

Kind Regards,


Ryan
 

Mihalich

Patron
Joined
Mar 14, 2017
Messages
297
Maybe it's a power supply.
 
Joined
Dec 29, 2014
Messages
1,135
I would wonder about cooling. Are the drives and the drive controller getting enough airflow? Excessive heat makes electronics do strange (and bad) things. It could also be cabling in addition to the power supply issue suggested above. I would lean towards heat being the next thing to investigate.
 
Last edited:

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Another peculiarity is that rebooting the unit will clear all the errors and the pool goes back to healthy again.
That's perfectly normal--pool errors aren't stored persistently, so when you reboot, the counters are cleared.
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
10 x 10TB SATA Drives. These are all in a RAIDz2 pool.
:eek:
You may want to consider rearranging that...
I have found You can offline the disk and replace it with itself and then another disk will fail
Have you pulled on and done a full burn-in test? It never hurts to be sure!
Please post full system details and model numbers of ALL parts please! also list all power splitters/extension cables!
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
Does this only happen to one drive at a time? Have you ever had two drives fail at the same time (within a few min)
 

BraveStarr

Cadet
Joined
Aug 20, 2018
Messages
5
Hi,

First of all thanks for all the replies! :)

Maybe it's a power supply.
So we have swapped the 2 PSUs over to see if that has any impact.

I would wonder about cooling. Are the drives and the drive controller getting enough airflow? Excessive heat makes electronics do strange (and bad) things. It could also be cabling in addition to the power supply issue suggested above. I would lean towards heat being the next thing to investigate.
Airflow seems ok and the Aircon is pretty effective but I have stuck one of our temperature monitoring probes into the chassis.

Does this only happen to one drive at a time? Have you ever had two drives fail at the same time (within a few min)
Normally this is one drive at a time, however we had 2 fail over the weekend.

The System Specs are:
Processor: E5-2620 v4
Motherboard: X10SRL (DDR 4)
RAM: 32GB DDR4
HBA: 2 x LSI 9305 12Gb/s (SSD and SAS SATA Compatible)
Pool Hard Drives: Exos 10TB (ST10000NM0086)
Boot Drive: Redudant SSDs (2 x 120GB)

If we have another failure I will post some more details.

Kind regards,


Ryan
 

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
The System Specs are:
Processor: E5-2620 v4
Motherboard: X10SRL (DDR 4)
RAM: 32GB DDR4
HBA: 2 x LSI 9305 12Gb/s (SSD and SAS SATA Compatible)
Pool Hard Drives: Exos 10TB (ST10000NM0086)
Boot Drive: Redudant SSDs (2 x 120GB)
What are you using for power supplies?
 

wokka

Dabbler
Joined
Aug 13, 2013
Messages
16
Hi Everyone,

I am having a bit of a strange issue with my FreeNas deployment. I have a 45 Drives Solution with 10 x 10TB SATA Drives. These are all in a RAIDz2 pool.

The issue is that drive will randomly fail, we have tried replacing the drives and controllers and are still having the same issue. The drives will show a high number of write failures and 6 read failures and no checksum failures. I have found You can offline the disk and replace it with itself and then another disk will fail. So to me the disks seem fine.

I have had a bit of a poke around and can see the following error:

SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da9:mpr1:0:9:0): Error 6, Retries exhausted

But I have no idea what is causing it, or if this is even indicative of the route cause.

Another peculiarity is that rebooting the unit will clear all the errors and the pool goes back to healthy again.

Any help or pointers would be much appreciated.

Kind Regards,


Ryan

Oddly enough, I came here to ask about this same thing.

-edit-

I'm having the same errors, with a lot more detail. Starting my own thread per request.
 
Last edited:

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
@wokka, Feel free to sit in but to troubleshoot, yes please start a new thread and include all the same information requested here. ESPECIALLY the power supply information.
 

Mihalich

Patron
Joined
Mar 14, 2017
Messages
297
What about Mobile Rack?
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Firmware version of HBAs?
 

BraveStarr

Cadet
Joined
Aug 20, 2018
Messages
5
Ok, so 2 drives have failed again overnight:

Code:
 pool: Data
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
		continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Aug 21 08:16:25 2018
		50.7G scanned at 3.62G/s, 191M issued at 13.7M/s, 14.1T total
		0 resilvered, 0.00% done, 12 days 12:45:39 to go
config:

		NAME												  STATE	 READ WRITE CKSUM
		Data												  DEGRADED	 0	 0	 0
		  raidz2-0											DEGRADED	 0	 0	 0
			replacing-0									   FAULTED	  0	 0	 0
			  gptid/eb9072a9-6a88-11e8-82ff-ac1f6b8d1d8c.eli  FAULTED	  6	76	 0  too many errors
			  gptid/1bdd433c-a512-11e8-b5e8-ac1f6b8d1d8c.eli  ONLINE	   0	 0	 0
			gptid/5ae06b25-a207-11e8-b218-ac1f6b8d1d8c.eli	ONLINE	   0	 0	 0
			gptid/451a74d4-8459-11e8-9a57-ac1f6b8d1d8c.eli	FAULTED	  6	98	 0  too many errors
			gptid/887eb6f6-8e4f-11e8-a560-ac1f6b8d1d8c.eli	ONLINE	   0	 0	 0
			gptid/436a8a26-8a62-11e8-b095-ac1f6b8d1d8c.eli	ONLINE	   0	 0	 0
			gptid/2cc85fde-7851-11e8-810d-ac1f6b8d1d8c.eli	ONLINE	   0	 0	 0
			gptid/f08a7e77-6a88-11e8-82ff-ac1f6b8d1d8c.eli	ONLINE	   0	 0	 0
			gptid/f15e0c4d-6a88-11e8-82ff-ac1f6b8d1d8c.eli	ONLINE	   0	 0	 0
			gptid/f23472e4-6a88-11e8-82ff-ac1f6b8d1d8c.eli	ONLINE	   0	 0	 0
			gptid/dedd30d9-a448-11e8-b218-ac1f6b8d1d8c.eli	ONLINE	   0	 0	 0
			gptid/bac631e9-7dd8-11e8-9a57-ac1f6b8d1d8c.eli	ONLINE	   0	 0	 0
			gptid/fc9ac9ee-7dd7-11e8-9a57-ac1f6b8d1d8c.eli	ONLINE	   0	 0	 0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:04 with 0 errors on Fri Aug 17 03:45:04 2018
config:

		NAME		STATE	 READ WRITE CKSUM
		freenas-boot  ONLINE	   0	 0	 0
		  mirror-0  ONLINE	   0	 0	 0
			ada0p2  ONLINE	   0	 0	 0
			ada1p2  ONLINE	   0	 0	 0

errors: No known data errors



I have noted that there are always 6 READ errors.

Controller Revisions are

Code:
mpr0: <Avago Technologies (LSI) SAS3224> port 0xd000-0xd0ff mem 0xfb400000-0xfb40ffff irq 32 at device 0.0 on pci4
mpr0: Firmware: 09.00.100.00, Driver: 18.03.00.00-fbsd
mpr0: IOCCapabilities: 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc,FastPath,RDPQArray>

Code:
mpr1: <Avago Technologies (LSI) SAS3224> port 0xc000-0xc0ff mem 0xfb200000-0xfb20ffff irq 40 at device 0.0 on pci5
mpr1: Firmware: 15.00.00.00, Driver: 18.03.00.00-fbsd
mpr1: IOCCapabilities: 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc,FastPath,RDPQArray>

I have just noticed that the firmware's are not on the same revision and will remedy this shortly.
- Firmware is now updated to 16.00.01.00
Code:
Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02)
Copyright 2008-2017 Avago Technologies. All rights reserved.

		Adapter Selected is a Avago SAS: SAS3224(A1)

Num   Ctlr			FW Ver		NVDATA		x86-BIOS		 PCI Addr
----------------------------------------------------------------------------

0  SAS3224(A1)  16.00.01.00	10.00.00.05	08.27.00.00	 00:03:00:00
1  SAS3224(A1)  16.00.01.00	10.00.00.05	08.35.00.00	 00:04:00:00

		Finished Processing Commands Successfully.
		Exiting SAS3Flash.


Power supplies are:
2x Emacs MTW-5AD0B2 1200W PSU

Any thoughts?

Kind Regards,


Ryan
 
Last edited:

kdragon75

Wizard
Joined
Aug 7, 2016
Messages
2,457
It almost has to be a controller or driver issue with the consistent 6 read errors.
 

wokka

Dabbler
Joined
Aug 13, 2013
Messages
16
I never noticed it was always 6 errors, but I had a drive fail overnight and it has 6 read errors as well. It also had write errors, and more than 6. I've got a new controller on order to test this hypothesis.

What command is being used to check the firmware version?

Thanks!
 

wokka

Dabbler
Joined
Aug 13, 2013
Messages
16
Just curious if this resolved your problem? I put my new controller in a couple of hours ago and have been trying to hammer my NAS to see if an error will appear. Knock on wood, nothing so far.
 
Status
Not open for further replies.
Top