Random Drives Failing

BraveStarr · Aug 20, 2018

Hi Everyone,

I am having a bit of a strange issue with my FreeNas deployment. I have a 45 Drives Solution with 10 x 10TB SATA Drives. These are all in a RAIDz2 pool.

The issue is that drive will randomly fail, we have tried replacing the drives and controllers and are still having the same issue. The drives will show a high number of write failures and 6 read failures and no checksum failures. I have found You can offline the disk and replace it with itself and then another disk will fail. So to me the disks seem fine.

I have had a bit of a poke around and can see the following error:

SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da9:mpr1:0:9:0): Error 6, Retries exhausted

But I have no idea what is causing it, or if this is even indicative of the route cause.

Another peculiarity is that rebooting the unit will clear all the errors and the pool goes back to healthy again.

Any help or pointers would be much appreciated.

Kind Regards,

Ryan

Mihalich · Aug 20, 2018

Maybe it's a power supply.

Elliot Dierksen · Aug 20, 2018

I would wonder about cooling. Are the drives and the drive controller getting enough airflow? Excessive heat makes electronics do strange (and bad) things. It could also be cabling in addition to the power supply issue suggested above. I would lean towards heat being the next thing to investigate.

danb35 · Aug 20, 2018

BraveStarr said:
Another peculiarity is that rebooting the unit will clear all the errors and the pool goes back to healthy again.

That's perfectly normal--pool errors aren't stored persistently, so when you reboot, the counters are cleared.

kdragon75 · Aug 20, 2018

BraveStarr said:
10 x 10TB SATA Drives. These are all in a RAIDz2 pool.

You may want to consider rearranging that...

BraveStarr said:
I have found You can offline the disk and replace it with itself and then another disk will fail

Have you pulled on and done a full burn-in test? It never hurts to be sure!
Please post full system details and model numbers of ALL parts please! also list all power splitters/extension cables!

kdragon75 · Aug 20, 2018

Does this only happen to one drive at a time? Have you ever had two drives fail at the same time (within a few min)

BraveStarr · Aug 20, 2018

Hi,

First of all thanks for all the replies! :)

Mihalich said:
Maybe it's a power supply.

So we have swapped the 2 PSUs over to see if that has any impact.

Elliot Dierksen said:
I would wonder about cooling. Are the drives and the drive controller getting enough airflow? Excessive heat makes electronics do strange (and bad) things. It could also be cabling in addition to the power supply issue suggested above. I would lean towards heat being the next thing to investigate.

Airflow seems ok and the Aircon is pretty effective but I have stuck one of our temperature monitoring probes into the chassis.

kdragon75 said:
Does this only happen to one drive at a time? Have you ever had two drives fail at the same time (within a few min)

Normally this is one drive at a time, however we had 2 fail over the weekend.

The System Specs are:
Processor: E5-2620 v4
Motherboard: X10SRL (DDR 4)
RAM: 32GB DDR4
HBA: 2 x LSI 9305 12Gb/s (SSD and SAS SATA Compatible)
Pool Hard Drives: Exos 10TB (ST10000NM0086)
Boot Drive: Redudant SSDs (2 x 120GB)

If we have another failure I will post some more details.

Kind regards,

Ryan

kdragon75 · Aug 20, 2018

BraveStarr said:
The System Specs are:
Processor: E5-2620 v4
Motherboard: X10SRL (DDR 4)
RAM: 32GB DDR4
HBA: 2 x LSI 9305 12Gb/s (SSD and SAS SATA Compatible)
Pool Hard Drives: Exos 10TB (ST10000NM0086)
Boot Drive: Redudant SSDs (2 x 120GB)

What are you using for power supplies?

wokka · Aug 20, 2018

BraveStarr said:
Hi Everyone,

I am having a bit of a strange issue with my FreeNas deployment. I have a 45 Drives Solution with 10 x 10TB SATA Drives. These are all in a RAIDz2 pool.

The issue is that drive will randomly fail, we have tried replacing the drives and controllers and are still having the same issue. The drives will show a high number of write failures and 6 read failures and no checksum failures. I have found You can offline the disk and replace it with itself and then another disk will fail. So to me the disks seem fine.

I have had a bit of a poke around and can see the following error:

SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da9:mpr1:0:9:0): Error 6, Retries exhausted

But I have no idea what is causing it, or if this is even indicative of the route cause.

Another peculiarity is that rebooting the unit will clear all the errors and the pool goes back to healthy again.

Any help or pointers would be much appreciated.

Kind Regards,

Ryan

Oddly enough, I came here to ask about this same thing.

-edit-

I'm having the same errors, with a lot more detail. Starting my own thread per request.

kdragon75 · Aug 20, 2018

@wokka, Feel free to sit in but to troubleshoot, yes please start a new thread and include all the same information requested here. ESPECIALLY the power supply information.

wokka · Aug 20, 2018

Understood, thanks, will start my own thread.

Mihalich · Aug 20, 2018

What about Mobile Rack?

Stux · Aug 20, 2018

Firmware version of HBAs?

BraveStarr · Aug 21, 2018

Ok, so 2 drives have failed again overnight:

Code:

 pool: Data
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
		continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Aug 21 08:16:25 2018
		50.7G scanned at 3.62G/s, 191M issued at 13.7M/s, 14.1T total
		0 resilvered, 0.00% done, 12 days 12:45:39 to go
config:

		NAME												  STATE	 READ WRITE CKSUM
		Data												  DEGRADED	 0	 0	 0
		  raidz2-0											DEGRADED	 0	 0	 0
			replacing-0									   FAULTED	  0	 0	 0
			  gptid/eb9072a9-6a88-11e8-82ff-ac1f6b8d1d8c.eli  FAULTED	  6	76	 0  too many errors
			  gptid/1bdd433c-a512-11e8-b5e8-ac1f6b8d1d8c.eli  ONLINE	   0	 0	 0
			gptid/5ae06b25-a207-11e8-b218-ac1f6b8d1d8c.eli	ONLINE	   0	 0	 0
			gptid/451a74d4-8459-11e8-9a57-ac1f6b8d1d8c.eli	FAULTED	  6	98	 0  too many errors
			gptid/887eb6f6-8e4f-11e8-a560-ac1f6b8d1d8c.eli	ONLINE	   0	 0	 0
			gptid/436a8a26-8a62-11e8-b095-ac1f6b8d1d8c.eli	ONLINE	   0	 0	 0
			gptid/2cc85fde-7851-11e8-810d-ac1f6b8d1d8c.eli	ONLINE	   0	 0	 0
			gptid/f08a7e77-6a88-11e8-82ff-ac1f6b8d1d8c.eli	ONLINE	   0	 0	 0
			gptid/f15e0c4d-6a88-11e8-82ff-ac1f6b8d1d8c.eli	ONLINE	   0	 0	 0
			gptid/f23472e4-6a88-11e8-82ff-ac1f6b8d1d8c.eli	ONLINE	   0	 0	 0
			gptid/dedd30d9-a448-11e8-b218-ac1f6b8d1d8c.eli	ONLINE	   0	 0	 0
			gptid/bac631e9-7dd8-11e8-9a57-ac1f6b8d1d8c.eli	ONLINE	   0	 0	 0
			gptid/fc9ac9ee-7dd7-11e8-9a57-ac1f6b8d1d8c.eli	ONLINE	   0	 0	 0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:04 with 0 errors on Fri Aug 17 03:45:04 2018
config:

		NAME		STATE	 READ WRITE CKSUM
		freenas-boot  ONLINE	   0	 0	 0
		  mirror-0  ONLINE	   0	 0	 0
			ada0p2  ONLINE	   0	 0	 0
			ada1p2  ONLINE	   0	 0	 0

errors: No known data errors

I have noted that there are always 6 READ errors.

Controller Revisions are

Code:

mpr0: <Avago Technologies (LSI) SAS3224> port 0xd000-0xd0ff mem 0xfb400000-0xfb40ffff irq 32 at device 0.0 on pci4
mpr0: Firmware: 09.00.100.00, Driver: 18.03.00.00-fbsd
mpr0: IOCCapabilities: 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc,FastPath,RDPQArray>

Code:

mpr1: <Avago Technologies (LSI) SAS3224> port 0xc000-0xc0ff mem 0xfb200000-0xfb20ffff irq 40 at device 0.0 on pci5
mpr1: Firmware: 15.00.00.00, Driver: 18.03.00.00-fbsd
mpr1: IOCCapabilities: 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc,FastPath,RDPQArray>

I have just noticed that the firmware's are not on the same revision and will remedy this shortly.
- Firmware is now updated to 16.00.01.00

Code:

Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02)
Copyright 2008-2017 Avago Technologies. All rights reserved.

		Adapter Selected is a Avago SAS: SAS3224(A1)

Num   Ctlr			FW Ver		NVDATA		x86-BIOS		 PCI Addr
----------------------------------------------------------------------------

0  SAS3224(A1)  16.00.01.00	10.00.00.05	08.27.00.00	 00:03:00:00
1  SAS3224(A1)  16.00.01.00	10.00.00.05	08.35.00.00	 00:04:00:00

		Finished Processing Commands Successfully.
		Exiting SAS3Flash.

Power supplies are:
2x Emacs MTW-5AD0B2 1200W PSU

Any thoughts?

Kind Regards,

Ryan

kdragon75 · Aug 21, 2018

It almost has to be a controller or driver issue with the consistent 6 read errors.

wokka · Aug 21, 2018

I never noticed it was always 6 errors, but I had a drive fail overnight and it has 6 read errors as well. It also had write errors, and more than 6. I've got a new controller on order to test this hypothesis.

What command is being used to check the firmware version?

Thanks!

BraveStarr · Aug 21, 2018

wokka said:
What command is being used to check the firmware version?

I used this to check the concise firmware revision:

Code:

sas3flash -listall

You can also run:

Code:

dmesg | grep mpr

BraveStarr · Aug 21, 2018

kdragon75 said:
It almost has to be a controller or driver issue with the consistent 6 read errors.

We have received a new controller from 45 Drives which I will install tomorrow if we have a drive failure.

wokka · Aug 23, 2018

Just curious if this resolved your problem? I put my new controller in a couple of hours ago and have been trying to hammer my NAS to see if an error will appear. Knock on wood, nothing so far.

wokka · Aug 25, 2018

Controller replacement has seemed to resolve my issue.

Important Announcement for the TrueNAS Community.

Random Drives Failing

Cadet

Patron

Guru

Hall of Famer

Wizard

Wizard

Cadet

Wizard

Dabbler

Wizard

Dabbler

Patron

MVP

Cadet

Wizard

Dabbler

Cadet

Cadet

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Random Drives Failing"

Similar threads