When a disk drive fails (post mortem)

Status
Not open for further replies.

Evi Vanoost

Explorer
Joined
Aug 4, 2016
Messages
91
Last night apparently one of my drives started failing. This is how it started:

Code:
8:43PM - The following multipaths are not optimal: disk12
1:51AM - The following multipaths are not optimal: disk4, disk12
3:01AM - a huge e-mail containing a bunch of SCSI error messages for a bunch of targets ending with this
(da229:(da228:mpr1:0:352:0): WRITE(10). CDB: 2a 00 42 7f 18 c0 00 00 02 00
mpr1:0:(da228:mpr1:0:352:0): CAM status: CCB request completed with an error
353:(da228:0): mpr1:0:Retrying command
352:0): Retrying command
mpr1: Sending reset from mprsas_send_abort for target ID 357
mpr1: Unfreezing devq for target ID 357
5:11AM - The following multipaths are not optimal: disk4, disk12, disk19
6:42AM - The following multipaths are not optimal: disk30, disk4, disk12, disk19
7:04AM - The following multipaths are not optimal: disk30, disk4, disk12, disk19
Device: /dev/da223, Read SMART Self-Test Log Failed
7:31AM - The following multipaths are not optimal: disk30, disk4, disk12, disk19
Device: /dev/da241, failed to read SMART values
Device: /dev/da223, Read SMART Self-Test Log Failed
9:24AM - FreeNAS gives up and crashes, the watchdog timer on the IPMI host now hard-reboots the system
9:44AM - The volume Volumes (ZFS) state is DEGRADED: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state
11:45AM - zpool status gives me :
13.2T scanned out of 87.3T at 2.03G/s, 10h23m to go
spare-0									   UNAVAIL	  0	 0	 0
		  1641015999348081297						 UNAVAIL	  0	 0	 0  was /dev/gptid/af7e0f74-6554-11e6-b95f-0cc47aa483b8
		  gptid/d96f0f68-6554-11e6-b95f-0cc47aa483b8  ONLINE	   0	 0	 0  (resilvering)


No more SCSI error messages in the console, the drive has died and taken itself offline.

Lessons learned:
- Drives fail when you're sleeping
- Drives don't always fail gracefully, this one took the server through a little over 12 hours of abuse.
- SMART notices things a bit late, it checks every 10 minutes but no SMART errors are asserted for almost 11 hours.
- Drives, even in SAS setups can influence other drives on the bus. Apparently an entire channel (4 drives/channel in the enclosure) went offline, if it didn't have multi-path, my pool could've been toast.
- FreeNAS will crash (unknown as to why) at some point when things get dicey. I know OpenSolaris used to hang itself up (kernel panic) if it couldn't maintain the stability of a ZFS pool, in Solaris this is by design.
- Resilvering, even with mirrors takes a freaking long time.

Coming up:
Once the resilvering is done , hunt down and replace the defective drive amongst 60 of it's brethren.
 
Last edited:

droeders

Contributor
Joined
Mar 21, 2016
Messages
179
- FreeNAS will crash (unknown as to why) at some point when things get dicey. I know OpenSolaris used to hang itself up (kernel panic) if it couldn't maintain the stability of a ZFS pool, in Solaris this is by design.

It's possible that you have a swap partition on the drive that failed. If so, and FreeNAS tries to use this swap when the drive is dying, you will likely crash.

There have been discussions and bug reports cited in the forum, but I'm too lazy to look them up right now.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Joined
Oct 8, 2016
Messages
48
- Drives fail when you're sleeping

Yes. This happens ALWAYS :)

- Drives, even in SAS setups can influence other drives on the bus. Apparently an entire channel (4 drives/channel in the enclosure) went offline, if it didn't have multi-path, my pool could've been toast.

Also happened to me. I've never understood why this happens.
What do you mean with "multipath" ? Are you using SAS disks connected to two HBA ?
How old are these disks/server ?
 

Evi Vanoost

Explorer
Joined
Aug 4, 2016
Messages
91
Yes I'm using multipath so all the connections between the internal expanders and between the enclosures and also between the server and the enclosures is duplicated. The only SPOF is the actual server itself but the services and configurations are replicated so they can be taken over by another host as necessary.
 
Joined
Oct 8, 2016
Messages
48
Do you know why a single disk failure could bring down the whole bus?

is this the same with SATA directly mapped to each port without using any SAS expander or multilane cables ?
 

Evi Vanoost

Explorer
Joined
Aug 4, 2016
Messages
91
SATA has only a single connection so if you bring a SATA multiplexer to it's knees, the entire connection fails (I've had that happen on a cheap 1x5 SATA multiplexer - 1 drive down, all 5 drives 'disconnect'. If you are able to connect each drive to it's own SATA controller, you may not have an issue. If you have SATA drives in a SAS expander (which is allowed) you lose the benefit of multi-pathing, if your expander goes down for whatever reason, it doesn't have another expander to fall back on even if the enclosure is multi-pathed, the SATA drive simply doesn't know that it can take another route (it doesn't have the electrical for it either)

SAS has two data paths (multi-pathing) per drive, so if you have a well-designed enclosure each of your drives go to two different expanders, if one path goes down, the other path(s) take over.

I think the bad drive must have been dumping garbage or doing something funky to the expander it was connected to which brought down all the drives connected to that pathway as well as the expander itself. Luckily since everything is dual-pathed, the 'good' drives just took another route thus the warnings about losing the path to the endpoints.
 
Joined
Oct 8, 2016
Messages
48
Some servers has multiple SATA ports
In theory, if a disk goes down, the other disk connected to a different port shouldn't notice it and still keep running
 
Status
Not open for further replies.
Top