When a disk drive fails (post mortem)

Evi Vanoost · Mar 16, 2017

Last night apparently one of my drives started failing. This is how it started:

Code:

8:43PM - The following multipaths are not optimal: disk12
1:51AM - The following multipaths are not optimal: disk4, disk12
3:01AM - a huge e-mail containing a bunch of SCSI error messages for a bunch of targets ending with this
(da229:(da228:mpr1:0:352:0): WRITE(10). CDB: 2a 00 42 7f 18 c0 00 00 02 00
mpr1:0:(da228:mpr1:0:352:0): CAM status: CCB request completed with an error
353:(da228:0): mpr1:0:Retrying command
352:0): Retrying command
mpr1: Sending reset from mprsas_send_abort for target ID 357
mpr1: Unfreezing devq for target ID 357
5:11AM - The following multipaths are not optimal: disk4, disk12, disk19
6:42AM - The following multipaths are not optimal: disk30, disk4, disk12, disk19
7:04AM - The following multipaths are not optimal: disk30, disk4, disk12, disk19
Device: /dev/da223, Read SMART Self-Test Log Failed
7:31AM - The following multipaths are not optimal: disk30, disk4, disk12, disk19
Device: /dev/da241, failed to read SMART values
Device: /dev/da223, Read SMART Self-Test Log Failed
9:24AM - FreeNAS gives up and crashes, the watchdog timer on the IPMI host now hard-reboots the system
9:44AM - The volume Volumes (ZFS) state is DEGRADED: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state
11:45AM - zpool status gives me :
13.2T scanned out of 87.3T at 2.03G/s, 10h23m to go
spare-0									   UNAVAIL	  0	 0	 0
		  1641015999348081297						 UNAVAIL	  0	 0	 0  was /dev/gptid/af7e0f74-6554-11e6-b95f-0cc47aa483b8
		  gptid/d96f0f68-6554-11e6-b95f-0cc47aa483b8  ONLINE	   0	 0	 0  (resilvering)

No more SCSI error messages in the console, the drive has died and taken itself offline.

Lessons learned:
- Drives fail when you're sleeping
- Drives don't always fail gracefully, this one took the server through a little over 12 hours of abuse.
- SMART notices things a bit late, it checks every 10 minutes but no SMART errors are asserted for almost 11 hours.
- Drives, even in SAS setups can influence other drives on the bus. Apparently an entire channel (4 drives/channel in the enclosure) went offline, if it didn't have multi-path, my pool could've been toast.
- FreeNAS will crash (unknown as to why) at some point when things get dicey. I know OpenSolaris used to hang itself up (kernel panic) if it couldn't maintain the stability of a ZFS pool, in Solaris this is by design.
- Resilvering, even with mirrors takes a freaking long time.

Coming up:
Once the resilvering is done , hunt down and replace the defective drive amongst 60 of it's brethren.

droeders · Mar 16, 2017

Evi Vanoost said:
- FreeNAS will crash (unknown as to why) at some point when things get dicey. I know OpenSolaris used to hang itself up (kernel panic) if it couldn't maintain the stability of a ZFS pool, in Solaris this is by design.

It's possible that you have a swap partition on the drive that failed. If so, and FreeNAS tries to use this swap when the drive is dying, you will likely crash.

There have been discussions and bug reports cited in the forum, but I'm too lazy to look them up right now.

Stux · Mar 16, 2017

droeders said:
It's possible that you have a swap partition on the drive that failed. If so, and FreeNAS tries to use this swap when the drive is dying, you will likely crash.

There have been discussions and bug reports cited in the forum, but I'm too lazy to look them up right now.

Script to page in swap to prevent kernel crashes:
https://forums.freenas.org/index.ph...ny-used-swap-to-prevent-kernel-crashes.46206/

Evi Vanoost · Mar 16, 2017

Unlikely it's using swap, the system has 256GB of memory.

Gandalf Corvotempesta · Apr 10, 2017

Evi Vanoost said:
- Drives fail when you're sleeping

Yes. This happens ALWAYS :)

- Drives, even in SAS setups can influence other drives on the bus. Apparently an entire channel (4 drives/channel in the enclosure) went offline, if it didn't have multi-path, my pool could've been toast.

Also happened to me. I've never understood why this happens.
What do you mean with "multipath" ? Are you using SAS disks connected to two HBA ?
How old are these disks/server ?

Evi Vanoost · Apr 10, 2017

Yes I'm using multipath so all the connections between the internal expanders and between the enclosures and also between the server and the enclosures is duplicated. The only SPOF is the actual server itself but the services and configurations are replicated so they can be taken over by another host as necessary.

Gandalf Corvotempesta · Apr 10, 2017

Do you know why a single disk failure could bring down the whole bus?

is this the same with SATA directly mapped to each port without using any SAS expander or multilane cables ?

Evi Vanoost · Apr 10, 2017

SATA has only a single connection so if you bring a SATA multiplexer to it's knees, the entire connection fails (I've had that happen on a cheap 1x5 SATA multiplexer - 1 drive down, all 5 drives 'disconnect'. If you are able to connect each drive to it's own SATA controller, you may not have an issue. If you have SATA drives in a SAS expander (which is allowed) you lose the benefit of multi-pathing, if your expander goes down for whatever reason, it doesn't have another expander to fall back on even if the enclosure is multi-pathed, the SATA drive simply doesn't know that it can take another route (it doesn't have the electrical for it either)

SAS has two data paths (multi-pathing) per drive, so if you have a well-designed enclosure each of your drives go to two different expanders, if one path goes down, the other path(s) take over.

I think the bad drive must have been dumping garbage or doing something funky to the expander it was connected to which brought down all the drives connected to that pathway as well as the expander itself. Luckily since everything is dual-pathed, the 'good' drives just took another route thus the warnings about losing the path to the endpoints.

Gandalf Corvotempesta · Apr 10, 2017

Some servers has multiple SATA ports
In theory, if a disk goes down, the other disk connected to a different port shouldn't notice it and still keep running

Gandalf Corvotempesta · Apr 11, 2017

Anyway, can you share some details about your server? Vendor/model/HDD type and so on ?

Important Announcement for the TrueNAS Community.

When a disk drive fails (post mortem)

Evi Vanoost

Explorer

droeders

Contributor

Stux

MVP

Evi Vanoost

Explorer

Gandalf Corvotempesta

Dabbler

Evi Vanoost

Explorer

Gandalf Corvotempesta

Dabbler

Evi Vanoost

Explorer

Gandalf Corvotempesta

Dabbler

Gandalf Corvotempesta

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

When a disk drive fails (post mortem)

Explorer

Contributor

MVP

Explorer

Dabbler

Explorer

Dabbler

Explorer

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "When a disk drive fails (post mortem)"

Similar threads