Evi Vanoost
Explorer
- Joined
- Aug 4, 2016
- Messages
- 91
Last night apparently one of my drives started failing. This is how it started:
No more SCSI error messages in the console, the drive has died and taken itself offline.
Lessons learned:
- Drives fail when you're sleeping
- Drives don't always fail gracefully, this one took the server through a little over 12 hours of abuse.
- SMART notices things a bit late, it checks every 10 minutes but no SMART errors are asserted for almost 11 hours.
- Drives, even in SAS setups can influence other drives on the bus. Apparently an entire channel (4 drives/channel in the enclosure) went offline, if it didn't have multi-path, my pool could've been toast.
- FreeNAS will crash (unknown as to why) at some point when things get dicey. I know OpenSolaris used to hang itself up (kernel panic) if it couldn't maintain the stability of a ZFS pool, in Solaris this is by design.
- Resilvering, even with mirrors takes a freaking long time.
Coming up:
Once the resilvering is done , hunt down and replace the defective drive amongst 60 of it's brethren.
Code:
8:43PM - The following multipaths are not optimal: disk12 1:51AM - The following multipaths are not optimal: disk4, disk12 3:01AM - a huge e-mail containing a bunch of SCSI error messages for a bunch of targets ending with this (da229:(da228:mpr1:0:352:0): WRITE(10). CDB: 2a 00 42 7f 18 c0 00 00 02 00 mpr1:0:(da228:mpr1:0:352:0): CAM status: CCB request completed with an error 353:(da228:0): mpr1:0:Retrying command 352:0): Retrying command mpr1: Sending reset from mprsas_send_abort for target ID 357 mpr1: Unfreezing devq for target ID 357 5:11AM - The following multipaths are not optimal: disk4, disk12, disk19 6:42AM - The following multipaths are not optimal: disk30, disk4, disk12, disk19 7:04AM - The following multipaths are not optimal: disk30, disk4, disk12, disk19 Device: /dev/da223, Read SMART Self-Test Log Failed 7:31AM - The following multipaths are not optimal: disk30, disk4, disk12, disk19 Device: /dev/da241, failed to read SMART values Device: /dev/da223, Read SMART Self-Test Log Failed 9:24AM - FreeNAS gives up and crashes, the watchdog timer on the IPMI host now hard-reboots the system 9:44AM - The volume Volumes (ZFS) state is DEGRADED: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state 11:45AM - zpool status gives me : 13.2T scanned out of 87.3T at 2.03G/s, 10h23m to go spare-0 UNAVAIL 0 0 0 1641015999348081297 UNAVAIL 0 0 0 was /dev/gptid/af7e0f74-6554-11e6-b95f-0cc47aa483b8 gptid/d96f0f68-6554-11e6-b95f-0cc47aa483b8 ONLINE 0 0 0 (resilvering)
No more SCSI error messages in the console, the drive has died and taken itself offline.
Lessons learned:
- Drives fail when you're sleeping
- Drives don't always fail gracefully, this one took the server through a little over 12 hours of abuse.
- SMART notices things a bit late, it checks every 10 minutes but no SMART errors are asserted for almost 11 hours.
- Drives, even in SAS setups can influence other drives on the bus. Apparently an entire channel (4 drives/channel in the enclosure) went offline, if it didn't have multi-path, my pool could've been toast.
- FreeNAS will crash (unknown as to why) at some point when things get dicey. I know OpenSolaris used to hang itself up (kernel panic) if it couldn't maintain the stability of a ZFS pool, in Solaris this is by design.
- Resilvering, even with mirrors takes a freaking long time.
Coming up:
Once the resilvering is done , hunt down and replace the defective drive amongst 60 of it's brethren.
Last edited: