My ZFS is in a bad place, and I'm out of ideas on how to fix it. Short summary: I'm running RAID-Z2, 6x3TB drives, running under encryption through FreeNAS's
This motivated me to replace my server hardware, just in case it was actually a controller failing instead of just the disk. New motherboard, disks still not great: the first to go offline (
FreeNAS will not unlock the drives through the UI, hanging indefinitely, so I manually iterate through each and run
Now, in the past I've replaced a drive or two, and what normally happens is the drive becomes resilvered over time. However, for what I suspect was the staggered nature of my drives failing, I am getting this (w/ map-resolved drive names written besides GPTID):
Data errors, drives running in "degraded" state with too many errors, and the drive that I've replaced hanging there. That sucks, so I `zpool clear`, and the command hangs uninterruptible. Reboot, repeat, over the course of a few days and a few hangs, I am slowly making progress as a `zpool clear` completes without hanging. At the moment, the drive seems to be resilvering, a first in the ~5 times I've rebooted and manually unlocked them, so maybe some forum magic is already rubbing off on me!
Anyway, to date I've hit a few things that scare me:
Can anyone weigh in and let me know if I'm doing anything well, or if there is a really great thing that I could do to make this go better? I actually started writing this with the resilver hung at 0%, so I'm a little optimistic that maybe I've kicked it in the correct direction this round, but I'd still appreciate any wisdom surround the issue or things I should or should not do.
I'll update as things progress! Thanks in advance.
geli
layer. After moving, I unboxed my server and powered in on - all's well, went to my folks' for the holidays. Remotely, I notice my pool reporting a degraded state and see that a disk has gone offline. That's fine, I ordered one on Amazon and continued in a degraded state. Then a second disk went offline. I remotely shutdown the server.This motivated me to replace my server hardware, just in case it was actually a controller failing instead of just the disk. New motherboard, disks still not great: the first to go offline (
ada1
) is DOA, the second ( ada4
) is punting a handful of bad sectors in SMART scans but seems otherwise functional.FreeNAS will not unlock the drives through the UI, hanging indefinitely, so I manually iterate through each and run
geli attach -k /data/geli/... /dev/gptid/drive
on all five drives. They all get attached, and the zpool can be imported, but missing ada1
. I use the FreeNAS UI to provision my replacement drive and replace ada1
.Now, in the past I've replaced a drive or two, and what normally happens is the drive becomes resilvered over time. However, for what I suspect was the staggered nature of my drives failing, I am getting this (w/ map-resolved drive names written besides GPTID):
Code:
Geom name: gptid/7537576d-6d37-11e3-9def-0019db684008.eli - ada0 Geom name: gptid/770f6fe4-6d37-11e3-9def-0019db684008.eli - ada5 Geom name: gptid/78013b99-6d37-11e3-9def-0019db684008.eli - ada1 Geom name: gptid/7884d394-6d37-11e3-9def-0019db684008.eli - ada2 Geom name: gptid/78f3188b-6d37-11e3-9def-0019db684008.eli - ada3 /dev/gptid/61e5d38d-7fd3-11e4-86e0-0019db684008 - ada4 87023237-f8d4-11e7-b423-309c2342a812 - new ada1 pool: poolname state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Fri Jan 19 19:03:46 2018 947G scanned at 1.31G/s, 64.3G issued at 91.3M/s, 9.71T total 10.7G resilvered, 0.65% done, 1 days 06:46:28 to go config: NAME STATE READ WRITE CKSUM poolname DEGRADED 0 0 404 raidz2-0 DEGRADED 0 0 1.58K gptid/7537576d-6d37-11e3-9def-0019db684008.eli DEGRADED 0 0 0 too many errors <-- ada0 gptid/61e5d38d-7fd3-11e4-86e0-0019db684008.eli ONLINE 0 0 0 <-- ada4 gptid/770f6fe4-6d37-11e3-9def-0019db684008.eli DEGRADED 0 0 0 too many errors <-- ada5 replacing-3 DEGRADED 0 0 0 1216402105560805077 UNAVAIL 0 0 0 was /dev/gptid/78013b99-6d37-11e3-9def-001 9db684008.eli gptid/87023237-f8d4-11e7-b423-309c2342a812.eli ONLINE 0 0 0 (resilvering) <-- ada1 gptid/7884d394-6d37-11e3-9def-0019db684008.eli ONLINE 0 0 0 <-- ada2 gptid/78f3188b-6d37-11e3-9def-0019db684008.eli DEGRADED 0 0 0 too many errors <-- ada3 errors: 7434 data errors, use '-v' for a list
Data errors, drives running in "degraded" state with too many errors, and the drive that I've replaced hanging there. That sucks, so I `zpool clear`, and the command hangs uninterruptible. Reboot, repeat, over the course of a few days and a few hangs, I am slowly making progress as a `zpool clear` completes without hanging. At the moment, the drive seems to be resilvering, a first in the ~5 times I've rebooted and manually unlocked them, so maybe some forum magic is already rubbing off on me!
Anyway, to date I've hit a few things that scare me:
- "pool I/O is currently suspended"
- "too many errors"
- Hung resilvering (0% for days).
- Hung `zpool clear`
Can anyone weigh in and let me know if I'm doing anything well, or if there is a really great thing that I could do to make this go better? I actually started writing this with the resilver hung at 0%, so I'm a little optimistic that maybe I've kicked it in the correct direction this round, but I'd still appreciate any wisdom surround the issue or things I should or should not do.
I'll update as things progress! Thanks in advance.