So here's the good news... This is all in my home lab, there is no irreplaceable data in jeopardy, and all data is backed up anyway, so I'm happy to report this isn't one of those panicky "How do I get my data back" first posts. That said, I do have some peculiar things going on and I'm a bit puzzled as to exactly what is going on, what was the likely cause, and what's the "proper" way to resolve the issues.
The situation:
I've been in the process of upgrading my home lab recently. Among other upgrades, I got a storage server to replace my old legacy TrueNAS server. (see sig) The plan was to get the new server spun up, then just move my disk shelves to the new server and import the pools. (I've actually done that part before with no problems) This time, when I attempted it, the new TrueNAS server took forever to boot, and when I went to import the pools, it didn't find them. I fooled around with it for a few minutes, before I decided to just roll back to the old config and get the old server back online before my Plex friends and family started complaining. Upon booting the legacy server up, it also took forever, and when it finally came back on line, my main pool was showing up as "degraded."
The Symptoms:
On my main pool (2x RAIDz2 vdevs, each made of 12x 6TB Seagate SAS drives)
Looking at the status of the vdevs and disks, both vdevs show up as "degraded." On the first vdev all of the drives show up as "online" except one that shows "failed" and it has several read errors. The status of the 2nd vdev is also "degraded" and several of the drives have a status of "degraded" with no errors of any type on them. The remaining disks on that vdev are "online."
The questions:
1) What in tarnation is a "degraded" disk? I know when the pool says "degraded" it means that it has one or more disks that aren't "online" and I'm assuming that also means that it is relying on the parity information (or mirror) for data availability. So on my first vdev, it makes perfect sense to me. I have a failed drive (I have a few cold spares, I just haven't bothered to replace it yet, but I'm comfortable with that process) it's currently relying on parity data. After I replace the failed disk and resilver, it should go back to "online" But what is the difference between a "degraded" disk and and "failed" disk and how is my 2nd vdev still available despite having several (more than the parity level allows for) "degraded" disks in the vdev? Shouldn't it be dead?
2) What likely caused this? The failed drive needs no explanation, but the "degraded" drives puzzle me. The drives seem to be "working" by most metrics, and it strikes me as unlikely that I'd have so many fail independently yet simultaneously. Is it possible that I have a bad cable/controller/expander? Is it more likely to be on the legacy system (where everything is "working" but the disks have "odd" statuses) or the new hardware (where the trouble initially started, and the pool never showed up) Or is this just one of those situations where there is simply no easy troubleshooting besides swapping out hardware until the problem goes away? (I have extra cables, controllers and a spare disk shelf)
3) What's the "proper" way to resolve this problem? Wipe out the pool, recreate everything from scratch, and restore from backup? I don't mind doing that if I have to, but I'd prefer to fix it without taking everything offline if possible. If I figure out the hardware issue, will the disks right themselves, or do I need to coax them back into being online?
The situation:
I've been in the process of upgrading my home lab recently. Among other upgrades, I got a storage server to replace my old legacy TrueNAS server. (see sig) The plan was to get the new server spun up, then just move my disk shelves to the new server and import the pools. (I've actually done that part before with no problems) This time, when I attempted it, the new TrueNAS server took forever to boot, and when I went to import the pools, it didn't find them. I fooled around with it for a few minutes, before I decided to just roll back to the old config and get the old server back online before my Plex friends and family started complaining. Upon booting the legacy server up, it also took forever, and when it finally came back on line, my main pool was showing up as "degraded."
The Symptoms:
On my main pool (2x RAIDz2 vdevs, each made of 12x 6TB Seagate SAS drives)
Looking at the status of the vdevs and disks, both vdevs show up as "degraded." On the first vdev all of the drives show up as "online" except one that shows "failed" and it has several read errors. The status of the 2nd vdev is also "degraded" and several of the drives have a status of "degraded" with no errors of any type on them. The remaining disks on that vdev are "online."
The questions:
1) What in tarnation is a "degraded" disk? I know when the pool says "degraded" it means that it has one or more disks that aren't "online" and I'm assuming that also means that it is relying on the parity information (or mirror) for data availability. So on my first vdev, it makes perfect sense to me. I have a failed drive (I have a few cold spares, I just haven't bothered to replace it yet, but I'm comfortable with that process) it's currently relying on parity data. After I replace the failed disk and resilver, it should go back to "online" But what is the difference between a "degraded" disk and and "failed" disk and how is my 2nd vdev still available despite having several (more than the parity level allows for) "degraded" disks in the vdev? Shouldn't it be dead?
2) What likely caused this? The failed drive needs no explanation, but the "degraded" drives puzzle me. The drives seem to be "working" by most metrics, and it strikes me as unlikely that I'd have so many fail independently yet simultaneously. Is it possible that I have a bad cable/controller/expander? Is it more likely to be on the legacy system (where everything is "working" but the disks have "odd" statuses) or the new hardware (where the trouble initially started, and the pool never showed up) Or is this just one of those situations where there is simply no easy troubleshooting besides swapping out hardware until the problem goes away? (I have extra cables, controllers and a spare disk shelf)
3) What's the "proper" way to resolve this problem? Wipe out the pool, recreate everything from scratch, and restore from backup? I don't mind doing that if I have to, but I'd prefer to fix it without taking everything offline if possible. If I figure out the hardware issue, will the disks right themselves, or do I need to coax them back into being online?