Degraded disk vs degraded vdev?

ajhieb

Cadet
Joined
Apr 10, 2022
Messages
3
So here's the good news... This is all in my home lab, there is no irreplaceable data in jeopardy, and all data is backed up anyway, so I'm happy to report this isn't one of those panicky "How do I get my data back" first posts. That said, I do have some peculiar things going on and I'm a bit puzzled as to exactly what is going on, what was the likely cause, and what's the "proper" way to resolve the issues.

The situation:
I've been in the process of upgrading my home lab recently. Among other upgrades, I got a storage server to replace my old legacy TrueNAS server. (see sig) The plan was to get the new server spun up, then just move my disk shelves to the new server and import the pools. (I've actually done that part before with no problems) This time, when I attempted it, the new TrueNAS server took forever to boot, and when I went to import the pools, it didn't find them. I fooled around with it for a few minutes, before I decided to just roll back to the old config and get the old server back online before my Plex friends and family started complaining. Upon booting the legacy server up, it also took forever, and when it finally came back on line, my main pool was showing up as "degraded."

The Symptoms:
On my main pool (2x RAIDz2 vdevs, each made of 12x 6TB Seagate SAS drives)
Looking at the status of the vdevs and disks, both vdevs show up as "degraded." On the first vdev all of the drives show up as "online" except one that shows "failed" and it has several read errors. The status of the 2nd vdev is also "degraded" and several of the drives have a status of "degraded" with no errors of any type on them. The remaining disks on that vdev are "online."

The questions:
1) What in tarnation is a "degraded" disk? I know when the pool says "degraded" it means that it has one or more disks that aren't "online" and I'm assuming that also means that it is relying on the parity information (or mirror) for data availability. So on my first vdev, it makes perfect sense to me. I have a failed drive (I have a few cold spares, I just haven't bothered to replace it yet, but I'm comfortable with that process) it's currently relying on parity data. After I replace the failed disk and resilver, it should go back to "online" But what is the difference between a "degraded" disk and and "failed" disk and how is my 2nd vdev still available despite having several (more than the parity level allows for) "degraded" disks in the vdev? Shouldn't it be dead?
2) What likely caused this? The failed drive needs no explanation, but the "degraded" drives puzzle me. The drives seem to be "working" by most metrics, and it strikes me as unlikely that I'd have so many fail independently yet simultaneously. Is it possible that I have a bad cable/controller/expander? Is it more likely to be on the legacy system (where everything is "working" but the disks have "odd" statuses) or the new hardware (where the trouble initially started, and the pool never showed up) Or is this just one of those situations where there is simply no easy troubleshooting besides swapping out hardware until the problem goes away? (I have extra cables, controllers and a spare disk shelf)
3) What's the "proper" way to resolve this problem? Wipe out the pool, recreate everything from scratch, and restore from backup? I don't mind doing that if I have to, but I'd prefer to fix it without taking everything offline if possible. If I figure out the hardware issue, will the disks right themselves, or do I need to coax them back into being online?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
A degraded disk is one that's been throwing I/O errors and ZFS is not happy with it.

A degraded vdev is a vdev that contains a degraded or failed disk and is not in its optimal redundancy state.

You're running Proxmox, so if you haven't followed the guidance in the virtualization stickies, particularly


you are creating a situation where weird failures are more likely. Proxmox itself is immature for this use, and doesn't work reliably well for everyone. Some people have no issues. Some have problems.
 

ajhieb

Cadet
Joined
Apr 10, 2022
Messages
3
Thanks for the reply. I figured the "degraded" disk was something like that, but just looking for some confirmation. Kinda odd that TrueNAS isn't showing any errors for the degraded disks right now (at least not on the status page) but I'll dig deeper and see what I can find.

And while I didn't specifically follow the linked thread when putting together my new TrueNAS primary and backup servers, I did consult several online guides specific to the topic, and in hindsight it looks like I'm in the clear. While I guess this is "technically" part of my production network at home, there is nothing really critical that's relying on TrueNAS. My Plex server going down would cause me some mild irritation, but other than that, it's all homelab duty so nothing that be wiped out and recreated. And what's the fun of having a home lab if you don't have weird problems to troubleshoot?

But if worse comes to worse, I can nix proxmox on the primary and backup TrueNAS servers and install it on bare metal. The only reason I went with proxmox was to get 3 nodes up so I can do high availability for the stuff I'd like to be reasonably bulletproof. (network services, etc) But if I just have to keep them running on my main VM server then I can live with that. (but I probably can't live with my electricity bill if I added two more servers in the rack to get 3 dedicated vm servers)
 
Top