Replaced/resilvered HDD and now all hdd's are degraded and pool has completely gone

tsnx

Cadet
Joined
Dec 15, 2023
Messages
3
OS Version: TrueNAS-SCALE-22.02.4
Motherboard: Supermicro X10SL7-F
CPU: Intel(R) Xeon(R) CPU E3-1230 v3 @ 3.30GHz
RAM: 4x8GB Crutial DDR3 ECC RAM 1600MHz
Controller Hardware:
8x SAS2 (6Gbps) via Broadcom 2308 flashed into IT mode, 2 SATA (6Gbps), 4 SATA (3Gbps)​
HDD Hardware/Layout:
5x3TB SAS HGST HUS724030ALS640 (plugged into the SAS2 ports)​
3x3TB SAS WD Enterprise-class WD3001FYYG (plugged into the SAS2 ports)​
2x128GB PNY SSD's for boot (plugged into the 2x SATA 6Gbps ports)​
I had 10x HGST ones in total so 2 spare for replacement.​
The WD HDD's was out of an old server I had and kept for temp backup hdd replacements (10 in total).​
All drives was tested and fine before put to use.​
RAID: RAIDZ2 using all 8x3TB SAS HDD's in a pool, boot ssd's are just mirrored

So this morning I had an email notification from my server telling me 2 hdd have faulted. One was practically dead and the other had like 1k write errors. The one thing that confused me was, this happened on 12th december but I had only just got the email so I was a bit unsure what happened there.
Anyhow, I did the usual procedure of replacing an hdd. The first hdd replacement went completely fine and only took around an hour or so to resilver. I then got around to doing the 2nd one and had to go to work. Once I got home I checked on it and to my surprise, it was still going with an INSANE amount of errors. I've never seen anything remotely as bad! My heart just dropped. My pool was completely gone and I legit don't know where to go from here. I can't even use the shell because it just hangs when first accessing it.

Sadly I'm presuming I've lost all or if not most of the data (roughly 6TB). I'm expecting either there's been a power issue or the HBA likely somehow crapped itself on the 2nd resilver.

The resilvering is still going and all HDD's now show as degraded aside from the 2 replaced ones being online, and another 1 being faulted. (see images)

I do have a backup on another pool (external hdd) that I plug in and import once every 1-2 week or so for my important data like password vaults and such. I wasn't able to do a full replication due to being limited via ports and how much storage I needed so I opted just to do the important data on an ext hdd till I got another server capable.

I have to be the unluckiest guy in the world though. I literally ordered a new 2nd-hand server from ebay yesterday (poweredge t430, really good price) to treat myself a bit towards christmas and was planning on doing a full migration into it. Replacing all drives one by one with some new ones in a couple days to then use this current server as a backup server.

Okay back to the point- I've practically not had any major issue since running this server (3 years) aside from the odd hdd starting to get errors and then be replaced.
My main question is, where would you go from here if you was in my shoes and what would be the best practice to MAYBE salvaging some data?



Please let me know if I have missed some information in the specs or otherwise. This is my first post and I'll happily provide you with what I can if needed (and if I'm able to). Thank you.
 

Attachments

  • pool-status.png
    pool-status.png
    56.3 KB · Views: 49
  • pool-storage.png
    pool-storage.png
    18.9 KB · Views: 47
  • shell.png
    shell.png
    19.6 KB · Views: 36
  • truenas-specs.png
    truenas-specs.png
    14.4 KB · Views: 37

tsnx

Cadet
Joined
Dec 15, 2023
Messages
3
Just a quick update. I've been searching through these forums again to find anything remotely similar since I've posted and during that time. The server web gui is just refusing to connect now. I've not touched, nor attempted anything so it just seems like everything that could possibly go wrong, is doing.

The server is still on and running, no restarts or such. At this point I have no idea what's going on. I can't ssh to it either.

All I'm getting now is "Connecting to TrueNAS ..." etc as shown in the picture.

I've not powered it off in-case some miracle happens (yeah right) and due to it still resilvering. I just didn't want to mess things up even further until I got more information on the matter from more experienced members.

Was planning on powering it off until the poweredge t430 arrives then attempting things with the new hardware.
I'm guessing its safe to assume a power off wouldn't really make anything worse at this point?
 

Attachments

  • webgui-fail.png
    webgui-fail.png
    11.5 KB · Views: 47

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
I would leave the server powered on. Make sure it has proper ventilation. Even an external house fan might help.

Next, a bit more details;
  • What is the make and model of disks?
  • How are the disks connected? Make & model please.
  • Are their only 8 disks in your RAID-Z2 vDev?
Their are known problems with certain disks, (SMR), LSI HBAs are know to need cooling, (any dust bunnies?), and of course, hardware RAID controllers are really discourage.
 

tsnx

Cadet
Joined
Dec 15, 2023
Messages
3
I would leave the server powered on. Make sure it has proper ventilation. Even an external house fan might help.

Next, a bit more details;
  • What is the make and model of disks?
  • How are the disks connected? Make & model please.
  • Are their only 8 disks in your RAID-Z2 vDev?
Their are known problems with certain disks, (SMR), LSI HBAs are know to need cooling, (any dust bunnies?), and of course, hardware RAID controllers are really discourage.
Ah sorry I put them at the top of the initial post inside the spoiler button thing so the post wasn't massive.

They're all SAS drives.
When I bought the HGST of the hdd's listed in there.. They were from a company that sold server hardware and one of my friends worked there. They was "older" drives (2017) by like 2 years or something at that point but practically unused as far as I can remember.

The older 2014 WD drives I bought in roughly 2016 was pretty much the same story but slightly more cycles, I added a lot of my own over the time I had them running. I never had any issue with them though so I tossed them aside in the same padded hdd box for temp backup replacements.

All of them was tested by them before sending to me and also by myself once they arrived. Every single one has worked perfectly fine aside from the two HGST drives that starting to get small read/write errors over 12 months ago so I replaced them then another about 6 months ago. Been fine since. Every scrub and smart test has gone without an hitch up until the 12th of december as explained in the main post which lead me to replacing one of them, then another, then this whole ordeal.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Sorry I overlooked the spoiler tag in your first post.

The drives look good. On occasion LSI HBAs can over-heat, causing noticeable errors. I don't remember which type, read, write or checksum. Thus, my suggestion to make sure their is cool air flowing into the server.

I don't have any other suggestions at this point.
 
Top