Help identifying server status

Timothy Montoya · Nov 2, 2016

Hi all!

I work at a small production house based in California and we have a fairly sizable FreeNas server setup for everyone to work off of. I'm a relatively new hire, and I was recently given all of the login info to take a look at it because there has been no on maintaining it for the last several months or so. I've never dabbled in FreeNas, but I have spent a fair amount of time setting up/maintaining Ubuntu web/email/DNS servers, and I have a fairly decent knowledge of storage setups/RAID/formatting.

So after I was given all of the info, I poked around to see what the status of everything was, and I was greeted by a nice red flashing warning light for a status. I'll post all of the relevant information below, but my main question is, in it's current state, how safe is this array/setup? Is there any fault tolerance left, or if a drive fails are we out the whole pool? I've already recommended backing the whole thing up, wiping it, testing all of the drives, and get it back into a known good working state, but how mission critical is it at this point?

The chassis is a 24-drive setup, each bay containing a WD 4TB Enterprise drive. I was told that it was originally setup with 3 RAIDZ1, pooled together, with the remaining 3 drives added as hot spares. I first logged in to find it resilvering (not entirely sure why) RAIDZ1-2, and it having 9 drives now instead of the original 7. There are two drives not in use/showing. Drive 9 shows up, but isn't in use, and is available to be added to a pool, and then drive 12 is missing entirely. I'm not sure if it's dead, but I cannot find it anywhere. I'm afraid to pull it because I don't want to risk anything.

Below are screenshots of everything that I believe is relevant to our setup

Screen Shot 2016-11-02 at 9.05.57 AM.png

Screen Shot 2016-11-02 at 9.06.10 AM.png

Screen Shot 2016-11-02 at 9.06.27 AM.png

When "zpool status -v" is run this is the output:

Screen Shot 2016-11-02 at 9.23.18 AM.png

Ericloewe · Nov 2, 2016

Oh dear, that server is in very poor shape.

RAIDZ1 in this kind of scenario is very irresponsible
Previous disks were not replaced correctly or something of the sort (daX designators instead of gptids)
The pool has corrupted data
The current resilver seems to be doing weird things

Basically, you'll have to rebuild this. Hope you have good backups.

Timothy Montoya · Nov 2, 2016

I'm definitely going to look into RAIDZ2/3 for the rebuild. What might you recommend for a configuration with the current number/size of drives?

danb35 · Nov 2, 2016

Timothy Montoya said:
I've already recommended backing the whole thing up, wiping it, testing all of the drives, and get it back into a known good working state,

That's good advice, especially with the metadata errors that are showing up. Hopefully you already have backups, and if so, I'd make sure they're walled off from whatever backup you run at this point (i.e., make sure the backup you run today doesn't overwrite the one you had from last week/month).

but how mission critical is it at this point?

If the wrong drive fails (i.e., one of the drives that's in the raidz1-1 set), all your data will go away. That's pretty mission-critical, IMO. You can survive a failure in raidz1-0 or raidz1-2 at the moment though.

Timothy Montoya said:
What might you recommend for a configuration with the current number/size of drives?

With 24 disks, off the cuff, I'd recommend three, eight-disk RAIDZ2 sets. What's the server being used for? That might affect the pool layout suggestion.

Timothy Montoya · Nov 2, 2016

What would be our best route to backup the data on there currently knowing that we'll be utilizing a new RAIDZ scheme with a lower amount of total storage? The setup as is is fairly full so I don't know if the standard backup configuration+data to another server (which we're looking to rent short term) would work knowing our available storage will go down

danb35 · Nov 2, 2016

You said you had 24 disks, 21 active in RAIDZ1 vdevs and three spares. I'm suggesting having all 24 disks active in RAIDZ2 vdevs, with no spares--at least, no hot spares (one or two spare disks would likely be a good idea--burn them in and test them thoroughly, then put them aside for when a disk fails). Storage capacity should be identical. Though if you're more than about 80% full, you really need to add capacity or reduce contents anyway.

Ericloewe · Nov 2, 2016

danb35 said:
Though if you're more than about 80% full

96% full. It's absolutely critical.

Basically, OP, you're going to need more drives or bigger drives. That's even if you were to stay with RAIDZ1, which I'd advise against.

danb35 · Nov 2, 2016

Ericloewe said:
96% full. It's absolutely critical.

Ah, yes, right you are. I'd missed that in the OP.

Stux · Nov 2, 2016

Replace one of the vdevs with 6TB or 8TB drives. That will get you some more space and you'll have a few 4TB drives as spares.

Hot spares are a waste, the same number of drives could've been in 3 vdevs of 8 way z2.

Raidz1 was irresponsible in this scenario

Long term. It may be worthwhile thinking about adding another 24 bay chassis to extend the first one. Sounds like your data requirements are only going to grow

You need to find 72TB of storage to backup the contents while you repair the pool!

Maybe that extra chassis is a good idea?

Bidule0hm · Nov 3, 2016

I wonder if SMART tests are correctly configured too because I see a lot of checksum errors and no SMART alert.

Ericloewe · Nov 3, 2016

It's fairly clear to me that the whole thing is best done from scratch. The level of trust in the current configuration is close to zero.

Important Announcement for the TrueNAS Community.

Help identifying server status

Timothy Montoya

Cadet

Ericloewe

Server Wrangler

Timothy Montoya

Cadet

danb35

Hall of Famer

Timothy Montoya

Cadet

danb35

Hall of Famer

Ericloewe

Server Wrangler

danb35

Hall of Famer

Stux

MVP

Bidule0hm

Server Electronics Sorcerer

Ericloewe

Server Wrangler

Similar threads

Important Announcement for the TrueNAS Community.

Help identifying server status

Cadet

Server Wrangler

Cadet

Hall of Famer

Cadet

Hall of Famer

Server Wrangler

Hall of Famer

MVP

Server Electronics Sorcerer

Server Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Help identifying server status"

Similar threads