Zpool error questions

tenjuna · May 5, 2016

Hello everyone, long time lurker, first-time poster here. This is going to be a long post, sorry.

I have had a FreeNAS server running Plex and CrashPlan for the last 2 years, and have been very happy with it...until this past week. Let me preface the rest of this by saying that after a week's worth of reading and googling, I KNOW I made some mistakes with my server build. With that said, I purchased a shiny new server based on cyberjock's recommendations, and parts are arriving today. So I am hoping to dispense with the excessive trout to the face here, as I am trying to reverse what I know now to be poor decisions made 2 years ago.

Anyway...a few days ago, I out of the blue received an email stating "The volume red5 (ZFS) state is ONLINE: One or more devices has experienced an error resulting in data corruption. Applications may be affected." As this is the first error I have seen with this server, I quickly looked into things, and found not one thing wrong with the machine (well, after it finished a 9 hour scrub, I was unable to access the server until that was finished). I ran the various SMART tests, an overnight MEMTEST, a separate scrub, no errors reported, no corruption found. Admittedly, I am a ZFS newb, so the lack of actual problems was a bit of a head scratcher. I replaced the crappy SATA controller and the cables just to be safe, and moved on with my day. I was watching movies until late last night via Plex, no problems.

Cut to this morning, and now I have 2 more errors:

The volume red5 (ZFS) state is ONLINE: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
The volume red5 (ZFS) state is ONLINE: One or more devices are faulted in response to IO failures.

The server is going through a scrub right now so I cannot do much, but based on zpool status I was able to run, it is again stating that there are no problems or errors. Which makes absolutely no sense to me. And that has led me to post here.

I am building a new server and intend to migrate my zpool to it, but before I even dig into that I have some questions, mostly to satisfy my curiosity. If the simple answer is "trash zpool and rebuild" then fine. But I will obsess over this so I am going to ask anyway:

1. Knowing that there are likely a variety of reasons why a zpool would "glitch" with no discernible actionable errors, is there something(s) specific I could run to figure out if this is a drive error, or just a crappy hardware error? I am running a prosumer motherboard using 8GB of non-ECC memory (yes, I know), using the built-in SATA controller and a cheap PCI SATA controller for the extra ports. I suspect (and hope) the problem lies solely with the PCI controller.

2. Is there anything I can do to make zfs less "glitchy"? I suspect that having new hardware within the recommended specs is going to mitigate this one, but I have also read plenty of posts that have led me to feel like zfs is one touchy beast.

3. Anyone have any suggestions for a place/service that I can place a backup of a 24TB dataset? I was thinking of using CrashPlan Cloud, but the amount of time the initial seed will take makes me shudder. I know there are storage experts in here, so I could use the advice. I cannot afford a replication server or more hard drives, unfortunately. And anything has to be better than the towering stack of used small drives I am currently using.

Thanks for the read...

Cain

danb35 · May 5, 2016

It's hard to say what might have happened without quite a bit more information (the output of 'zpool status -v' would be very helpful, for example, and there's no reason that couldn't be done while the scrub is running). But as to your third question...

Backing up 24 TB of data is just going to be tough. Cloud backup is great, and cheap in the case of CrashPlan, but you have to get the data there (and back, if/when you need to restore). With that much data, unless you're on Google Fiber, you're looking at a really long time to backup or restore. Is all of that 24 TB important enough to be worth backing up? Or is the bulk of it media which could be reacquired with manageable difficulty if needed?

A backup server doesn't have to be very expensive. A Dell PowerEdge T20, for example, starts at under $200. Add an 8 GB stick of RAM and the desired drives, and you're set. The problem in this case is that it only supports four drives, and the only way you'd get all your data on there would be to stripe 4 x 8 TB drives--not recommended. I don't quickly think of an inexpensive 6+-bay server, which is really what you'd want for this.

tenjuna · May 5, 2016

Yeah the overwhelming bulk is media...not a big deal but I figured I would ask since I was asking questions anyway. It would be time consuming to replace, but time consuming to back up too. So it sounds like I either suck it up or figure out how to buy a replication server. Got it.

As for the rest, I literally cannot run any commands during a scrub, it just sits there with no output. The gist of 'zpool status -v' though is all zeros, and no errors in files found...with a status of 'ONLINE: One or more devices has experienced an error resulting in data corruption. Applications may be affected.' I would paste the output if I could get it...otherwise I will post it once I can get into the shell again.

tenjuna · May 5, 2016

I will post smartctl output once I can get back in as well.

danb35 · May 5, 2016

Wow. If a scrub brings your system to its knees, such that it's completely unresponsive to any other commands, that's bad. Good that you're upgrading with better hardware.

tenjuna · May 5, 2016

It's a Q9650 CPU, 8GB, eVGA 780i motherboard, 8x 3TB WD Red + 2x 4TB WD Red. I was a little surprised at it being unresponsive, but after having read hardware recommendations I realized that I put together a monster. *shrug* not a big deal if I can't get the answers, I have a good backup and can just start over, but the nerdist in me really would love to know why it's bouncing like this.

danb35 · May 5, 2016

Well, I know that a lot of SATA controllers can be flaky, so that's a reasonably-likely source of the problem right there. If the error comes up again, before starting a scrub, do 'zpool status -v' and see what it shows. Once ZFS sees an error, it shouldn't clear it without some intervention.

tenjuna · May 5, 2016

Actually, now that you said that it leads me to realize I am not actually running a scrub, I was simply assuming zfs was running a scrub after whatever the "glitch" was that prompted the emails. In any case it's doing something that involves the hard drives being thrashed for 9 hours and the server being unresponsive. As I am unaware as to a system log (via webgui or cli) I cannot say what it's actually doing. I simply assumed it was running a scrub. I have run a manual scrub after the first time this happened, and the output I see is from THAT scrub. So maybe I have unwittingly overwritten whatever errors were there? Like I said, I R newbish.

In any case, I also suspect crappy chinese sata controller, and I doubt there are any tests to rule that out other than not using it. Since I am rebuilding this server anyway I will stress test the drives before using them again and hopefully that will be the ens of it.

styno · May 5, 2016

Why don't you import the pool in your new server and work from there? Best case scenario: the poor sata controller/cables, maybe even a disk is dying and zfs will handle this for you (and slowing down the old system while doing so). Worst case: bad memory and no ecc.

tenjuna · May 5, 2016

styno said:
Why don't you import the pool in your new server and work from there? Best case scenario: the poor sata controller/cables, maybe even a disk is dying and zfs will handle this for you (and slowing down the old system while doing so). Worst case: bad memory and no ecc.

That's an excellent point and well taken...I think I will do exactly that. I guess it doesn't really matter why, I'm just the curious sort.

tenjuna · May 5, 2016

Well I just got home and found a long string of cam errors, basically stating that ada 3, ada5, and ada 7 were unreachable. These are 2 drives on the on-board SATA, and the last is on that cheap controller. I hit the reset button as the drives were still thrashing 11 hours later and the server was unresponsive besides displaying the errors. Anyway, when they server came back up, no hard drive activity, and no errors either via status or smartctl:

pool: red5
state: ONLINE
scan: scrub repaired 0 in 10h41m with 0 errors on Sun May 1 21:41:15 2016
config:

NAME STATE READ WRITE CKSUM
red5 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gptid/14314b1c-3601-11e4-9418-6805ca0f2ab7 ONLINE 0 0 0
gptid/148be5fc-3601-11e4-9418-6805ca0f2ab7 ONLINE 0 0 0
gptid/14e45731-3601-11e4-9418-6805ca0f2ab7 ONLINE 0 0 0
gptid/154f6773-3601-11e4-9418-6805ca0f2ab7 ONLINE 0 0 0
gptid/15af01a1-3601-11e4-9418-6805ca0f2ab7 ONLINE 0 0 0

errors: No known data errors

I am just going to get this new server going. Thanks for the comments guys.

SweetAndLow · May 5, 2016

Get new hardware your controllers and motherboard are hurting you. You will want to rebuild your pool using something other than raid z1.

tenjuna · May 6, 2016

Agreed, and is exactly what I am doing. These showed up yesterday:

SuperMicro X10SL7-F
Xeon E3-1231V3
Crucial 32GB ECC 1600
Thermaltake 80Plus Gold 650W

Just confirmed my backup was solid, now I need to copy my jails over to a spare drive, gut the 32GB caching SSD drive from my gaming rig to become my new boot drive, then rebuild from scratch using RAIDZ2.

Unfortunately, I am going to have to figure out how to buy larger hard drives, as I only have 8 bays to work with, and the drives I have now are a dead end. I am thinking I will buy 8TB Reds over time, resilver the 3TBs out of the new pool, then use the 3TBs in my gaming rig attached to the Areca 1220 I have in it to be my new backup. This will take most of the next year to do, but it's a plan.

tenjuna · May 7, 2016

Just to follow up and finalize on this, it turns out the "glitch" was the old power supply throwing out some weird currents on the rail the drives were on. Once I swapped out to the new power supply and put on battery backup, I didn't have anymore problems up until I upgraded the server.

Speaking of the new server, holy gigabytes Batman, this thing is schweeeeeet.

So yeah, everyone should be listening to cyberjock's hardware recommendations...don't be like me and just Superman a server out of your parts closet lol.

Important Announcement for the TrueNAS Community.

Zpool error questions

tenjuna

Dabbler

danb35

Hall of Famer

tenjuna

Dabbler

tenjuna

Dabbler

danb35

Hall of Famer

tenjuna

Dabbler

danb35

Hall of Famer

tenjuna

Dabbler

styno

Patron

tenjuna

Dabbler

tenjuna

Dabbler

SweetAndLow

Sweet'NASty

tenjuna

Dabbler

tenjuna

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Zpool error questions

Dabbler

Hall of Famer

Dabbler

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Patron

Dabbler

Dabbler

Sweet'NASty

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Zpool error questions"

Similar threads