My First Alert

Status
Not open for further replies.

B34N

Dabbler
Joined
Jul 6, 2014
Messages
32
If there is a place for me to go to where all if this is covered, please just direct me.
I received the alert: "WARNING: The volume BeanFreeNAS (ZFS) status is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'."

I did a "zpool status" and got back:
Code:
  pool: BeanFreeNAS
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 1h12m with 0 errors on Sun Oct 19 01:12:14 2014
config:

    NAME                                            STATE     READ WRITE CKSUM
    BeanFreeNAS                                     ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/f9303580-3b6c-11e4-ad9c-002590472d99  ONLINE       0     0 1.33K
        gptid/f9fb5b7a-3b6c-11e4-ad9c-002590472d99  ONLINE       0     0   786
        gptid/fac43542-3b6c-11e4-ad9c-002590472d99  ONLINE       0     0     0
        gptid/fb8cc5f6-3b6c-11e4-ad9c-002590472d99  ONLINE       0     0     0

errors: No known data errors


Look at my signature for my setup details. All very new equipment so I'd be disappointed to see a failure this early.

I still need to "bulletproof" my system with UPSs and additional backup sources. None of the current data is critical but it would be a major PITA to replace.

Thank you,
B34N
 

pjc

Contributor
Joined
Aug 26, 2014
Messages
187
What does smartctl say about each of your drives? (smartctl -x /dev/daX) Anything in the /var/log/messages or dmesg?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Good news is ZFS is doing its job and all data is still fine.

Bad news is you may have two bad drives. Did you burn them in?
 

B34N

Dabbler
Joined
Jul 6, 2014
Messages
32
What does smartctl say about each of your drives? (smartctl -x /dev/daX) Anything in the /var/log/messages or dmesg?
How do I check dmesg?
I see a bunch of "Conversion error: Illegal multibyte sequence" in my messages but I'm not sure what to look for.

Is this how I was supposed to run smartctl?
http://pastebin.com/bS1DSEKm

Good news is ZFS is doing its job and all data is still fine.
Bad news is you may have two bad drives. Did you burn them in?
No, I didn't do a burn in test. This was my first system build in over a decade and I'm very rusty. I just followed the instructions I could gather from this community and didn't see anything about burn-in testing.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Two thing that I notice are your drives are getting a little too hot, you should look into better cooling so you can keep them under 40c when under full load. You should also run smart tests on all your drives. To automate this find cyberjocks post about it.
 

pjc

Contributor
Joined
Aug 26, 2014
Messages
187
How do I check dmesg?
sudo dmesg

Is this how I was supposed to run smartctl?
Yes, and as @SweetAndLow noticed, several of your drives are running pretty hot. 40 and higher is to be avoided.

That said, I'm not seeing any indications of read/write errors on the disks themselves, but you may have a bad SATA connection somewhere:

Two of your drives show this:
0x0009 2 4 Transition from drive PhyRdy to drive PhyNRdy
0x000a 2 4 Device-to-host register FISes sent due to a COMRESET
And the other two have counts of 8 and 9 respectively (with fewer power cycles). Similarly, the vendor-specific counter for the first two drives is around 700K, and for the other two 3.2M
 

B34N

Dabbler
Joined
Jul 6, 2014
Messages
32
Two thing that I notice are your drives are getting a little too hot, you should look into better cooling so you can keep them under 40c when under full load. You should also run smart tests on all your drives. To automate this find cyberjocks post about it.
I just added two 80mm fans to the case. One brings in air directly across the four HDs and another to pull air out of the case in the rear. It was my plan to install them, I just didn’t get to it until now. I'll look for and review cyberjock's post about automating smart tests.

sudo dmesg
http://pastebin.com/uBRuhGaW
That said, I'm not seeing any indications of read/write errors on the disks themselves, but you may have a bad SATA connection somewhere:

Two of your drives show this:

And the other two have counts of 8 and 9 respectively (with fewer power cycles). Similarly, the vendor-specific counter for the first two drives is around 700K, and for the other two 3.2M

Should I be worried? Should I reset the counters and see it the alert returns? I can reseat or replace SATA cables. That's an easy and cheap solution.

Thank you.
B34N
 

pjc

Contributor
Joined
Aug 26, 2014
Messages
187
Should I be worried? Should I reset the counters and see it the alert returns? I can reseat or replace SATA cables. That's an easy and cheap solution.
I honestly don't know. I don't see anything in your dmesg/kernel logs that provides any indication of what kind of "unrecoverable error" might have occurred.

The only oddity I saw was in those PHY numbers. It's not hugely uncommon for those numbers to creep up on reboot, but it's a little surprising that yours diverge as much as they do.

@cyberjock, any idea what else might cause an "unrecoverable error"?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
There's nothing specific to the error (or any data you've provided) that identifies a definitive culprit. You *can* try checking your power and data cables are secure. ZFS is protecting your data, which is the good news. Unfortunately the bad news is that if you weren't using ZFS you'd have silent corruption right now.

If I had to guess what was going on I'd think it was crappy power to your system or bad SATA cables. Now normally bad SATA cables cause UDMA CRC errors, which you have none. So if I were a betting man I'd get another power supply and try first.
 
Status
Not open for further replies.
Top