My First Alert

B34N · Nov 8, 2014

If there is a place for me to go to where all if this is covered, please just direct me.
I received the alert: "WARNING: The volume BeanFreeNAS (ZFS) status is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'."

I did a "zpool status" and got back:

Code:

  pool: BeanFreeNAS
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 1h12m with 0 errors on Sun Oct 19 01:12:14 2014
config:

    NAME                                            STATE     READ WRITE CKSUM
    BeanFreeNAS                                     ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/f9303580-3b6c-11e4-ad9c-002590472d99  ONLINE       0     0 1.33K
        gptid/f9fb5b7a-3b6c-11e4-ad9c-002590472d99  ONLINE       0     0   786
        gptid/fac43542-3b6c-11e4-ad9c-002590472d99  ONLINE       0     0     0
        gptid/fb8cc5f6-3b6c-11e4-ad9c-002590472d99  ONLINE       0     0     0

errors: No known data errors

Look at my signature for my setup details. All very new equipment so I'd be disappointed to see a failure this early.

I still need to "bulletproof" my system with UPSs and additional backup sources. None of the current data is critical but it would be a major PITA to replace.

Thank you,
B34N

pjc · Nov 8, 2014

What does smartctl say about each of your drives? (smartctl -x /dev/daX) Anything in the /var/log/messages or dmesg?

Ericloewe · Nov 9, 2014

Good news is ZFS is doing its job and all data is still fine.

Bad news is you may have two bad drives. Did you burn them in?

B34N · Nov 9, 2014

pjc said:
What does smartctl say about each of your drives? (smartctl -x /dev/daX) Anything in the /var/log/messages or dmesg?

How do I check dmesg?
I see a bunch of "Conversion error: Illegal multibyte sequence" in my messages but I'm not sure what to look for.

Is this how I was supposed to run smartctl?
http://pastebin.com/bS1DSEKm

Ericloewe said:
Good news is ZFS is doing its job and all data is still fine.
Bad news is you may have two bad drives. Did you burn them in?

No, I didn't do a burn in test. This was my first system build in over a decade and I'm very rusty. I just followed the instructions I could gather from this community and didn't see anything about burn-in testing.

SweetAndLow · Nov 9, 2014

Two thing that I notice are your drives are getting a little too hot, you should look into better cooling so you can keep them under 40c when under full load. You should also run smart tests on all your drives. To automate this find cyberjocks post about it.

pjc · Nov 9, 2014

B34N said:
How do I check dmesg?

sudo dmesg

Is this how I was supposed to run smartctl?

Yes, and as @SweetAndLow noticed, several of your drives are running pretty hot. 40 and higher is to be avoided.

That said, I'm not seeing any indications of read/write errors on the disks themselves, but you may have a bad SATA connection somewhere:

Two of your drives show this:

0x0009 2 4 Transition from drive PhyRdy to drive PhyNRdy
0x000a 2 4 Device-to-host register FISes sent due to a COMRESET

And the other two have counts of 8 and 9 respectively (with fewer power cycles). Similarly, the vendor-specific counter for the first two drives is around 700K, and for the other two 3.2M

B34N · Nov 9, 2014

SweetAndLow said:
Two thing that I notice are your drives are getting a little too hot, you should look into better cooling so you can keep them under 40c when under full load. You should also run smart tests on all your drives. To automate this find cyberjocks post about it.

I just added two 80mm fans to the case. One brings in air directly across the four HDs and another to pull air out of the case in the rear. It was my plan to install them, I just didn’t get to it until now. I'll look for and review cyberjock's post about automating smart tests.

pjc said:
sudo dmesg

http://pastebin.com/uBRuhGaW

pjc said:
That said, I'm not seeing any indications of read/write errors on the disks themselves, but you may have a bad SATA connection somewhere:

Two of your drives show this:

And the other two have counts of 8 and 9 respectively (with fewer power cycles). Similarly, the vendor-specific counter for the first two drives is around 700K, and for the other two 3.2M

Should I be worried? Should I reset the counters and see it the alert returns? I can reseat or replace SATA cables. That's an easy and cheap solution.

Thank you.
B34N

pjc · Nov 9, 2014

B34N said:
Should I be worried? Should I reset the counters and see it the alert returns? I can reseat or replace SATA cables. That's an easy and cheap solution.

I honestly don't know. I don't see anything in your dmesg/kernel logs that provides any indication of what kind of "unrecoverable error" might have occurred.

The only oddity I saw was in those PHY numbers. It's not hugely uncommon for those numbers to creep up on reboot, but it's a little surprising that yours diverge as much as they do.

@cyberjock, any idea what else might cause an "unrecoverable error"?

cyberjock · Nov 9, 2014

There's nothing specific to the error (or any data you've provided) that identifies a definitive culprit. You *can* try checking your power and data cables are secure. ZFS is protecting your data, which is the good news. Unfortunately the bad news is that if you weren't using ZFS you'd have silent corruption right now.

If I had to guess what was going on I'd think it was crappy power to your system or bad SATA cables. Now normally bad SATA cables cause UDMA CRC errors, which you have none. So if I were a betting man I'd get another power supply and try first.

Important Announcement for the TrueNAS Community.

My First Alert

B34N

Dabbler

pjc

Contributor

Ericloewe

Server Wrangler

B34N

Dabbler

SweetAndLow

Sweet'NASty

pjc

Contributor

B34N

Dabbler

pjc

Contributor

cyberjock

Inactive Account

Similar threads

Important Announcement for the TrueNAS Community.

My First Alert

Dabbler

Contributor

Server Wrangler

Dabbler

Sweet'NASty

Contributor

Dabbler

Contributor

Inactive Account

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "My First Alert"

Similar threads