ATA Error Count Rising.... Hard drive flaking - Gen8

schnoodles · Jan 10, 2020

Hey.

Just warning everyone that I built this NAS 4 years ago and havent touched it much so I am quite a bit out of the loop.

I am running a HP Gen8 with 4 5TB WD Green HDDs with the latest FreeNAS-11.2-U7

The HDD part on the Dashboard says Healthy for all disks.... How ever I am getting an error every few seconds

Code:

Device: /dev/ada2, ATA error count increased from 59693 to 60134
          Fri, 10 Jan 2020 04:24:35 GMT

I ran smartctl -a /dev/ada2 and it returned all sorts of crazy..... (When it wants to work)

smartctl -a /dev/ada2 - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

pastebin.com

Although most of the time I run it I just get

Code:

smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

Read Device Identity failed: Input/output error

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

So I am assuming something very wrong is happening here....

I tried to run zpool status and I got

zpool status - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

pastebin.com

So I am really a bit out of my depth here.

My question really is.

1. Is my HDD completely screwed? Should I kick it off a cliff?
2. If it is screwed should I just by a WD 5TB Blue (I heard they replaced Green. I only bought Green as that was meant to be the NAS HDD).
3. If I buy a new HDD can I just plug it in and the RAID will rectify it?

Johnnie Black · Jan 10, 2020

Disk appears to be failing, you should run an extended SMART test to confirm, you should also have regular SMART tests scheduled, both short and long.

sretalla · Jan 10, 2020

Also, don't use pastebin... it's against the forum rules... please paste the text in code tags here and use spoilers if needed to condense the message.

schnoodles · Jan 10, 2020

Johnnie Black said:
Disk appears to be failing, you should run an extended SMART test to confirm, you should also have regular SMART tests scheduled, both short and long.

Smart Tests run every 2 hours roughly and this is what gets emailed to me.

Code:

The following alert has been cleared:
* Unable to run alert source 'HasUpdate'
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/middlewared/plugins/alert.py", line 465, in __run_source
    alerts = (await alert_source.check()) or []
  File "/usr/local/lib/python3.6/site-packages/middlewared/alert/base.py", line 100, in check
    return await self.middleware.run_in_thread(self.check_sync)
  File "/usr/local/lib/python3.6/site-packages/middlewared/main.py", line 1009, in run_in_thread
    raise result
  File "/usr/local/lib/python3.6/site-packages/middlewared/main.py", line 1015, in _run_in_thread_wrap
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/middlewared/plugins/../alert/source/update.py", line 34, in check_sync
    path = self.middleware.call_sync("notifier.get_update_location")
  File "/usr/local/lib/python3.6/site-packages/middlewared/main.py", line 1137, in call_sync
    return fut.result()
  File "/usr/local/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/usr/local/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.6/site-packages/middlewared/main.py", line 1069, in _call
    return await run_method(methodobj, *args)
  File "/usr/local/lib/python3.6/site-packages/middlewared/main.py", line 1009, in run_in_thread
    raise result
  File "/usr/local/lib/python3.6/site-packages/middlewared/main.py", line 1015, in _run_in_thread_wrap
    return f(*args, **kwargs)
  File "/usr/local/www/freenasUI/middleware/notifier.py", line 1526, in get_update_location
    syspath = c.call('systemdataset.config')['path']
  File "/usr/local/www/freenasUI/middleware/notifier.py", line 1526, in get_update_location
    syspath = c.call('systemdataset.config')['path']
  File "/usr/local/lib/python3.6/site-packages/middlewared/client/client.py", line 447, in call
    raise CallTimeout("Call timeout")
middlewared.client.client.CallTimeout: Call timeout


Current alerts:
* Device: /dev/ada2, not capable of SMART self-check
* Device: /dev/ada2, Read SMART Error Log Failed
* Device: /dev/ada2, Read SMART Self-Test Log Failed
* Device: /dev/ada2, ATA error count increased from 59693 to 60134
* Device: /dev/ada2, failed to read SMART Attribute Data

Unfortunately when I try and do a long test manually through shell I get.

Code:

root@rivendell:~ # smartctl -t long /dev/ada2
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

Read Device Identity failed: Input/output error

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

The reason why I am stumped is because on the dashboard it looks like

sretalla · Jan 11, 2020

The dashboard won't show a problem until a disk actually fails or causes data loss... your pool is actually healthy because ZFS and low-level disk functions have saved you so far.

SMART tests are there to warn you before failures cause real problems with your data.

You probably shouldn't be running short tests every 2 hours... once a day is enough. Maybe that's why your long tests fail as they can't finish before another short test starts?

joeschmuck · Jan 11, 2020

@sretalla is correct, running Short SMART tests all the time is overkill, once a day is good. A weekly or every 2 week Long test is good as well. You should be able to change the smart test settings and then open a shell and then manually start a Long SMART test.

BUT, it looks like you have possibly bigger problems. You seem to be having a lot of error messages. Use

Code:

smartctl -x /dev/ada2

to examine the full drive results.

My Advice:
1) Power off the FreeNAS machine.
2) Unplug and reseat the drive data cable, data cables do go bad. If you feel froggy you might swap the data cable between two drives (just at the drive ends if possible) to see if the problem moves to the other drive.
3) Power on the machine.
4) Pay attention to the console during the boot sequence, look for error messages.
5) Open a Shell window or SSH window and run a SMART Short test. Once it is complete, post the SMART results.
6) Next run a SMART Long test and again, once it's complete post the SMART results.

Drive ada2 does have 30 bad sectors (ID 5) but I see not Pending Sector errors so these could be old. With that said, the drive is still suspect until you can run SMART testing on it.

Good Luck!

Important Announcement for the TrueNAS Community.

ATA Error Count Rising.... Hard drive flaking - Gen8

schnoodles

Cadet

smartctl -a /dev/ada2 - Pastebin.com

zpool status - Pastebin.com

Johnnie Black

Guru

sretalla

Powered by Neutrality

schnoodles

Cadet

sretalla

Powered by Neutrality

joeschmuck

Old Man

Similar threads

Important Announcement for the TrueNAS Community.

ATA Error Count Rising.... Hard drive flaking - Gen8

Cadet

Guru

Powered by Neutrality

Cadet

Powered by Neutrality

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "ATA Error Count Rising.... Hard drive flaking - Gen8"

Similar threads