ATA Error Count Rising.... Hard drive flaking - Gen8

schnoodles

Cadet
Joined
Jan 10, 2020
Messages
4
Hey.

Just warning everyone that I built this NAS 4 years ago and havent touched it much so I am quite a bit out of the loop.

I am running a HP Gen8 with 4 5TB WD Green HDDs with the latest FreeNAS-11.2-U7

The HDD part on the Dashboard says Healthy for all disks.... How ever I am getting an error every few seconds

Code:
Device: /dev/ada2, ATA error count increased from 59693 to 60134
          Fri, 10 Jan 2020 04:24:35 GMT


I ran smartctl -a /dev/ada2 and it returned all sorts of crazy..... (When it wants to work)


Although most of the time I run it I just get

Code:
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

Read Device Identity failed: Input/output error

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.


So I am assuming something very wrong is happening here....

I tried to run zpool status and I got


So I am really a bit out of my depth here.

My question really is.

1. Is my HDD completely screwed? Should I kick it off a cliff?
2. If it is screwed should I just by a WD 5TB Blue (I heard they replaced Green. I only bought Green as that was meant to be the NAS HDD).
3. If I buy a new HDD can I just plug it in and the RAID will rectify it?
 
Joined
May 10, 2017
Messages
838
Disk appears to be failing, you should run an extended SMART test to confirm, you should also have regular SMART tests scheduled, both short and long.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Also, don't use pastebin... it's against the forum rules... please paste the text in code tags here and use spoilers if needed to condense the message.
 

schnoodles

Cadet
Joined
Jan 10, 2020
Messages
4
Disk appears to be failing, you should run an extended SMART test to confirm, you should also have regular SMART tests scheduled, both short and long.

Smart Tests run every 2 hours roughly and this is what gets emailed to me.

Code:
The following alert has been cleared:
* Unable to run alert source 'HasUpdate'
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/middlewared/plugins/alert.py", line 465, in __run_source
    alerts = (await alert_source.check()) or []
  File "/usr/local/lib/python3.6/site-packages/middlewared/alert/base.py", line 100, in check
    return await self.middleware.run_in_thread(self.check_sync)
  File "/usr/local/lib/python3.6/site-packages/middlewared/main.py", line 1009, in run_in_thread
    raise result
  File "/usr/local/lib/python3.6/site-packages/middlewared/main.py", line 1015, in _run_in_thread_wrap
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/middlewared/plugins/../alert/source/update.py", line 34, in check_sync
    path = self.middleware.call_sync("notifier.get_update_location")
  File "/usr/local/lib/python3.6/site-packages/middlewared/main.py", line 1137, in call_sync
    return fut.result()
  File "/usr/local/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/usr/local/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.6/site-packages/middlewared/main.py", line 1069, in _call
    return await run_method(methodobj, *args)
  File "/usr/local/lib/python3.6/site-packages/middlewared/main.py", line 1009, in run_in_thread
    raise result
  File "/usr/local/lib/python3.6/site-packages/middlewared/main.py", line 1015, in _run_in_thread_wrap
    return f(*args, **kwargs)
  File "/usr/local/www/freenasUI/middleware/notifier.py", line 1526, in get_update_location
    syspath = c.call('systemdataset.config')['path']
  File "/usr/local/www/freenasUI/middleware/notifier.py", line 1526, in get_update_location
    syspath = c.call('systemdataset.config')['path']
  File "/usr/local/lib/python3.6/site-packages/middlewared/client/client.py", line 447, in call
    raise CallTimeout("Call timeout")
middlewared.client.client.CallTimeout: Call timeout


Current alerts:
* Device: /dev/ada2, not capable of SMART self-check
* Device: /dev/ada2, Read SMART Error Log Failed
* Device: /dev/ada2, Read SMART Self-Test Log Failed
* Device: /dev/ada2, ATA error count increased from 59693 to 60134
* Device: /dev/ada2, failed to read SMART Attribute Data


Unfortunately when I try and do a long test manually through shell I get.

Code:
root@rivendell:~ # smartctl -t long /dev/ada2
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

Read Device Identity failed: Input/output error

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.


The reason why I am stumped is because on the dashboard it looks like

harddrive_health_ada2.png
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
The dashboard won't show a problem until a disk actually fails or causes data loss... your pool is actually healthy because ZFS and low-level disk functions have saved you so far.

SMART tests are there to warn you before failures cause real problems with your data.

You probably shouldn't be running short tests every 2 hours... once a day is enough. Maybe that's why your long tests fail as they can't finish before another short test starts?
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
@sretalla is correct, running Short SMART tests all the time is overkill, once a day is good. A weekly or every 2 week Long test is good as well. You should be able to change the smart test settings and then open a shell and then manually start a Long SMART test.

BUT, it looks like you have possibly bigger problems. You seem to be having a lot of error messages. Use
Code:
smartctl -x /dev/ada2
to examine the full drive results.

My Advice:
1) Power off the FreeNAS machine.
2) Unplug and reseat the drive data cable, data cables do go bad. If you feel froggy you might swap the data cable between two drives (just at the drive ends if possible) to see if the problem moves to the other drive.
3) Power on the machine.
4) Pay attention to the console during the boot sequence, look for error messages.
5) Open a Shell window or SSH window and run a SMART Short test. Once it is complete, post the SMART results.
6) Next run a SMART Long test and again, once it's complete post the SMART results.

Drive ada2 does have 30 bad sectors (ID 5) but I see not Pending Sector errors so these could be old. With that said, the drive is still suspect until you can run SMART testing on it.

Good Luck!
 
Top