testing & re-attaching faulty drive ?

benni2 · Nov 19, 2017

freenas version: 9

hi folks,

i'm pretty new to FreeNAS (or at least: i haven't got much experience). right now, i have a faulty drive, so i wanna do everything right.

i have a raid z2 with 8 drives. one of them is faulty right now. first of all: in the documentation, it says that you can see the resilvering status when calling zpool status. but there, i only see a scrub, no resilvering. 1) is that the same in this case? or how can i see the resilvering status? (please note that i have version 9 of FreeNAS).

Code:

 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
		Sufficient replicas exist for the pool to continue functioning in a
		degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
		repaired.
  scan: scrub in progress since Sun Nov 19 12:47:17 2017
		2.06T scanned out of 16.3T at 478M/s, 8h41m to go
		0 repaired, 12.60% done

2) parallelly to the resilvering process, i probably can do a long SMART test on the faulty drive without making the mess worse, right?
3) if the SMART test is ok, i'd like to re-attach the drive to the raid z as every drive has got a "bad moment". how can i do that? first, i'll probably wait until the resilvering is done completely. and then i just type in zpool clear and everything is back to normal?

it would be great if you helped me out :)

cheers,
benni

joeschmuck · Nov 19, 2017

First of all what is the error message of the failed drive? And output the results of smartctl -a /dev/adax where adax is the drive identifier.

Do not run a SMART test while the drive is running a scrub or resilvering.

EDIT: Also look at my signature for the Hard Drive Troubleshooting Guide. It has a lot of info in it but I really need to know what the failure is to be of much help.

benni2 · Nov 19, 2017

thanks for your fast reply, joeschmuck.

the (short) smart test says that the drive is fine.

Code:

SMART overall-health self-assessment test result: PASSED
...
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000b   100   100   016	Pre-fail  Always	   -	   0
  2 Throughput_Performance  0x0005   133   133   054	Pre-fail  Offline	  -	   104
  3 Spin_Up_Time			0x0007   128   128   024	Pre-fail  Always	   -	   545 (Average 543)
  4 Start_Stop_Count		0x0012   100   100   000	Old_age   Always	   -	   30
  5 Reallocated_Sector_Ct   0x0033   100   100   005	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000b   100   100   067	Pre-fail  Always	   -	   0
  8 Seek_Time_Performance   0x0005   113   113   020	Pre-fail  Offline	  -	   42
  9 Power_On_Hours		  0x0012   100   100   000	Old_age   Always	   -	   265
 10 Spin_Retry_Count		0x0013   100   100   060	Pre-fail  Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   29
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   40
193 Load_Cycle_Count		0x0012   100   100   000	Old_age   Always	   -	   40
194 Temperature_Celsius	 0x0002   146   146   000	Old_age   Always	   -	   41 (Min/Max 20/41)
196 Reallocated_Event_Count 0x0032   100   100   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0022   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0008   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x000a   200   200   000	Old_age   Always	   -	   0

SMART Error Log Version: 1
No Errors Logged

...
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%		12		 -

in the admin panel it just states that "The volume serien_rz2 (ZFS) state is DEGRADED: One or more devices are faulted in response to persistent errors" and shows 232 write errors.

joeschmuck · Nov 19, 2017

Looks like you are not running frequent SMART short and long tests as you should be so while SMART may not be showing you any errors, that is likely because you are not running the tests. What I suspect is the system wrote to a bad location or more likely you had a power issue while the system was running which caused an issue.

After the scrub has completed then run a SMART Long test, and you should setup routine weekly long tests to occur on any day other than Sunday. Sunday is the day a scrub may automatically fire off unless you have changed the default schedule. Also setup daily SMART short tests.

benni2 · Nov 19, 2017

thanks for the advice. this file server only gets started when i need something which is once a month or so. so, regular tests are not practical.
and i got the error message when the server was running for an hour or so. so i guess, there are tests running by FreeNAS?

ok, then.. after the smart long test finished successfully, what can i do next? zpool clean? and then the faulty drive is back in the game automatically?

joeschmuck · Nov 19, 2017

As of right now you don't even know if you actually have a faulty drive, right now you have corrupt data. If you do power the system up and it starts doing some maintenance behind the scenes and then you keep powering it down before it completes, you are apt to do some harm. FreeNAS was not designed to be powered up and down but rather run for long periods of time.

A scrub should take care of the issue, do not run zpool clean until a scrub completes without issues. Let the SMART Long test complete and see if there are any errors. And I'd recommend you change your mindset on what is practical because routine tests are required if you value your data. I'm not trying to be harsh about it but try to ensure that you understand that it is important otherwise you are likely to have this problem again.

benni2 · Nov 21, 2017

ok, it took 2 smart runs for it to complete. that's why i answer so late.

...
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
...
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 299 -
# 2 Extended captive Interrupted (host reset) 90% 274 -
# 3 Extended offline Completed without error 00% 12 -

between the runs i did a reboot (to make sure the drive gets powered down once). now, the drive has been automatically re-attached to the raid z. but i still get a critical message: "The volume serien_rz2 (ZFS) state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected." -- how is that meant? the scrub was finished. does that mean there's still corrupt data anywhere? and how do i get rid of this message?

and yes, you're right. i didn't wanna be a wise guy. i just assumed that freenas has a lot of automatic methods in the background already. but maybe that's not the case. i do understand the importance of smart tests and scrubs.

cheers,
benni

joeschmuck · Nov 21, 2017

So it looks like your hard drive is fine. Did you test all the other hard drives as well? If not, I'd recommend that you do so. Also know that you can test them all at the same time.

What is the output of zpool status ? Hopefully it is not degraded anymore. What is the current error message you are getting?

benni2 · Nov 21, 2017

state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub repaired 0 in 10h21m with 0 errors on Sun Nov 19 23:08:30 2017

and the checksum errors got higher from 7 to 9, so i guess, this is a persisting error, anyway. even though the long smart test was fine (on this drive).
gptid/xxxxxxxxxxxxxxxxxxxxxxxxxxx ONLINE 0 0 9

i'll check the others as well.

joeschmuck · Nov 21, 2017

Keep in mind that a SMART Long test is a read only test so it is only looking for data, not if the data is any good. If it can read the sector then it is happy. I'd go ahead and run zpool clear and run another scrub over night.

Did you test all the drives just in case you are looking at the wrong one? Honestly you are not providing much data as I couldn't tell you the make/model or serial number of the drive, nor your pool construction. It's tuff to help when you are providing little bits of data. If you feel you have the correct hard drive then becasue it hasn't failed hard, I'd run badblocks on it for a few days to break it in. Maybe you have a drive on the edge and if it shoudl fail; then you can RMA the drive. you always want to do this stuff on your schedule, not when it all fails.

benni2 · Nov 22, 2017

thanks for all your help. i executed the long test on all drives and all seem fine.

i'm not sure if i wanna share serial numbers on the net. i have 8 drives, each 4 TB, as a raid z2 (with 2 spare drives). different models:
6x HGST HMS5C4040BLE640 (each < 400 h life time)
1x seagate ST4000DM000-1F2168 (7000h life time)
1x western digital WDC WD40EZRX-00SPEB0 (10.000h life time)

it's a good idea to run badblocks on it. but then i'd have to disconnect the drive from the zfs (again). i'm not feeling too comfy doing that.

//edit:
scrub finds errors again on this drive. ok, now i'd like to do the badblocks on the drive ;) i should stop the scrub, pull the drive out of the zfs (how do i do that), start a scrub / resilvering again, and run badblocks on this drive, so i can RMA this b.tch after just 300 h :/

joeschmuck · Nov 22, 2017

benni2 said:
i'm not sure if i wanna share serial numbers on the net.

Last 4 digits then is fine. It's good to list these here for tracking purposes and this way someone can catch you if you make a mistake. We all make mistakes and it's just best to be aware. Also it helps us rule out certain drives if we know with some certainty that it's been tested completely.

benni2 said:
it's a good idea to run badblocks on it. but then i'd have to disconnect the drive from the zfs (again). i'm not feeling too comfy doing that.

Unfortunately I feel like this is the best thing for you to do. Do you have your 16TB of data backed up?

I've got four 6TB drives being delivered on Friday, I will run badblocks on these drives for a while before building my replacement pool on them. I want to do my best to ensure that if these are going to die from infant mortality, that they do it before they become part of my pool. So maybe running them almost 200 hours like that.

How did you determine the faulty drive? Can you show the data for that? I just wanted to ensure that you picked the correct drive.

Have you conducted the zpool clear ?

benni2 · Nov 22, 2017

i did execute zpool clear. and shortly after that i got an error message, saying that the vol. is degraded (again).
zpool status brings up this one:

NAME STATE READ WRITE CKSUM
serien_rz2 DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
gptid/43eb-xxxxxxxxx ONLINE 0 0 0
gptid/4516-xxxxxxxxx ONLINE 0 0 0
gptid/461exxxxxxxxx ONLINE 0 0 0
gptid/4762-xxxxxxxxx ONLINE 0 0 0
gptid/485f-xxxxxxxxx ONLINE 0 0 0
gptid/497e-xxxxxxxxx ONLINE 0 0 0
gptid/4a9d-xxxxxxxxx ONLINE 0 0 0
gptid/4bba-xxxxxxxxx DEGRADED 0 0 125 too many errors (repairing)

..and in the admin panel, i see under volume status, that /dev/hda7p2 has got these 125 checksum errors.
/dev/hda7 has the s/n: ----2LRJ

i don't have a backup of my 16 tb - that's why i have this raid :)) but if i pull this one drive, i still have one spare drive left. so i *guess* i should feel safe, doing that.

benni2 · Nov 23, 2017

"scan: scrub repaired 540K in 10h29m with 0 errors on Thu Nov 23 03:52:20 2017"

ok.. now... how do i remove the drive from the volume?

danb35 · Nov 28, 2017

benni2 said:
i see under volume status, that /dev/hda7p2 has got these 125 checksum errors.

Is this a typo, or is that actually what it shows? Because if it's hda7 rather than ada7, there's some strangeness going on. What disk controller are you using?

benni2 said:
i don't have a backup of my 16 tb - that's why i have this raid

RAID does not substitute for a backup.

benni2 said:
i'm not sure if i wanna share serial numbers on the net.

Why not? What purpose does it serve to obscure this information? Sure, we don't need the serial numbers as such, but in editing to remove those, often people remove or change other information that's actually relevant.

Important Announcement for the TrueNAS Community.

testing & re-attaching faulty drive ?

benni2

Dabbler

joeschmuck

Old Man

benni2

Dabbler

joeschmuck

Old Man

benni2

Dabbler

joeschmuck

Old Man

benni2

Dabbler

joeschmuck

Old Man

benni2

Dabbler

joeschmuck

Old Man

benni2

Dabbler

joeschmuck

Old Man

benni2

Dabbler

benni2

Dabbler

danb35

Hall of Famer

Similar threads

Important Announcement for the TrueNAS Community.

testing & re-attaching faulty drive ?

Dabbler

Old Man

Dabbler

Old Man

Dabbler

Old Man

Dabbler

Old Man

Dabbler

Old Man

Dabbler

Old Man

Dabbler

Dabbler

Hall of Famer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "testing & re-attaching faulty drive ?"

Similar threads