zfs replacing after resilver ends

Status
Not open for further replies.

heliop100

Cadet
Joined
Jun 27, 2016
Messages
6
Hi,
I replace a faulty HD on my raidz1 pool (3 days ago).
The resilver process have ended (3 times actually) but status are still DEGRADED with replacing virtual dev.

Any Idea?
Thanks.

[root@freenas] /# zpool status -v
pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: resilvered 776G in 5h3m with 1 errors on Sun Jun 26 17:39:31 2016
config:

NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 1
raidz1-0 DEGRADED 0 0 2
ada0 ONLINE 0 0 0
ada2 ONLINE 0 0 0
ada3 ONLINE 0 0 0
replacing-3 DEGRADED 0 0 0
18003346797264841597 UNAVAIL 0 0 0 was /dev/ada4
gptid/9bceee17-3a47-11e6-ab6d-001d7dfce54f ONLINE 0 0 0
ada5 ONLINE 0 0 0

errors: Permanent errors have been detected in the following files:

tank/.system/syslog-adb946163d914f088dc14617dbc0bec3:<0x1d>
 

heliop100

Cadet
Joined
Jun 27, 2016
Messages
6
Can you please elaborate on this?

Also, please post System Configuration info.

Hi,

FreeNAS-9.2.1.8-RELEASE-x64 (e625626)
Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
8Gb RAM

system
<HTS721010G9SA00 MCZIC14V> at scbus1 target 0 lun 0 (pass1,ada1)

zfs raidz1 pool
<ST31000528AS CC38> at scbus0 target 0 lun 0 (pass0,ada0)
<ST31000528AS CC34> at scbus3 target 0 lun 0 (pass2,ada2)
<WDC WD10EARS-00Y5B1 80.00A80> at scbus4 target 0 lun 0 (pass3,ada3)
<WDC WD10EARS-00Y5B1 80.00A80> at scbus6 target 0 lun 0 (pass5,ada5)
<ST1000DM003-1SB102 CC43> at scbus5 target 0 lun 0 (pass4,ada4) *****
***** this are the new disk

Thanks.
 

maglin

Patron
Joined
Jun 20, 2015
Messages
299
It hasn't finished the resilver. It's erroring out of the resilver. Did you have scrubs setup on your pool? I would run smart long tests on your drives in your pool.

You can do these all at the same time.
Code:
smartclt -t long /dev/adao
smartclt -t long /dev/ada1
smartclt -t long /dev/ada2
smartclt -t long /dev/ada3


Then post the results after it's finished. Should probably only take just over 2 hours for 1TB HDD's. Then get the results and post them in in code tags.
Code:
smartclt -a /dev/ada0
smartclt -a /dev/ada1
smartclt -a /dev/ada2
smartclt -a /dev/ada3


If your old HDD was still working you might want to see if you can reattach it to the pool. Then do a scrub of the pool. Then replace the drive. I have a feeling you have some bad sectors and the data to repair are on the disk that you are attempting to replace.
 

heliop100

Cadet
Joined
Jun 27, 2016
Messages
6
Hi,
I run the smartctl long test and all HD show:
SMART overall-health self-assessment test result: PASSED
bellow are complete results***

The only HD that report erros are ada2
I have on console:
smartd: device /dev/ada2, 1 currently unreadable (pending) sectors
smartd: device /dev/ada2, 1 offline uncorrectable sectors

Should I replace ada2 or scrub are good enough?

Yesterday night the replication finished. Now I have:
[root@freenas] ~# zpool status -v
pool: tank
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: resilvered 776G in 5h13m with 0 errors on Mon Jun 27 17:14:09 2016
config:

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ada0 ONLINE 0 0 0
ada2 ONLINE 0 0 0
ada3 ONLINE 0 0 0
gptid/9bceee17-3a47-11e6-ab6d-001d7dfce54f ONLINE 0 0 0
ada5 ONLINE 0 0 0

errors: No known data errors

Why gptid/9bceee17-3a47-11e6-ab6d-001d7dfce54f and not ada4?

Thanks

***
SMART Error Log Version: 1
ATA Error Count: 133 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 133 occurred at disk power-on lifetime: 25926 hours (1080 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff 4f 00 19:52:51.271 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:48.499 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:43.229 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:40.456 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.764 READ DMA EXT

Error 132 occurred at disk power-on lifetime: 25926 hours (1080 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff 4f 00 19:52:48.499 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:43.229 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:40.456 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.764 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.753 READ DMA EXT

Error 131 occurred at disk power-on lifetime: 25926 hours (1080 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff 4f 00 19:52:43.229 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:40.456 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.764 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.753 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.743 READ DMA EXT

Error 130 occurred at disk power-on lifetime: 25926 hours (1080 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff 4f 00 19:52:40.456 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.764 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.753 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.743 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.740 READ DMA EXT

Error 129 occurred at disk power-on lifetime: 25926 hours (1080 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff 4f 00 19:52:37.764 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.753 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.743 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.740 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.712 READ DMA EXT

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Self-test routine in progress 50% 26252 -
# 2 Short offline Completed: read failure 90% 25481 512
# 3 Extended offline Aborted by host 90% 254 -
# 4 Short offline Aborted by host 90% 128 -
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Looks like you had multiple HDD fail around the same time. You also have been using the cli to modify your pool which is not what you want to do with freenas.

You should backup all your data and rebuild using raid z2 and only use the gui.
 

maglin

Patron
Joined
Jun 20, 2015
Messages
299
You can also upgrade the pool from the GUI. I noticed it doesn't have all the flags set on it. But rebuilding sure wouldn't hurt. People need to stay out of the CLI for core system changes. It ends up not so good in the end for some people.


Sent from my iPhone using Tapatalk
 

heliop100

Cadet
Joined
Jun 27, 2016
Messages
6
Hi,

I will replace the drive.

ada2 have a lot of reallocated sectors (and are the oldest one)
1 Raw_Read_Error_Rate 0x000f 115 090 006 Pre-fail Always - 84324937
5 Reallocated_Sector_Ct 0x0033 092 092 036 Pre-fail Always - 339 *****
7 Seek_Error_Rate 0x000f 090 060 030 Pre-fail Always - 1127629449

Thanks.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
raidz1.

Ada2 had unrecoverable errors while ada4 was failed.

With raidz1 you can't have ANY errors when a device is failed. With Raidz2 you can.
 
Status
Not open for further replies.
Top