zfs replacing after resilver ends

heliop100 · Jun 27, 2016

Hi,
I replace a faulty HD on my raidz1 pool (3 days ago).
The resilver process have ended (3 times actually) but status are still DEGRADED with replacing virtual dev.

Any Idea?
Thanks.

[root@freenas] /# zpool status -v
pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: resilvered 776G in 5h3m with 1 errors on Sun Jun 26 17:39:31 2016
config:

NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 1
raidz1-0 DEGRADED 0 0 2
ada0 ONLINE 0 0 0
ada2 ONLINE 0 0 0
ada3 ONLINE 0 0 0
replacing-3 DEGRADED 0 0 0
18003346797264841597 UNAVAIL 0 0 0 was /dev/ada4
gptid/9bceee17-3a47-11e6-ab6d-001d7dfce54f ONLINE 0 0 0
ada5 ONLINE 0 0 0

errors: Permanent errors have been detected in the following files:

tank/.system/syslog-adb946163d914f088dc14617dbc0bec3:<0x1d>

Mirfster · Jun 27, 2016

heliop100 said:
with replacing virtual dev.

Can you please elaborate on this?

Also, please post System Configuration info.

heliop100 · Jun 27, 2016

Mirfster said:
Can you please elaborate on this?

Also, please post System Configuration info.

Hi,

FreeNAS-9.2.1.8-RELEASE-x64 (e625626)
Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
8Gb RAM

system
<HTS721010G9SA00 MCZIC14V> at scbus1 target 0 lun 0 (pass1,ada1)

zfs raidz1 pool
<ST31000528AS CC38> at scbus0 target 0 lun 0 (pass0,ada0)
<ST31000528AS CC34> at scbus3 target 0 lun 0 (pass2,ada2)
<WDC WD10EARS-00Y5B1 80.00A80> at scbus4 target 0 lun 0 (pass3,ada3)
<WDC WD10EARS-00Y5B1 80.00A80> at scbus6 target 0 lun 0 (pass5,ada5)
<ST1000DM003-1SB102 CC43> at scbus5 target 0 lun 0 (pass4,ada4) *****
***** this are the new disk

Thanks.

maglin · Jun 27, 2016

It hasn't finished the resilver. It's erroring out of the resilver. Did you have scrubs setup on your pool? I would run smart long tests on your drives in your pool.

You can do these all at the same time.

Code:

smartclt -t long /dev/adao
smartclt -t long /dev/ada1
smartclt -t long /dev/ada2
smartclt -t long /dev/ada3

Then post the results after it's finished. Should probably only take just over 2 hours for 1TB HDD's. Then get the results and post them in in code tags.

Code:

smartclt -a /dev/ada0
smartclt -a /dev/ada1
smartclt -a /dev/ada2
smartclt -a /dev/ada3

If your old HDD was still working you might want to see if you can reattach it to the pool. Then do a scrub of the pool. Then replace the drive. I have a feeling you have some bad sectors and the data to repair are on the disk that you are attempting to replace.

heliop100 · Jun 28, 2016

Hi,
I run the smartctl long test and all HD show:
SMART overall-health self-assessment test result: PASSED
bellow are complete results***

The only HD that report erros are ada2
I have on console:
smartd: device /dev/ada2, 1 currently unreadable (pending) sectors
smartd: device /dev/ada2, 1 offline uncorrectable sectors

Should I replace ada2 or scrub are good enough?

Yesterday night the replication finished. Now I have:
[root@freenas] ~# zpool status -v
pool: tank
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: resilvered 776G in 5h13m with 0 errors on Mon Jun 27 17:14:09 2016
config:

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ada0 ONLINE 0 0 0
ada2 ONLINE 0 0 0
ada3 ONLINE 0 0 0
gptid/9bceee17-3a47-11e6-ab6d-001d7dfce54f ONLINE 0 0 0
ada5 ONLINE 0 0 0

errors: No known data errors

Why gptid/9bceee17-3a47-11e6-ab6d-001d7dfce54f and not ada4?

Thanks

***
SMART Error Log Version: 1
ATA Error Count: 133 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 133 occurred at disk power-on lifetime: 25926 hours (1080 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff 4f 00 19:52:51.271 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:48.499 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:43.229 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:40.456 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.764 READ DMA EXT

Error 132 occurred at disk power-on lifetime: 25926 hours (1080 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff 4f 00 19:52:48.499 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:43.229 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:40.456 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.764 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.753 READ DMA EXT

Error 131 occurred at disk power-on lifetime: 25926 hours (1080 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff 4f 00 19:52:43.229 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:40.456 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.764 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.753 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.743 READ DMA EXT

Error 130 occurred at disk power-on lifetime: 25926 hours (1080 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff 4f 00 19:52:40.456 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.764 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.753 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.743 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.740 READ DMA EXT

Error 129 occurred at disk power-on lifetime: 25926 hours (1080 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff 4f 00 19:52:37.764 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.753 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.743 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.740 READ DMA EXT
25 00 08 ff ff ff 4f 00 19:52:37.712 READ DMA EXT

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Self-test routine in progress 50% 26252 -
# 2 Short offline Completed: read failure 90% 25481 512
# 3 Extended offline Aborted by host 90% 254 -
# 4 Short offline Aborted by host 90% 128 -

Robert Trevellyan · Jun 28, 2016

heliop100 said:
2 Short offline Completed: read failure 90% 25481 512

I would replace any drive that fails a SMART test.

SweetAndLow · Jun 28, 2016

Looks like you had multiple HDD fail around the same time. You also have been using the cli to modify your pool which is not what you want to do with freenas.

You should backup all your data and rebuild using raid z2 and only use the gui.

maglin · Jun 28, 2016

You can also upgrade the pool from the GUI. I noticed it doesn't have all the flags set on it. But rebuilding sure wouldn't hurt. People need to stay out of the CLI for core system changes. It ends up not so good in the end for some people.

Sent from my iPhone using Tapatalk

heliop100 · Jun 29, 2016

Hi,

I will replace the drive.

ada2 have a lot of reallocated sectors (and are the oldest one)
1 Raw_Read_Error_Rate 0x000f 115 090 006 Pre-fail Always - 84324937
5 Reallocated_Sector_Ct 0x0033 092 092 036 Pre-fail Always - 339 *****
7 Seek_Error_Rate 0x000f 090 060 030 Pre-fail Always - 1127629449

Thanks.

Stux · Jul 11, 2016

raidz1.

Ada2 had unrecoverable errors while ada4 was failed.

With raidz1 you can't have ANY errors when a device is failed. With Raidz2 you can.

Important Announcement for the TrueNAS Community.

zfs replacing after resilver ends

heliop100

Cadet

Mirfster

Doesn't know what he's talking about

heliop100

Cadet

maglin

Patron

heliop100

Cadet

Robert Trevellyan

Pony Wrangler

SweetAndLow

Sweet'NASty

maglin

Patron

heliop100

Cadet

Stux

MVP

Similar threads

Important Announcement for the TrueNAS Community.

zfs replacing after resilver ends

Cadet

Doesn't know what he's talking about

Cadet

Patron

Cadet

Pony Wrangler

Sweet'NASty

Patron

Cadet

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "zfs replacing after resilver ends"

Similar threads