Resilvering horrible slow (takes weeks!!)

Status
Not open for further replies.

samweis

Cadet
Joined
Jan 20, 2015
Messages
2
Hi!

I am new in this forum. I have a 2 year old freenas server with 4 disks. a few weeks ago the first disk starts failing and I ordered 4 new (larger) disks (and larger RAM) to replace the old one.

But then I turned off my server added the new drives and started an online replacement. After a few hours I looked at zpool status and saw, that it scans with around 1-2 MB/s !!! Even more a second disk got resilvered.

Code:
[root@freenas] ~# zpool status
  pool: NAS
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
  continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Jan 10 18:12:14 2015
  2.02T scanned out of 5.85T at 1.07M/s, (scan is slow, no estimated time)
  493G resilvered, 34.57% done
config:

  NAME  STATE  READ WRITE CKSUM
  NAS  ONLINE  0  0  0
  raidz1-0  ONLINE  0  0  0
  gptid/bfb81bea-53d1-11e1-9d57-f46d0473bc7e  ONLINE  0  0  0
  gptid/c05f49a0-53d1-11e1-9d57-f46d0473bc7e  ONLINE  0  0  0
  gptid/c1249ad5-53d1-11e1-9d57-f46d0473bc7e  ONLINE  0  0  0  (resilvering)
  replacing-3  ONLINE  0  0  0
  gptid/c1e8972b-53d1-11e1-9d57-f46d0473bc7e  ONLINE  0  0  0
  gptid/4f646e62-98eb-11e4-9c57-f46d0473bc7e  ONLINE  0  0  0  (resilvering)


The resilvering takes 1,5 weeks till now and we have only 35%. After a reboot the speed falls down to around a single MB/s !

Please help, if you have a suggestion! I need to speed up the resilvering because two of the disks seem to fail more and more! Failing means such errors in messages for the one which gets resilvered and the second one. smartctl -a for the 2 disks does not look better..

Code:
Jan 19 07:24:40 freenas kernel: (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b0 c0 00 09 40 09 00 00 00 00 00
Jan 19 07:24:40 freenas kernel: (ada3:ahcich3:0:0:0): CAM status: ATA Status Error
Jan 19 07:24:40 freenas kernel: (ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 19 07:24:40 freenas kernel: (ada3:ahcich3:0:0:0): RES: 41 40 30 01 09 40 09 00 00 00 00
Jan 19 07:24:40 freenas kernel: (ada3:ahcich3:0:0:0): Retrying command
Jan 19 07:24:40 freenas kernel: (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b0 c0 00 09 40 09 00 00 00 00 00
Jan 19 07:24:40 freenas kernel: (ada3:ahcich3:0:0:0): CAM status: ATA Status Error
Jan 19 07:24:40 freenas kernel: (ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 19 07:24:40 freenas kernel: (ada3:ahcich3:0:0:0): RES: 41 40 30 01 09 40 09 00 00 00 00
Jan 19 07:24:40 freenas kernel: (ada3:ahcich3:0:0:0): Retrying command
Jan 19 07:24:40 freenas kernel: (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 b0 c0 00 09 40 09 00 00 00 00 00
Jan 19 07:24:40 freenas kernel: (ada3:ahcich3:0:0:0): CAM status: ATA Status Error
Jan 19 07:24:40 freenas kernel: (ada3:ahcich3:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Jan 19 07:24:40 freenas kernel: (ada3:ahcich3:0:0:0): RES: 41 40 30 01 09 40 09 00 00 00 00
Jan 19 07:24:40 freenas kernel: (ada3:ahcich3:0:0:0): Error 5, Retries exhausted


Is it possible that there is something wrong with the tunables (I never changed them manually, only autotune)?

best regards

Code:
[root@freenas] ~# cat /boot/loader.conf
#
# Boot loader file for FreeNAS.  This relies on a hacked beastie.4th.
#
autoboot_delay="2"
loader_logo="freenas-logo"
loader_menu_title="Welcome to FreeNAS"
loader_brand="freenas-brand"
loader_version=" "

# kernel modules: only load them in loader.conf if they are absolutely
# necessary.

# xhci is qualified as "critical" because this could be a boot device.
# Bug 3273, we are going to disable this by default for now.
#xhci_load="YES"

# If the machine dies at boot before /etc/rc.d/sysctl is run, let the user do
# something.
debug.debugger_on_panic=1
debug.ddb.textdump.pending=1

hw.hptrr.attach_generic=0

# Set the allowed number of mbuf clusters to a sane minimum.
kern.ipc.nmbclusters="262144"

# Retry for 30 seconds before giving up.
vfs.mountroot.timeout="30"

# Put FC ports in initiator mode
hint.isp.0.role=2
hint.isp.1.role=2
hint.isp.2.role=2
hint.isp.3.role=2

# Possible kernel module locations
module_path="/boot/modules;/usr/local/modules"

# Disable IPv6 link local addresses.  We'll enable this per interface
# if IPv6 is configured for the interface later on in boot.
net.inet6.ip6.auto_linklocal="0"

vfs.zfs.vol.mode=2

 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Well, if multiple drives are failing, you have a problem. The only thing you can do is not load the server, so no file sharing.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Why are you resilvering two drives at the same time, especially on a RAIDZ1 system? I've never heard of someone doing that and depending on your system it could take a very long time.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Why are you resilvering two drives at the same time, especially on a RAIDZ1 system? I've never heard of someone doing that and depending on your system it could take a very long time.

Hehe. He's not. ZFS is doing unexplained things because he's got more than 1 failed disk with only 1 disk's worth of redundancy.

That pool is toast. ZFS just hasn't been told yet. I'm betting it'll eventually kernel panic or reboot, and the pool will be gone forever.

My advice to the OP.. copy your data off and give up trying to recover your pool. There's a reason we recommend RAIDZ2, and you just provided another example. You've got more than 1 disk failing, so recovery is not gonna happen. Your problem is the disks are failing, so they will take some ungodly amount of time to finish (if they ever do). Meanwhile the disks will slowly deteriorate further until they pool is gone for good.
 

samweis

Cadet
Joined
Jan 20, 2015
Messages
2
thanks for your replies. it's exactly how cyberjock wrote. I will try to export as many files as possible and kick the RAID. I am glad to have a backup of the important files...
 
Status
Not open for further replies.
Top