Slow resilvering speed after replacing second disk

Status
Not open for further replies.

level3

Dabbler
Joined
Mar 30, 2017
Messages
12
Hi All,

I'm in the process of replacing 4x3TB WD RED drive with 4X4TB ST4000NM0023 drives. (HP Micro server + LSI 9211-8i )
First drive resilvered for less than 8h, but when I replaced second drive resilvering went super slow:
Code:
zpool status -v DATA
  pool: DATA
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Feb 22 08:19:48 2018
	1.40T scanned at 933M/s, 43.5G issued at 28.4M/s, 5.36T total
	10.3G resilvered, 0.79% done, 2 days 06:37:05 to go
config:

	NAME											STATE	 READ WRITE CKSUM
	DATA											ONLINE	   0	 0	 0
	  raidz1-0									  ONLINE	   0	 0	 0
		gptid/eedd2914-1754-11e8-9246-70106f3e4588  ONLINE	   0	 0	 0
		gptid/b51ae28a-17a0-11e8-85bc-70106f3e4588  ONLINE	   0	 0	 1  (resilvering)
		gptid/c7858df1-050c-11e7-8bb9-70106f3e4588  ONLINE	   0	 0	 0
		gptid/c8490434-050c-11e7-8bb9-70106f3e4588  ONLINE	   0	 0	 0
	logs
	  ada0										  ONLINE	   0	 0	 0

errors: No known data errors


zpool list -v
NAME									 SIZE  ALLOC   FREE  EXPANDSZ   FRAG	CAP  DEDUP  HEALTH  ALTROOT
DATA									10.9T  5.40T  5.47T		 -	21%	49%  1.00x  ONLINE  /mnt
  raidz1								10.9T  5.40T  5.47T		 -	21%	49%
	gptid/eedd2914-1754-11e8-9246-70106f3e4588	  -	  -	  -		 -	  -	  -
	gptid/b51ae28a-17a0-11e8-85bc-70106f3e4588	  -	  -	  -		 -	  -	  -
	gptid/c7858df1-050c-11e7-8bb9-70106f3e4588	  -	  -	  -		 -	  -	  -
	gptid/c8490434-050c-11e7-8bb9-70106f3e4588	  -	  -	  -		 -	  -	  -
log										 -	  -	  -		 -	  -	  -
  ada0									93G  1.26M  93.0G		 -	 0%	 0%
freenas-boot							7.31G  1.64G  5.68G		 -	  -	22%  1.00x  ONLINE  -
  da4p2								 7.31G  1.64G  5.68G		 -	  -	22%

What could be the issue here?
 
Last edited by a moderator:

bigphil

Patron
Joined
Jan 30, 2014
Messages
486
Well...for starters you need to state what version of FreeNAS you're running. There were resilvering performance enhancements with FreeNAS 11.1+ so if not running that build or higher then it may help. As for why the first was much faster than the second...could be any number of things I suppose. Hardware problems with that disk? At this point, I'd just let it finish and then see what stats you can get from that disk.
 

level3

Dabbler
Joined
Mar 30, 2017
Messages
12
I'm running FreeNAS-11.1-U1, as for the HW problem, while resilvering I've executed dd=/dev/zero to file on that pool with 2xRAM size and it gives me ~160Mbps read and write speed.
 

bigphil

Patron
Joined
Jan 30, 2014
Messages
486
Ok, well at least you're using an updated version. As for your test...I'm not sure about the data path if the disk thats currently resilvering is included in the new writes yet or not.
 
Last edited by a moderator:

level3

Dabbler
Joined
Mar 30, 2017
Messages
12
Ok, well at least you're using an updated version. As for your test...I'm not sure about the data path if the disk thats currently resilvering is included in the new writes yet or not.
How do you mean? It's RAID-Z with single vdev, of course it is included.
 
Last edited by a moderator:

level3

Dabbler
Joined
Mar 30, 2017
Messages
12
Replaced 2nd disk which was resilvering slowly with different ST4000NM0023, also new and got similar results:
Code:
[root@brunas01 /nonexistent]# zpool status -v DATA
  pool: DATA
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Feb 22 16:48:48 2018
	1.20T scanned at 2.35G/s, 7.01G issued at 37.2M/s, 5.40T total
	1.57G resilvered, 0.13% done, 1 days 18:13:56 to go
config:

	NAME											STATE	 READ WRITE CKSUM
	DATA											ONLINE	   0	 0	 0
	  raidz1-0									  ONLINE	   0	 0	 0
		gptid/eedd2914-1754-11e8-9246-70106f3e4588  ONLINE	   0	 0	 0
		gptid/c7bb3d65-17e7-11e8-853d-70106f3e4588  ONLINE	   0	 0	 0  (resilvering)
		gptid/c7858df1-050c-11e7-8bb9-70106f3e4588  ONLINE	   0	 0	 0
		gptid/c8490434-050c-11e7-8bb9-70106f3e4588  ONLINE	   0	 0	 0
	logs
	  ada0										  ONLINE	   0	 0	 0

errors: No known data errors
 
Last edited by a moderator:

bigphil

Patron
Joined
Jan 30, 2014
Messages
486
check iostat and see how busy the resilvering disk is. iostat -w 1 -x <device name>. Check %b. whats the value of sysctl -a | grep vfs.zfs.resilver_delay? If 0 (which I think 11.1-U1 is set to) then resilver priority should be the same as other I/O. This value is changed in 11.1-U2 to be 2, so resilver has a lower priority when the pool is not idle. Still doesn't explain why the first was fast. Maybe we can get some other input and ideas.
 
Last edited by a moderator:

level3

Dabbler
Joined
Mar 30, 2017
Messages
12
da3 which I'm resilvering jumping from 80 to 100%
Code:
					extended device statistics
device	   r/s	 w/s	 kr/s	 kw/s  ms/r  ms/w  ms/o  ms/t qlen  %b
da0		   57	   0   2508.0	  0.0	 0	 0	 0	 0	0   5
da1		   57	   0   2508.0	  0.0	 0	 0	 0	 0	0   5
da2		   57	   0   2508.0	  0.0	 0	 0	 0	 0	0   5
da3			0	 109	  0.0   4360.0	 0	 9	 0	 9	1 100

sysctl -a | grep vfs.zfs.resilver_delay
vfs.zfs.resilver_delay: 0

What is interesting is that according to latency graph, first Seagate disk which I already replaced has 15ms write latency and the one which I'm resilvering now 10ms, while remaining two WD RED keep it ~1-2ms. For me it also make no sense since ST4000NM0023 has 128mb cache and 7200rpm, while WD RED 5400 and 64mb.
 
Last edited by a moderator:

level3

Dabbler
Joined
Mar 30, 2017
Messages
12
So I replaced 3rd disk, and no issues with it:
Code:
[root@brunas01 /nonexistent]# zpool status -v DATA
  pool: DATA
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Feb 24 13:31:33 2018
	1.18T scanned at 3.97G/s, 50.9G issued at 171M/s, 5.36T total
	12.6G resilvered, 0.93% done, 0 days 09:01:18 to go
config:

	NAME											STATE	 READ WRITE CKSUM
	DATA											ONLINE	   0	 0	 0
	  raidz1-0									  ONLINE	   0	 0	 0
		gptid/eedd2914-1754-11e8-9246-70106f3e4588  ONLINE	   0	 0	 0
		gptid/c7bb3d65-17e7-11e8-853d-70106f3e4588  ONLINE	   0	 0	 0
		gptid/8c4b15e2-195e-11e8-976b-70106f3e4588  ONLINE	   0	 0	 0  (resilvering)
		gptid/c8490434-050c-11e7-8bb9-70106f3e4588  ONLINE	   0	 0	 0
	logs
	  ada0										  ONLINE	   0	 0	 0
 
Last edited by a moderator:

hescominsoon

Patron
Joined
Jul 27, 2016
Messages
456
you may have a dodgy cable or a dodgy drive port then since it only appears to be the second port that is having problems.
 

level3

Dabbler
Joined
Mar 30, 2017
Messages
12
you may have a dodgy cable or a dodgy drive port then since it only appears to be the second port that is having problems.
Indeed, it looks like problem with Bay 2. I just did test by moving resilvered disk from Bay 2 to bay 4, and then started resilvering last disk in vdev and I get same issue with 20Mbps. However, when I switches disks from slot 2 and 4 again, it is still giving me 20Mbps speed even while disk is in bay 4. It almost looks like resilvering even disk is going much slower than odd.

Is there any command to show error counters on SAS interface between HBA and Target/Disk?
 
Status
Not open for further replies.
Top