Performance problem

Status
Not open for further replies.

Jano

Dabbler
Joined
Jan 7, 2014
Messages
31
Hi,

Inspired by quite old post: https://forums.freenas.org/index.php?threads/write-performance-issues-mid-level-zfs-setup.13372/
I decided to make tests of my platform... I have however problem with finding mistake and I would like to ask you for consultation.

Let's start from begining...

My system base on FreeNAS-9.10.1-U4 (ec9a7d3)
It has:
Intel(R) Core(TM) i7-4820K CPU @ 3.70GHz
64GB RAM
9 x NIC (Intel PRO1000 / 1Gbit / 1500 MTU) - 6x DATA chanel working as ISCSI interface to ESXI, 2x BACKUP chanel to another FREENAS system, 1x ADM
RAID10 composed with 9 pairs of disks (all disks are WD RED 3TB 5400 rpm)
Disks of RAID are connected (part of disks through LSI 9260 controler and part of disks via SATA onboard)
There is also L2ARC 120GB SSD bit for all theses tests it was turned off.

Parameters of ZFS are default:
[root@PROD-1S] ~# zfs get all storage2
NAME PROPERTY VALUE SOURCE
storage2 type filesystem -
storage2 creation Fri Apr 21 13:10 2017 -
storage2 used 20.4T -
storage2 available 3.28T -
storage2 referenced 144G -
storage2 compressratio 1.34x -
storage2 mounted yes -
storage2 quota none local
storage2 reservation none local
storage2 recordsize 128K default
storage2 mountpoint /mnt/storage2 default
storage2 sharenfs off default
storage2 checksum on default
storage2 compression off local
storage2 atime off local
storage2 devices on default
storage2 exec on default
storage2 setuid on default
storage2 readonly off default
storage2 jailed off default
storage2 snapdir hidden default
storage2 aclmode passthrough local
storage2 aclinherit passthrough local
storage2 canmount on default
storage2 xattr off temporary
storage2 copies 1 default
storage2 version 5 -
storage2 utf8only off -
storage2 normalization none -
storage2 casesensitivity sensitive -
storage2 vscan off default
storage2 nbmand off default
storage2 sharesmb off default
storage2 refquota none local
storage2 refreservation none local
storage2 primarycache all default
storage2 secondarycache all default
storage2 usedbysnapshots 0 -
storage2 usedbydataset 144G -
storage2 usedbychildren 20.3T -
storage2 usedbyrefreservation 0 -
storage2 logbias latency default
storage2 dedup off local
storage2 mlslabel -
storage2 sync disabled local
storage2 refcompressratio 1.00x -
storage2 written 144G -
storage2 logicalused 10.5T -
storage2 logicalreferenced 144G -
storage2 volmode default default
storage2 filesystem_limit none default
storage2 snapshot_limit none default
storage2 filesystem_count none default
storage2 snapshot_count none default
storage2 redundant_metadata all default


Composition of pool:

[root@PROD-1S] ~# zpool status storage2
pool: storage2
state: ONLINE
scan: scrub repaired 0 in 114h22m with 0 errors on Sat Apr 29 06:22:48 2017
config:

NAME STATE READ WRITE CKSUM
storage2 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/227efce1-2683-11e7-80e6-90e2ba36a9d0 ONLINE 0 0 0
gptid/232e4be0-2683-11e7-80e6-90e2ba36a9d0 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/42c536e5-2683-11e7-80e6-90e2ba36a9d0 ONLINE 0 0 0
gptid/4371d520-2683-11e7-80e6-90e2ba36a9d0 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
gptid/ca3f4bae-2683-11e7-80e6-90e2ba36a9d0 ONLINE 0 0 0
gptid/ca992fc6-2683-11e7-80e6-90e2ba36a9d0 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
gptid/e2a2ad81-275c-11e7-80e6-90e2ba36a9d0 ONLINE 0 0 0
gptid/e2e2ebe7-275c-11e7-80e6-90e2ba36a9d0 ONLINE 0 0 0
mirror-4 ONLINE 0 0 0
gptid/fea2fea2-275c-11e7-80e6-90e2ba36a9d0 ONLINE 0 0 0
gptid/fed7a02c-275c-11e7-80e6-90e2ba36a9d0 ONLINE 0 0 0
mirror-5 ONLINE 0 0 0
gptid/242437c7-275d-11e7-80e6-90e2ba36a9d0 ONLINE 0 0 0
gptid/24666d69-275d-11e7-80e6-90e2ba36a9d0 ONLINE 0 0 0
mirror-6 ONLINE 0 0 0
gptid/b999ffd4-275d-11e7-80e6-90e2ba36a9d0 ONLINE 0 0 0
gptid/ba714e80-275d-11e7-80e6-90e2ba36a9d0 ONLINE 0 0 0
mirror-7 ONLINE 0 0 0
gptid/fbc0a7da-275d-11e7-80e6-90e2ba36a9d0 ONLINE 0 0 0
gptid/fc09d437-275d-11e7-80e6-90e2ba36a9d0 ONLINE 0 0 0
mirror-8 ONLINE 0 0 0
gptid/2a733a0f-3564-11e7-8b69-90e2ba36a9d0 ONLINE 0 0 0
gptid/2b01f6e4-3564-11e7-8b69-90e2ba36a9d0 ONLINE 0 0 0

errors: No known data errors


Standard system tuning generated by FreeNas automatically.


So...what I did...

First I turned off totally iSCSI and started to check performance (READ) of each of drive by simple DD

dd if=/dev/xxx of=/dev/null bs=262144

I checked each of drive (being part of my pool) and results were more and less the same... 140MB/s max
Of course it is sequentional read.

I compared this with diskinfo:

[root@PROD-1S] ~# diskinfo -t /dev/ada0
/dev/ada0
512 # sectorsize
3000592982016 # mediasize in bytes (2.7T)
5860533168 # mediasize in sectors
4096 # stripesize
0 # stripeoffset
5814021 # Cylinders according to firmware.
16 # Heads according to firmware.
63 # Sectors according to firmware.
WD-WMC4N2099293 # Disk ident.

Seek times:
Full stroke: 250 iter in 6.608834 sec = 26.435 msec
Half stroke: 250 iter in 4.833553 sec = 19.334 msec
Quarter stroke: 500 iter in 7.409172 sec = 14.818 msec
Short forward: 400 iter in 2.446857 sec = 6.117 msec
Short backward: 400 iter in 3.096281 sec = 7.741 msec
Seq outer: 2048 iter in 0.172173 sec = 0.084 msec
Seq inner: 2048 iter in 0.159437 sec = 0.078 msec
Transfer rates:
outside: 102400 kbytes in 0.713659 sec = 143486 kbytes/sec
middle: 102400 kbytes in 0.870385 sec = 117649 kbytes/sec
inside: 102400 kbytes in 1.440545 sec = 71084 kbytes/sec


so.... comparabe... even too much but dd started from data being at the "beining of disk". Data located at the "end of disk" will be probably closer to 70MB/s.
Test base of READ operation (as I cannot damage data) but WRITE operation should be comparable.

OK, anyway.... in FreeNas UI report I see nice, stabile load of each disk.

Next I made the same operation but on all disks (18 disks) in the same time to eliminate eventual interference between disks, interface etc.
So I did:

foreach i ( ada0 ada1 ada3 ada4 mfid0 mfid1 mfid2 mfid3 mfid4 mfid5 mfid6 mfid7 mfid8 mfid9 mfid10 mfid11 mfid12 mfid13 mfid14 mfid15)
foreach? dd if=/dev/${i} of=/dev/null bs=262144 &
foreach? end


Result was also as expected... all disk generates the same performance compare to individual tests.
So, physical disk system can transfer up to 140MB/s for each of disk in the same time.

As I have RAID10 (9x2 drives), theoretical performance is:
READ: 18 x read performance of one driver = 2520 MB/s (if I take into account external tracks of disks where performance is 70MB/s it gives me about 1260MB/s for RAID)
WRITE: 9x write performance of one driver = 1260 MB/s (if I take into account external tracks of disks where performance is 70MB/s it gives me about 630MB/s for RAID)

I started to make tests on pool.
First I turned off compression of the pool becasue using dd with /dev/zero (when compression is turn on) gives wrong results.
Second I generated test file (only zeros) with size 107GB... seems large enough compare to ARC size and memory system has.

During genration of file I was checking performance of pool by:
zpool iostat storage2 1
and disk load by
gstat -p

I saw all disk working, each of them (gstat) 99...100 busy.
zpool iostat reports write speed about 890MB/s but not always... sometime it was a little over 170MB/s:


storage2 7.98T 16.5T 0 1.59K 0 164M
storage2 7.98T 16.5T 1 6.76K 256K 857M
storage2 7.99T 16.5T 0 6.88K 128K 836M
storage2 7.99T 16.5T 0 6.95K 128K 881M
storage2 7.99T 16.5T 2 6.58K 136K 835M
storage2 7.99T 16.5T 0 4.32K 0 517M
storage2 7.99T 16.5T 22 2.82K 20.5K 323M
storage2 7.99T 16.5T 0 6.71K 128K 853M
storage2 7.99T 16.5T 0 6.80K 0 871M
storage2 7.99T 16.5T 0 6.72K 0 860M
storage2 7.99T 16.5T 0 6.35K 0 794M
storage2 7.99T 16.5T 0 2.84K 126K 333M
storage2 7.99T 16.5T 21 3.71K 17.0K 413M
storage2 7.99T 16.5T 0 6.28K 128K 803M
storage2 7.99T 16.5T 0 6.68K 0 854M
storage2 7.99T 16.5T 0 6.80K 0 870M
storage2 7.99T 16.5T 0 6.16K 0 767M
storage2 7.99T 16.5T 0 2.87K 0 330M
storage2 8.00T 16.5T 1 1.69K 132K 172M
storage2 8.00T 16.5T 0 5.85K 0 749M
storage2 8.00T 16.5T 0 6.40K 0 819M
storage2 8.00T 16.5T 0 6.68K 0 843M
storage2 8.00T 16.5T 0 5.26K 0 644M
storage2 8.00T 16.5T 0 1.85K 0 230M
storage2 8.00T 16.5T 0 4.83K 0 574M
storage2 8.00T 16.5T 0 5.96K 128K 762M
storage2 8.00T 16.5T 0 6.70K 0 858M


So, there were two issues which I tried to explain:
1.
Why write performance if significantly slower than theoretical (1250 MB/s vs max 890MB/s)... so 29% less than theoretical.
I conclude it is not so bad... even very good - specially if I take into account that my disk on external tracks can have 70MB/s which gives 630MB/s of theorethical write speed

2.
Why write performance is not stabile but flow 170...890MB/s
I conclude it may be result of some ZSF buffer flushing (TXG)

I made tests to detect number of delayed writes compare to not delayed by:

dtrace -n fbt::dsl_pool_need_dirty_delay:return'{ @[args[1] == 0 ? "no delay" : "delay"] = count(); }'


results:
delayed: 89%
not delayed: 11%

I checked also how TXG is filled by:


txg-syncing
{
this->dp = (dsl_pool_t *)arg0;
}

txg-syncing
/this->dp->dp_spa->spa_name == $$1/
{
printf("%4dMB of %4dMB used", this->dp->dp_dirty_total / 1024 / 1024,
`zfs_dirty_data_max / 1024 / 1024);
}

# dtrace -s dirty.d pool


results were:

CPU ID FUNCTION:NAME
6 55778 none:txg-syncing 81MB of 4096MB used
3 55778 none:txg-syncing 3696MB of 4096MB used
7 55778 none:txg-syncing 3606MB of 4096MB used
3 55778 none:txg-syncing 3109MB of 4096MB used
6 55778 none:txg-syncing 3068MB of 4096MB used
2 55778 none:txg-syncing 3096MB of 4096MB used
2 55778 none:txg-syncing 3162MB of 4096MB used
1 55778 none:txg-syncing 3149MB of 4096MB used
5 55778 none:txg-syncing 3040MB of 4096MB used
4 55778 none:txg-syncing 3046MB of 4096MB used
0 55778 none:txg-syncing 3061MB of 4096MB used
6 55778 none:txg-syncing 3040MB of 4096MB used
3 55778 none:txg-syncing 3064MB of 4096MB used
1 55778 none:txg-syncing 3279MB of 4096MB used
2 55778 none:txg-syncing 3427MB of 4096MB used
4 55778 none:txg-syncing 3368MB of 4096MB used
5 55778 none:txg-syncing 3317MB of 4096MB used
2 55778 none:txg-syncing 3205MB of 4096MB used
3 55778 none:txg-syncing 3101MB of 4096MB used


All these gives me idea that pool cannot handle such write workload... pool is trying to delay writes hoping that a little later disks will be more free but they are not, so only possibility to flush buffers is to slowdown workload to get some IOPS to do that.

At the end of this part I would like to explain also CPU was loaded max 20% during all these tests.


Now time for second part of my test - regarding READ speed.
Traditionally I used dd to read, previously created file (107GB).
I was expecting significant disk load and transfers near 1TB/s.

I used:
[root@PROD-1S] /mnt/storage2# dd if=/mnt/storage2/testfile of=/dev/null bs=1048576

Results were a little surprising:
1. disk load (gstat) 30...80%
2. transfer (zpool iostat) from about 300 to 850MB/s
3. dd reported performance as 1025152405 bytes/sec = 977 MB/s

'dd' reported larger transfer (in fact dd was 'client' for ZFS) and it seems I should treat this as real value not data reported by 'zpool iostat'.
'zpool iostat' seems reports transfer to pool but not take into account ARC (am I right ??)
So, 'extra performance' comes from ARC specially on this part of file (107GB) which was cached by ARC.

After this I got idea to turn off PREFECH of zfs hoping that if ZFS will not make read-ahead on not so fast disks I have result should be more stabile.
But after this result was terrible... up to 80MB/s
So, seems I did not understand correctly how prefetch is working (can you explain ??)

In this part of test READ speed and specially fluctuation 300...850MB/s is for me the biggest surprise.

During all these tests CPU load not exceed 20% also.

So, sorry for long introducion to ask final questions...
1. Do you think above tests as reasonable ? Do you see mistakes in philosophy of these tests and/or interpretation of results ?
2. Do you agree, WRITE performance cannot be better without chaning hardware / adding more disks to RAID ?
3. How to find bottleneck during READ operation causing fluctuation ? Is it possible to dtrace stages of ZFS during reading data ?
 

Jano

Dabbler
Joined
Jan 7, 2014
Messages
31
Hi again,

To continue this subject and maybe convince other persons that Freenas/ZFS are really powerfull solutions...

Compare to my previous post in which I described analyse of system composed with 18x WD RED 3TB disks and 6x 1Gbit interface there is change.
I analysed bottlenecks of my system and decided to change disks for not cheap ones but anyway... 18x Seagate EXOS 7E8 4TB (SATA).
Rest of system stay the same (maybe except changing FN to 11.x) for now... results are impressive.

Take a look for picture:
picture1.jpg

It is dump of iSCSCI target performance (during writing and reading). To provoke this load I just force all my 140 VMs of VMWare to be suspended and then start them again. As you see performance is very nice, 650MB/s and what is very important is stabile during about 10 minutes so it means really disks can handle this load (previously I had pick becasue write cache of FN and then after saturation of disks all go down).

Take a look also for corresponding network saturation during this opeation (6x 1Gbit):
picture2.jpg

full 6Gbit saturated and also stabile during all the operation.

So to give you some conclusions:
1. Even not enterprise disks (like WD Red) can give you oportunity to create very powerful system however in case heavy load they can give you high perfromance only fro some time. It is due to improvements FN/ZFS provide. Of course you can play with larger matrix to provide better performance but minimum you will not go over is access time of one disk.
2. With enterprise disks you can create really powerful system for real professional use (in my case whole my production base on this system, 140 VM)

Going to measure my system in details and improve another bottleneck (now seems network :).

Good luck with your ways... FN/ZFS is really nice oportunity compare to "professional systems from professional corporations" :)
 
Last edited:
Status
Not open for further replies.
Top