Performance problem

Jano · May 21, 2017

Hi,

Inspired by quite old post: https://forums.freenas.org/index.php?threads/write-performance-issues-mid-level-zfs-setup.13372/
I decided to make tests of my platform... I have however problem with finding mistake and I would like to ask you for consultation.

Let's start from begining...

My system base on FreeNAS-9.10.1-U4 (ec9a7d3)
It has:
Intel(R) Core(TM) i7-4820K CPU @ 3.70GHz
64GB RAM
9 x NIC (Intel PRO1000 / 1Gbit / 1500 MTU) - 6x DATA chanel working as ISCSI interface to ESXI, 2x BACKUP chanel to another FREENAS system, 1x ADM
RAID10 composed with 9 pairs of disks (all disks are WD RED 3TB 5400 rpm)
Disks of RAID are connected (part of disks through LSI 9260 controler and part of disks via SATA onboard)
There is also L2ARC 120GB SSD bit for all theses tests it was turned off.

Parameters of ZFS are default:

 [root@PROD-1S] ~# zfs get all storage2

NAME  PROPERTY  VALUE  SOURCE

storage2  type  filesystem  -

storage2  creation  Fri Apr 21 13:10 2017  -

storage2  used  20.4T  -

storage2  available  3.28T  -

storage2  referenced  144G  -

storage2  compressratio  1.34x  -

storage2  mounted  yes  -

storage2  quota  none  local

storage2  reservation  none  local

storage2  recordsize  128K  default

storage2  mountpoint  /mnt/storage2  default

storage2  sharenfs  off  default

storage2  checksum  on  default

storage2  compression  off  local

storage2  atime  off  local

storage2  devices  on  default

storage2  exec  on  default

storage2  setuid  on  default

storage2  readonly  off  default

storage2  jailed  off  default

storage2  snapdir  hidden  default

storage2  aclmode  passthrough  local

storage2  aclinherit  passthrough  local

storage2  canmount  on  default

storage2  xattr  off  temporary

storage2  copies  1  default

storage2  version  5  -

storage2  utf8only  off  -

storage2  normalization  none  -

storage2  casesensitivity  sensitive  -

storage2  vscan  off  default

storage2  nbmand  off  default

storage2  sharesmb  off  default

storage2  refquota  none  local

storage2  refreservation  none  local

storage2  primarycache  all  default

storage2  secondarycache  all  default

storage2  usedbysnapshots  0  -

storage2  usedbydataset  144G  -

storage2  usedbychildren  20.3T  -

storage2  usedbyrefreservation  0  -

storage2  logbias  latency  default

storage2  dedup  off  local

storage2  mlslabel  -

storage2  sync  disabled  local

storage2  refcompressratio  1.00x  -

storage2  written  144G  -

storage2  logicalused  10.5T  -

storage2  logicalreferenced  144G  -

storage2  volmode  default  default

storage2  filesystem_limit  none  default

storage2  snapshot_limit  none  default

storage2  filesystem_count  none  default

storage2  snapshot_count  none  default

storage2  redundant_metadata  all  default

Composition of pool:

 

[root@PROD-1S] ~# zpool status storage2

  pool: storage2

state: ONLINE

  scan: scrub repaired 0 in 114h22m with 0 errors on Sat Apr 29 06:22:48 2017

config:



  NAME  STATE  READ WRITE CKSUM

  storage2  ONLINE  0  0  0

  mirror-0  ONLINE  0  0  0

  gptid/227efce1-2683-11e7-80e6-90e2ba36a9d0  ONLINE  0  0  0

  gptid/232e4be0-2683-11e7-80e6-90e2ba36a9d0  ONLINE  0  0  0

  mirror-1  ONLINE  0  0  0

  gptid/42c536e5-2683-11e7-80e6-90e2ba36a9d0  ONLINE  0  0  0

  gptid/4371d520-2683-11e7-80e6-90e2ba36a9d0  ONLINE  0  0  0

  mirror-2  ONLINE  0  0  0

  gptid/ca3f4bae-2683-11e7-80e6-90e2ba36a9d0  ONLINE  0  0  0

  gptid/ca992fc6-2683-11e7-80e6-90e2ba36a9d0  ONLINE  0  0  0

  mirror-3  ONLINE  0  0  0

  gptid/e2a2ad81-275c-11e7-80e6-90e2ba36a9d0  ONLINE  0  0  0

  gptid/e2e2ebe7-275c-11e7-80e6-90e2ba36a9d0  ONLINE  0  0  0

  mirror-4  ONLINE  0  0  0

  gptid/fea2fea2-275c-11e7-80e6-90e2ba36a9d0  ONLINE  0  0  0

  gptid/fed7a02c-275c-11e7-80e6-90e2ba36a9d0  ONLINE  0  0  0

  mirror-5  ONLINE  0  0  0

  gptid/242437c7-275d-11e7-80e6-90e2ba36a9d0  ONLINE  0  0  0

  gptid/24666d69-275d-11e7-80e6-90e2ba36a9d0  ONLINE  0  0  0

  mirror-6  ONLINE  0  0  0

  gptid/b999ffd4-275d-11e7-80e6-90e2ba36a9d0  ONLINE  0  0  0

  gptid/ba714e80-275d-11e7-80e6-90e2ba36a9d0  ONLINE  0  0  0

  mirror-7  ONLINE  0  0  0

  gptid/fbc0a7da-275d-11e7-80e6-90e2ba36a9d0  ONLINE  0  0  0

  gptid/fc09d437-275d-11e7-80e6-90e2ba36a9d0  ONLINE  0  0  0

  mirror-8  ONLINE  0  0  0

  gptid/2a733a0f-3564-11e7-8b69-90e2ba36a9d0  ONLINE  0  0  0

  gptid/2b01f6e4-3564-11e7-8b69-90e2ba36a9d0  ONLINE  0  0  0



errors: No known data errors

Standard system tuning generated by FreeNas automatically.

So...what I did...

First I turned off totally iSCSI and started to check performance (READ) of each of drive by simple DD

dd if=/dev/xxx of=/dev/null bs=262144

I checked each of drive (being part of my pool) and results were more and less the same... 140MB/s max
Of course it is sequentional read.

I compared this with diskinfo:

 

[root@PROD-1S] ~# diskinfo -t /dev/ada0

/dev/ada0

  512  # sectorsize

  3000592982016  # mediasize in bytes (2.7T)

  5860533168  # mediasize in sectors

  4096  # stripesize

  0  # stripeoffset

  5814021  # Cylinders according to firmware.

  16  # Heads according to firmware.

  63  # Sectors according to firmware.

  WD-WMC4N2099293 # Disk ident.



Seek times:

  Full stroke:  250 iter in  6.608834 sec =  26.435 msec

  Half stroke:  250 iter in  4.833553 sec =  19.334 msec

  Quarter stroke:  500 iter in  7.409172 sec =  14.818 msec

  Short forward:  400 iter in  2.446857 sec =  6.117 msec

  Short backward:  400 iter in  3.096281 sec =  7.741 msec

  Seq outer:  2048 iter in  0.172173 sec =  0.084 msec

  Seq inner:  2048 iter in  0.159437 sec =  0.078 msec

Transfer rates:

  outside:  102400 kbytes in  0.713659 sec =  143486 kbytes/sec

  middle:  102400 kbytes in  0.870385 sec =  117649 kbytes/sec

  inside:  102400 kbytes in  1.440545 sec =  71084 kbytes/sec

so.... comparabe... even too much but dd started from data being at the "beining of disk". Data located at the "end of disk" will be probably closer to 70MB/s.
Test base of READ operation (as I cannot damage data) but WRITE operation should be comparable.

OK, anyway.... in FreeNas UI report I see nice, stabile load of each disk.

Next I made the same operation but on all disks (18 disks) in the same time to eliminate eventual interference between disks, interface etc.
So I did:

 

foreach i ( ada0 ada1 ada3 ada4 mfid0 mfid1 mfid2 mfid3 mfid4 mfid5 mfid6 mfid7 mfid8 mfid9 mfid10 mfid11 mfid12 mfid13 mfid14 mfid15)

foreach? dd if=/dev/${i} of=/dev/null bs=262144 &

foreach? end

Result was also as expected... all disk generates the same performance compare to individual tests.
So, physical disk system can transfer up to 140MB/s for each of disk in the same time.

As I have RAID10 (9x2 drives), theoretical performance is:
READ: 18 x read performance of one driver = 2520 MB/s (if I take into account external tracks of disks where performance is 70MB/s it gives me about 1260MB/s for RAID)
WRITE: 9x write performance of one driver = 1260 MB/s (if I take into account external tracks of disks where performance is 70MB/s it gives me about 630MB/s for RAID)

I started to make tests on pool.
First I turned off compression of the pool becasue using dd with /dev/zero (when compression is turn on) gives wrong results.
Second I generated test file (only zeros) with size 107GB... seems large enough compare to ARC size and memory system has.

During genration of file I was checking performance of pool by:
zpool iostat storage2 1
and disk load by
gstat -p

I saw all disk working, each of them (gstat) 99...100 busy.
zpool iostat reports write speed about 890MB/s but not always... sometime it was a little over 170MB/s:

 

storage2  7.98T  16.5T  0  1.59K  0  164M

storage2  7.98T  16.5T  1  6.76K  256K  857M

storage2  7.99T  16.5T  0  6.88K  128K  836M

storage2  7.99T  16.5T  0  6.95K  128K  881M

storage2  7.99T  16.5T  2  6.58K  136K  835M

storage2  7.99T  16.5T  0  4.32K  0  517M

storage2  7.99T  16.5T  22  2.82K  20.5K  323M

storage2  7.99T  16.5T  0  6.71K  128K  853M

storage2  7.99T  16.5T  0  6.80K  0  871M

storage2  7.99T  16.5T  0  6.72K  0  860M

storage2  7.99T  16.5T  0  6.35K  0  794M

storage2  7.99T  16.5T  0  2.84K  126K  333M

storage2  7.99T  16.5T  21  3.71K  17.0K  413M

storage2  7.99T  16.5T  0  6.28K  128K  803M

storage2  7.99T  16.5T  0  6.68K  0  854M

storage2  7.99T  16.5T  0  6.80K  0  870M

storage2  7.99T  16.5T  0  6.16K  0  767M

storage2  7.99T  16.5T  0  2.87K  0  330M

storage2  8.00T  16.5T  1  1.69K  132K  172M

storage2  8.00T  16.5T  0  5.85K  0  749M

storage2  8.00T  16.5T  0  6.40K  0  819M

storage2  8.00T  16.5T  0  6.68K  0  843M

storage2  8.00T  16.5T  0  5.26K  0  644M

storage2  8.00T  16.5T  0  1.85K  0  230M

storage2  8.00T  16.5T  0  4.83K  0  574M

storage2  8.00T  16.5T  0  5.96K  128K  762M

storage2  8.00T  16.5T  0  6.70K  0  858M

So, there were two issues which I tried to explain:
1.
Why write performance if significantly slower than theoretical (1250 MB/s vs max 890MB/s)... so 29% less than theoretical.
I conclude it is not so bad... even very good - specially if I take into account that my disk on external tracks can have 70MB/s which gives 630MB/s of theorethical write speed

2.
Why write performance is not stabile but flow 170...890MB/s
I conclude it may be result of some ZSF buffer flushing (TXG)

I made tests to detect number of delayed writes compare to not delayed by:

 

dtrace -n fbt::dsl_pool_need_dirty_delay:return'{ @[args[1] == 0 ? "no delay" : "delay"] = count(); }'

results:
delayed: 89%
not delayed: 11%

I checked also how TXG is filled by:

 

txg-syncing

{

		this->dp = (dsl_pool_t *)arg0;

}



txg-syncing

/this->dp->dp_spa->spa_name == $$1/

{

		printf("%4dMB of %4dMB used", this->dp->dp_dirty_total / 1024 / 1024,

			`zfs_dirty_data_max / 1024 / 1024);

}



# dtrace -s dirty.d pool

results were:

 

CPU  ID  FUNCTION:NAME

  6  55778  none:txg-syncing  81MB of 4096MB used

  3  55778  none:txg-syncing 3696MB of 4096MB used

  7  55778  none:txg-syncing 3606MB of 4096MB used

  3  55778  none:txg-syncing 3109MB of 4096MB used

  6  55778  none:txg-syncing 3068MB of 4096MB used

  2  55778  none:txg-syncing 3096MB of 4096MB used

  2  55778  none:txg-syncing 3162MB of 4096MB used

  1  55778  none:txg-syncing 3149MB of 4096MB used

  5  55778  none:txg-syncing 3040MB of 4096MB used

  4  55778  none:txg-syncing 3046MB of 4096MB used

  0  55778  none:txg-syncing 3061MB of 4096MB used

  6  55778  none:txg-syncing 3040MB of 4096MB used

  3  55778  none:txg-syncing 3064MB of 4096MB used

  1  55778  none:txg-syncing 3279MB of 4096MB used

  2  55778  none:txg-syncing 3427MB of 4096MB used

  4  55778  none:txg-syncing 3368MB of 4096MB used

  5  55778  none:txg-syncing 3317MB of 4096MB used

  2  55778  none:txg-syncing 3205MB of 4096MB used

  3  55778  none:txg-syncing 3101MB of 4096MB used

All these gives me idea that pool cannot handle such write workload... pool is trying to delay writes hoping that a little later disks will be more free but they are not, so only possibility to flush buffers is to slowdown workload to get some IOPS to do that.

At the end of this part I would like to explain also CPU was loaded max 20% during all these tests.

Now time for second part of my test - regarding READ speed.
Traditionally I used dd to read, previously created file (107GB).
I was expecting significant disk load and transfers near 1TB/s.

I used:
[root@PROD-1S] /mnt/storage2# dd if=/mnt/storage2/testfile of=/dev/null bs=1048576

Results were a little surprising:
1. disk load (gstat) 30...80%
2. transfer (zpool iostat) from about 300 to 850MB/s
3. dd reported performance as 1025152405 bytes/sec = 977 MB/s

'dd' reported larger transfer (in fact dd was 'client' for ZFS) and it seems I should treat this as real value not data reported by 'zpool iostat'.
'zpool iostat' seems reports transfer to pool but not take into account ARC (am I right ??)
So, 'extra performance' comes from ARC specially on this part of file (107GB) which was cached by ARC.

After this I got idea to turn off PREFECH of zfs hoping that if ZFS will not make read-ahead on not so fast disks I have result should be more stabile.
But after this result was terrible... up to 80MB/s
So, seems I did not understand correctly how prefetch is working (can you explain ??)

In this part of test READ speed and specially fluctuation 300...850MB/s is for me the biggest surprise.

During all these tests CPU load not exceed 20% also.

So, sorry for long introducion to ask final questions...
1. Do you think above tests as reasonable ? Do you see mistakes in philosophy of these tests and/or interpretation of results ?
2. Do you agree, WRITE performance cannot be better without chaning hardware / adding more disks to RAID ?
3. How to find bottleneck during READ operation causing fluctuation ? Is it possible to dtrace stages of ZFS during reading data ?

dlavigne · May 30, 2017

Did you find answers for your questions?

Jano · May 30, 2017

Unfortunately no

Jano · Jan 14, 2018

Hi again,

To continue this subject and maybe convince other persons that Freenas/ZFS are really powerfull solutions...

Compare to my previous post in which I described analyse of system composed with 18x WD RED 3TB disks and 6x 1Gbit interface there is change.
I analysed bottlenecks of my system and decided to change disks for not cheap ones but anyway... 18x Seagate EXOS 7E8 4TB (SATA).
Rest of system stay the same (maybe except changing FN to 11.x) for now... results are impressive.

Take a look for picture:

It is dump of iSCSCI target performance (during writing and reading). To provoke this load I just force all my 140 VMs of VMWare to be suspended and then start them again. As you see performance is very nice, 650MB/s and what is very important is stabile during about 10 minutes so it means really disks can handle this load (previously I had pick becasue write cache of FN and then after saturation of disks all go down).

Take a look also for corresponding network saturation during this opeation (6x 1Gbit):

full 6Gbit saturated and also stabile during all the operation.

So to give you some conclusions:
1. Even not enterprise disks (like WD Red) can give you oportunity to create very powerful system however in case heavy load they can give you high perfromance only fro some time. It is due to improvements FN/ZFS provide. Of course you can play with larger matrix to provide better performance but minimum you will not go over is access time of one disk.
2. With enterprise disks you can create really powerful system for real professional use (in my case whole my production base on this system, 140 VM)

Going to measure my system in details and improve another bottleneck (now seems network :).

Good luck with your ways... FN/ZFS is really nice oportunity compare to "professional systems from professional corporations" :)

Important Announcement for the TrueNAS Community.

Performance problem

Jano

Dabbler

dlavigne

Guest

Jano

Dabbler

Jano

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Performance problem

Jano

Dabbler

dlavigne

Guest

Jano

Dabbler

Jano

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Performance problem"

Similar threads