Performance on RAIDZ2 - Big Hardware (CPU,RAM) but slow performance with disk not really busy?

m4rv1n · Apr 23, 2016

Hi to all.
I think to get best performance with my hardware, but it seems something slow all the system.

FreeNas Server Specs:

Chassis: Dell PowerEdge C2100 Server
Build: FreeNAS-9.10-STABLE-201604181743 (74ef270)
CPUs: 2 x Intel(R) Xeon(R) CPU Westmere X5650 @2.67GHz, 6 core 12 thread, SSE 4.2, AES
Memory: 49114MB
HBA: pci-e cross-flashed lsi-2118it
Storage Hard Disk(s): 6 - Wester Digital RED 2TB
- 5900RPM 64MB SATAIII (6Gb/s) 3.5"
Volume: Composed of 2 RAIDZ2 w/6 Disks

In all test the cpu is around this CPU: 0.0% user, 0.0% nice, 2.5% system, 0.1% interrupt, 97.3% idle

NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
zvol 10.9T 2.61T 8.26T - 12% 24% 1.00x ONLINE /mnt
raidz2 10.9T 2.61T 8.26T - 12% 24%

Some test with compression off:
dd if=/dev/zero of=zerocompoff.dat bs=1024k count=20k
20480+0 records in
20480+0 records out
21474836480 bytes transferred in 46.140274 secs (465424988 bytes/sec)

L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 1147 0 0 0.0 1147 142867 1.8 97.1| da0
1 1044 0 0 0.0 1044 129394 2.0 97.1| da1
1 1057 0 0 0.0 1057 131195 2.0 97.5| da2
0 1068 0 0 0.0 1068 132222 1.9 97.2| da3
0 1151 0 0 0.0 1151 143163 1.8 97.4| da4
1 1049 0 0 0.0 1049 129917 2.0 97.3| da5

dd if=zerocompoff.dat of=/dev/null bs=1024k
20480+0 records in
20480+0 records out
21474836480 bytes transferred in 84.695996 secs (253551968 bytes/sec)

dd if=/dev/random of=randomcompoff.dat bs=1024k count=20k
20480+0 records in
20480+0 records out
21474836480 bytes transferred in 397.257573 secs (54057715 bytes/sec)

L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 197 0 0 0.0 196 15701 0.6 22.1| da0
0 211 0 0 0.0 210 15836 0.5 20.0| da1
0 210 0 0 0.0 208 15802 0.6 25.9| da2
0 206 0 0 0.0 204 15795 0.7 27.2| da3
0 209 0 0 0.0 207 15798 0.5 19.7| da4
0 196 0 0 0.0 194 15682 0.6 25.1| da5

dd if=/dev/random of=randomcompoff.dat bs=4M count=10000
10000+0 records in
10000+0 records out
41943040000 bytes transferred in 773.239792 secs (54243251 bytes/sec)

Some test with compression on:
dd if=/dev/zero of=zero.dat bs=1024k count=10k
10240+0 records in
10240+0 records out
10737418240 bytes transferred in 12.880493 secs (833618566 bytes/sec)

dd if=zero.dat of=/dev/null bs=1024k
10240+0 records in
10240+0 records out
10737418240 bytes transferred in 2.798744 secs (3836513229 bytes/sec)

dd if=/dev/random of=random.dat bs=1024k count=10k
10240+0 records in
10240+0 records out
10737418240 bytes transferred in 196.514067 secs (54639438 bytes/sec)

The disk speed if I write from zero, without compression, is about 465MB/s with a gstat about 97% for all the disk, instead if I write from random is about 54MB/s and a gstat about 25% for all the disk. All the big value (over 500MB/s) are read from cache or are compressed zero so are not important.
Also resilver speed is slow --> scan: resilvered 791G in 7h37m (29,54 MB/s)

What do you think is the problem?
Thank you

m4rv1n · Apr 24, 2016

./arc_summary.py
System Memory:

0.03% 15.84 MiB Active, 0.77% 366.47 MiB Inact
96.33% 45.00 GiB Wired, 0.00% 0 Bytes Cache
2.87% 1.34 GiB Free, 0.00% 0 Bytes Gap

Real Installed: 48.00 GiB
Real Available: 99.92% 47.96 GiB
Real Managed: 97.40% 46.72 GiB

Logical Total: 48.00 GiB
Logical Used: 96.46% 46.30 GiB
Logical Free: 3.54% 1.70 GiB

Kernel Memory: 464.80 MiB
Data: 94.31% 438.37 MiB
Text: 5.69% 26.43 MiB

Kernel Memory Map: 46.72 GiB
Size: 90.07% 42.07 GiB
Free: 9.93% 4.64 GiB
Page: 1
------------------------------------------------------------------------

ARC Summary: (HEALTHY)
Storage pool Version: 5000
Filesystem Version: 5
Memory Throttle Count: 0

ARC Misc:
Deleted: 5.70m
Mutex Misses: 3.44k
Evict Skips: 3.44k

ARC Size: 92.98% 42.51 GiB
Target Size: (Adaptive) 92.98% 42.51 GiB
Min Size (Hard Limit): 12.50% 5.71 GiB
Max Size (High Water): 8:1 45.72 GiB

ARC Size Breakdown:
Recently Used Cache Size: 93.72% 39.84 GiB
Frequently Used Cache Size: 6.28% 2.67 GiB

ARC Hash Breakdown:
Elements Max: 5.09m
Elements Current: 100.00% 5.09m
Collisions: 6.27m
Chain Max: 7
Chains: 1.04m
Page: 2
------------------------------------------------------------------------

ARC Total accesses: 57.80m
Cache Hit Ratio: 88.01% 50.87m
Cache Miss Ratio: 11.99% 6.93m
Actual Hit Ratio: 86.34% 49.91m

Data Demand Efficiency: 92.44% 31.86m
Data Prefetch Efficiency: 3.35% 2.33m

CACHE HITS BY CACHE LIST:
Most Recently Used: 15.71% 7.99m
Most Frequently Used: 82.39% 41.91m
Most Recently Used Ghost: 5.06% 2.58m
Most Frequently Used Ghost: 0.85% 434.94k

CACHE HITS BY DATA TYPE:
Demand Data: 57.88% 29.45m
Prefetch Data: 0.15% 78.19k
Demand Metadata: 35.93% 18.28m
Prefetch Metadata: 6.03% 3.07m

CACHE MISSES BY DATA TYPE:
Demand Data: 34.76% 2.41m
Prefetch Data: 32.52% 2.25m
Demand Metadata: 8.16% 565.59k
Prefetch Metadata: 24.56% 1.70m
Page: 3
------------------------------------------------------------------------

Page: 4
------------------------------------------------------------------------

DMU Prefetch Efficiency: 24.82m
Hit Ratio: 1.01% 249.97k
Miss Ratio: 98.99% 24.57m

Page: 5
------------------------------------------------------------------------

Page: 6
------------------------------------------------------------------------

ZFS Tunable (sysctl):
kern.maxusers 3405
vm.kmem_size 50160381952
vm.kmem_size_scale 1
vm.kmem_size_min 0
vm.kmem_size_max 1319413950874
vfs.zfs.vol.unmap_enabled 1
vfs.zfs.vol.mode 2
vfs.zfs.sync_pass_rewrite 2
vfs.zfs.sync_pass_dont_compress 5
vfs.zfs.sync_pass_deferred_free 2
vfs.zfs.zio.exclude_metadata 0
vfs.zfs.zio.use_uma 1
vfs.zfs.cache_flush_disable 0
vfs.zfs.zil_replay_disable 0
vfs.zfs.version.zpl 5
vfs.zfs.version.spa 5000
vfs.zfs.version.acl 1
vfs.zfs.version.ioctl 6
vfs.zfs.debug 0
vfs.zfs.super_owner 0
vfs.zfs.min_auto_ashift 9
vfs.zfs.max_auto_ashift 13
vfs.zfs.vdev.write_gap_limit 4096
vfs.zfs.vdev.read_gap_limit 32768
vfs.zfs.vdev.aggregation_limit 131072
vfs.zfs.vdev.trim_max_active 64
vfs.zfs.vdev.trim_min_active 1
vfs.zfs.vdev.scrub_max_active 2
vfs.zfs.vdev.scrub_min_active 1
vfs.zfs.vdev.async_write_max_active 10
vfs.zfs.vdev.async_write_min_active 1
vfs.zfs.vdev.async_read_max_active 3
vfs.zfs.vdev.async_read_min_active 1
vfs.zfs.vdev.sync_write_max_active 10
vfs.zfs.vdev.sync_write_min_active 10
vfs.zfs.vdev.sync_read_max_active 10
vfs.zfs.vdev.sync_read_min_active 10
vfs.zfs.vdev.max_active 1000
vfs.zfs.vdev.async_write_active_max_dirty_percent60
vfs.zfs.vdev.async_write_active_min_dirty_percent30
vfs.zfs.vdev.mirror.non_rotating_seek_inc1
vfs.zfs.vdev.mirror.non_rotating_inc 0
vfs.zfs.vdev.mirror.rotating_seek_offset1048576
vfs.zfs.vdev.mirror.rotating_seek_inc 5
vfs.zfs.vdev.mirror.rotating_inc 0
vfs.zfs.vdev.trim_on_init 1
vfs.zfs.vdev.larger_ashift_minimal 0
vfs.zfs.vdev.bio_delete_disable 0
vfs.zfs.vdev.bio_flush_disable 0
vfs.zfs.vdev.cache.bshift 16
vfs.zfs.vdev.cache.size 0
vfs.zfs.vdev.cache.max 16384
vfs.zfs.vdev.metaslabs_per_vdev 200
vfs.zfs.vdev.trim_max_pending 10000
vfs.zfs.txg.timeout 5
vfs.zfs.trim.enabled 1
vfs.zfs.trim.max_interval 1
vfs.zfs.trim.timeout 30
vfs.zfs.trim.txg_delay 32
vfs.zfs.space_map_blksz 4096
vfs.zfs.spa_slop_shift 5
vfs.zfs.spa_asize_inflation 24
vfs.zfs.deadman_enabled 1
vfs.zfs.deadman_checktime_ms 5000
vfs.zfs.deadman_synctime_ms 1000000
vfs.zfs.debug_flags 0
vfs.zfs.recover 0
vfs.zfs.spa_load_verify_data 1
vfs.zfs.spa_load_verify_metadata 1
vfs.zfs.spa_load_verify_maxinflight 10000
vfs.zfs.ccw_retry_interval 300
vfs.zfs.check_hostid 1
vfs.zfs.mg_fragmentation_threshold 85
vfs.zfs.mg_noalloc_threshold 0
vfs.zfs.condense_pct 200
vfs.zfs.metaslab.bias_enabled 1
vfs.zfs.metaslab.lba_weighting_enabled 1
vfs.zfs.metaslab.fragmentation_factor_enabled1
vfs.zfs.metaslab.preload_enabled 1
vfs.zfs.metaslab.preload_limit 3
vfs.zfs.metaslab.unload_delay 8
vfs.zfs.metaslab.load_pct 50
vfs.zfs.metaslab.min_alloc_size 33554432
vfs.zfs.metaslab.df_free_pct 4
vfs.zfs.metaslab.df_alloc_threshold 131072
vfs.zfs.metaslab.debug_unload 0
vfs.zfs.metaslab.debug_load 0
vfs.zfs.metaslab.fragmentation_threshold70
vfs.zfs.metaslab.gang_bang 16777217
vfs.zfs.free_bpobj_enabled 1
vfs.zfs.free_max_blocks 18446744073709551615
vfs.zfs.no_scrub_prefetch 0
vfs.zfs.no_scrub_io 0
vfs.zfs.resilver_min_time_ms 3000
vfs.zfs.free_min_time_ms 1000
vfs.zfs.scan_min_time_ms 1000
vfs.zfs.scan_idle 50
vfs.zfs.scrub_delay 4
vfs.zfs.resilver_delay 2
vfs.zfs.top_maxinflight 32
vfs.zfs.delay_scale 500000
vfs.zfs.delay_min_dirty_percent 60
vfs.zfs.dirty_data_sync 67108864
vfs.zfs.dirty_data_max_percent 10
vfs.zfs.dirty_data_max_max 4294967296
vfs.zfs.dirty_data_max 4294967296
vfs.zfs.max_recordsize 1048576
vfs.zfs.zfetch.array_rd_sz 1048576
vfs.zfs.zfetch.max_distance 8388608
vfs.zfs.zfetch.min_sec_reap 2
vfs.zfs.zfetch.max_streams 8
vfs.zfs.prefetch_disable 0
vfs.zfs.mdcomp_disable 0
vfs.zfs.nopwrite_enabled 1
vfs.zfs.dedup.prefetch 1
vfs.zfs.l2c_only_size 0
vfs.zfs.mfu_ghost_data_lsize 13633547264
vfs.zfs.mfu_ghost_metadata_lsize 15828674560
vfs.zfs.mfu_ghost_size 29462221824
vfs.zfs.mfu_data_lsize 19100094464
vfs.zfs.mfu_metadata_lsize 792016384
vfs.zfs.mfu_size 20064577024
vfs.zfs.mru_ghost_data_lsize 9467189248
vfs.zfs.mru_ghost_metadata_lsize 771992064
vfs.zfs.mru_ghost_size 10239181312
vfs.zfs.mru_data_lsize 23504714752
vfs.zfs.mru_metadata_lsize 17272320
vfs.zfs.mru_size 23708516864
vfs.zfs.anon_data_lsize 0
vfs.zfs.anon_metadata_lsize 0
vfs.zfs.anon_size 460800
vfs.zfs.l2arc_norw 0
vfs.zfs.l2arc_feed_again 1
vfs.zfs.l2arc_noprefetch 0
vfs.zfs.l2arc_feed_min_ms 200
vfs.zfs.l2arc_feed_secs 1
vfs.zfs.l2arc_headroom 2
vfs.zfs.l2arc_write_boost 40000000
vfs.zfs.l2arc_write_max 10000000
vfs.zfs.arc_meta_limit 12271660032
vfs.zfs.arc_free_target 84942
vfs.zfs.arc_shrink_shift 7
vfs.zfs.arc_average_blocksize 8192
vfs.zfs.arc_min 6135830016
vfs.zfs.arc_max 49086640128
Page: 7
------------------------------------------------------------------------

styno · Apr 24, 2016

I don't have the time to do some reference tests right now, but writing from /dev/random was always slower in my previous tests due to the slow(er) random generator.
This is most likely not zfs/hardware related. What are your real-life transfer speeds?

m4rv1n · Apr 24, 2016

In real-life, the storage is attached to a vmware esx with a 4Gbps fc. The test is ok (test attached)

And this is when I copy and write to the same datastore from esx:

L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
3 741 445 28079 5.0 291 28742 1.4 89.9| da0
0 457 201 7631 4.2 253 25763 1.0 54.1| da1
3 560 276 28242 9.8 281 29277 2.4 96.1| da2
1 573 318 21454 5.1 251 25447 1.4 70.6| da3
3 413 149 7946 9.2 261 26621 1.1 68.7| da4
4 604 342 21574 4.7 258 25379 1.5 79.9| da5

The big problem for me is the resilver speed, scan: resilvered 791G in 7h37m (29,54 MB/s)
I get at least 4x this speed (with a maximum of 500MB/s) with same disk but with raidz1 on a microserver hp gen8 with only 6GB of RAM. Of course raidz1 is faster than razidz2 but this is enormous difference.

hugovsky · Apr 24, 2016

tl;dr. How full is your pool?

EDIT: nervermind. I've found it. It doesn't seem a problem.

Robert Trevellyan · Apr 24, 2016

m4rv1n said:
The big problem for me is the resilver speed, scan: resilvered 791G in 7h37m (29,54 MB/s)

What is the workload and what is on the pool? Heavy fragmentation or a very large number of small files could lead to slow scrubs. Another candidate is a failing hard drive.

m4rv1n · Apr 25, 2016

Resilver was made with no other workload. The frag is 12% . About 60GB are big and small files, other 2TB are big files (vmware disk .vmdk).
Smart report are ok on all drive and no drive present a 100% busy , also when I make normal test the system is good (test in the previous post).

Robert Trevellyan · Apr 25, 2016

It's funny, I thought you were concerned about slow scrubs, but reading again I see it was a resilver. A RAIDZ2 pool is pretty busy during resilver, and 29.54MB/s doesn't seem that bad to me.

jgreco · Apr 26, 2016

m4rv1n said:
Resilver was made with no other workload. The frag is 12% . About 60GB are big and small files, other 2TB are big files (vmware disk .vmdk).
Smart report are ok on all drive and no drive present a 100% busy , also when I make normal test the system is good (test in the previous post).

For purposes of a scrub or resilver, a VMware vmdk file is likely to be a highly fragmented thing, so you're best off as thinking of that as a really bad workload.

RAIDZ2 is generally inappropriate for VM storage, though it can be made to work for a small number of simultaneous lightweight VM's. You will only have the IOPS performance of approximately one of the component devices, and that's going to hurt during scrubs etc. The only thing that's really saving you here is the ARC. Once things get fragmented, scrubs will take much longer. I imagine your HP microserver had a much fresher pool, and therefore much less fragmented.

You can try moving the VM datafiles off of the pool and then back on, and you'll probably hit a much higher speed again, until high fragmentation reappears.

m4rv1n · Apr 26, 2016

@Robert Trevellyan this is the status during a scrub

scan: scrub in progress since Tue Apr 26 09:46:40 2016
357G scanned out of 2.64T at 117M/s, 5h41m to go

L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 1342 1342 32922 0.4 0 0 0.0 26.6| da0
0 1360 1360 34391 0.4 0 0 0.0 28.6| da1
0 1348 1348 32321 0.3 0 0 0.0 24.7| da2
0 1330 1330 34326 0.5 0 0 0.0 38.2| da3
2 1224 1224 28398 0.4 0 0 0.0 28.6| da4
0 1038 1038 30499 1.2 0 0 0.0 63.1| da5

CPU: 0.0% user, 0.0% nice, 1.2% system, 0.1% interrupt, 98.7% idle
--------------------------------------------------------------------------------------------------------------
@jgreco The vmdk files is backup (simple copy from other datastore to this datastore), so no machine running in this datastore.
The scrub go at 117M/s , so the resilver speed seems to me really poor. Boot test was made without other load on the disk.

Can you explain more why raidz2 is so bad with vmfs? The test inside the VM is the one in the previus image post, so it doesn't look so bad to me. Not?

jgreco · Apr 26, 2016

m4rv1n said:
Can you explain more why raidz2 is so bad with vmfs?

It's bad for the same reason you shouldn't serve up databases. RAIDZ is good at large blocks of sequential storage. It sucks for random access.

The test inside the VM is the one in the previus image post, so it doesn't look so bad to me. Not?

.... so what? ZFS will do its best to do what it can for you. That has nothing to what sort of nightmare is being generated behind the curtains.

m4rv1n · Apr 26, 2016

Ok but I saw I copy full vmdk files so I copy single files of 100GB at time.
Of course zfs will do what it can do, but I want better understand why this big difference between normal use/scrub/resilver when in scrub and resilver process the disk never is busy a lot (only about 25%) and also the cpu is not stressed. Where the flow is blocked?

depasseg · Apr 26, 2016

I'm a little confused. The subject of this post is Big hardware, slow performance. Yet neither of those seem to be the case. You seem fine with the performance of your VM's performance, but are concerned with the speed of a resilver. And the "big hardware" is only 6 disks worth of 5900rpm rust platters. That is not "Big" by any stretch of the imagination.

And then you are comparing sequential writes with high IOPS required random IO (resilver). You've only got the IOPS performance of 1 (ONE) 5900rpm drive. That's it. Don't expect much.
Sequential performance is better - you will get the aggregation of 4 drives (the data drives) worth of writes, not including any caching you might be using.

And finally, why are you so concerned with resilver performance? It shouldn't be happening that often. And the pool should be offline when it is happening. Otherwise it will slow to a crawl.

m4rv1n · Apr 26, 2016

Hi Depasseg, the "big" is not related to disk but to cpu and RAM, my bad, I will modify title. My concern is because the disk is never really busy in this operation (scrub or resilver) but is only about 25% busy, also the cpu is about 99% idle.
With a lowest hardware (1 cpu 4 core and 6GB of ram, 4 disk in raidz) I get at least 4x this speed in scrub and resilver.
This pool have large backup files, the pool of the small hardware have more little files.
I'm concerned with resilver time because when I will be in production I can't put it offline, so it will be more slowly than now.

jgreco · Apr 26, 2016

If you need it to be fast in production during a resilver, then you need more vdevs, and probably mirror vdevs. Three way mirror vdevs work nicely.

m4rv1n · Apr 26, 2016

And with this I will speed up the resilver or only don't will be slower?
I would like to find the problem now with disk not really busy because of big difference with small hardware.

depasseg · Apr 26, 2016

To compare apples to apples, try configuring your big server with RAIDz1 and testing and configuring your small server with RAIDZ2 and test that.

How are you triggering the resilver?

m4rv1n · Apr 26, 2016

I can't destroy all vdev in this moment, but I can say you that when the small server is resilvering it take busy all disk, the big server don't do this. Also for the scrub is the same thing, over 500MB/s in small server vs 117MB/s in big server where the resource of disk are not taken over 25%.
If possible I prefer to make analisys test rather than make test that "don't know why but work".

Here a comparative available between raidz performance of resilver http://louwrentius.com/zfs-resilver-performance-of-various-raid-schemas.html
I trigger the resilver by offline the disk, format disk and replace the missing disk with the formatted disk.

In other thread I see this speed during resilver (under 30MB/s) can come from a bad disk but all the disk is 25% busy and smart status is ok.
What block the server to make the disks busy over 25%? Where is the bootlenk and how can I find it?

depasseg · Apr 26, 2016

m4rv1n said:
If possible I prefer to make analisys test rather than make test that "don't know why but work".

If you want to analyse this, then break down the problem and test it in steps.
If it were me, I would start by creating 6 individual pools (one on each drive), and fill up each with the same data, then run a scrub and monitor zpool iostat -v tank[1-6], zpool status tank[1-6] (for the scrub status) as well as gstat. This will baseline each individual disk. Then put them in mirrored pairs and re-run the scrub, then raidz1 and re-run the scrub and finally raidz2 and re-run.

Robert Trevellyan · Apr 26, 2016

Is this box hosting live VM storage, or just a backup destination for vmdk files? If the latter, RAIDZ2 should be fine. If the former, striped mirrors are preferred.

Important Announcement for the TrueNAS Community.

Performance on RAIDZ2 - Big Hardware (CPU,RAM) but slow performance with disk not really busy?

Explorer

Explorer

Patron

Explorer

Guru

Pony Wrangler

Explorer

Pony Wrangler

Resident Grinch

Explorer

Resident Grinch

Explorer

FreeNAS Replicant

Explorer

Resident Grinch

Explorer

FreeNAS Replicant

Explorer

FreeNAS Replicant

Pony Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Performance on RAIDZ2 - Big Hardware (CPU,RAM) but slow performance with disk not really busy?"

Similar threads