Which hardware/config is the bottleneck in my system?

Stilez · Dec 28, 2017

My NAS hardware and tunables are below, and it's a good build. In theory the pool should be blazingly fast, but it isn't.

A first bottleneck is that it crawls (comparatively speaking) on sequential writes, even locally from the CLI. I think I know why, and I'd like to reduce the bottleneck, but I'm not sure if I'm facing a config issue or hardware issue, and how to determine the optimal place to improve it.

Until recently I was getting a consistent 1GB/sec read and write across the LAN with Samba from SSD (about 970 - 990MB/sec). So I know the hardware can do 1GB/sec including LAN+CIFS overhead on large parallel sequential writes. As far as I know, I've only upgraded the mirrors from 2 way (previously) to 3 way (as they are now), and upgraded 11.0-U4 to 11.1. I've never got my Samba speed much above the values below, since then.

What I'd expect is that writes are as fast as a very fast NVMe SSD, due to a good ZIL, until txg's are full or it has to flush, but at worst, not much worse than single HDD speed for seq write after that (100-200 MB/sec?), and reads should be as fast as any read split across 12 HDDs can be (all data can be pulled from 3 mirrors and the file is striped across 4 vdevs since the pool isn't very full), certainly over 1GB/sec for large files.

I don't know if this is the "right" way to test, but anyhow. I created a 20GB malloc-backed ramdisk containing an 8GB random file (this was to ensure a full 8GB was written each time, and because /dev/random is CPU intensive). I copied this onto both my dedup'ed dataset and my non-deduped dataset in the same pool, 3 times each with different block sizes of 4k, 32k and 1m, making 6 tests total. I deleted and recreated the random file before each copy, to ensure any caching effects were similar and didn't favour the 2nd or subsequent access. The output was ... dramatic? ... to say the least:

# mdmfs -M -s 20g md /tmp/ramdisk
# cd /var/tmp/ramdisk/

### COPYING TO DEDUP DATASET IN MY USUAL POOL

# dd bs=4096 count=2048k if=/dev/random of=./randomjunk ; rm -f <DEDUP_DATASET>/randomjunk* ; /usr/bin/time -h dd bs=4096 if=./randomjunk of=<DEDUP_DATASET>/randomjunk
8589934592 bytes transferred in 168.876257 secs (50865259 bytes/sec)
2m48.87s real 0.49s user 13.91s sys

# dd bs=32768 count=256k if=/dev/random of=./randomjunk ; rm -f <DEDUP_DATASET>/randomjunk* ; /usr/bin/time -h dd bs=32768 if=./randomjunk of=<DEDUP_DATASET>/randomjunk
8589934592 bytes transferred in 177.767098 secs (48321285 bytes/sec)
2m57.76s real 0.07s user 6.68s sys

# dd bs=1048576 count=8k if=/dev/random of=./randomjunk ; rm -f <DEDUP_DATASET>/randomjunk* ; /usr/bin/time -h dd bs=1048576 if=./randomjunk of=<DEDUP_DATASET>/randomjunk
8589934592 bytes transferred in 194.109326 secs (44253075 bytes/sec)
3m14.11s real 0.00s user 4.07s sys

### COPYING TO NON-DEDUP DATASET IN MY USUAL POOL

# dd bs=4096 count=2048k if=/dev/random of=./randomjunk ; rm -f <NON-DEDUP_DATASET>/randomjunk* ; /usr/bin/time -h dd bs=4096 if=./randomjunk of=<NON-DEDUP_DATASET>/randomjunk
8589934592 bytes transferred in 15.409154 secs (557456587 bytes/sec)
15.40s real 0.21s user 13.52s sys

# dd bs=32768 count=256k if=/dev/random of=./randomjunk ; rm -f <NON-DEDUP_DATASET>/randomjunk* ; /usr/bin/time -h dd bs=32768 if=./randomjunk of=<NON-DEDUP_DATASET>/randomjunk
8589934592 bytes transferred in 14.666728 secs (585674903 bytes/sec)
14.66s real 0.11s user 5.02s sys

# dd bs=1048576 count=8k if=/dev/random of=./randomjunk ; rm -f <NON-DEDUP_DATASET>/randomjunk* ; /usr/bin/time -h dd bs=1048576 if=./randomjunk of=<NON-DEDUP_DATASET>/randomjunk
8589934592 bytes transferred in 10.166545 secs (844921743 bytes/sec)
10.16s real 0.00s user 3.47s sys

So clearly dedup processing is the first major issue. I use dedup because I get about 1.3x saving from compression and about 4.5x saving from dedup (the system was specced with - hopefully! - enough RAM and a fast CPU specifically to make dedup practical after experimenting with data saving benefits). But I never expected it to add 3 minutes to a 10-15 second write.

Read speed is not very good for the hardware either. For example, reading back a 12GB Blu-ray rip (not previously accessed/cached) from the non-dedup dataset:

/usr/bin/time -h dd bs=32768 if=./12gb_file_for_testing.avi of=/dev/null
12297826416 bytes transferred in 21.761556 secs (565117054 bytes/sec)
21.76s real 0.03s user 11.12s sys

565MB/s without dedup is what I'd expect from a single 3-way mirror, but this pool has been made up of 4 striped 2-then-3-way mirrors since the start, so it should be a lot faster than that. No idea.

There might be other bottlenecks. For example:

I don't know if my PCIe-2 9211's can handle 8 HDDs at full speed, or 4 per port, so I have split the 8 disks across 2 HBAs, but it hasn't helped. Do I need a next-gen 3008 card like a 9300/9305/9311 or is this unlikely to be a bottleneck?
Do I need more RAM (it can't hurt but it's expensive) or a better CPU?
Should I change dedup algorithm on the dedup dataset? (I'm not sure how to check what I'm using but could well be SHA256 not fletcher+verify)
ZIL has a top-end 400GB NVMe card to play with but doesn't seem to be making great advantage of it, and its txg size is probably tiny by comparison to the card size. Should I retune the ZIL sysctls to make better use of the ZIL card? Or will this be great up to 400MB and timeout for an hour flushing afterwards? :)
Worth noting, I get occasional timeouts across the LAN on CIFS + iSCSI, but not sure how to diagnose what causes them. This can happen even when copying one large (sequential?) file from a single client. Maybe that's a separate question.

Basically, this rig should be able to do a lot better, and 40-50MB/s isn't really okay. I can probably improve it by tuning or selective upgrades.

What can I do to either improve what I have, or pin down the upgrades needed for better performance?

My detailed hw and tunables are below.

Hardware spec:

I use mirrored HDD vdevs for my pool, with good hardware, fast CPU, 10G Chelsio, fast ZIL + L2ARC and plenty of RAM (detailed spec below). The system has been configured through the GUI only (original clean install 9.10.2) and doesn't run any services except Samba, iSCSI (idle/no connection) and SSH, no VMs, no extensions.

Hardware: Supermicro X10 series, Xeon 1620-v4 (quad core 3.5GHz to ensure good performance on single threaded tasks like Samba), 96GB ECC 2400, 10G T420-CR Chelsio NIC, two LSI 9211's with latest fw
Disks:
Pool: vdevs are all 3 way mirrors configured as (3x 6TB) + (3 x 6TB) + (3x 8TB) + (3x 4TB) 7200 Enterprise drives, total formatted capacity ~ 22TB, used ~ 14TB. The HDDs are about 4 SAS and 8 SATA.
Boot: 2 mirrored Intel SSDs.
ZIL: Intel P3700 NVMe (PCie gen.3 slot)
L2ARC: Samsung 250GB NVMe (PCie gen.3 slot)
HDD/HBA connections: The HBAs are gen2 and in gen 2 + gen 3 slots. 8 of the HDDs (including all the SAS drives) are connected to the 9211s, and the rest are on the baseboard Intel chipset connectors. I also split the 9211-connected disks from 8 on one HBA to 4 each on 2 HBAs in case the older 2008 chips were a bottleneck.

Tunables:

kern.ipc.maxsockbuf = 8388608
kern.ipc.nmbclusters = 6282056
kern.ipc.soacceptqueue = 1024
kern.ps_arg_cache_limit = 4096
kern.random.sys.harvest.ethernet = 0
kern.random.sys.harvest.interrupt = 0
kern.random.sys.harvest.point_to_point = 0
net.inet.ip.fastforwarding = 1
net.inet.ip.forwarding = 0
net.inet.raw.maxdgram = 57344
net.inet.tcp.blackhole = 2
net.inet.tcp.delayed_ack = 0
net.inet.tcp.mssdflt = 1448
net.inet.tcp.recvbuf_inc = 524288
net.inet.tcp.recvbuf_max = 16777216
net.inet.tcp.recvspace = 524288
net.inet.tcp.sendbuf_inc = 16384
net.inet.tcp.sendbuf_max = 16777216
net.inet.tcp.sendspace = 524288
net.inet.udp.blackhole = 1
vfs.zfs.arc_max = 92632692326
vfs.zfs.l2arc_headroom = 2
vfs.zfs.l2arc_noprefetch = 0
vfs.zfs.l2arc_norw = 0
vfs.zfs.l2arc_write_boost = 40000000
vfs.zfs.l2arc_write_max = 10000000
vfs.zfs.metaslab.lba_weighting_enabled = 1
vfs.zfs.min_auto_ashift = 12
vfs.zfs.resilver_delay = 0
vfs.zfs.scrub_delay = 0
vfs.zfs.zfetch.max_distance = 33554432
vm.kmem_size = 128656517120

Johnnie Black · Dec 28, 2017

Stilez said:
I don't know if my PCIe-2 9211's can handle 8 HDDs at full speed, or 4 per port, so I have split the 8 disks across 2 HBAs, but it hasn't helped.

I can help with this one, you'll have no bottleneck with 8 disks on a 9211 (assuming it's on an x8 PCIe 2.0 slot), only if using SSDs there would be one, I've benchmarked mine at ~2560MB/s total, 8 x 320MB/s.

Stilez · Dec 28, 2017

Johnnie Black said:
I can help with this one, you'll have no bottleneck with 8 disks on a 9211 (assuming it's on an x8 PCIe 2.0 slot....

Thanks, that's one possibility resolved (and £150 or so saved!)

SweetAndLow · Dec 28, 2017

Start by disabling all tunables, the changes you have made will only cause problems.

Stilez · Dec 28, 2017

SweetAndLow said:
Start by disabling all tunables, the changes you have made will only cause problems.

Is that disable all, or just some? Most were set by the server's own autotune, I assume to optimise for memory, LAN and storage specifics of the server. I'll have to test with them disabled when home (away for 2 nights).

Presumably you mean disable all, then re-enable one at a time and test? Also I'm guessing I can leave/ignore the NIC/LAN tunables alone for now, since the issue can be reproduced locally using "dd" as above?

SweetAndLow · Dec 28, 2017

Stilez said:
Is that disable all, or just some? Most were set by the server's own autotune, I assume to optimise for memory, LAN and storage specifics of the server. I'll have to test with them disabled when home (away for 2 nights).

Presumably you mean disable all, then re-enable one at a time and test? Also I'm guessing I can leave/ignore the NIC/LAN tunables alone for now, since the issue can be reproduced locally using "dd" as above?

No disable them and leave them disabled you don't have any clue what they do so you shouldn't be using them.

Important Announcement for the TrueNAS Community.

Which hardware/config is the bottleneck in my system?

Stilez

Guru

Johnnie Black

Guru

Stilez

Guru

SweetAndLow

Sweet'NASty

Stilez

Guru

SweetAndLow

Sweet'NASty

Similar threads

Important Announcement for the TrueNAS Community.

Which hardware/config is the bottleneck in my system?

Stilez

Guru

Johnnie Black

Guru

Stilez

Guru

SweetAndLow

Sweet'NASty

Stilez

Guru

SweetAndLow

Sweet'NASty

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Which hardware/config is the bottleneck in my system?"

Similar threads