Encryption performance with L5630 & X5670 CPUs

Status
Not open for further replies.

SubnetMask

Contributor
Joined
Jul 27, 2017
Messages
129
My FreeNAS box originally had a pair of Xeon L5630 CPUs in it. While the L5630's have been doing ok, every now and then it seems like it slows down a little, so I just acquired a pair of X5670 CPUs for it for a cheap price. Before shutting it down to swap the processors, I ran the performance benchmark found here and the result I got was '4294967296 bytes transferred in 7.993979 secs (537275292 bytes/sec)', which is about 512MB/s. About what I expected. After installing the X5670's, I re-ran the test, and got '4294967296 bytes transferred in 7.833480 secs (548283443 bytes/sec)' or abut 522MB/s.

Given the benchmark differences found here, I was more than a little surprised to see virtually no change in the benchmark. Does the benchmark itself have a limit that I'm hitting, or did I do something wrong?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
If you want good performance for encryption, you need a CPU that supports it in hardware. Pretty sure Westmere etc doesn't.
 
Joined
Dec 29, 2014
Messages
1,135
If you want good performance for encryption, you need a CPU that supports it in hardware. Pretty sure Westmere etc doesn't.

I believe Westmere does support at least some encryption features https://en.wikipedia.org/wiki/Westmere_(microarchitecture) but I know Nehalem does not. I saw the encryption features pop up in one of my previous FreeNAS builds when I went from an X5500 to an X5600 CPU.

Given the benchmark differences found here, I was more than a little surprised to see virtually no change in the benchmark.

I would strongly suspect that your bottleneck is not CPU. You got a small bump from the newer CPU, but something else is slowing you down. It could be controller, drives, or how your pools are built. What does a zpool status -v look like?
 

SubnetMask

Contributor
Joined
Jul 27, 2017
Messages
129
Per the Intel ARK page fo rthe X5670 & L5630, they do support AES-NI. Also, running 'dmesg | grep aes' yields:

Code:
root@freenas1:~ # dmesg | grep aes
aesni0: <AES-CBC,AES-XTS,AES-GCM,AES-ICM> on motherboard


My 'understanding' of the benchmark is that it tests what the CPUs could 'theoretically' do, and that the test runs in RAM, but number of drives, speed of drives, RAM, etc, play into what kind of performance you'll ACTUALLY see - so shouldn't the drives, controllers, etc not factor into that test?
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
My 'understanding' of the benchmark is that it tests what the CPUs could 'theoretically' do, and that the test runs in RAM


You tested the same RAM. That's likely the limiting factor.
 

SubnetMask

Contributor
Joined
Jul 27, 2017
Messages
129
Yes, but PC1333 RAM (supposedly) has a peak transfer rate of 10,600MB/s. Even the slowest DDR3 RAM has a peak transfer rate of 6,400MB/s - 512MB/s seems a little off, doesn't it?
 

SubnetMask

Contributor
Joined
Jul 27, 2017
Messages
129
So a slight update - it turns out that for whatever reason, the CPU's clock speed was not automatic on this Supermicro X8DTE, so the X5670's were still clocked at 2.13Ghz (I haven't had to manually set a CPU speed in a LOONG time! lol). The slight bump may have been the result of the extra cores. After fixing the multiplier so the CPUs were actually running at 2.93Ghz, I re-ran the test and this was the result:

Code:
4294967296 bytes transferred in 5.959902 secs (720643926 bytes/sec) 687.25MB/s


So that's a bit more of an improvement, but it still seems low for a test that's run strictly in the RAM and CPUs. Based on what the benchmarks claim a single core of the X5670 can do (1,590,000 MB/s), so I'd expect the memory to be a bottleneck, but with PC1333 memory, I'd think that bottleneck would be well north of 1GB/s (provided all processing is kept in the RAM And CPUs - no HDDs or SSDs involved in the process)
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
Yes, but PC1333 RAM (supposedly) has a peak transfer rate of 10,600MB/s. Even the slowest DDR3 RAM has a peak transfer rate of 6,400MB/s - 512MB/s seems a little off, doesn't it?

Network hiccup at the office while I was drafting a reply... I had this saved as a draft before you found the clock rate issue.

I wouldn't expect to hit even half of peak transfer. But my point was, you're carrying out the same operations, and obtained the same result. The X5670 CPU should perform the operations much faster... And doesn't. Which implies the limiting factor is something other than the CPU. Since the test runs entirely in RAM... What else is there? (And I've learned something about the Supermicro X8's and clock speeds that I didn't know! :) )

But the key thing I found in Wikipedia... The remainder of my draft said:

But that still seems too slow for an AES-NI enabled processor. Wikipedia suggests 3.5 cycles per byte, which at 2.93Ghz is what? About 800 MB/sec? That would make for ~600MB/sec for the L5630? Maybe competing for access to the same RAM? Kernel locality, context switching, etc...
 

SubnetMask

Contributor
Joined
Jul 27, 2017
Messages
129
So let's assume that peak transfer is a fringe case - let's go for real world being 25% of peak - even with the slowest of DDR3 RAM (and what I have is the fastest of DDR3 at 1333), that's still 1,600MB/s, and a single core of a X5670 is supposedly capable of ridiculously higher than that (never mind 12 cores, since the 12 hyperthread cores don't count towards AES-NI). That's where I'm scratching my head. Don't get me wrong - I'm not expecting the disk arrays to be able to pull 20GB/s or anything - that's just not possible (without an array of SSDs, and even then, maybe). I could FULLY accept if the test (which supposedly is fully on the motherboard - totally in the RAM and CPU(s)) said the AES throughput was 2GB/s (just throwing a number out there) but the 'real' performance was 512MB/s - disks & controllers play heavily into the overall real performance - but the test is supposedly 'synthetic' & theoretical, pushing only the fastest components to their limits (CPUs, MB, RAM), leaving the controller(s) and disks out of the picture, so I'm scratching my head as to why the number is so low for this particular test. When I ran the tests on the x5670's, the system had 'just' booted, with none of the VMWare guest load on it at all (the L5630 test was done with all the VMWare guests up and running as normal), so contention should have been non-existent for both X5670 tests. Where could there be a bottleneck for this test? The CPUs are pretty darn fast, the memory is the fastest available for these CPUs... unless the Supermicro motherboards are flaming piles of junk, but I have a hard time seeing that being the case as Supermicro is a fairly big name...

I mean, overall, it's running well - even prior to fixing the clocking issue, it did seem to be running a bit better with the X5670's than it was with the L5630's, as I had several hiccups suspending and shutting down VMs with the L5630's that weren't there with the X5670's - the numbers from the test are just a bit perplexing to me. Based on the benchmarks 'out there', I expected a bigger jump on the test with the swap since it's supposed to be all within the CPU and RAM (I wouldn't expect the same jump with a test that involves the controller and drives as the CPU and RAM should be able to outperform the disks multiple times over unless you're running a P166 or something equally antiquated).
 
Last edited:

rvassar

Guru
Joined
May 2, 2018
Messages
972
Is the test even multi-thread? If it's only using 1 CPU core at a time, 687MB/sec is above 85% of the 3.5 clock per byte figure.

What happens if you run two at the same time?
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
As a real case scenario, I have my backup server (G4600 Pentium on Supermicro) attached to 2 volumes (5 disk RAIDZ2 and 3 disk Raidz1 (where redundancy disk has been removed) All are HGST 10TB drives and performing scrubbing leads to top throughput of 1.4GB/s.

This runs around 70% CPU usage.

Running scrubing on my Threadripper 1900X with same 5 disk RAIDZ2 leads to max read speed around 1.1GB/s. CPU usage I think was less than 20%.
Core count seems to play a role.

I haven't tried running this benchmark on them but I would be very interested to do so and report back.
What
 
Status
Not open for further replies.
Top