Any ideas on the Penalty for dual CPU and infiniband

JoshuaB · June 29, 2018, 13:18

Hi,

I am curious as the the pealty involved in having a two CPU system or for having CPUs connected via Infiniband.

I understand that having more cores will eventually be beneficial, but if the software one has limits one to a certain number of cores, then this issue becomes more relevant.

But, I was wondering if anyone has an idea of how to understand the performance differences (given the same processor) for the following situation…

Single CPU running
Dual CPU on the same motherboard
Single CPU on two different motherboards connected via infiniband.

For arguments sake consider the CFD simulation to be large enough to require all the cores being offered. In the first case we have only half the cores available as in the second and third cases, but we don’t have any CPU to CPU or CPU to Infiniband to Infiniband to CPU latency issues.

This has all come up because I recently was able to run a CFX analysis with 11 million nodes on two different computers:

1. A single 8 core W-2145 CPU (7 cores were actually used during the solution run)
2. A dual CPU 8 core each 2687W v2 (14 cores were actually used in total for the solution run)

What I found was that the wall time per iteration during the solution was fairly close. For the older 2687W v2 it took it 168 secs/iter, while the W-2145 took 180 sec/iter.

Now I understand the W-2145 is a newer and faster CPU. But if I could scale this up, this would mean I could have on a 2 W-2145 CPU machine about 1.87x the performance 168/(180/2).

But the W-2145 can’t be configured in a dual core CPU (no UPI links). But I could configure on two separate machines connected via infinband (which in theory would allow me to scale it out even further later one).

So this got me to thinking what is the performance penalty one pays for having these types of configuration. I doubt I would see the a true 1.87 speed up, but the question is what should I expect for a speed up (if I devoted the same number of total cores to the solution)?

flotus1 · July 1, 2018, 05:17

First things first, It seems a little odd that the two platforms you tested are so close together. I would expect the Xeon 2687W v2 to perform better. Usually this is due to some suboptimal memory configuration.

That being said, there is no real penalty with a low number of nodes for a properly implemented MPI application. Neither within the nodes aka shared memory nor between the nodes with Infiniband given the problem size is large enough. Instead, you often see superlinear speedup in strong scaling due to cache effects.

Before you start connecting two single-socket Xeon CPUs with infiniband, I would recommend using a dual-socket solution with Xeon Silver or Gold CPUs. They provide 50% more memory bandwidth per CPU compared to Xeon W series.

JoshuaB · July 3, 2018, 13:40

First of all, thanks for replying. I have a couple of follow up questions if you don't mind...

I'm not quite sure why you are so surprised. The memory bandwidth of the 2687W v2 is 59.7 GB/s, while the newer W-2145 is 85.3 GB/s. (As per Intel's website)

Since you recommend a dual CPU Gold/Silver over two W-2145 hooked together via infiniband, I take it your experience tells you the higher the memory bandwidth generally the better the performance. I base this on the W-2145 having 4 channels for memory while the 6134 has six. They are otherwise both 8-core CPUs, and the maximum frequency for the W-2145 is even a bit higher.

Quote:

Originally Posted by flotus1

First things first, It seems a little odd that the two platforms you tested are so close together. I would expect the Xeon 2687W v2 to perform better. Usually this is due to some suboptimal memory configuration.

That being said, there is no real penalty with a low number of nodes for a properly implemented MPI application. Neither within the nodes aka shared memory nor between the nodes with Infiniband given the problem size is large enough. Instead, you often see superlinear speedup in strong scaling due to cache effects.

Before you start connecting two single-socket Xeon CPUs with infiniband, I would recommend using a dual-socket solution with Xeon Silver or Gold CPUs. They provide 50% more memory bandwidth per CPU compared to Xeon W series.

flotus1 · July 3, 2018, 14:00

Memory bandwidth in a dual-socket system adds up. 2 CPUs -> 2 times the theoretical memory bandwidth.
This and the fact that you have more raw computing power plus more and faster caches on the Xeon E5 v2 system makes me think that it should be a bit faster.

You can try it yourself to which extent you are limited by memory bandwidth on your Xeon W CPU. Run the same case with 1, 2, 4, 6 and 8 cores. Less than linear speedup in this case is very likely due to a memory bandwidth limit. At least if the scaling benchmark is run with a fixed CPU frequency (->turn off turbo boost), higher turbo frequencies for a lower amount of active cores can slightly skew the results.

My recommendation for a dual-socket system over two single-socket systems hooked together over Infiniband is mostly for cost-effectivenes and convenience.
You only need one case, motherboard, PSU etc.
No networking gear required
No need to set up the Infiniband network
More memory in one shared memory system if you need it
and last not least more total memory bandwidth

June 29, 2018, 13:18	Any ideas on the Penalty for dual CPU and infiniband	#1
JoshuaB New Member Joshua Brickel Join Date: Nov 2013 Posts: 26 Rep Power: 13	Hi, I am curious as the the pealty involved in having a two CPU system or for having CPUs connected via Infiniband. I understand that having more cores will eventually be beneficial, but if the software one has limits one to a certain number of cores, then this issue becomes more relevant. But, I was wondering if anyone has an idea of how to understand the performance differences (given the same processor) for the following situation… Single CPU running Dual CPU on the same motherboard Single CPU on two different motherboards connected via infiniband. For arguments sake consider the CFD simulation to be large enough to require all the cores being offered. In the first case we have only half the cores available as in the second and third cases, but we don’t have any CPU to CPU or CPU to Infiniband to Infiniband to CPU latency issues. This has all come up because I recently was able to run a CFX analysis with 11 million nodes on two different computers: 1. A single 8 core W-2145 CPU (7 cores were actually used during the solution run) 2. A dual CPU 8 core each 2687W v2 (14 cores were actually used in total for the solution run) What I found was that the wall time per iteration during the solution was fairly close. For the older 2687W v2 it took it 168 secs/iter, while the W-2145 took 180 sec/iter. Now I understand the W-2145 is a newer and faster CPU. But if I could scale this up, this would mean I could have on a 2 W-2145 CPU machine about 1.87x the performance 168/(180/2). But the W-2145 can’t be configured in a dual core CPU (no UPI links). But I could configure on two separate machines connected via infinband (which in theory would allow me to scale it out even further later one). So this got me to thinking what is the performance penalty one pays for having these types of configuration. I doubt I would see the a true 1.87 speed up, but the question is what should I expect for a speed up (if I devoted the same number of total cores to the solution)?

July 1, 2018, 05:17		#2
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,428 Rep Power: 49	First things first, It seems a little odd that the two platforms you tested are so close together. I would expect the Xeon 2687W v2 to perform better. Usually this is due to some suboptimal memory configuration. That being said, there is no real penalty with a low number of nodes for a properly implemented MPI application. Neither within the nodes aka shared memory nor between the nodes with Infiniband given the problem size is large enough. Instead, you often see superlinear speedup in strong scaling due to cache effects. Before you start connecting two single-socket Xeon CPUs with infiniband, I would recommend using a dual-socket solution with Xeon Silver or Gold CPUs. They provide 50% more memory bandwidth per CPU compared to Xeon W series. Last edited by flotus1; July 1, 2018 at 14:33.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Superlinear speedup in OpenFOAM 13	msrinath80	OpenFOAM Running, Solving & CFD	18	March 3, 2015 06:36

July 3, 2018, 14:00		#4
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,428 Rep Power: 49	Memory bandwidth in a dual-socket system adds up. 2 CPUs -> 2 times the theoretical memory bandwidth. This and the fact that you have more raw computing power plus more and faster caches on the Xeon E5 v2 system makes me think that it should be a bit faster. You can try it yourself to which extent you are limited by memory bandwidth on your Xeon W CPU. Run the same case with 1, 2, 4, 6 and 8 cores. Less than linear speedup in this case is very likely due to a memory bandwidth limit. At least if the scaling benchmark is run with a fixed CPU frequency (->turn off turbo boost), higher turbo frequencies for a lower amount of active cores can slightly skew the results. My recommendation for a dual-socket system over two single-socket systems hooked together over Infiniband is mostly for cost-effectivenes and convenience. You only need one case, motherboard, PSU etc. No networking gear required No need to set up the Infiniband network More memory in one shared memory system if you need it and last not least more total memory bandwidth