One thread on two cores?

kris_jag · April 15, 2013, 11:15

Hi all,

We recently replaced our four "ourself made" i7-980X 3.33 GHz computers with two Dell PowerEdge R820 (each with 4 Xeon E5-4650 2.70 GHz) servers. On old i7 computers were installed Windows Server 2008 R2 Standard, on R820 we have Windows Server 2008 R2 Enterprise. On R820 there is installed MS HPC, but it is not used yet - we are carrying out the Fluent computations by Remote Desktop (RDP).

Unfortunately the time computation on new servers is rather poor. According to CFP2006 Rates benchmarks from SPEC.ORG (parallel floating point calculations) R820 should be significantly faster than old computers (table.gif).
I know that they are only benchmarks, but...
Our R820 at the beginning were slower than two years' old i7. After some bios and operating system tuning by our hardware provider they are more or less the same as old (still a little slower) - instead of 50-100% faster...

******************
That was only background. Main reason of this post is a question - do someone know, why one thread could be calculated by two cores?

Example:
One Fluent calculation on 4 processes fl_mpi1400.exe (processes.gif)

Each process has 7 threads, but only one of them is "power demanding" (threads.gif).

3.14% of processor load equals 100% of one core load (32 cores in system).

But for these four processes/threads the eight cores is used, of average 50% usage (graphs.gif).

Why one thread is calculated by two cores (at least it looks like that)? Does anybody know?

I suppose that could(?) be a (one of) reason for poor performance.

Just one comment: Hyper Threating on both old and new computers is disabled.

Thanks for any help/advise.

kyle · April 15, 2013, 13:58

You are seeing about what I would expect. Even though the 980X came out over three years ago, it is still pretty close to the fastest processor you can buy.

The bottleneck for CFD on unstructured meshes is typically random memory access speed, and your Dell machine has several things working against it in this area...

Both the i7 980X and the E5-4650 are capable of running one memory channel per two cores, but you are using a quad socket machine. That means you have to have 16 memory channels to feed your four CPUs as effectively as the three channel 980X. Your monster 4 socket motherboard might only be capable of running 8 memory channels, or you might not have 16 sticks of memory installed. If I had to guess, your new cluster is running with 8 channels per machine, or 16 total, vs 12 total channels of memory on the old cluster.

But why is the 12 faster than the 16? Two more reasons. You likely have faster memory in the old cluster. The 980X can run overclocked, low latency memory, whereas the Xeon machine is likely using much slower registered ECC memory at a lower frequency. Multi-socket systems are also bad for random memory access. Each socket is directly attached to its own group of memory channels, but for the other 3/4 of the system's memory it must ask another processor to relay the data. The memory for the calculations each processor is handling may not be directly accessible by that processor, and this necessarily adds latency to memory access.

Bottom line is consumer grade "gamer" hardware is significantly faster per dollar for CFD than the big $10,000 machines that Dell and HP want to sell you. CFD puts very different demands on a machine than most applications. Most applications are not memory access bound, so servers are not built to maximize memory accesss speed. For my startup I built a 15 node, 60 CPU core cluster of i7 machines for $12,000, and it calculates faster than a $100,000 cluster that Dell would spec for you.

Edit - And to answer your question about one thread appearing to be shared by two cores, this does slow down the calculations. What is likely happening is the thread is jumping around to different cores, which you obviously don't want. In Linux you can lock a thread to a specific core, but I am not sure how to do it in Windows. While this will help, I would not expect to see huge gains.

kris_jag · April 16, 2013, 12:01

Thanks Kyle for an explanation. I was not aware of so large importance of memory access speed - I thought that CPU is definitely the most important. We have to better consider upgrade of our computers in future...

But one thing is steel intrigue me - why four processes/threads (where in its properties it is indicated, that they are calculated as 100% of one core - 3.14% of total CPU) are actually calculated by eight cores (average 50% load of each)?

evcelica · April 16, 2013, 21:57

My computer does the same thing, when I use 4 cores it uses partial load on all six cores, to equal 66% of the CPU, not 4 at 100% and 2 cores sitting idle. I wouldn't worry about it, I'm sure it is supposed to act this way.

April 15, 2013, 13:58		#2
kyle Senior Member Join Date: Mar 2009 Location: Austin, TX Posts: 160 Rep Power: 18	You are seeing about what I would expect. Even though the 980X came out over three years ago, it is still pretty close to the fastest processor you can buy. The bottleneck for CFD on unstructured meshes is typically random memory access speed, and your Dell machine has several things working against it in this area... Both the i7 980X and the E5-4650 are capable of running one memory channel per two cores, but you are using a quad socket machine. That means you have to have 16 memory channels to feed your four CPUs as effectively as the three channel 980X. Your monster 4 socket motherboard might only be capable of running 8 memory channels, or you might not have 16 sticks of memory installed. If I had to guess, your new cluster is running with 8 channels per machine, or 16 total, vs 12 total channels of memory on the old cluster. But why is the 12 faster than the 16? Two more reasons. You likely have faster memory in the old cluster. The 980X can run overclocked, low latency memory, whereas the Xeon machine is likely using much slower registered ECC memory at a lower frequency. Multi-socket systems are also bad for random memory access. Each socket is directly attached to its own group of memory channels, but for the other 3/4 of the system's memory it must ask another processor to relay the data. The memory for the calculations each processor is handling may not be directly accessible by that processor, and this necessarily adds latency to memory access. Bottom line is consumer grade "gamer" hardware is significantly faster per dollar for CFD than the big $10,000 machines that Dell and HP want to sell you. CFD puts very different demands on a machine than most applications. Most applications are not memory access bound, so servers are not built to maximize memory accesss speed. For my startup I built a 15 node, 60 CPU core cluster of i7 machines for $12,000, and it calculates faster than a $100,000 cluster that Dell would spec for you. Edit - And to answer your question about one thread appearing to be shared by two cores, this does slow down the calculations. What is likely happening is the thread is jumping around to different cores, which you obviously don't want. In Linux you can lock a thread to a specific core, but I am not sure how to do it in Windows. While this will help, I would not expect to see huge gains. evcelica and kris_jag like this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
udf problem	jane	Fluent UDF and Scheme Programming	37	February 20, 2018 05:17
Guide: Getting Started with the CFD Online Discussion Forums	pete	Site Help, Feedback & Discussions	8	July 29, 2016 06:00
Which is better for CFD 4 core i7-2600 or AMD 8 core FX-8150?	GregShaffer	Hardware	3	May 7, 2015 14:26
Superlinear speedup in OpenFOAM 13	msrinath80	OpenFOAM Running, Solving & CFD	18	March 3, 2015 06:36
Phase locked average in run time	panara	OpenFOAM	2	February 20, 2008 15:37

April 16, 2013, 12:01		#3
kris_jag New Member Krzysztof Jagiełło Join Date: Apr 2013 Location: Warsaw/Poland Posts: 2 Rep Power: 0	Thanks Kyle for an explanation. I was not aware of so large importance of memory access speed - I thought that CPU is definitely the most important. We have to better consider upgrade of our computers in future... But one thing is steel intrigue me - why four processes/threads (where in its properties it is indicated, that they are calculated as 100% of one core - 3.14% of total CPU) are actually calculated by eight cores (average 50% load of each)?

April 16, 2013, 21:57		#4
evcelica Senior Member Erik Join Date: Feb 2011 Location: Earth (Land portion) Posts: 1,188 Rep Power: 23	My computer does the same thing, when I use 4 cores it uses partial load on all six cores, to equal 66% of the CPU, not 4 at 100% and 2 cores sitting idle. I wouldn't worry about it, I'm sure it is supposed to act this way.