|
[Sponsors] |
February 7, 2013, 11:11 |
Intel Core-i7 Hyperthreading and CFX
|
#1 |
Member
Max
Join Date: May 2011
Location: old europe
Posts: 88
Rep Power: 15 |
Hey,
I've read several discussion about Hyperthreading (HT) in Intel Core-i7 processors and it's impact on the CFX performance. Since I purchased a new i7 and do have enough licenses available, I ran a Benchmark. I thought the results could be useful for other, so here they are. The diagram below shows the relative speed (inverse of the CFD solver wall clock time for the CFX Benchmark case, normalized with the value for i3770 HT on with 1 process) on the y-axis. The irregular trend is probably a result of round-off errors by the CFD solver wall clock time... it only gives two significant numbers. Bottom line is: If HT is disabled, the maximum speed is achieved with 4 processes. This configuration is just as fast as with HT on and 6 processes. So you cannot improve the speed by disabling HT but you can save licenses. |
|
February 7, 2013, 12:05 |
|
#2 |
Member
Daniel Ceglarski
Join Date: Sep 2012
Location: Essen, Germany
Posts: 50
Rep Power: 14 |
You do not have to disbale HT at all. Just assign the real cores to the threads in the Task Manager. They can be identified by their number, namely core 0, 2, 4 and 6.
But I think this topic belongs to the hardware forum. |
|
February 7, 2013, 23:58 |
|
#3 |
Senior Member
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,188
Rep Power: 23 |
And how would you assign the real cores to the threads in the Task Manager?
|
|
February 8, 2013, 03:20 |
|
#4 | |
Member
Daniel Ceglarski
Join Date: Sep 2012
Location: Essen, Germany
Posts: 50
Rep Power: 14 |
Quote:
You can seperate the real from the hyperthreading cores, if you shut down all of your application and bring the computer into the idle state. Then go to the Performance tab and there you click on "Resource Monitor...". Each hyperthreading core is marked with "Parked". Usually the cores with an even number (including 0) are the real cores. It is a common mistake that people just using one real core and the hyperthreading unit for multithreading an a dual core intel cpu. |
||
February 8, 2013, 03:45 |
|
#5 |
Member
Max
Join Date: May 2011
Location: old europe
Posts: 88
Rep Power: 15 |
I see... good to know!
But I guess you have to do this again everytime you start a new simulation, right? So this might be an option for someone doing long lasting simulations. For me it is unfortunately not since I have to run a lot of simulations which each only take about 5-10 minutes. |
|
February 8, 2013, 05:30 |
|
#6 |
Super Moderator
Glenn Horrocks
Join Date: Mar 2009
Location: Sydney, Australia
Posts: 17,872
Rep Power: 144 |
In my experience assigning processes to core just makes it go slower. Best leave the OS sort that out. But benchmark it and find out for yourself, things may have changed.
But if you have a quad core processor, don't be fooled by the hyperthreading virtual cores. You can only run at 4 processes with any reasonable efficiency, and in fact even the last physical core is of marginal benefit. But again, this is different for different machines so benchmark it on your machine and work out what works best on your system. |
|
February 8, 2013, 05:40 |
|
#7 | |
Member
Max
Join Date: May 2011
Location: old europe
Posts: 88
Rep Power: 15 |
Quote:
On the ivy bridge i7, that i used for this benchmark, the maximum performance was at 6 processes (see chart attached to my first posting) and on a sandy bridge i7 that i used some time ago, the maximum was at 7 processes. |
||
February 8, 2013, 05:47 |
|
#8 |
Super Moderator
Glenn Horrocks
Join Date: Mar 2009
Location: Sydney, Australia
Posts: 17,872
Rep Power: 144 |
This is where you have to factor in your specific case. In most commercial applications the additional speed from the last few processes is not worth the cost of the parallel license. In fact the license cost is many times the hardward cost. So the performance per $ makes the optimum quite a few less processes. In fact the optimum is often a single (or maybe two) processes per machine and a fast network connection to run distributed parallel.
But for academic applications they often have lots of licenses (as they get them cheaply) so then it makes sense to run whatever is the fastest, even if the last few processes are not really adding much. |
|
February 8, 2013, 06:53 |
|
#9 | |
Member
Daniel Ceglarski
Join Date: Sep 2012
Location: Essen, Germany
Posts: 50
Rep Power: 14 |
Quote:
Moreover it is inconvenient for me to disable the hyperthreading capability each time I want to simulate a case, since I want to benefit from the additional performance that hyperthreading offers me when I am working with other applications like Power Point etc. |
||
February 8, 2013, 07:03 |
|
#10 |
Super Moderator
Glenn Horrocks
Join Date: Mar 2009
Location: Sydney, Australia
Posts: 17,872
Rep Power: 144 |
I see. I have done exactly the same test years ago with earlier generations of hyperthreading and CFX and found that assigning processes to cores slows things down. Just goes to show you have to test these things for yourself on your system, as some systems behave totally differently to others.
|
|
February 9, 2013, 05:46 |
|
#11 |
Member
Daniel Ceglarski
Join Date: Sep 2012
Location: Essen, Germany
Posts: 50
Rep Power: 14 |
Regarding the additional speed I totally agree with ghorrocks and it really depends on the system.
I have an Ivy Bridge i3770 too, but don't benefit from two additional threads compared to only four threads on my quad core. Rather I get a speed drop in my simulation if I use more threads than physical cores are available. Nevertheless in the Internet I found: >> Hyper-Threading is now called Simultaneous multithreading or SMT. Customers are recommended to leave SMT enabled on their systems but not over-subscribe physical cores for parallel simulations. While some improvement is possible, the extra performance from the virtual threads is not cost-effective and incommensurate with the additional license costs (which are per process)." Basically, if a section of the CPU core is not being used it tries to run a second task on these sections. For example, if one process only needs to do floating point operations while another only needs to do integer operations they can run both concurrently. For FLUENT, there is no consequence to performance if it is turned off. If SMT is on, and you run 16x (instead of 8x; assuming dual cpu quad-core nodes), you can get an additional 20% or so (compared to 8x) improvement. This is not recommended since you only get 20% more for 2x licenses (license is per process). in this scene rio, leave SMT on and run 8 way. This is the recommended approach << This comes from http://www.simutechgroup.com/Technic...e-support.html |
|
March 9, 2013, 12:16 |
|
#12 |
Member
Shawn
Join Date: Oct 2011
Posts: 56
Rep Power: 15 |
I've done the same tests myself. CFX code has been pretty well optimized for parallel use. Hyperthreading, as you've said, attempts to make use of idle CPU resources by utilizing an independent front end of each physical core to prepare data for the shared math unit of each physical core, but, due to the parallel efficiency of the CFX solver, there is basically no idle CPU time.
I've found that if you have, for example, and quad-core CPU with 4 physical cores and 8 locigical cores with hyperthreading enabled, there will be a VERY VERY small, if any, performance difference between running a simulation with 4 cores hyperthreading off and 8 cores hyperthreading enabled. Also, you will get VERY non-linear performance speedup with additional cores with hyperthreading enabled. As Glenn said, if your simulations are not limited by the available licenses, you can leave hyperthreading enabled, but you MUST run your job will ALL the cores available on your system. If your have a limited number of licenses, then DISABLE HYPERTHREADING. Also, if you want to run MULTIPLE simultaneous jobs, DISABLE HYPERTHREADING. |
|
March 15, 2013, 06:13 |
|
#13 |
Senior Member
OJ
Join Date: Apr 2012
Location: United Kindom
Posts: 473
Rep Power: 20 |
I have done this benchmarking of CFX and Fluent, although not on hyperthreading but on physical cores. Thought it may be useful. The cores are physical cores of cluster, which has 4 “boxes” each having a quad core Intel i7-2600 processor and 16 GB RAM, connected to each other by Infiniband SDR 4X using RDMA 10 Gbits/s (latency ~5 microseconds).
The mesh was roughly 4 million and physics, same for both codes, was : porous region, second order discretization schemes, 2-equation models. Attached is the snap of results. It is evident that not only Fluent is a faster for same physics and computational resource, but also is a lot more efficient in leveraging the multicore processing. Nearly twice as efficient Agreed, CFX being coupled solver reaches convergence faster (smaller no. of iterations). Yet, the difference is a lot. OJ |
|
March 16, 2013, 06:06 |
|
#14 |
Super Moderator
Glenn Horrocks
Join Date: Mar 2009
Location: Sydney, Australia
Posts: 17,872
Rep Power: 144 |
These results are strange. CFX usually parallelises very well when properly set up. I do not think this result is typical, and I suspect something is wrong with your benchmark.
|
|
March 16, 2013, 08:56 |
|
#15 |
Senior Member
OJ
Join Date: Apr 2012
Location: United Kindom
Posts: 473
Rep Power: 20 |
The time study was done on one of the models. I have observed the same trend in numerous other models - with smaller and bigger meshes than the one used for this study. Typically, 2000 iterations of Fluent used to be finished within say 2-3 hours. CFX did close to 1000 (+/- 200) iterations in the same timeframe.
Is it fair to say that the porous jump in Fluent and porous interface in CFX have different computational requirements? OJ |
|
March 17, 2013, 05:42 |
|
#16 |
Super Moderator
Glenn Horrocks
Join Date: Mar 2009
Location: Sydney, Australia
Posts: 17,872
Rep Power: 144 |
Your other results are as expected:
* CFX takes much longer per iteration than Fluent (this is because the coupled solver in CFX is much more complex than Fluent's default SIMPLE based solver) * CFX converges faster than Fluent (again, due to CFX's coupled solver). So your comment that CFX does half as many iterations as Fluent in the same time is as expected. The comment I am surprised about is your comment that the parallel speedup factor is much lower for CFX than it is for Fluent. They should both be similar, and for hardware with few bottlenecks should be close to ideal speedup. If you are reporting CFX is off ideal scaling then I suspect either your benchmark is dodgy or the result is throttled by your hardware somehow. |
|
March 17, 2013, 10:20 |
|
#17 | ||
Senior Member
OJ
Join Date: Apr 2012
Location: United Kindom
Posts: 473
Rep Power: 20 |
Quote:
Quote:
Or, is it that FLUENT's pace compared to CFX is surprising? OJ |
|||
March 17, 2013, 18:17 |
|
#18 |
Super Moderator
Glenn Horrocks
Join Date: Mar 2009
Location: Sydney, Australia
Posts: 17,872
Rep Power: 144 |
Doing a 4-way simulation on a single quad core CPU will end up in a speedup factor around 2.5. This is due to memory bottlenecks on the CPU and motherboard and has little to do with the software. You will find running 4 totally independant processes simultaneously on a quad core CPU will run about 2.5 times faster than a single process.
Be aware of the new Intel technology, I forget its name, where it runs at a higher CPU clock speed when running single core versus multi core. This can distort speedup benchmarks. To get speedups in the 3.5 and higher range for an ideal 4 times acceleration you need to remove the CPU/motherboard memory bottleneck. An easy (but expensive) way of doing this is by running 4 machines, each using a single core of the CPU. Note you will also need a reasonable network for this to work. Under this setup I would expect both CFX and Fluent to have speedup efficiencies of 95% in the simulation size have here. |
|
|
|