Intel Core-i7 Hyperthreading and CFX

murx · February 7, 2013, 11:11

Hey,

I've read several discussion about Hyperthreading (HT) in Intel Core-i7 processors and it's impact on the CFX performance. Since I purchased a new i7 and do have enough licenses available, I ran a Benchmark. I thought the results could be useful for other, so here they are.

The diagram below shows the relative speed (inverse of the CFD solver wall clock time for the CFX Benchmark case, normalized with the value for i3770 HT on with 1 process) on the y-axis. The irregular trend is probably a result of round-off errors by the CFD solver wall clock time... it only gives two significant numbers.

Bottom line is:

If HT is disabled, the maximum speed is achieved with 4 processes. This configuration is just as fast as with HT on and 6 processes. So you cannot improve the speed by disabling HT but you can save licenses.

Daniel C · February 7, 2013, 12:05

You do not have to disbale HT at all. Just assign the real cores to the threads in the Task Manager. They can be identified by their number, namely core 0, 2, 4 and 6.

But I think this topic belongs to the hardware forum.

evcelica · February 7, 2013, 23:58

And how would you assign the real cores to the threads in the Task Manager?

Daniel C · February 8, 2013, 03:20

Quote:

Originally Posted by evcelica

And how would you assign the real cores to the threads in the Task Manager?

This is quite easy, just go to the Task Manager and click on the processes tab. There you have to look for the ansys cfx solver process or processes (if solving in parallel mode) and execute a right click. Select the Set Affinity command (Don't know the exact translation because I am using WIN7 in german) and tick on the desired core for each process.

You can seperate the real from the hyperthreading cores, if you shut down all of your application and bring the computer into the idle state. Then go to the Performance tab and there you click on "Resource Monitor...". Each hyperthreading core is marked with "Parked". Usually the cores with an even number (including 0) are the real cores.

It is a common mistake that people just using one real core and the hyperthreading unit for multithreading an a dual core intel cpu.

murx · February 8, 2013, 03:45

I see... good to know!
But I guess you have to do this again everytime you start a new simulation, right? So this might be an option for someone doing long lasting simulations. For me it is unfortunately not since I have to run a lot of simulations which each only take about 5-10 minutes.

ghorrocks · February 8, 2013, 05:30

In my experience assigning processes to core just makes it go slower. Best leave the OS sort that out. But benchmark it and find out for yourself, things may have changed.

But if you have a quad core processor, don't be fooled by the hyperthreading virtual cores. You can only run at 4 processes with any reasonable efficiency, and in fact even the last physical core is of marginal benefit. But again, this is different for different machines so benchmark it on your machine and work out what works best on your system.

murx · February 8, 2013, 05:40

Quote:

Originally Posted by ghorrocks

But if you have a quad core processor, don't be fooled by the hyperthreading virtual cores. You can only run at 4 processes with any reasonable efficiency, and in fact even the last physical core is of marginal benefit.

My experience tells me that you do get a significant speedup by increasing the number of processes over the number of physical cores when hyperthreading is active.
On the ivy bridge i7, that i used for this benchmark, the maximum performance was at 6 processes (see chart attached to my first posting) and on a sandy bridge i7 that i used some time ago, the maximum was at 7 processes.

ghorrocks · February 8, 2013, 05:47

This is where you have to factor in your specific case. In most commercial applications the additional speed from the last few processes is not worth the cost of the parallel license. In fact the license cost is many times the hardward cost. So the performance per $ makes the optimum quite a few less processes. In fact the optimum is often a single (or maybe two) processes per machine and a fast network connection to run distributed parallel.

But for academic applications they often have lots of licenses (as they get them cheaply) so then it makes sense to run whatever is the fastest, even if the last few processes are not really adding much.

Daniel C · February 8, 2013, 06:53

Quote:

Originally Posted by ghorrocks

In my experience assigning processes to core just makes it go slower. Best leave the OS sort that out. But benchmark it and find out for yourself, things may have changed.....

If you don't assign the cores to the threads, you will find that e.g. Windows 7 will utilize just one core with its corresponding hyperthreading unit for two threads. That is a waste of one core. I have experienced it myself.

Moreover it is inconvenient for me to disable the hyperthreading capability each time I want to simulate a case, since I want to benefit from the additional performance that hyperthreading offers me when I am working with other applications like Power Point etc.

ghorrocks · February 8, 2013, 07:03

I see. I have done exactly the same test years ago with earlier generations of hyperthreading and CFX and found that assigning processes to cores slows things down. Just goes to show you have to test these things for yourself on your system, as some systems behave totally differently to others.

Daniel C · February 9, 2013, 05:46

Regarding the additional speed I totally agree with ghorrocks and it really depends on the system.

I have an Ivy Bridge i3770 too, but don't benefit from two additional threads compared to only four threads on my quad core. Rather I get a speed drop in my simulation if I use more threads than physical cores are available.

Nevertheless in the Internet I found:

>> Hyper-Threading is now called Simultaneous multithreading or SMT. Customers are recommended to leave SMT enabled on their systems but not over-subscribe physical cores for parallel simulations. While some improvement is possible, the extra performance from the virtual threads is not cost-effective and incommensurate with the additional license costs (which are per process)."

Basically, if a section of the CPU core is not being used it tries to run a second task on these sections. For example, if one process only needs to do floating point operations while another only needs to do integer operations they can run both concurrently. For FLUENT, there is no consequence to performance if it is turned off. If SMT is on, and you run 16x (instead of 8x; assuming dual cpu quad-core nodes), you can get an additional 20% or so (compared to 8x) improvement. This is not recommended since you only get 20% more for 2x licenses (license is per process). in this scene rio, leave SMT on and run 8 way. This is the recommended approach <<

This comes from

http://www.simutechgroup.com/Technic...e-support.html

Shawn_A · March 9, 2013, 12:16

I've done the same tests myself. CFX code has been pretty well optimized for parallel use. Hyperthreading, as you've said, attempts to make use of idle CPU resources by utilizing an independent front end of each physical core to prepare data for the shared math unit of each physical core, but, due to the parallel efficiency of the CFX solver, there is basically no idle CPU time.

I've found that if you have, for example, and quad-core CPU with 4 physical cores and 8 locigical cores with hyperthreading enabled, there will be a VERY VERY small, if any, performance difference between running a simulation with 4 cores hyperthreading off and 8 cores hyperthreading enabled. Also, you will get VERY non-linear performance speedup with additional cores with hyperthreading enabled.

As Glenn said, if your simulations are not limited by the available licenses, you can leave hyperthreading enabled, but you MUST run your job will ALL the cores available on your system. If your have a limited number of licenses, then DISABLE HYPERTHREADING. Also, if you want to run MULTIPLE simultaneous jobs, DISABLE HYPERTHREADING.

oj.bulmer · March 15, 2013, 06:13

I have done this benchmarking of CFX and Fluent, although not on hyperthreading but on physical cores. Thought it may be useful. The cores are physical cores of cluster, which has 4 “boxes” each having a quad core Intel i7-2600 processor and 16 GB RAM, connected to each other by Infiniband SDR 4X using RDMA 10 Gbits/s (latency ~5 microseconds).

The mesh was roughly 4 million and physics, same for both codes, was : porous region, second order discretization schemes, 2-equation models. Attached is the snap of results. It is evident that not only Fluent is a faster for same physics and computational resource, but also is a lot more efficient in leveraging the multicore processing. Nearly twice as efficient

Agreed, CFX being coupled solver reaches convergence faster (smaller no. of iterations). Yet, the difference is a lot.

OJ

ghorrocks · March 16, 2013, 06:06

These results are strange. CFX usually parallelises very well when properly set up. I do not think this result is typical, and I suspect something is wrong with your benchmark.

oj.bulmer · March 16, 2013, 08:56

The time study was done on one of the models. I have observed the same trend in numerous other models - with smaller and bigger meshes than the one used for this study. Typically, 2000 iterations of Fluent used to be finished within say 2-3 hours. CFX did close to 1000 (+/- 200) iterations in the same timeframe.

Is it fair to say that the porous jump in Fluent and porous interface in CFX have different computational requirements?

OJ

ghorrocks · March 17, 2013, 05:42

Your other results are as expected:
* CFX takes much longer per iteration than Fluent (this is because the coupled solver in CFX is much more complex than Fluent's default SIMPLE based solver)
* CFX converges faster than Fluent (again, due to CFX's coupled solver).

So your comment that CFX does half as many iterations as Fluent in the same time is as expected.

The comment I am surprised about is your comment that the parallel speedup factor is much lower for CFX than it is for Fluent. They should both be similar, and for hardware with few bottlenecks should be close to ideal speedup. If you are reporting CFX is off ideal scaling then I suspect either your benchmark is dodgy or the result is throttled by your hardware somehow.

oj.bulmer · March 17, 2013, 10:20

Quote:

So your comment that CFX does half as many iterations as Fluent in the same time is as expected.

Agreed, I should have kept the comparison limited to code's efficiency in leveraging more cores rather than the iterations part, which would be obvious as you stated. The additional bits of information do digress the message here.

Quote:

The comment I am surprised about is your comment that the parallel speedup factor is much lower for CFX than it is for Fluent

Well, the increase in speed when processing power was quadrupled (4 to 16 cores), for my exercise, was 3.5 times which is actually more than the one for Marx's exercise, 2.6 times, when he quadrupled the processing power(1 to 4 cores). Now I know that the relationship is not exactly linear, and towards fewer number of cores, the curve of speed-boost is steeper. So though I didn't do the bench-marking of 1 to 4 cores, I suspect, the speed-boost may exceed 3.5 in that area for my case. By that logic, isn't Marx's speedup factor smaller than you'd think?

Or, is it that FLUENT's pace compared to CFX is surprising?

OJ

ghorrocks · March 17, 2013, 18:17

Doing a 4-way simulation on a single quad core CPU will end up in a speedup factor around 2.5. This is due to memory bottlenecks on the CPU and motherboard and has little to do with the software. You will find running 4 totally independant processes simultaneously on a quad core CPU will run about 2.5 times faster than a single process.

Be aware of the new Intel technology, I forget its name, where it runs at a higher CPU clock speed when running single core versus multi core. This can distort speedup benchmarks.

To get speedups in the 3.5 and higher range for an ideal 4 times acceleration you need to remove the CPU/motherboard memory bottleneck. An easy (but expensive) way of doing this is by running 4 machines, each using a single core of the CPU. Note you will also need a reasonable network for this to work. Under this setup I would expect both CFX and Fluent to have speedup efficiencies of 95% in the simulation size have here.

February 7, 2013, 12:05		#2
Daniel C Member Daniel Ceglarski Join Date: Sep 2012 Location: Essen, Germany Posts: 50 Rep Power: 14	You do not have to disbale HT at all. Just assign the real cores to the threads in the Task Manager. They can be identified by their number, namely core 0, 2, 4 and 6. But I think this topic belongs to the hardware forum.

February 7, 2013, 23:58		#3
evcelica Senior Member Erik Join Date: Feb 2011 Location: Earth (Land portion) Posts: 1,188 Rep Power: 23	And how would you assign the real cores to the threads in the Task Manager?

February 8, 2013, 03:45		#5
murx Member Max Join Date: May 2011 Location: old europe Posts: 88 Rep Power: 15	I see... good to know! But I guess you have to do this again everytime you start a new simulation, right? So this might be an option for someone doing long lasting simulations. For me it is unfortunately not since I have to run a lot of simulations which each only take about 5-10 minutes.

February 8, 2013, 05:30		#6
ghorrocks Super Moderator Glenn Horrocks Join Date: Mar 2009 Location: Sydney, Australia Posts: 17,872 Rep Power: 144	In my experience assigning processes to core just makes it go slower. Best leave the OS sort that out. But benchmark it and find out for yourself, things may have changed. But if you have a quad core processor, don't be fooled by the hyperthreading virtual cores. You can only run at 4 processes with any reasonable efficiency, and in fact even the last physical core is of marginal benefit. But again, this is different for different machines so benchmark it on your machine and work out what works best on your system.

February 8, 2013, 05:47		#8
ghorrocks Super Moderator Glenn Horrocks Join Date: Mar 2009 Location: Sydney, Australia Posts: 17,872 Rep Power: 144	This is where you have to factor in your specific case. In most commercial applications the additional speed from the last few processes is not worth the cost of the parallel license. In fact the license cost is many times the hardward cost. So the performance per $ makes the optimum quite a few less processes. In fact the optimum is often a single (or maybe two) processes per machine and a fast network connection to run distributed parallel. But for academic applications they often have lots of licenses (as they get them cheaply) so then it makes sense to run whatever is the fastest, even if the last few processes are not really adding much.

February 8, 2013, 07:03		#10
ghorrocks Super Moderator Glenn Horrocks Join Date: Mar 2009 Location: Sydney, Australia Posts: 17,872 Rep Power: 144	I see. I have done exactly the same test years ago with earlier generations of hyperthreading and CFX and found that assigning processes to cores slows things down. Just goes to show you have to test these things for yourself on your system, as some systems behave totally differently to others.

February 9, 2013, 05:46		#11
Daniel C Member Daniel Ceglarski Join Date: Sep 2012 Location: Essen, Germany Posts: 50 Rep Power: 14	Regarding the additional speed I totally agree with ghorrocks and it really depends on the system. I have an Ivy Bridge i3770 too, but don't benefit from two additional threads compared to only four threads on my quad core. Rather I get a speed drop in my simulation if I use more threads than physical cores are available. Nevertheless in the Internet I found: >> Hyper-Threading is now called Simultaneous multithreading or SMT. Customers are recommended to leave SMT enabled on their systems but not over-subscribe physical cores for parallel simulations. While some improvement is possible, the extra performance from the virtual threads is not cost-effective and incommensurate with the additional license costs (which are per process)." Basically, if a section of the CPU core is not being used it tries to run a second task on these sections. For example, if one process only needs to do floating point operations while another only needs to do integer operations they can run both concurrently. For FLUENT, there is no consequence to performance if it is turned off. If SMT is on, and you run 16x (instead of 8x; assuming dual cpu quad-core nodes), you can get an additional 20% or so (compared to 8x) improvement. This is not recommended since you only get 20% more for 2x licenses (license is per process). in this scene rio, leave SMT on and run 8 way. This is the recommended approach << This comes from http://www.simutechgroup.com/Technic...e-support.html

March 9, 2013, 12:16		#12
Shawn_A Member Shawn Join Date: Oct 2011 Posts: 56 Rep Power: 15	I've done the same tests myself. CFX code has been pretty well optimized for parallel use. Hyperthreading, as you've said, attempts to make use of idle CPU resources by utilizing an independent front end of each physical core to prepare data for the shared math unit of each physical core, but, due to the parallel efficiency of the CFX solver, there is basically no idle CPU time. I've found that if you have, for example, and quad-core CPU with 4 physical cores and 8 locigical cores with hyperthreading enabled, there will be a VERY VERY small, if any, performance difference between running a simulation with 4 cores hyperthreading off and 8 cores hyperthreading enabled. Also, you will get VERY non-linear performance speedup with additional cores with hyperthreading enabled. As Glenn said, if your simulations are not limited by the available licenses, you can leave hyperthreading enabled, but you MUST run your job will ALL the cores available on your system. If your have a limited number of licenses, then DISABLE HYPERTHREADING. Also, if you want to run MULTIPLE simultaneous jobs, DISABLE HYPERTHREADING.

March 16, 2013, 06:06		#14
ghorrocks Super Moderator Glenn Horrocks Join Date: Mar 2009 Location: Sydney, Australia Posts: 17,872 Rep Power: 144	These results are strange. CFX usually parallelises very well when properly set up. I do not think this result is typical, and I suspect something is wrong with your benchmark.

March 16, 2013, 08:56		#15
oj.bulmer Senior Member OJ Join Date: Apr 2012 Location: United Kindom Posts: 473 Rep Power: 20	The time study was done on one of the models. I have observed the same trend in numerous other models - with smaller and bigger meshes than the one used for this study. Typically, 2000 iterations of Fluent used to be finished within say 2-3 hours. CFX did close to 1000 (+/- 200) iterations in the same timeframe. Is it fair to say that the porous jump in Fluent and porous interface in CFX have different computational requirements? OJ

March 17, 2013, 05:42		#16
ghorrocks Super Moderator Glenn Horrocks Join Date: Mar 2009 Location: Sydney, Australia Posts: 17,872 Rep Power: 144	Your other results are as expected: * CFX takes much longer per iteration than Fluent (this is because the coupled solver in CFX is much more complex than Fluent's default SIMPLE based solver) * CFX converges faster than Fluent (again, due to CFX's coupled solver). So your comment that CFX does half as many iterations as Fluent in the same time is as expected. The comment I am surprised about is your comment that the parallel speedup factor is much lower for CFX than it is for Fluent. They should both be similar, and for hardware with few bottlenecks should be close to ideal speedup. If you are reporting CFX is off ideal scaling then I suspect either your benchmark is dodgy or the result is throttled by your hardware somehow.

March 17, 2013, 18:17		#18
ghorrocks Super Moderator Glenn Horrocks Join Date: Mar 2009 Location: Sydney, Australia Posts: 17,872 Rep Power: 144	Doing a 4-way simulation on a single quad core CPU will end up in a speedup factor around 2.5. This is due to memory bottlenecks on the CPU and motherboard and has little to do with the software. You will find running 4 totally independant processes simultaneously on a quad core CPU will run about 2.5 times faster than a single process. Be aware of the new Intel technology, I forget its name, where it runs at a higher CPU clock speed when running single core versus multi core. This can distort speedup benchmarks. To get speedups in the 3.5 and higher range for an ideal 4 times acceleration you need to remove the CPU/motherboard memory bottleneck. An easy (but expensive) way of doing this is by running 4 machines, each using a single core of the CPU. Note you will also need a reasonable network for this to work. Under this setup I would expect both CFX and Fluent to have speedup efficiencies of 95% in the simulation size have here.