Fluent speed up studies

Anna Tian · February 5, 2014, 08:29

Hi,

Have someone done any speed up studies for Fluent? For example, I have 16 simulations with 10 millions grids to run and I only have 32 processors. In this case, shall I run 4 jobs at the same time with 8 processors for each or I'd better run 8 jobs at the same time with 4 processors for each? Or 2 jobs with 16 processors for each? Which way can let me finish all the simulations earlier? Or this also depends on the kind of CPU I use? Any indications on that?

This should be a quite general topic discussed several times before. But I didn't find any threads on that. Could someone give a link to them?

flotus1 · February 5, 2014, 09:01

The theoretical linear speedup can never be reached due to communication losses and parts in the code that can not be executed in parallel.
So it is fastest to run as many cases at the same time as possible as long as you dont run out of memory.

Anna Tian · February 5, 2014, 10:52

Quote:

Originally Posted by flotus1

The theoretical linear speedup can never be reached due to communication losses and parts in the code that can not be executed in parallel.
So it is fastest to run as many cases at the same time as possible as long as you dont run out of memory.

Ok. Your answer helps me. I think I should refer to something else.

I once saw a speed up study results in an academic institute. It shows that when the processors number jump from 8 to 10, the simulation won't be as much speed up as when the processors number jump from 6 to 8. They even plot a curve to show that there is a kink at the number of 8. Before 8, it is quite effective to increase the number of processors. But after 8, it won't be that effective. Is the anywhere which gives a this kind of speed up study methodology so that I can follow or see the testing results directly?

Sorry that this area is quite new to me, so I'm not sure whether I'm using the correct technical term to describe it.

flotus1 · February 5, 2014, 13:24

Speedup is the correct term for this.

The procedure is quite simple.
You run the case of interest on a single core and take the time.
Then you increase the number of cores and run the same case again. For a case run on n cores we will name the time taken

.

Now the speedup

for n cores is simply
$S_n=\frac{T_1}{T_n}$
The ideal speedup would be a straight line with slope 1.
Usually, due to the losses already mentioned, the real curve will be below this line.

Personally, I find the parrallel efficiency more intuitive
$E_n=\frac{T_1}{n \cdot T_n}$

Should have checked first, I just wrote down part of the wikipedia article on this topic. But it still has some more information so its worth visiting.

Quote:

Is the anywhere which gives a this kind of speed up study methodology so that I can follow or see the testing results directly?

I dont quite get what you want. Please rephrase.

Anna Tian · February 5, 2014, 16:14

Quote:

Originally Posted by flotus1

Speedup is the correct term for this.

The procedure is quite simple.
You run the case of interest on a single core and take the time.
Then you increase the number of cores and run the same case again. For a case run on n cores we will name the time taken

.

Now the speedup

for n cores is simply
$S_n=\frac{T_1}{T_n}$
The ideal speedup would be a straight line with slope 1.
Usually, due to the losses already mentioned, the real curve will be below this line.

Personally, I find the parrallel efficiency more intuitive
$E_n=\frac{T_1}{n \cdot T_n}$

Should have checked first, I just wrote down part of the wikipedia article on this topic. But it still has some more information so its worth visiting.

I dont quite get what you want. Please rephrase.

Thank you for your answer, Flotus1. That's very helpful. I meant could I just use the testing results that other people did or I need to do the testing by myself? Does it also depend on the algorithm I choose? For different grids but same grids number, will steady-state simulation give the same testing results (we set the max iteration number)? Does the test result depend on the CPUs? For example, we have 32 CPU A and another group have 32 CPU B. Will we obtain the same testing results?

flotus1 · February 5, 2014, 16:53

Quote:

I meant could I just use the testing results that other people did or I need to do the testing by myself?

That also depends on WHY these results are so important for you.
The only time I did such an analysis it was to compare the parallel efficiency of an in-house CFD code with a commercial one.

Apart from that, as you already supposed, there are many factors affecting the result of such an analysis.
Impossible to list them all, so lets stick with the conclusion that group A and group B from you example will definitely not get the same result.
They might not even get the same result if they were using the same cpus because there could still be many other issues affecting the parallel performance.

Quote:

For different grids but same grids number, will steady-state simulation give the same testing results

Not necessarily. Imagine an all-hex mesh and compare it to a polyhedral mesh.
The polyhedral cells have a higher number of faces at the interfaces between the partitions, so the performance loss due to communication will be higher.

Anna Tian · February 5, 2014, 17:16

Quote:

Originally Posted by flotus1

That also depends on WHY these results are so important for you.
The only time I did such an analysis it was to compare the parallel efficiency of an in-house CFD code with a commercial one.

Apart from that, as you already supposed, there are many factors affecting the result of such an analysis.
Impossible to list them all, so lets stick with the conclusion that group A and group B from you example will definitely not get the same result.
They might not even get the same result if they were using the same cpus because there could still be many other issues affecting the parallel performance.

Not necessarily. Imagine an all-hex mesh and compare it to a polyhedral mesh.
The polyhedral cells have a higher number of faces at the interfaces between the partitions, so the performance loss due to communication will be higher.

Sorry that I didn't tell the conditions clearly.

1. Software is fixed to be Fluent.

2. I understand about the CPU issue, the testing results will depend on CPU, memory speed and a lot of issues.

3. Only the structured grids will be used.

Questions:

Is there any other motivations to do the speed up studies?

For different structured grids but same grids number, will steady-state simulation give the same testing results (we set the max iteration number)? I ask this because I'd like to know whether I can do the tests and run my simulations at the same time.

flotus1 · February 5, 2014, 17:55

Quote:

whether I can do the tests and run my simulations at the same time

Do I get this correctly? You want to find out about the parallel speedup of Ansys Fluent for whatever reason,
and now you want to run several parts of the test at the same time?
Like number of cores 1, 2, 4 and 8 all at the same time and later the test with 16 cores? That would be unadvisable.

Quote:

For different structured grids but same grids number, will steady-state simulation give the same testing results

There is still the domain decomposition itself that can make a difference, but the results should be comparable.

Anna Tian · February 6, 2014, 06:24

Quote:

Originally Posted by flotus1

Do I get this correctly? You want to find out about the parallel speedup of Ansys Fluent for whatever reason,
and now you want to run several parts of the test at the same time?
Like number of cores 1, 2, 4 and 8 all at the same time and later the test with 16 cores? That would be unadvisable.

Why I can't do them at the same time? I have the same CPUs.

Btw, will the speed up efficiency also depend on the grids number that I'm running?

flotus1 · February 6, 2014, 07:26

You can not run them at the same time because it is exactly the influence of parallel load that you are trying to examine. So running the "serial" case of one core in parallel with other simulations will spoil the result.

Lets take a very simple example of a 4-core cpu that is only supplied with one piece of ram, so we only have one memory channel.
Running the test for one core alone, this simulation can use the whole memory bandwith and will be rather fast. That is how it should be done.
If you run one other simulation on two cores at the same time, the simulation on one core will only have a fraction of the memory bandwith available and run slower.
The time taken will depend on how many other simulation you ran at the same time.
The memory bandwith is only an example, there are many possible influences besides the cpu type.

Quote:

Btw, will the speed up efficiency also depend on the grids number that I'm running?

Yes it will. The speedup is usually better with higher cell counts so it usually makes no sense to run a mesh with 10000 cells on 128 cores.

Anna Tian · February 6, 2014, 12:16

Quote:

Originally Posted by flotus1

You can not run them at the same time because it is exactly the influence of parallel load that you are trying to examine. So running the "serial" case of one core in parallel with other simulations will spoil the result.

Lets take a very simple example of a 4-core cpu that is only supplied with one piece of ram, so we only have one memory channel.
Running the test for one core alone, this simulation can use the whole memory bandwith and will be rather fast. That is how it should be done.
If you run one other simulation on two cores at the same time, the simulation on one core will only have a fraction of the memory bandwith available and run slower.
The time taken will depend on how many other simulation you ran at the same time.
The memory bandwith is only an example, there are many possible influences besides the cpu type.

Yes it will. The speedup is usually better with higher cell counts so it usually makes no sense to run a mesh with 10000 cells on 128 cores.

How many cores are suggested to be used by Fluent service per million grids? What's the value you like to choose?

flotus1 · February 6, 2014, 12:38

I dont know if there is an official recommendation for the minimum number of cells per core by fluent.
I run fluent cases on the maximum number of cores or licenses available regardless of the cell count, except for very simple problems.

Anna Tian · February 9, 2014, 09:08

Quote:

Originally Posted by flotus1

Do I get this correctly? You want to find out about the parallel speedup of Ansys Fluent for whatever reason,
and now you want to run several parts of the test at the same time?
Like number of cores 1, 2, 4 and 8 all at the same time and later the test with 16 cores? That would be unadvisable.

Why it is unadvisable? Because different simulations using different CPUs will somehow interact with each other? How?

flotus1 · February 9, 2014, 16:33

Quote:

Originally Posted by Anna Tian

Why it is unadvisable? Because different simulations using different CPUs will somehow interact with each other? How?

You already answered the question yourself.
So did I a few posts ago.

Quote:

Originally Posted by flotus1

You can not run them at the same time because it is exactly the influence of parallel load that you are trying to examine. So running the "serial" case of one core in parallel with other simulations will spoil the result.

Lets take a very simple example of a 4-core cpu that is only supplied with one piece of ram, so we only have one memory channel.
Running the test for one core alone, this simulation can use the whole memory bandwith and will be rather fast. That is how it should be done.
If you run one other simulation on two cores at the same time, the simulation on one core will only have a fraction of the memory bandwith available and run slower.
The time taken will depend on how many other simulation you ran at the same time.
The memory bandwith is only an example, there are many possible influences besides the cpu type.

February 5, 2014, 08:29	Fluent speed up studies	#1
Anna Tian Senior Member Meimei Wang Join Date: Jul 2012 Posts: 494 Rep Power: 16	Hi, Have someone done any speed up studies for Fluent? For example, I have 16 simulations with 10 millions grids to run and I only have 32 processors. In this case, shall I run 4 jobs at the same time with 8 processors for each or I'd better run 8 jobs at the same time with 4 processors for each? Or 2 jobs with 16 processors for each? Which way can let me finish all the simulations earlier? Or this also depends on the kind of CPU I use? Any indications on that? This should be a quite general topic discussed several times before. But I didn't find any threads on that. Could someone give a link to them? __________________ Best regards, Meimei

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Two questions on Fluent UDF	Steven	Fluent UDF and Scheme Programming	7	March 23, 2018 04:22
problem in using parallel process in fluent 14	aydinkabir88	FLUENT	1	July 10, 2013 03:00
LSR (Land speed record) car simulation on FLUENT	Maxime31850	FLUENT	2	May 1, 2013 12:15
few quesions on ANSYS ICEMCFD and FLUENT	Prakash.Paudel	ANSYS	0	August 12, 2010 13:07
Parametric Studies Using Fluent 6.1	Jim	FLUENT	0	April 12, 2003 11:22

February 5, 2014, 09:01		#2
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,427 Rep Power: 49	The theoretical linear speedup can never be reached due to communication losses and parts in the code that can not be executed in parallel. So it is fastest to run as many cases at the same time as possible as long as you dont run out of memory.

February 6, 2014, 12:38		#12
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,427 Rep Power: 49	I dont know if there is an official recommendation for the minimum number of cells per core by fluent. I run fluent cases on the maximum number of cores or licenses available regardless of the cell count, except for very simple problems.