CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > ANSYS > CFX

CFX performance scaling on multicore local server

Register Blogs Community New Posts Updated Threads Search

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   July 18, 2016, 17:35
Default CFX performance scaling on multicore local server
  #1
New Member
 
Sachin Aggarwal
Join Date: Aug 2014
Posts: 4
Rep Power: 12
saggarw2 is on a distinguished road
Hi,

I have been running a frozen rotor problem on a local server with 19 cores running parallel using Intel local parallel MPI. My company recently purchased a new server to be able to run 32 cores and use our hpc licenses to full. When i ran the same simulation on new server with 32 cores it runs slower but when i run it with same 19 cores it run a little bit faster than old server. Any settings I am missing? My simulation is big enough to have 14 million+ elements, so in my understanding it should not have any multi-threading issues. Anyone can give me some insight/guidance on this issue?

I really appreciate the help. Thank you,

Sachin Aggarwal
saggarw2 is offline   Reply With Quote

Old   July 18, 2016, 21:31
Default
  #2
Super Moderator
 
Glenn Horrocks
Join Date: Mar 2009
Location: Sydney, Australia
Posts: 17,854
Rep Power: 144
ghorrocks is just really niceghorrocks is just really niceghorrocks is just really niceghorrocks is just really nice
Effective multi-processor simulations require good interconnects, memory busses and much more.

If you have 32 cores in a single machine it will need to be carefully designed for multi-threaded operation or it will not give you the performance you expect.

Also check you have not crippled the machine. Check the:
* BIOS is current
* motherboard, hard drive, ethernet and other drivers are correct and current
* firmware is current in the hard drive and other gizmos
* You have not run out of memory
* You are not sharing the machine with other users
* Your antivirus or other background process is not causing problems.
ghorrocks is offline   Reply With Quote

Old   July 19, 2016, 15:32
Default
  #3
New Member
 
Sachin Aggarwal
Join Date: Aug 2014
Posts: 4
Rep Power: 12
saggarw2 is on a distinguished road
Hi Glenn,

Thank you for your reply.
We completed the installation of this machine last week only, so hardware and software parts are good and configured by Dell technician himself. The server has 192 GB of RAM and only 12-15 GB is being used while simulation is running. I do have virtualization "on" in the server I cannot say if that can affect the solution time or not. I am not quite sure about antivirus, but I will check with IT. The server has 44 cores in FC630 blade configuration installed in a FX2 blade chassis.

I hope this helps.

Other than this I ran a stage up study on my set-up with increasing number of cores by 4 and found that 28 cores is most time efficient rather than 32. Any thoughts on that?

Thank you very much for your help.

Regards,

Sachin Aggarwal
saggarw2 is offline   Reply With Quote

Old   July 19, 2016, 20:45
Default
  #4
Super Moderator
 
Glenn Horrocks
Join Date: Mar 2009
Location: Sydney, Australia
Posts: 17,854
Rep Power: 144
ghorrocks is just really niceghorrocks is just really niceghorrocks is just really niceghorrocks is just really nice
If 28 cores is running better than 32 it suggests there is a bottleneck in your system which is preventing it running efficiently to larger number of cores.

Do not assume that because it was installed by a technician and it is the latest stuff that makes it suitable for large multiprocessor simulation. Most large multi processor systems are design for servers and web servers and they have very different demands compared to multi processor simulations.

Also - make sure your simulation is suitable for lots of partitions. How many nodes per core? What physics are you using? What physics are you modelling.

Here are some examples of things which have caught me out in the past on multiprocessor simulations:
1) A workstation straight from the vendor (Dell) ran at half the speed I expected based on spec.org results. I found the BIOS did not support the CPU and when I upgraded the BIOS to the latest BIOS it supported the CPU and double speed to the expected value.
2) A high-end workstation straight from the vendor ran a different simulation software at a fraction of the speed expected. It turned out the motherboard was unsuitable for multi-processor operation as the FSB was not fast enough for the memory throughput. This was despite having the best CPU and lots of memory. We had to downgrade the machine to a CAD workstation and buy more suitable machines where I checked the technical details of the workstation carefully.
3) How is the CPU to memory and CPU to CPU interconnect done on this machine?
ghorrocks is offline   Reply With Quote

Old   July 20, 2016, 03:31
Default
  #5
Senior Member
 
Maxim
Join Date: Aug 2015
Location: Germany
Posts: 413
Rep Power: 13
-Maxim- is on a distinguished road
Quote:
Originally Posted by ghorrocks View Post
3) How is the CPU to memory and CPU to CPU interconnect done on this machine?
This is a key point. I don't know much about HPC hardware but as far as my understanding goes, any bottleneck can slow the whole thing down. So in case your upgrade went like "I already have those 19 cores (why is this such an uneven number?) of this older CPU/RAM generation and we just add some more of the newer generation" isn't really helpful.

My CFD hardware guy showed my bechmarks of multi-core CPUs where apparently two of the 4-core Xeons on one mainboard (with same amount of RAM each) are faster than a one 8-core Xeon with the same amount of RAM. So is your 32-core cluster a 4 computer with each 2*4-core Xeons on one mainboard plus InfiniBand for the connection?

You question might also be suitable for the hardware section of this forum - this is where the hardware guys are hiding
-Maxim- is offline   Reply With Quote

Old   July 21, 2016, 17:14
Default
  #6
New Member
 
Sachin Aggarwal
Join Date: Aug 2014
Posts: 4
Rep Power: 12
saggarw2 is on a distinguished road
Quote:
Originally Posted by ghorrocks View Post
If 28 cores is running better than 32 it suggests there is a bottleneck in your system which is preventing it running efficiently to larger number of cores.

Do not assume that because it was installed by a technician and it is the latest stuff that makes it suitable for large multiprocessor simulation. Most large multi processor systems are design for servers and web servers and they have very different demands compared to multi processor simulations.

Also - make sure your simulation is suitable for lots of partitions. How many nodes per core? What physics are you using? What physics are you modelling.

Here are some examples of things which have caught me out in the past on multiprocessor simulations:
1) A workstation straight from the vendor (Dell) ran at half the speed I expected based on spec.org results. I found the BIOS did not support the CPU and when I upgraded the BIOS to the latest BIOS it supported the CPU and double speed to the expected value.
2) A high-end workstation straight from the vendor ran a different simulation software at a fraction of the speed expected. It turned out the motherboard was unsuitable for multi-processor operation as the FSB was not fast enough for the memory throughput. This was despite having the best CPU and lots of memory. We had to downgrade the machine to a CAD workstation and buy more suitable machines where I checked the technical details of the workstation carefully.
3) How is the CPU to memory and CPU to CPU interconnect done on this machine?
Hi Glenn,

Thank you for your reply.

I am working with my IT department to figure the answer to your questions out. They told me that BIOS is the updated one. They will try to look into FSB and motherboard and also about the interconnect. When i will get the answer I will let you know.

About the problem itself, I am simulating a high speed wind turbine with a frozen rotor interface. The model has 14+ Million elements and 8+ million nodes. As far as i know the thumb rule is 50-100K nodes/core. This makes me believe that I should be able to use 80 cores without any loss of performance. I am using rotational periodicity to decrease the problem size to half and default partitioner Metis for partitioning the model.

Thank You,

Sachin Aggarwal
saggarw2 is offline   Reply With Quote

Old   July 21, 2016, 17:18
Default
  #7
New Member
 
Sachin Aggarwal
Join Date: Aug 2014
Posts: 4
Rep Power: 12
saggarw2 is on a distinguished road
Quote:
Originally Posted by -Maxim- View Post
This is a key point. I don't know much about HPC hardware but as far as my understanding goes, any bottleneck can slow the whole thing down. So in case your upgrade went like "I already have those 19 cores (why is this such an uneven number?) of this older CPU/RAM generation and we just add some more of the newer generation" isn't really helpful.

My CFD hardware guy showed my bechmarks of multi-core CPUs where apparently two of the 4-core Xeons on one mainboard (with same amount of RAM each) are faster than a one 8-core Xeon with the same amount of RAM. So is your 32-core cluster a 4 computer with each 2*4-core Xeons on one mainboard plus InfiniBand for the connection?

You question might also be suitable for the hardware section of this forum - this is where the hardware guys are hiding
Hi Maxim,

We did not add new CPU to old CPU but replaced it with new. The old server was a 20 core machine and i was using 19 out of 20 for simulations as 20 was clogging it down. The new machine has 44 cores in total and my intention was to use 32 cores out of them. I hope this clear things up.

Thank You,

Sachin Aggarwal
saggarw2 is offline   Reply With Quote

Old   July 21, 2016, 19:19
Default
  #8
Super Moderator
 
Glenn Horrocks
Join Date: Mar 2009
Location: Sydney, Australia
Posts: 17,854
Rep Power: 144
ghorrocks is just really niceghorrocks is just really niceghorrocks is just really niceghorrocks is just really nice
You do not appear to be modelling any physics which cause multi processor issues.

Can you show a graph of simulation speed versus number of cores? Also, how does your simulation speed compare to the spec.org result for your machine?
ghorrocks is offline   Reply With Quote

Reply

Tags
cfx 17.1, intel local parallel mpi, multi-cores, solve time


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
time step continuity problem in VAWT simulation lpz_michele OpenFOAM Running, Solving & CFD 5 February 22, 2018 20:50
Help for the small implementation in turbulence model shipman OpenFOAM Programming & Development 25 March 19, 2014 11:08
CFX local parallel on windows XP frank CFX 12 April 24, 2008 08:26
ANSYS CFX 10.0 Parallel Performance for Windows XP Saturn CFX 4 August 13, 2006 13:27
Does CFX support LES, local dynamic mdoel JJ CFX 0 August 28, 2003 22:15


All times are GMT -4. The time now is 13:31.