CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > ANSYS > CFX

MPI Distributed parallel

Register Blogs Community New Posts Updated Threads Search

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   June 29, 2017, 09:00
Default MPI Distributed parallel
  #1
Senior Member
 
Jiri
Join Date: Mar 2014
Posts: 221
Rep Power: 13
Jiricbeng is on a distinguished road
I use Ansys 18.1.
- when using IBM MPI distributed parallel on 1 computer (32 cores), iteration lasts 30 minutes
- when using IBM MPI distributed parallel on 2 computer (40 cores), iteration lasts again 30 minutes.
- when using IBM MPI distributed parallel on 3 computer (46 cores), iteration lasts again 30 minutes.

- when using Intel MPI local parallel on 1 computer (32 cores), iteration lasts 11 minutes.

The 32 core computer has got AMD processors, not intel!

- using Intel MPI distributed parallel does not work.

Does anyone know why IBM MPI distributed parallel lasts the same time although even 3 computers are included? Where is the problem?

Thank you for any reply.
Jiricbeng is offline   Reply With Quote

Old   June 29, 2017, 19:20
Default
  #2
Super Moderator
 
Glenn Horrocks
Join Date: Mar 2009
Location: Sydney, Australia
Posts: 17,872
Rep Power: 144
ghorrocks is just really niceghorrocks is just really niceghorrocks is just really niceghorrocks is just really nice
You need to consider if the bottleneck is:
* The network system
* The computers themselves
* The MPI distribution is running properly on your hardware+network
* The hardware has correct versions of motherboard, BIOS, drivers
* The simulation you are using is limited by CPU, hard drive, memory or something else
* Other software and/or users on the machine, including virus checkers
* Relative speed differences between the MPI implementations
* Relative speed differences between the different machines

Optimising distributed parallel simulations is a complex task.
ghorrocks is offline   Reply With Quote

Old   June 30, 2017, 06:43
Default
  #3
Senior Member
 
Jiri
Join Date: Mar 2014
Posts: 221
Rep Power: 13
Jiricbeng is on a distinguished road
Thank you for reply. Yes it is a very complex task.

I have another question:
Is it possible to reasonably increase iteration time by another partitioning type? For example to use Circumferential partitioning instead of MeTiS type?
Jiricbeng is offline   Reply With Quote

Old   July 1, 2017, 08:08
Default
  #4
Super Moderator
 
Glenn Horrocks
Join Date: Mar 2009
Location: Sydney, Australia
Posts: 17,872
Rep Power: 144
ghorrocks is just really niceghorrocks is just really niceghorrocks is just really niceghorrocks is just really nice
The ideal partition has small interface areas with other partitions as data needs to be transferred at these boundaries, so making them smaller increases parallel efficiency. So if another partitioning method generates better partition shapes than METIS then it will improve run time.

But in my experience for most cases the simulation speed differences between different partitions is minimal. The partitions need to be really bad before it makes a difference.

If you have multiple domains you might consider the coupled partitioner. That takes into account the multiple domains better. Also some simulations don't like nasty features aligning with partition boundaries (eg shock waves, free surface etc).
ghorrocks is offline   Reply With Quote

Old   August 2, 2017, 01:26
Default
  #5
Member
 
Join Date: Jan 2015
Posts: 62
Rep Power: 11
Christophe is on a distinguished road
I've done similar benchmark studies, distributing Cfx across multiple high end work stations linked with dual 10 Gb Ethernet. My simulation time flatlined after adding a third computer. A true cluster on Linux with no IT security software with IB 56 Gb interconnect is the way to go. Also platform MPI is worse than Intel. Fix Intel distributed parallel. Turn on verbose mode by adding -v to startmethods.ccl. See if you can find the error that way.


Sent from my iPhone using CFD Online Forum mobile app
Christophe is offline   Reply With Quote

Old   August 3, 2017, 11:49
Default
  #6
Senior Member
 
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,188
Rep Power: 23
evcelica is on a distinguished road
I've run up to 5 machines with 20 Gbps DDR infiniband, and IBM platform MPI, and seen linear speedup. Granted I was only running ~4 cores per machine, but fast cores and fast memory. (4.2 GHz CPU and Quad channel 2133 MHz RAM).

I agree that Intel MPI is default and is supposed to be better, but I could never get it to work.
evcelica is offline   Reply With Quote

Old   August 3, 2017, 13:10
Default
  #7
Member
 
Join Date: Jan 2015
Posts: 62
Rep Power: 11
Christophe is on a distinguished road
I've had issues with Intel MPI too. Did you archive the username and PW for all the machines? Also I think startmethods.ccl had a line of code in it that was freezing things for distributed Intel that I had to delete. Unfortunately I've not found ANSYS tech support to be very helpful on any of these issues.


Sent from my iPhone using CFD Online Forum mobile app
Christophe is offline   Reply With Quote

Old   August 8, 2017, 09:04
Default
  #8
Senior Member
 
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,188
Rep Power: 23
evcelica is on a distinguished road
Thanks for the advice Christophe. I did cache a username and password on the head node. I can't remember if I did it on all nodes or not. Probably not as you don't have to with IBM MPI, only the head node. I'll make sure to cache it on at least 2 nodes, and see if those 2 work in distributed.
Actually, I believe I did get Intel MPI working, but only over TCPiB, it wouldn't work over native infiniband. I might give it another shot if I ever get less swamped at work (I wish)
evcelica is offline   Reply With Quote

Old   September 24, 2017, 08:07
Default
  #9
Member
 
Join Date: Jun 2010
Posts: 77
Rep Power: 16
Echidna is on a distinguished road
I think it's better to post my question here rather than creating a new thread.
First of all i use Ansys Mechanical (FEA), not CFX.
I am trying to setup a distributed network with no result until now.
Here is my configuration:

Node1
Supermicro 2x Xeon X5670, 16GBs of DDR3 ECC RAM, Win10

Node 2
HP Proliant DL180 G6, 2x Xeon X5670, 96GBs of DDR3 ECC RAM, Win10

Both systems are compatible i suppose.
Same CPUs, same chipset, same type of RAM (DDR3 ECC).
Same OS, same installation path for Ansys and Intel MPI.

Right now i have the two systems connected togheter through Gigabit Ethernet to test the connectivity. I have already ordered a pair of Infiniband 4x QDR cards and a suitable cable.

I have followed all the instructions found here: https://www.sharcnet.ca/Software/Ans..._setupdan.html

Regarding MPI, i installed the software on both systems and finally cached my Windows password on both systems. The Windows account name and password is the same for the 2 systems.
Only computer name (host) is different.
Hostname1=SUPERMICRO
Hostname2=HP

I populated the hostnames in ANS_ADMIN 18.1 successfully (see attached pic). I did the same on both machines!

Then i go to Mechanical APDL product launcher (from SUPERMICRO machine) to run an official benchmark.
See attached picture for File management tab
See attached picture for File HPC tab

Then i click "Run" but nothing happens.

I think i have done all the necessary steps but nothing happens.
Anyone can help?
Thanks in advance!
Attached Images
File Type: jpg 1.jpg (25.6 KB, 73 views)
File Type: jpg file management.jpg (117.7 KB, 85 views)
File Type: jpg hpc.jpg (145.2 KB, 54 views)
Echidna is offline   Reply With Quote

Old   September 25, 2017, 10:50
Default
  #10
Senior Member
 
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,188
Rep Power: 23
evcelica is on a distinguished road
- You have to have the same working directory in both systems. make sure the full folder path is present in both systems, and solve from that directory. On all my machines I use G:\D_ANSYS\ANSYS_WD. which are my RAID SSD drives. These working directories must be shared folders, so you can remotely access them.

- You must have remote desktop privileges on all nodes.

- You have to share the installation directory with all nodes, usually C:\Program Files\ANSYS Inc.

- Did you make the machines type: win x64 when you configured the hosts file?

- You have to allow firewall exceptions for all the correct programs. see the ansys installation documentation for a complete list.

Something like all exe files in :
C:\Program Files\ANSYS Inc\v181\commonfiles\MPI\IBM\9.1.4.2\winx64\bin
C:\Program Files\ANSYS Inc\v181\commonfiles\MPI\IBM\9.1.4.2\winx64\sbin

For CFX, solver-MPI in:
C:\Program Files\ANSYS Inc\v181\CFX\bin\winnt-amd64
C:\Program Files\ANSYS Inc\v181\CFX\bin\winnt-amd64\double

Allow on all networks, including public networks if your infiniband network is public.

Check your working directory to see if you got any error clues.
evcelica is offline   Reply With Quote

Old   September 25, 2017, 16:30
Default
  #11
Member
 
Join Date: Jun 2010
Posts: 77
Rep Power: 16
Echidna is on a distinguished road
Hello Erik and many thanks for your help.
I've done most of what you have recommended above but no result yet.
I made a trip through ANSYS license management center -> View Licensing Interconnect Log ON SUYPERMICRO machine and i got this interesting info:

Unable to retrieve IP for host HP (see attached pic for the full message).

I also attach a picture of my machines IP manual configuration i did.
I set the SUPERMICRO's IP as 192.168.1.1 and for HP i set 192.168.1.2.
I have left the DNS fields blank though...
The local network works as it should. But it seems that the SUPERMICRO node cannot connect with HP node in Ansys.

UPDATE
NSLOOKUP ON SUPERMICRO
DEFAULT SERVER: 192.168.2.1
ADDRESS: 192.168.2.1

NSLOOKUP ON HP
DNS REQUEST TIMED OUT
TIMEOUT WAS 2 SECONDS.
DEFAULT SERVER: UNKNOWN
ADDRESS: fec0:0:0:ffff::1

Weird.... but i feel i am pretty close...
Attached Images
File Type: jpg connect error.jpg (46.4 KB, 40 views)
File Type: jpg hp-ip.jpg (60.4 KB, 42 views)
File Type: jpg supermicro-ip.jpg (62.4 KB, 34 views)
Echidna is offline   Reply With Quote

Old   September 25, 2017, 17:17
Default
  #12
Member
 
Join Date: Jun 2010
Posts: 77
Rep Power: 16
Echidna is on a distinguished road
I also attach printscreens of ipconfig/all for both machines.
Attached Images
File Type: jpg ipconfig-SUPERMICRO.jpg (140.2 KB, 34 views)
File Type: jpg ipconfig-HP.jpg (166.5 KB, 20 views)
Echidna is offline   Reply With Quote

Old   September 26, 2017, 11:52
Default
  #13
Senior Member
 
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,188
Rep Power: 23
evcelica is on a distinguished road
You have to do ALL of what I said, not most, these have to be done, among other things.
Did you set the environmental variable for the license path? If both machines get their license from the same license server, you don't have to do this. Otherwise you have to make the path of the slave node point to the head nodes license. Specifics are in the documentation.
ANSYSLIC_DIR on slave node must point to head nodes license.

Also when you specify hosts in the ansys launcher, make sure the head node is on top, or it won't work, but it looks like you did that.

Try reinstalling IBM MPI, or you could try Intel MPI.
evcelica is offline   Reply With Quote

Old   September 26, 2017, 11:57
Default
  #14
Member
 
Join Date: Jun 2010
Posts: 77
Rep Power: 16
Echidna is on a distinguished road
Hi Eric, how can i set the environmental variables?
Echidna is offline   Reply With Quote

Old   September 26, 2017, 12:02
Default
  #15
Senior Member
 
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,188
Rep Power: 23
evcelica is on a distinguished road
ANSYS170_DIR=\\head_node_machine_name\ANSYS Inc\v181\ansys
ANSYSLIC_DIR=\\head_node_machine_name\ANSYS Inc\Shared Files\Licensing


Start >> Right click on computer >> properties >> Advanced system settings >> Advanced Tab >> Environmental variables - Then set your system variables.
evcelica is offline   Reply With Quote

Old   September 26, 2017, 12:07
Default
  #16
Member
 
Join Date: Jun 2010
Posts: 77
Rep Power: 16
Echidna is on a distinguished road
Maybe this helps.
I attach a screenshot of the license status page from the HP (slave) node.
It says that the head node license server is down or not reporting.
The inverse is happening on the head node license status.
Attached Images
File Type: jpg license status hp.jpg (173.1 KB, 41 views)
Echidna is offline   Reply With Quote

Old   September 26, 2017, 12:32
Default
  #17
Senior Member
 
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,188
Rep Power: 23
evcelica is on a distinguished road
Can you run a local run at all? Is your license down completely?
Did you share out privileges to the ANSYS install path? Make it a shared folder.
evcelica is offline   Reply With Quote

Old   September 26, 2017, 12:43
Default
  #18
Member
 
Join Date: Jun 2010
Posts: 77
Rep Power: 16
Echidna is on a distinguished road
Quote:
Originally Posted by evcelica View Post
Can you run a local run at all? Is your license down completely?
Did you share out privileges to the ANSYS install path? Make it a shared folder.
Yes, i can run a local run as normal.
I set the environment variables as you suggested.
Installation folder is shared.
Something vague is happening too.
I go again to the Ansys License manager and i get this error:

ERROR Unable to retrieve IP for host THIS. Host not found: THIS

The host is named HP, not THIS....

It drives me crazy...
Echidna is offline   Reply With Quote

Old   September 26, 2017, 15:17
Default
  #19
Senior Member
 
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,188
Rep Power: 23
evcelica is on a distinguished road
Try using the static IPs instead of the computer names.
Turn off firewall completely during testing, just to see if it is something in the firewall blocking you.
evcelica is offline   Reply With Quote

Old   July 22, 2019, 09:58
Default parallel processing
  #20
New Member
 
Alireza
Join Date: Jul 2019
Posts: 5
Rep Power: 7
a1281366 is on a distinguished road
1 Question:

Hello, We asked our dear friends, there are several idle systems that we want to use with our network of other active systems, 4 weighing software applications, in parallel processing and maximum processing power, to save time and money. The question is, is there any software that can handle all of the above? best regards.
a1281366 is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
problem during mpi in server: expected Scalar, found on line 0 the word 'nan' muth OpenFOAM Running, Solving & CFD 3 August 27, 2018 05:18
Run Mode:Platform MPI Local Parallel core problem mztcu CFX 0 October 13, 2016 04:14
Explicitly filtered LES saeedi Main CFD Forum 16 October 14, 2015 12:58
simpleFoam parallel AndrewMortimer OpenFOAM Running, Solving & CFD 12 August 7, 2015 19:45
HP MPI warning...Distributed parallel processing Peter CFX 10 May 14, 2011 07:17


All times are GMT -4. The time now is 21:39.