|
[Sponsors] |
August 14, 2018, 17:34 |
Slow performance on Azure cloud
|
#1 |
New Member
Faraz
Join Date: Mar 2018
Posts: 25
Rep Power: 8 |
I am testing out the 1.7 million cell motorbike tutorial on Azure. Long story short, it solves in 190 seconds on one 36 cpu machine ( F72s_v2 ). When I run it on two machines ( 72 cpus ), it runs in 600 seconds! When I do top on both machines, I see 32 threads of simplefoam running at 100% . So it seems to be working correctly. Is the network too slow for mpi ? I know I have run openfoam on a cluster with 10gigE network and it seemed to scale well.. I am also aware there are HPC specific nodes on Azure. But I want to get my feet wet with the basics before trying those. |
|
August 15, 2018, 04:46 |
|
#2 |
Senior Member
anonymous
Join Date: Jan 2016
Posts: 416
Rep Power: 14 |
Hi!
That 72 cores is not too much for 1.7 million cells??? You will have more processor faces than internal... What if you try with 17 million cells? (But i think 72 cores will be still a lot for 17 million cells) |
|
August 15, 2018, 10:59 |
|
#3 |
New Member
Faraz
Join Date: Mar 2018
Posts: 25
Rep Power: 8 |
The same model scales well upto 160 cores on my bare-metal cluster . After 160 cores I see minimal speedup..
Last edited by feacluster; August 15, 2018 at 14:04. |
|
August 20, 2018, 13:19 |
|
#4 |
New Member
Faraz
Join Date: Mar 2018
Posts: 25
Rep Power: 8 |
I ran the Ohio State benchmarks to measure latency on Azure:
http://mvapich.cse.ohio-state.edu/benchmarks/ I got around 60 micro seconds which is quite high . For our infiniband cluster I get 1-2 micro seconds. I suppose that is the reason for my slowdown when using multiple nodes? I know people run Openfoam on Amazon aws. I am not sure how their latency numbers are. I would imagine they are in the same ballpark? |
|
August 20, 2018, 15:57 |
|
#5 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Quick answer: 10 GigE is known to have a considerably larger latency than Infiniband. 60 ms vs 1-2 ms sound about right.
Searching online for "azure infiniband" does return several answers, namely how to use Infiniband on Azure.
__________________
|
|
August 20, 2018, 22:54 |
|
#6 |
Member
Join Date: Jan 2015
Posts: 62
Rep Power: 11 |
I have a bunch of high end workstations with a 10 GigE interconnect. Did a lot of benchmarking. Once I added a third machine to go above 50 cores there was a huge diminishing return. I also have a cluster w 56 GB infiniband and it scaled perfectly with the same model I tested on the desktops.
Would def stay away from 10 GigE if possible. I think a lot of instances on Azure have at least 25 Gbps. |
|
August 21, 2018, 03:41 |
|
#7 |
Senior Member
Anton Kidess
Join Date: May 2009
Location: Germany
Posts: 1,377
Rep Power: 30 |
They could have 10000 Gbps and still show bad scaling. It's the latency that counts.
__________________
*On twitter @akidTwit *Spend as much time formulating your questions as you expect people to spend on their answer. |
|
August 23, 2018, 14:41 |
|
#8 |
New Member
Faraz
Join Date: Mar 2018
Posts: 25
Rep Power: 8 |
I re-ran my Openfoam benchmark on the Azure HPC compute instances ( H16r ). The perfomance scaled well, same as my bare-metal cluster.
These are 16 core machines with infiniband network. The catch is they are only compatible with Intel MPI . So you need your own cluster edition Intel compiler. I was able to compile on my bare-metal cluster and copy the executable to Azure and it ran fine.. |
|
November 20, 2018, 10:21 |
|
#9 |
New Member
hypersonic
Join Date: Mar 2009
Posts: 5
Rep Power: 17 |
Hi
I'm running ANSYS on Azure (8*H16r), but can only get RDMA inifiband at 1XDRR (5Gbit) Anyone had any luck getting 4XFDR (56Gbit) to work with Intel MPI - as they say is possible on the Azure HPC page? Thanks |
|
November 20, 2018, 21:35 |
|
#10 | |
New Member
Faraz
Join Date: Mar 2018
Posts: 25
Rep Power: 8 |
Quote:
I am not any expert in Infiniband, but I always try and check the basics first. Try running this and report the results: http://mvapich.cse.ohio-state.edu/benchmarks/
|
||
November 22, 2018, 14:24 |
|
#11 | |
Member
Fatih Ertinaz
Join Date: Feb 2011
Location: Istanbul
Posts: 64
Rep Power: 15 |
Hello
As Faraz has already pointed out, performance bottleneck you encounter should be due to the network latency. Quote:
Hope this helps |
||
November 25, 2019, 12:14 |
|
#12 |
Member
Jairo A. Gutiérrez S
Join Date: Nov 2014
Posts: 60
Rep Power: 12 |
I think you should limit to 50k cells per processor.
|
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Inconsistencies in reading .dat file during run time in new injection model | Scram_1 | OpenFOAM | 0 | March 23, 2018 23:29 |
GAMBIT + win7 = Very slow performance | rieuk | ANSYS Meshing & Geometry | 3 | September 6, 2017 03:41 |
Error during reconstructing lagarangian fields | ybapat | OpenFOAM | 9 | November 17, 2014 08:52 |
problem with solving lagrange reaction cloud | Polli | OpenFOAM Running, Solving & CFD | 0 | April 30, 2014 08:53 |
Running OpenFoam on a Computer Cluster in the Cloud - cloudnumbers.com | Markus Schmidberger | OpenFOAM Announcements from Other Sources | 0 | July 26, 2011 09:18 |