CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums

Dark side of Amdahl's law

Register Blogs Community New Posts Updated Threads Search

Rate this Entry

Dark side of Amdahl's law

Posted November 25, 2011 at 16:19 by SergeAS
Updated November 27, 2011 at 06:44 by SergeAS
Tags mpi, parallel

Looking the latest HPCSource I came across very well-known picture, illustrating the Amdahl's law:
S_p = \cfrac{1}{\alpha + \cfrac{1 - \alpha}{p}}


which shows the dependence of the speedup parallel code on number of processor cores for different fraction of time (\alpha \ne 0) spent in the part that was not parallelized. This diagram is interested in me for the reason that just before I did test the scalability of one of my old 2D research parallel solvers.
We call it the T-DEEPS2D (Transient, Density based Explicit Effective Parallel Solver for 2D compressible NS). For the test used the problem of simulating Von Karman vortex street near the 2D cylinder on a uniform mesh 500x500 nodes.

This solver uses various types of balances of 1D domain decomposition with halo exchanges. The measurements were performed on two clusters with different numbers of nodes, different types of processors. The exec code was produced by same compiler (Intel C++ 10.1) but for different processors. Solver using non-blocking MPI calls (there are also versions of the solver with blocking MPI calls and OpenMP version, but more on that later). In both clusters for interconnect using Infiniband. So, my attention was attracted by speedup figures in the graphs. According to my tests turned out that my parallel solver code for 100 cores much better than 95%. I just took the formula of the Amdahl's law and knowing real acceleration and the number of cores was built a dependence of \alpha on the number of cores. See results below:






It turns out that \alpha is not constant but depends on the number of used cores/subdomains ... and tends to zero

Likely for other solvers picture might be different.


Who has any ideas on this case ?
Attached Thumbnails
Click image for larger version

Name:	HyperFLOW2D-speedup.png
Views:	34403
Size:	7.7 KB
ID:	31   Click image for larger version

Name:	HyperFLOW2D-speed.png
Views:	31020
Size:	7.2 KB
ID:	32   Click image for larger version

Name:	HyperFLOW2D-parallel-factor.png
Views:	32319
Size:	6.5 KB
ID:	33   Click image for larger version

Name:	VonKarmanVorticeStreet2.gif
Views:	30793
Size:	87.4 KB
ID:	36  
Posted in Uncategorized
Views 3119 Comments 2 Edit Tags Email Blog Entry
« Prev     Main     Next »
Total Comments 2

Comments

  1. Old Comment
    lakeat's Avatar
    Hi, newbie here, why SDR shows better than DDR in your results?
    permalink
    Posted March 28, 2012 at 10:20 by lakeat lakeat is offline
  2. Old Comment
    SergeAS's Avatar
    This is due to the fact that the one cluster was used Infiniband DDR and has nodes on NUMA architecture and a smaller cache (Opteron 285) against the other cluster which used SDR on the SMP (Xeon) and large cache.

    The addition for each node we has two data exchanges (except first and last node) on Infiniband and 3 exchanges of internal memory (SMP or NUMA).

    PS: If I use one process per node (exchange only for InfiniBand) that scalability will be even better, it's a paradox but a fact
    permalink
    Posted March 28, 2012 at 11:20 by SergeAS SergeAS is offline
 

All times are GMT -4. The time now is 00:14.