|
[Sponsors] |
Problem with parallelization speedup using many CPUbs |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
January 19, 2009, 06:01 |
It seems you have a bandwidth
|
#1 |
Member
Martin Aunskjaer
Join Date: Mar 2009
Location: Denmark
Posts: 53
Rep Power: 17 |
It seems you have a bandwidth problem on your interconnect. One possible explanation is that your case is too small for an efficient parallel run using more than 2 CPU's. Small cases require low latency interconnects due to heavy network traffic. I am not familiar with the benchmark case, but my sugesstion is to make it larger. Keep us posted.
|
|
January 19, 2009, 06:13 |
Thanks for the reply, I will t
|
#2 |
New Member
Andreas Håkansson
Join Date: Mar 2009
Location: Lund, Sweden
Posts: 12
Rep Power: 17 |
Thanks for the reply, I will try increasing the size of the case and report what happens.
/Andreas |
|
January 19, 2009, 06:22 |
Hi
Why not you try smaller
|
#3 |
Member
Velan
Join Date: Mar 2009
Location: India
Posts: 50
Rep Power: 17 |
Hi
Why not you try smaller case by grid size of * 16*16*16 * 16*128*128 Decompose the grid in x direction, and if possible compare the results like nproc realtime 1 ??secs 2 ?? 4 ?? and samething for second case also(128cells in x-direction). I found the problem between AMD vs INTEL on this issues(bandwidth), i will post the results later. |
|
January 19, 2009, 12:32 |
Martin: I have now tried expan
|
#4 |
New Member
Andreas Håkansson
Join Date: Mar 2009
Location: Lund, Sweden
Posts: 12
Rep Power: 17 |
Martin: I have now tried expanding the case from its original 12 000 cells to about 300 000 cells. I have run this in 2h on 1, 2 and 4 CPU's to see what happens.
My results are then: 1 CPU Clock=7222 Execution=7207.49 Speedup 1.00 2 CPU Clock=7187 Execution=3610.08 Speedup 1.00 4 CPU Clock=7187 Execution=1804.03 Speedup 0.99 This means that the execution/clock time ratio now drops even earlier (already at 2 CPU's) and I get no speedup even between 1 and 2 CPU's. When I try this on my PC there is a clear speedup between 1.5 and 2. I also wonder if this is the result of slow networking; can it be fixed or do I need to use another cluster? Velan: Thanks for your reply but i am not sure exactly what you mean. As I understand your posting you want me to check a very small case to see if the problem has to do with low memory or something like this, I will try following your instructions and post results tomorrow. /Andreas |
|
January 20, 2009, 06:53 |
I have now done some asking ar
|
#5 |
New Member
Andreas Håkansson
Join Date: Mar 2009
Location: Lund, Sweden
Posts: 12
Rep Power: 17 |
I have now done some asking around at the cluster support. The network uses Gigabit Ethernet (1000Base-T) with a theoretical speed of 1000Mbit/s which according to support should imply that it transfers data at 125Mbtes/s between processors IF the switch is not overloaded. Time to initiate communication between nodes (latency) is 0.35 ms.
Does these numbers say anything to anyone? My main concern is if the cluster is fast enough... I have also tried out the case suggested by Velan. The time simulated when run for 2h decreases (i.e. speedup lower then 1) when comparing 1 and 2 or 4 CPU's even for this small case. The ratio between execution time and clock time also decreases below 50% for 2 CPU's and becomes even lower for 4 CPU's. For a oodles case with 16x16x16 cells: CPUs RelSimTime ExecutionTime ClockTime 1 1.000 6512 7226 2 0.733 3255 7249 4 0.532 1653 7235 Thanks in advance for any reply Andreas |
|
January 20, 2009, 07:04 |
Some comments on scaling behav
|
#6 |
Member
Christian Winkler
Join Date: Mar 2009
Location: Mannheim, Germany
Posts: 63
Rep Power: 17 |
Some comments on scaling behavior:
Intel harpertown and woodcrest cpus (quadcore xeons) have a significant memory bottleneck between cores and RAM. I work on a cluster with dual quadcore nodes connected by infiniband. Result: only use half of the cores per node! If i use all cores the ExecutionTime doubles! Otherwise scaling is good enough. Regarding your problem: bandwidth is not that important for cfd (but i does not hurt), latency is the key. You might want to use a low latency mpi implementation for gigabit: Gamma http://www.disi.unige.it/project/gam...mma/index.html I havent tried it myself, but they have performance data for OF 1.4 which looks quite nice. have a look at OpenFOAM-1.5.x/src/Pstream/gamma Best regards Christian |
|
January 20, 2009, 07:17 |
Can this be all latency induce
|
#7 |
Member
Martin Aunskjaer
Join Date: Mar 2009
Location: Denmark
Posts: 53
Rep Power: 17 |
Can this be all latency induced? It appears that for N>1 CPUs, each executes for about the time one would expect but sits doing nothing for (N-1)/N of the total wall clock time.
As I understand it, this is a cluster of single-core, single-CPU nodes with a Gbit interconnect. This is exactly what I'm planning to invest in for OF, albeit for turbFoam. However, I suspect turbFoam and oodles only differ in the turbulence models, so this result is not at all encouraging. I have no further ideas at this time, other than maybe trying other cases that use other sparse matrix solvers. |
|
January 20, 2009, 09:25 |
Thanks for your answers, I wil
|
#8 |
New Member
Andreas Håkansson
Join Date: Mar 2009
Location: Lund, Sweden
Posts: 12
Rep Power: 17 |
Thanks for your answers, I will check this with the latency.
Martin: Yes the cluster consists of 200 AMD Opteron 148 single-cpu nodes. I will report on any progress. Regards Andreas |
|
January 20, 2009, 12:27 |
Having thought a bit about it,
|
#9 |
Member
Martin Aunskjaer
Join Date: Mar 2009
Location: Denmark
Posts: 53
Rep Power: 17 |
Having thought a bit about it, the fact that you see no difference using floatTransfer might indicate that this is not a bandwidth problem; rather it might indeed be a latency problem. You might want to examine your network performance.
Also, have a look at these threads for possible further assistance: http://www.cfd-online.com/OpenFOAM_D...es/1/2970.html http://www.cfd-online.com/OpenFOAM_D...es/1/5473.html (posts from Sep 27 and onwards in the latter). |
|
January 21, 2009, 10:32 |
Dear Andreas,
maybe I didn'
|
#10 |
Member
Carsten Thorenz
Join Date: Mar 2009
Location: Germany
Posts: 34
Rep Power: 17 |
Dear Andreas,
maybe I didn't understand your table: > My results are then: > 1 CPU Clock=7222 Execution=7207.49 Speedup 1.00 > 2 CPU Clock=7187 Execution=3610.08 Speedup 1.00 > 4 CPU Clock=7187 Execution=1804.03 Speedup 0.99 But this looks like rather perfect speed-up for me?! 7207.4s on 1 CPU, 3610.08s on 2 CPUS, 1804.03s on 4 CPUs. Perfect. Dumb question: You're sure that you're interpreting the numbers correctly? Bye, Carsten |
|
January 21, 2009, 11:21 |
@Martin:
Thanks for the info.
|
#11 |
New Member
Andreas Håkansson
Join Date: Mar 2009
Location: Lund, Sweden
Posts: 12
Rep Power: 17 |
@Martin:
Thanks for the info. I will continue to check out what I can do about the MPI and latency. @Carsten: Sorry, maybe my table was not very clear. What I did was to run each case for 2h (giving ClockTime approx= 7200 for all cases). Still the 4 CPU case only spends 1804.03s computing while the 1 CPU spends almost the whole 2h for computing. The remaining time is probably spent waiting for communication between the nodes. What I mean in the last column is that the different cases reaches almost the same simulation time at the end of the 2h. So, as I see this I would have a very good speedup if the nodes didn't spend so much time waiting: if ExecutionTime where equal to clockTime, but they are not. Regards Andreas |
|
January 21, 2009, 14:12 |
Hi Andreas
Somewhere on the
|
#12 |
Senior Member
Niels Gjoel Jacobsen
Join Date: Mar 2009
Location: Copenhagen, Denmark
Posts: 1,903
Rep Power: 37 |
Hi Andreas
Somewhere on the Forum (cannot recall where), I read that you needed to have O(1e4) cells per processor to get reasonable results, otherwise transfer of BCs from processor to processor would eat all your time. I have a rather pragmatic way of looking at it: surface area / volume, i.e. number of processor patch faces divided by number of cells on each processor. If this is large (and the effort in solving the Poisson eq. is small), then you must expect a rather large still stand when running on multiple processors. Thus I would like to suggest for you to run the exact same case just with 32 * 32 * 32. For the 2 processor case, you would get half of "surface area / volume", thus hopefully a better scaling. I am not at all an expert, but I hope it is helpful. Best regards, Niels
__________________
Please note that I do not use the Friend-feature, so do not be offended, if I do not accept a request. |
|
January 21, 2009, 17:09 |
Hi,
I have no benchmark wit
|
#13 |
Member
Ola Widlund
Join Date: Mar 2009
Location: Sweden
Posts: 87
Rep Power: 17 |
Hi,
I have no benchmark with OF, but we run Fluent (similar algoritm and MPI impl.) on a cluster with 50 nodes, 2 AMD Opterons per node, and Gbit ethernet. Our rule of thumb is to have minimum 200-300 thousand cells per cpu, otherwise latency will kill us... /Ola |
|
January 22, 2009, 03:54 |
Hi Andreas, I don't want to so
|
#14 |
Member
Carsten Thorenz
Join Date: Mar 2009
Location: Germany
Posts: 34
Rep Power: 17 |
Hi Andreas, I don't want to sound stubborn, but I think you misinterprete your results or your set-up is wrong or I still don't understand your set-up.
What your're saying is that for a mesh of 16x16x16=4096 cells you have speedup of CPUs RelSimTime ExecutionTime ClockTime 1 1.000 6512 7226 2 0.733 3255 7249 4 0.532 1653 7235 Then, for 300000 cells you have a speedup of 1 CPU Clock=7222 Execution=7207.49 Speedup 1.00 2 CPU Clock=7187 Execution=3610.08 Speedup 1.00 4 CPU Clock=7187 Execution=1804.03 Speedup 0.99 From your interpretation this means that the speed-up is worse for larger grids. This is very very improbable from my experience. So, I would try the following: - use a testcase that is big enough (1e6 cells if you have enough RAM, this reduces latency impact) - adjust the testcase so that it does produce as little result output as possible (IO can slow everything down) - adjust "endTime" of the testcase so that it runs ~30min on 1 CPU (what did you mean by "giving ClockTime"?) - execute all runs with the "time" command in front of the foam-solver and in front of mpirun (e.g. "time mpirun -np 4 -hostfile mymachines time /pathtofoam/mysolver -case mycase -parallel" , syntax may differ for you) - decompose and run it for 2 and 4 CPUs. - post the reults Maybe you can log in on each of your client nodes while the job is running and execute "top" on it. Check how much time is spend on your job, on system, on idle, on wait. Maybe activating "Sleeping in Function" in "top" may help (hit f y in top) to identify the culprit. As already stated by Martin, latency can be a big issue on gig-ethernet networks. But I wouldn't expect this to be so severe. Bye, Carsten |
|
January 22, 2009, 04:20 |
Hi,
i have the feeling that
|
#15 |
Member
Christian Winkler
Join Date: Mar 2009
Location: Mannheim, Germany
Posts: 63
Rep Power: 17 |
Hi,
i have the feeling that your job is running on a single node. That would explain everything! In detail: you have four nodes with a single CPU-Core in each node. Now you start your job mit 4 processes. The idea is, that your system distributes those 4 processes to the 4 physical CPU-Cores. If this distribution is not done properly, all 4 processes are assigned to ONE CPU-Core. Then you get the behavior you see: each process uses up a fourth of the whole clocktime. I made this mistake once myself. That would be a bug in you job submission script or in the batch scheduling system. So you should check, if each node is running your job. Just log in during the run using ssh and do "top". Best regards Christian |
|
January 22, 2009, 04:26 |
Some more comments on cell num
|
#16 |
Member
Christian Winkler
Join Date: Mar 2009
Location: Mannheim, Germany
Posts: 63
Rep Power: 17 |
Some more comments on cell numbers per process and speedup:
I have some cases that i partitioned to have as low as 10.000 cells per CPU and even the show reasonable speedup with OF (also with Fluent, but not as good as OF). But the bigger the chuncks of the mesh per CPU the better the scaleup, as stated before. Best regards Christian |
|
January 22, 2009, 06:06 |
Hi again,
@Carsten: I thin
|
#17 |
New Member
Andreas Håkansson
Join Date: Mar 2009
Location: Lund, Sweden
Posts: 12
Rep Power: 17 |
Hi again,
@Carsten: I think I understand now what you mean. In each tabel I have standardized the simulation time to that of 1 CPU or comparison, clock time is posted in order to compare to execution time. If looking at actual numbers the small case have much longer simulation time. I have made a 1M case with low output. Then I set the maximum wall time when submitting to the cluster at 2h for all cases and see how far they get in simulation time. I have not used the "time" command. What is it for? Also see below. @Christian: Maybe you are right, I also recently got an email from another user who suggested this. It would most definetly make sense of my low efficiency. My logg files say that more than one CPU is used: bench_doc_1.o693801 bench_doc_2.o693802 (Only have 1 and 2 CPUs yet but for them I still have the problem) but I do not know if they are really sharing the load. The cluster supports has told me that they do not allow access to single nodes so I can not go there directly and use "top" as I do one my own computer. But probably it would be possible to get the information from the running program, but I do not know how. Any suggestions would be very welcome. Regards Andreas |
|
January 22, 2009, 06:33 |
Ok, your Job is running on a s
|
#18 |
Member
Christian Winkler
Join Date: Mar 2009
Location: Mannheim, Germany
Posts: 63
Rep Power: 17 |
Ok, your Job is running on a single CPU :-))
Have a look at your log: First it says: ... names of assigned nodes dn209 dn208 echo dn209 is main node ... And then from OF: [0] Date : Jan 22 2009 [0] Time : 08:25:34 [0] Host : dn209 [0] PID : 20938 [1] Date : Jan 22 2009 [1] Time : 08:25:34 [1] Host : dn209 [1] PID : 20939 [1] Root : /disk/global1/andhak/OpenFOAM/andhak-1.4.1/run [0] Root : /disk/global1/andhak/OpenFOAM/andhak-1.4.1/run [0] Case : myBench2/oodles_pitzDaily_2 [0] Nprocs : 2 [0] Slaves : [0] 1 [0] ( [0] dn209.20939 [0] ) [0] [1] Case : myBench2/oodles_pitzDaily_2 [1] Nprocs : 2 The Host process is on dn209 and the slave process is on dn209 as well!!!, but should be on dn208, right? Could you mail your Job script? Best regards Christian |
|
January 22, 2009, 06:42 |
Great, hope you are right, the
|
#19 |
New Member
Andreas Håkansson
Join Date: Mar 2009
Location: Lund, Sweden
Posts: 12
Rep Power: 17 |
Great, hope you are right, then I have hopes to fix this!
My submitt script is: bench2.scr Regards Andreas |
|
January 22, 2009, 07:02 |
Hi Andreas,
Try this comman
|
#20 |
Member
Christian Winkler
Join Date: Mar 2009
Location: Mannheim, Germany
Posts: 63
Rep Power: 17 |
Hi Andreas,
Try this command: mpirun -np $nrnodes -hostfile $PBS_NODEFILE $solver $PBS_O_WORKDIR $casename -parallel In my experience the -hostfile option actually must not be used with Openmpi but you should try it anyways. You are using LAM, right? Best regards Christian |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Superlinear speedup in OpenFOAM 13 | msrinath80 | OpenFOAM Running, Solving & CFD | 18 | March 3, 2015 06:36 |
speedup questions | tony | CFX | 5 | February 3, 2008 18:26 |
cluster - parallel speedup | George | Main CFD Forum | 3 | March 29, 2005 12:32 |
cluster - parallel speedup | George | FLUENT | 0 | March 25, 2005 06:54 |
About the parallelization | ptyue | Main CFD Forum | 8 | January 27, 2003 00:29 |