|
[Sponsors] |
August 29, 2006, 23:02 |
I have a case of about 1M cell
|
#1 |
Senior Member
Xiaofeng Liu
Join Date: Mar 2009
Location: State College, PA, USA
Posts: 118
Rep Power: 17 |
I have a case of about 1M cells. I run it parallel on 32 partitions (16nodes X 2cores = 32).
I want to simulate a process which takes about 1800s in reality. I run the case on the supercomputer for 12hours. It only simulates about 600s. One thing I noticed is the time information in the log file: at time step n: ExecutionTime = 25185.6 s ClockTime = 43755 s at time step n+1: ExecutionTime = 25194.1 s ClockTime = 43769 s The clocktime is almost twice that of execution time. Does execution time means CPU time and clocktime means CPU time plus communication time between nodes? Does it mean the program spent a lot of time just on waiting for data transfer?
__________________
Xiaofeng Liu, Ph.D., P.E., Assistant Professor Department of Civil and Environmental Engineering Penn State University 223B Sackett Building University Park, PA 16802 Web: http://water.engr.psu.edu/liu/ |
|
August 29, 2006, 23:07 |
For time step n+1:
ExecutionT
|
#2 |
Senior Member
Xiaofeng Liu
Join Date: Mar 2009
Location: State College, PA, USA
Posts: 118
Rep Power: 17 |
For time step n+1:
ExecutionTime = 8.5s ClockTime=14s So about 5.5s is spent on communication?
__________________
Xiaofeng Liu, Ph.D., P.E., Assistant Professor Department of Civil and Environmental Engineering Penn State University 223B Sackett Building University Park, PA 16802 Web: http://water.engr.psu.edu/liu/ |
|
August 30, 2006, 04:16 |
The 'missing' time is probably
|
#3 |
Senior Member
Mattijs Janssens
Join Date: Mar 2009
Posts: 1,419
Rep Power: 26 |
The 'missing' time is probably spent waiting for communication. This is due to imperfect balancing and just purely the communication time and latency. What interconnect do you have?
|
|
August 30, 2006, 08:00 |
It's a common problem: You nee
|
#4 |
Member
Ola Widlund
Join Date: Mar 2009
Location: Sweden
Posts: 87
Rep Power: 17 |
It's a common problem: You need a lot of cells in each partition for the cpu:s to spend more time iterating than waiting... From my experience using Fluent on a 16 node (32 cpu) cluster, you should have 100.000 to 200.000 cells in each partition to get decent parallel efficiency. I'm rather surprised it went so "well" for you! (We have a ordinary Gigabit ethernet; With a high-speed interconnect it would be better, but you still loose a lot.)
/Ola |
|
October 13, 2006, 15:34 |
Hi, Xiaofeng,
What computer
|
#5 |
Senior Member
Pei-Ying Hsieh
Join Date: Mar 2009
Posts: 317
Rep Power: 18 |
Hi, Xiaofeng,
What computer you were running the parallel case? I recently tested my cluster (2 dual CPU workstations + 2 dual core workstation). When I used all CPUs/cores, that is a totoal of 8, I got 45% - 50% efficiency (executionTime/ClockTime). When I used on 1 CPU (or 1 core) from each workstation, I got 65% - 70% efficiency. However, executionTime in the 4 CPU run is longer than the 8 CPU/core case. So, in real time, the 8 CPU run is still "slightly" faster than the 4 CPU run. I was told that even a 70% efficiency is not good. Each workstation has 1 gigabit NIC connected to a Linksys gigabit switch (SD2008) which support non-blocking/Jumo Frames. I mgiht want to try out GAMMA. But, is there a way to improve efficiency without GAMMA? What is typical parallel efficiency people get? Any suggestion? pei |
|
October 14, 2006, 08:07 |
Hi,
I was looking at the be
|
#6 |
Senior Member
Pei-Ying Hsieh
Join Date: Mar 2009
Posts: 317
Rep Power: 18 |
Hi,
I was looking at the benchmark results posted on the OpenFOAM wiki. I noticed that for the interFoam case (case #4) when ran on the Waltons cluster, the 3-CPU run and the 4-CPU run actually were 50% slower than the serial run (1 CPU). Is this real? pei |
|
October 15, 2006, 17:06 |
Hi Pei!
(about case #4 on the
|
#7 |
Assistant Moderator
Bernhard Gschaider
Join Date: Mar 2009
Posts: 4,225
Rep Power: 51 |
Hi Pei!
(about case #4 on the Wiki) Yep. I'm afraid so. The case is just too small (18MB according to the table on the top, don't know how many cells right now). If you look on the other small cases on that machine: they don't scale that good either. (partly the network on that machine can be blamed but not totally)
__________________
Note: I don't use "Friend"-feature on this forum out of principle. Ah. And by the way: I'm not on Facebook either. So don't be offended if I don't accept your invitation/friend request |
|
October 16, 2006, 16:36 |
Hi, Bernard,
How is memory
|
#8 |
Senior Member
Pei-Ying Hsieh
Join Date: Mar 2009
Posts: 317
Rep Power: 18 |
Hi, Bernard,
How is memory determined? The case I am testing is about 1,158,000 hex cells. I am trying to find out what could be the cause(s) of the low executionTime/ClockTime ratio. I ran a case on a dual core AMD workstation, the ratio between executionTime/ClockTime is about 1. However, the speed up is only about 1.3. This could be due to both cores accessing the same memory bus. I am hoping to improve the efficiency of the executionTime/ClockTime ratio. Any suggestion? Pei |
|
October 17, 2006, 11:04 |
Hi Pi!
@memory usage: For t
|
#9 |
Assistant Moderator
Bernhard Gschaider
Join Date: Mar 2009
Posts: 4,225
Rep Power: 51 |
Hi Pi!
@memory usage: For the benchmark cases the memory usage was "measured" by getting the amount of residential memory every 5 seconds from the operating system and reporting the maximum value that occured durnig the benchmark. In general I think the rule of thumb is approx 800bytes/cell (double precision). More if you use additional models Bernhard
__________________
Note: I don't use "Friend"-feature on this forum out of principle. Ah. And by the way: I'm not on Facebook either. So don't be offended if I don't accept your invitation/friend request |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
parallel performance | ivandipia | CFX | 6 | January 29, 2009 16:26 |
Performance of interFoam running in parallel | hsieh | OpenFOAM Running, Solving & CFD | 8 | September 14, 2006 10:15 |
ANSYS CFX 10.0 Parallel Performance for Windows XP | Saturn | CFX | 4 | August 13, 2006 13:27 |
Parallel Performance of Fluent | Soheyl | FLUENT | 2 | October 30, 2005 07:11 |
Parallel performance | hsing | OpenFOAM Running, Solving & CFD | 16 | August 30, 2005 15:38 |