Parallel computing quad core

January 14, 2009, 12:14

Hi

I am running my CFD code on quad core machine. It terminating with following error: "rank 1 in job 78 host_40793 caused collective abort of all ranks exit status of rank 1: killed by signal 11"

My machine architecture is Processor (CPU): Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz

OS: Linux 2.6.22.19-0.1-default x86_64

System: openSUSE 10.3 (x86_64)

KDE: 3.5.7 "release 72.9"

MPI Library is :mpich

But I have already ran this with IBM-Cluster with AIX operating system with 8 processors. And there is no problem(even with higher number of control volumes)

But when I reduce the no of control volumes. It is working even on quad core.

For example I can run on single processor with 1 million control volumes. But I am not able use more than few thousand(say 40 K) control volumes on four processors, all together. My code is based on block structures(It reads blocks and the divisions on each line). It is written in Fortran 77 and using gcc and gfortran for compiling it.

In short I am able run on four processors of quad core machine with very few number of control volume. But when I increase the number of control volumes, it is collapsing with the above error.

I dont know much about MPI. So, it will be great, if somebody can throw light how to use calculate on quad core machine with higher number of control volumes.

Please let me know, if you need some more information to answer

Regards Prad

January 14, 2009, 13:54

I got similar problem like this. I ran my job using 1 or 2 million grid points in one machine. But the same job was not ran in other machine. I spend so much time on it and found one way of solution. Its not about machine problem. Its about compiler problem

check the below link (It may help you) http://www.clusterresources.com/pipe...er/004457.html

Job which ran has PGI compilers and Job which gave problem has Intel compilers. This error comes usually in do loops. PGI sends the data as vectors, but intel wont do it. In intel compliers Job will not run eventhough you job doesnt require high memory. So try with some other compliers.

Hope this provides you some help

- Velan

January 23, 2009, 08:59

well, r u sure u can use MPI on a quad-core system? As much as I know, quad-core like systems are shared memory systems suitable for OpenMP. MPI is suitable for distributed memory systems, like linux clusters.

Please check it out in detail.

January 23, 2009, 09:05

Hi Chandra, As I have mentioned I am able to use MPI on quadcore machine and it is running with few number of grid points(or control volumed). More ever my colleagues are able run their codes on similar quad core machines. But their solver is different.

I think problem may be with compiler as Velan mentioned or the memory allocation or something which I am not able to figure out.

Regards Prad

January 23, 2009, 09:12

Hi Velan,

I tried to change the PGI, I have some problems with make file. So, I am not bale to compile it with PGI. I am not sure how to make a makefile suitable for PGI.

Can you suggest on making of make file for PGI, I changed existing file , below u can how my new make file for PGI looks like...what is wrong in it?

SYSNAME = pgi.x86Linux

DEFTARGET = fast

USECPP = FALSE

MOVEFOBJS = FALSE

MOVECOBJS = FALSE

USEINLINE = FALSE

AUTOINLINE = FALSE

EXTRAPAROBJS = FALSE

FDEFINES = -DU77 -DX86LINUX $(XTRADEF)

CPPDFLAGS = -traditional-cpp -E -P -M

CPPD = gcc

CPPDTYP = CPPGNU

CPPFLAGS = -traditional-cpp -E -P

CPP = gcc

CPPTYP = CPPGNU

MACHOPT =

EXPSUB =

EXPFILE =

FFLAGSFAST = -fast -tp p6 -Mdalign -c -byteswapio

FFLAGSPROF = -pg $(FFLAGSFAST)

FFLAGSDEBUG = -g -c -byteswapio -Mbounds

FFLAGSPAR = (need only be set if EXTRAPAROBJS is TRUE)

FFLAGSPARPRF = (need only be set if EXTRAPAROBJS is TRUE)

FFLAGSPARDBG =

FFLAGOBJNAM = -o

FC = pgf77

FCPAR = (need only be set if EXTRAPAROBJS is TRUE)

CDEFINES = -DSUBNAMUNDERSCORE -DGNU

CFLAGSFAST = -c

CFLAGSPROF = -pg $(CFLAGSFAST)

CFLAGSDEBUG = -g -c

CFLAGSPAR = (need only be set if EXTRAPAROBJS is TRUE)

CFLAGSPARPRF = (need only be set if EXTRAPAROBJS is TRUE)

CFLAGSPARDBG = (need only be set if EXTRAPAROBJS is TRUE)

CC = gcc

CCPAR = (need only be set if EXTRAPAROBJS is TRUE)

LIBS =

LIBSPROF = $(LIBS)

LIBSPAR =

LIBSPARPROF = $(LIBSPAR)

LDFLAGSFAST =

LDFLAGSPROF = -pg

LDFLAGSDEBUG =

LDFLAGSPAR =

LDFLAGSPARPRF = -pg

LDFLAGSPARDBG =

LINK = pgf77

LINKPAR = mpif77

January 23, 2009, 09:20

If so, the compiler may be a problem. I've also faced problems in past because of the compiler. When I changed my compiler from GCC to Intel's ICC for my OpenMP code, the same code ran very well on the same machine. So, if possible, plz try to change the compiler and re-run the code.

January 23, 2009, 09:26

Have you tried running your code in debug mode? I've rum LPI using the intel compiler without any problems (you may need to type ulimit -s unlimited before running the code though!).

Also as chandra above says it's not a good idea to run mpi on a intel quadcore (my experience is that it will actually run slower than a single core due to each cpu flushing the shared cache and reloading it with it's own data).

The shared cache is really a big problem on intel quadcores since you only tend to get good scaling when the data that all 4 cores is using fits into cache at the same time.

January 23, 2009, 10:20

Hi Prad,

I used rocks version of PGI which is very simple to compile

. For fast reply post your quires and error in

http://www.pgroup.com/userforum/index.php

They will help you in more detail about how to compile it.

January 23, 2009, 14:51

The memory bandwidth issue is pretty fundamental, not specifically an MPI issue. You should get fine multicore performance for matrix assembly and residual evaluation. Everything will be poor for sparse linear algebra since one core can pretty much saturate the memory bandwidth for the entire socket. Getting significant benefit from multiple cores in the sparse matrix kernels requires quite a lot of tricks, see http://crd.lbl.gov/~oliker/papers/SIAMPP08-oliker.pdf and note that several techniques that make the final pthreads implementation impressive can also be applied to the MPI version.

The current advice is to get a Nehalem (Core i7) if you want better memory bandwidth. Otherwise, just buy sockets, the number of cores and their speed is much less relevant than the number of sockets and the speed of the bus.

January 24, 2009, 20:23

Well, since I've got unlimited access to 3 supercomputers which are (essentially) free of the problems I described to the original poster, that's not particularly good advice - I just use a quadcore at home for messing about. Basically on a quadcore it's a bad idea to use mpi and, as intel have reported in their own research, once your data exceeds a certain size you essentially aren't any better off using >2 cores.

January 25, 2009, 12:00

MPI on quad cores is neither good nor bad. Its just a means to communicate between different processes. If your algorithm is memory bandwidth intensive, no means of inter-process communication will keep the pipes full and your performance will suffer. If your algorithm is compute intensive and your memory bandwidth needs are low, it will work just fine. Its not MPI that is the problem - its the algorithm that determines whether it will scale well on quads or not.

January 25, 2009, 14:52

"If your algorithm is compute intensive and your memory bandwidth needs are low, it will work just fine"

That's the point (and the fact that the MPIsend/recieve can cause problems with the shared cache) most CFD calculations are going to run into the bandwidth problem on quadcores fairly quicky.

A simple example is to use Jacobi iteration (a highly scalable "bit reproducable" algorithm) so solve Poisson's equation on a intel quadcore you'll get perfect scaling on a 360x360 grid. Now redo the calculation on a 720x720 grid and you'll find it difficult to even half the computational time (basically two cores is almost optimal).

In contrast you also get the occasional "super scaling" by going to 2 cores from 1 (try the same problem on a 720x360 grid!) and no further improvement for 3 or 4 cores.

This is just something that you need to be wary of when your code has to run efficiently on a number of different parallel architectures.

January 29, 2009, 18:47

Hi Prad,

It is entirely possible there is something wrong in your algorithms that is a function of number of cores and number of control volumes.

I have my own home brew CFD code which I recently parallelized in MPI and ran into similar issues. I wouldn't blame your Intel box until you run the exact same case on the supercomputer with the same number of ranks.

I agree with the arguments posed by Tom and others but that shouldn't cause it to crash in this manner.

Philip

February 9, 2009, 15:28

Hi Philip,

I was out of station for last two weeks. I think your suggestion is most suitabl to my case. As this code is age old code. Many people worked on this code and added lot of stuff and now it is really huge code and with lots of problem. But I am supposed to work with this code only. And it is also based on block structured. So, it reads mesh as blocks. Sometimes same no of mesh points with different number of blocks also gives the problem for compilation. Older computers sometimes allow you to compile higher no of mesh points than newer processors and newer operating system .

I ran the code on super computer. It works much better, and it doesn't give any compilation problems and run time problems on super computer. Only problem is on local machines with single and quad core computers.

Can you elaborate on the issues u have faced and how did you solve them, which may be helpful to me? Thanks in advance Prad

January 14, 2009, 12:14	Parallel computing quad core	#1
Prad Guest Posts: n/a	Hi I am running my CFD code on quad core machine. It terminating with following error: "rank 1 in job 78 host_40793 caused collective abort of all ranks exit status of rank 1: killed by signal 11" My machine architecture is Processor (CPU): Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz OS: Linux 2.6.22.19-0.1-default x86_64 System: openSUSE 10.3 (x86_64) KDE: 3.5.7 "release 72.9" MPI Library is :mpich But I have already ran this with IBM-Cluster with AIX operating system with 8 processors. And there is no problem(even with higher number of control volumes) But when I reduce the no of control volumes. It is working even on quad core. For example I can run on single processor with 1 million control volumes. But I am not able use more than few thousand(say 40 K) control volumes on four processors, all together. My code is based on block structures(It reads blocks and the divisions on each line). It is written in Fortran 77 and using gcc and gfortran for compiling it. In short I am able run on four processors of quad core machine with very few number of control volume. But when I increase the number of control volumes, it is collapsing with the above error. I dont know much about MPI. So, it will be great, if somebody can throw light how to use calculate on quad core machine with higher number of control volumes. Please let me know, if you need some more information to answer Regards Prad

January 14, 2009, 13:54	Re: Parallel computing quad core	#2
Velan Guest Posts: n/a	I got similar problem like this. I ran my job using 1 or 2 million grid points in one machine. But the same job was not ran in other machine. I spend so much time on it and found one way of solution. Its not about machine problem. Its about compiler problem check the below link (It may help you) http://www.clusterresources.com/pipe...er/004457.html Job which ran has PGI compilers and Job which gave problem has Intel compilers. This error comes usually in do loops. PGI sends the data as vectors, but intel wont do it. In intel compliers Job will not run eventhough you job doesnt require high memory. So try with some other compliers. Hope this provides you some help - Velan

January 23, 2009, 08:59	Re: Parallel computing quad core	#3
chandra Guest Posts: n/a	well, r u sure u can use MPI on a quad-core system? As much as I know, quad-core like systems are shared memory systems suitable for OpenMP. MPI is suitable for distributed memory systems, like linux clusters. Please check it out in detail.

January 23, 2009, 09:05	Re: Parallel computing quad core	#4
Prad Guest Posts: n/a	Hi Chandra, As I have mentioned I am able to use MPI on quadcore machine and it is running with few number of grid points(or control volumed). More ever my colleagues are able run their codes on similar quad core machines. But their solver is different. I think problem may be with compiler as Velan mentioned or the memory allocation or something which I am not able to figure out. Regards Prad

January 23, 2009, 09:12	Re: Parallel computing quad core	#5
Prad Guest Posts: n/a	Hi Velan, I tried to change the PGI, I have some problems with make file. So, I am not bale to compile it with PGI. I am not sure how to make a makefile suitable for PGI. Can you suggest on making of make file for PGI, I changed existing file , below u can how my new make file for PGI looks like...what is wrong in it? SYSNAME = pgi.x86Linux DEFTARGET = fast USECPP = FALSE MOVEFOBJS = FALSE MOVECOBJS = FALSE USEINLINE = FALSE AUTOINLINE = FALSE EXTRAPAROBJS = FALSE FDEFINES = -DU77 -DX86LINUX $(XTRADEF) CPPDFLAGS = -traditional-cpp -E -P -M CPPD = gcc CPPDTYP = CPPGNU CPPFLAGS = -traditional-cpp -E -P CPP = gcc CPPTYP = CPPGNU MACHOPT = EXPSUB = EXPFILE = FFLAGSFAST = -fast -tp p6 -Mdalign -c -byteswapio FFLAGSPROF = -pg $(FFLAGSFAST) FFLAGSDEBUG = -g -c -byteswapio -Mbounds FFLAGSPAR = (need only be set if EXTRAPAROBJS is TRUE) FFLAGSPARPRF = (need only be set if EXTRAPAROBJS is TRUE) FFLAGSPARDBG = FFLAGOBJNAM = -o FC = pgf77 FCPAR = (need only be set if EXTRAPAROBJS is TRUE) CDEFINES = -DSUBNAMUNDERSCORE -DGNU CFLAGSFAST = -c CFLAGSPROF = -pg $(CFLAGSFAST) CFLAGSDEBUG = -g -c CFLAGSPAR = (need only be set if EXTRAPAROBJS is TRUE) CFLAGSPARPRF = (need only be set if EXTRAPAROBJS is TRUE) CFLAGSPARDBG = (need only be set if EXTRAPAROBJS is TRUE) CC = gcc CCPAR = (need only be set if EXTRAPAROBJS is TRUE) LIBS = LIBSPROF = $(LIBS) LIBSPAR = LIBSPARPROF = $(LIBSPAR) LDFLAGSFAST = LDFLAGSPROF = -pg LDFLAGSDEBUG = LDFLAGSPAR = LDFLAGSPARPRF = -pg LDFLAGSPARDBG = LINK = pgf77 LINKPAR = mpif77

January 23, 2009, 09:20	Re: Parallel computing quad core	#6
chandra Guest Posts: n/a	If so, the compiler may be a problem. I've also faced problems in past because of the compiler. When I changed my compiler from GCC to Intel's ICC for my OpenMP code, the same code ran very well on the same machine. So, if possible, plz try to change the compiler and re-run the code.

January 23, 2009, 09:26	Re: Parallel computing quad core	#7
Tom Guest Posts: n/a	Have you tried running your code in debug mode? I've rum LPI using the intel compiler without any problems (you may need to type ulimit -s unlimited before running the code though!). Also as chandra above says it's not a good idea to run mpi on a intel quadcore (my experience is that it will actually run slower than a single core due to each cpu flushing the shared cache and reloading it with it's own data). The shared cache is really a big problem on intel quadcores since you only tend to get good scaling when the data that all 4 cores is using fits into cache at the same time.

January 23, 2009, 10:20	Re: Parallel computing quad core	#8
Velan Guest Posts: n/a	Hi Prad, I used rocks version of PGI which is very simple to compile . For fast reply post your quires and error in http://www.pgroup.com/userforum/index.php They will help you in more detail about how to compile it.

January 23, 2009, 14:51	Re: Parallel computing quad core	#9
Jed Guest Posts: n/a	The memory bandwidth issue is pretty fundamental, not specifically an MPI issue. You should get fine multicore performance for matrix assembly and residual evaluation. Everything will be poor for sparse linear algebra since one core can pretty much saturate the memory bandwidth for the entire socket. Getting significant benefit from multiple cores in the sparse matrix kernels requires quite a lot of tricks, see http://crd.lbl.gov/~oliker/papers/SIAMPP08-oliker.pdf and note that several techniques that make the final pthreads implementation impressive can also be applied to the MPI version. The current advice is to get a Nehalem (Core i7) if you want better memory bandwidth. Otherwise, just buy sockets, the number of cores and their speed is much less relevant than the number of sockets and the speed of the bus.

January 24, 2009, 20:23	Re: Parallel computing quad core	#10
Tom Guest Posts: n/a	Well, since I've got unlimited access to 3 supercomputers which are (essentially) free of the problems I described to the original poster, that's not particularly good advice - I just use a quadcore at home for messing about. Basically on a quadcore it's a bad idea to use mpi and, as intel have reported in their own research, once your data exceeds a certain size you essentially aren't any better off using >2 cores.

January 25, 2009, 12:00	Re: Parallel computing quad core	#11
TG Guest Posts: n/a	MPI on quad cores is neither good nor bad. Its just a means to communicate between different processes. If your algorithm is memory bandwidth intensive, no means of inter-process communication will keep the pipes full and your performance will suffer. If your algorithm is compute intensive and your memory bandwidth needs are low, it will work just fine. Its not MPI that is the problem - its the algorithm that determines whether it will scale well on quads or not.

January 25, 2009, 14:52	Re: Parallel computing quad core	#12
Tom Guest Posts: n/a	"If your algorithm is compute intensive and your memory bandwidth needs are low, it will work just fine" That's the point (and the fact that the MPIsend/recieve can cause problems with the shared cache) most CFD calculations are going to run into the bandwidth problem on quadcores fairly quicky. A simple example is to use Jacobi iteration (a highly scalable "bit reproducable" algorithm) so solve Poisson's equation on a intel quadcore you'll get perfect scaling on a 360x360 grid. Now redo the calculation on a 720x720 grid and you'll find it difficult to even half the computational time (basically two cores is almost optimal). In contrast you also get the occasional "super scaling" by going to 2 cores from 1 (try the same problem on a 720x360 grid!) and no further improvement for 3 or 4 cores. This is just something that you need to be wary of when your code has to run efficiently on a number of different parallel architectures.

January 29, 2009, 18:47	Re: Parallel computing quad core	#13
hahnpv Guest Posts: n/a	Hi Prad, It is entirely possible there is something wrong in your algorithms that is a function of number of cores and number of control volumes. I have my own home brew CFD code which I recently parallelized in MPI and ran into similar issues. I wouldn't blame your Intel box until you run the exact same case on the supercomputer with the same number of ranks. I agree with the arguments posed by Tom and others but that shouldn't cause it to crash in this manner. Philip

February 9, 2009, 15:28	Re: Parallel computing quad core	#14
Prad Guest Posts: n/a	Hi Philip, I was out of station for last two weeks. I think your suggestion is most suitabl to my case. As this code is age old code. Many people worked on this code and added lot of stuff and now it is really huge code and with lots of problem. But I am supposed to work with this code only. And it is also based on block structured. So, it reads mesh as blocks. Sometimes same no of mesh points with different number of blocks also gives the problem for compilation. Older computers sometimes allow you to compile higher no of mesh points than newer processors and newer operating system . I ran the code on super computer. It works much better, and it doesn't give any compilation problems and run time problems on super computer. Only problem is on local machines with single and quad core computers. Can you elaborate on the issues u have faced and how did you solve them, which may be helpful to me? Thanks in advance Prad

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
solving a conduction problem in FLUENT using UDF	Avin2407	Fluent UDF and Scheme Programming	1	March 13, 2015 03:02
Superlinear speedup in OpenFOAM 13	msrinath80	OpenFOAM Running, Solving & CFD	18	March 3, 2015 06:36
Parallel Processing in Quad Core Computer	Francis	FLUENT	2	August 5, 2008 09:35
Parallel computing on dual core	Fabio	FLUENT	3	July 8, 2008 06:28
Parallel processing in quad core	Renato Pacheco	FLUENT	1	June 4, 2008 13:06