CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Main CFD Forum

Code Saturne mass sources MPI problem

Register Blogs Community New Posts Updated Threads Search

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   April 19, 2010, 08:06
Default Code Saturne mass sources MPI problem
  #1
Member
 
Patrick Wollny
Join Date: Apr 2010
Posts: 58
Rep Power: 16
Pat84 is on a distinguished road
Hello,

i have wrote a program to simulate effusion cooling holes with mass sources in the user routine "ustsma.f90", but if i try to run the case with more than one processor, the calculation will be killed by MPI.

The error message in the Terminal looks like follow:
Code:
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec has exited due to process rank 0 with PID 13747 on
node bk-003z-kp exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------
Error running the calculation.

Check Kernel log (listing) and error* files for details


  ********************************************
         Error in calculation stage.
  ********************************************
The list shows this error massage:

Code:
-------------------------------------------------------------


ALL PHASES  : 1D-WALL THERMAL MODULE NOT ACTIVATED 
                 NFPT1D =          0

-------------------------------------------------------------


PHASE      1 : HEAD LOSS TREATMENT NOT ACTIVATED 
                 NCEPDC =          0

-------------------------------------------------------------

SIGSEGV signal (forbidden memory area access) intercepted!

Call stack:
   1: 0x414f2a     <ustsma_+0x3278>                 (cs_solver)
   2: 0x7fdcf8c99632 <caltri_+0x3882>                 (libsaturne.so.0)
   3: 0x7fdcf8c76476 <cs_run+0x856>                   (libsaturne.so.0)
   4: 0x7fdcf8c7674d <main+0x1fd>                     (libsaturne.so.0)
   5: 0x7fdcf6d2e586 <__libc_start_main+0xe6>         (libc.so.6)
   6: 0x40d1a9     ?                                (?)
End of stack
Is it possible, that i am not allowed to use the "save" option in variable declaration? That would be fatal, because i need to save variables for the following time steps.

has someone a idea?

Tanks, Patrick
Pat84 is offline   Reply With Quote

Old   April 19, 2010, 10:28
Default
  #2
Member
 
Patrick Wollny
Join Date: Apr 2010
Posts: 58
Rep Power: 16
Pat84 is on a distinguished road
Hello again,

i have made a simple test case to analyze the problem and it appears to be a little difficult.

In that case i don't save any variable and with one processor the calculation runs very well, but with multi processor calculation the process ends with an error in very early stage.

Here is the error message appearing in the listing file:

Code:
ALL PHASES  : 1D-WALL THERMAL MODULE NOT ACTIVATED 
                 NFPT1D =          0

-------------------------------------------------------------


PHASE      1 : HEAD LOSS TREATMENT NOT ACTIVATED 
                 NCEPDC =          0

-------------------------------------------------------------


PHASE      1 : MASS SOURCE TERMS TREATMENT ACTIVATED 
                 ON A TOTAL OF          4 CELLS
-------------------------------------------------------------



===============================================================



                       MAIN CALCULATION                      
                       ================                      


===============================================================




===============================================================


 INSTANT    0.200000000E-04   TIME STEP NUMBER               1
 ============================================================= 


 --- Phase:          1
 ---------------------------------                           
 Property   Min. value  Max. value                           
 ---------------------------------                           
  Density   0.1161E+02  0.1161E+02
  visc.lam  0.4693E-04  0.4693E-04
  visc.tur  0.7198E-01  0.7198E-01
 ---------------------------------                           

 --- Diffusivity:                                            
 ---------------------------------------                     
 Scalar   Number  Min. value  Max. value                     
 ---------------------------------------                     
 Temp          1  0.6379E-01  0.6379E-01
 ---------------------------------------                     

SIGFPE signal (floating point exception) intercepted!

Call stack:
   1: 0x40f5e7     <ustsma_+0x157>                  (cs_solver)
   2: 0x7f4cdb7ef2c1 <tridim_+0x2ac1>                 (libsaturne.so.0)
   3: 0x7f4cdb639d4c <caltri_+0x4f9c>                 (libsaturne.so.0)
   4: 0x7f4cdb615476 <cs_run+0x856>                   (libsaturne.so.0)
   5: 0x7f4cdb61574d <main+0x1fd>                     (libsaturne.so.0)
   6: 0x7f4cd96cd586 <__libc_start_main+0xe6>         (libc.so.6)
   7: 0x40cb39     ?                                (?)
End of stack
floating point exception? Why in previous calculation with one processor there was no floating point exception ?!


If you want to try it by yourself, you can find the SRC files and the mesh in the attachment.

Edit:
.................................................. .....

This is a line from the running one processor calculation:
Code:
-------------------------------------------------------------


PHASE      1 : MASS SOURCE TERMS TREATMENT ACTIVATED 
                 ON A TOTAL OF          2 CELLS
-------------------------------------------------------------
You can see, that we have only 2 cells for mass sources. As you can see in the 2 processor case we got 4 cells ( 2x2 ).


The same with 4 processors:
Code:
-------------------------------------------------------------


PHASE      1 : MASS SOURCE TERMS TREATMENT ACTIVATED 
                 ON A TOTAL OF          8 CELLS
-------------------------------------------------------------
8 cells (4x2). 2 cells would be right.
Attached Files
File Type: gz SRC.tar.gz (43.6 KB, 15 views)
File Type: gz student-test-1.tar.gz (76.8 KB, 12 views)

Last edited by Pat84; April 19, 2010 at 10:59.
Pat84 is offline   Reply With Quote

Old   April 19, 2010, 11:31
Default
  #3
New Member
 
Yvan Fournier
Join Date: Mar 2010
Posts: 20
Rep Power: 16
Yvan Fournier is on a distinguished road
Hello,

There is an error in your programming of ustsma.f90:

Instead of looping on cells selected through getcel (see reference example), you seem to be interested only in lstelt(1), and later on icetsm(1) or icetsm(2).

This means your programming is mesh-numbering dependent, and in parallel, if the subdomains assigned to some processors do not have any cells in the source term area, arrays such as lstelt and icetsm are empty for those processors, but you are still trying to access those arrays.

Try to follow the examples in the reference examples, at least for the way of looping on cells (with your own values of course).

Also, version 2.0-rc1, which you may find on announces in the forum at http://code-saturne.info, has a version of ustsma.f90 with English comments, which may help (it also has several bug corrections compared to beta-2 and more functionnality in the GUI).

Best regards,

Yvan Fournier
Yvan Fournier is offline   Reply With Quote

Old   April 19, 2010, 18:29
Default
  #4
Member
 
Patrick Wollny
Join Date: Apr 2010
Posts: 58
Rep Power: 16
Pat84 is on a distinguished road
Hello Yvan,

thank you for your attention. In the base ustsma.f90 the cell number is called by "CALL getfbr" or "CALL getcel"

Copy & Paste :
Code:
  iutile = 0
  if(iutile.eq.1) then

    if(iphas.eq.1) then

      ieltsm = 0

!     Cellules dont le centre est a 250 < x < 500

      CALL GETCEL('X < 500.0 and X > 250.0',NLELT,LSTELT)
      do ilelt = 1, nlelt
        ii = lstelt(ilelt)
        ieltsm = ieltsm + 1
        if (iappel.eq.2) icetsm(ieltsm) = ii
      enddo


!     Cellules de bord dont une face est de couleur 3

      CALL GETFBR('3',NLELT,LSTELT)
      do ilelt = 1, nlelt
        ifac = lstelt(ilelt)
        ii   = ifabor(ifac)
!       On ne comptabilise cette cellule que si on ne l'a pas deja vue
!         pour cela on prend la negation du test precedent
        if (.not.(xyzcen(1,ii).lt.500.d0.and.                     &
                  xyzcen(1,ii).gt.250.d0)    )then
          ieltsm = ieltsm + 1
          if (iappel.eq.2) icetsm(ieltsm) = ii
        endif
      enddo

    else
      ieltsm = 0
    endif

  endif
In my easy example i make a call for every cell, but in my effusion cooling routine i make one call for all boundary faces and pick the right cell by an algorithm that searches for the smallest distance between a point and a boundary face center (I need the cell next to the wall -> ifabor).

Is it also possible to make a loop from i = 1 , nfabor and find the right cell by an if case ( if(xyzcen(1,i).lt.1234.and.xyzcen(1,i).gt.1234.and .xyzcen(2,i).lt. .. .. ..)? Is it than able to run with multi processor and mpi?
Pat84 is offline   Reply With Quote

Old   April 19, 2010, 19:27
Default
  #5
New Member
 
Yvan Fournier
Join Date: Mar 2010
Posts: 20
Rep Power: 16
Yvan Fournier is on a distinguished road
Hello Patrick,

I am not sure I understand what you are trying to do, but in any case,
getcel, getfbr, and getfac are intended to be used to obtain a list of cells, boundary faces, or interior faces respectively, matching a given selection criteria (the string argument in the call). These functions return the number of local (i.e. for that processor) elements matching the criteria, and fill a list of these elements). Used thus, if no local elements match the criteria, nlelt = 0, and loops on nlelt are empty (avoiding any problems accessing empty arrays).

The different possible selection criteria are described in the user documentation pdf (that of version 2.0-rc1 should be generated using "make pdf" for the Kernel). You may use either group names, or geometric primitives, such as "x < 2", plane[...], box[...], sphere[...].

So if you have an element matching such a criteria on any processor, using the correct selection criterion will find it on the processor handling it, and return an empty list on others.

If you need something more complex, such as filtering candidate cells using getcel and then requiring the cell closest to a given point, you may use code similar to the one used for probes. In that case, you would want to take a look at code calling "findpt" in the main (non-user) code, and possibly look at the utility function prototypes in include/base/cs_parall.h or examples in users/base/usproj.f90.

Note: if you have a few holes, you may use a criteria such as

getcel('box[xmin1, ymin1, zmin1, xmax1, ymax1, zmax1] or box[xmin2, ymin2, zmin2, xmax2, ymax2, zmax2]', nlelt, lstele)

but if you have many holes, it may be cheaper to simple loop on cells and select those matching the criteria, for example (if holes are regularly spaced along x and may be simply tested with a "modulo" type condition):

nlelt = 0
do iel = 1, ncel
if ((abs(mod(xyzcen(1,iel) - x1)) .lt 1d-2) then
nlelt = nlelt + 1
lstelt(nlelt) = iel
endif
enddo

Look for similar coding in the examples from uskpdf.f90.

Best regards,

Yvan Fournier
Yvan Fournier is offline   Reply With Quote

Old   April 20, 2010, 05:41
Default
  #6
Member
 
Patrick Wollny
Join Date: Apr 2010
Posts: 58
Rep Power: 16
Pat84 is on a distinguished road
Hi Yvan,

my code is ready since the beginning of last weekend. I search my cells with this algorithm:

Code:
      if(iappel.eq.1) then 
      call getfbr('all[]', nlelt, lstelt)
        do ilelt = 1, nlelt
          ifac = lstelt(ilelt)
          do zz = 1, nbohr
            if(ilelt.eq.1) then
              minabst(zz) = sqrt((mitts(zz,1)-cdgfbo(1,ifac))**2 +(mitts(zz,2)-cdgfbo(2,ifac))**2 &
                                 +(mitts(zz,3)-cdgfbo(3,ifac))**2)
              mittfaces(zz) = ifac
            else
              abstand = sqrt((mitts(zz,1)-cdgfbo(1,ifac))**2 +(mitts(zz,2)-cdgfbo(2,ifac))**2 &
                                 +(mitts(zz,3)-cdgfbo(3,ifac))**2)
              if(abstand.lt.minabst(zz)) then
                minabst(zz) = abstand
                mittfaces(zz) = ifac
              endif
            endif
          enddo
        enddo
With one processor the cells are right and there appears no problem .
My 2. cell on the other side i get by searching the face that fits the axis. that is performed by LR decomposition.

The best would be if i could perform the first and second appel by one processor, because taking apart a hole (the sources of inlet and outlet of the cooling hole would be on another part of the grid on different cpus) would break my calculation.
Pat84 is offline   Reply With Quote

Old   April 20, 2010, 07:42
Default
  #7
New Member
 
Yvan Fournier
Join Date: Mar 2010
Posts: 20
Rep Power: 16
Yvan Fournier is on a distinguished road
Hello Patrick,

Just a minor comment on your program: since you loop on all boundary faces,

call getfbr('all[]', nlelt, lstelt)
do ielt = 1, nlelt
ifac = lstelt(ielt)
...
enddo

Can be replaced simply be

do ifac = 1, nfabor
...
enddo

which may be slightly faster, and should give you the same results (possibly with a change of numbering).

For the more important stuff, there is no way to guarantee that the domain partitioning will not split a hole between 2 processors. You may check partitioning graphically (the postprocessing output has a time-independent field showing the partitioning, whether in EnSight, MED, or CGNS format), and even if you calculation crashed when calculating source terms, it should be available as it was output at an earlier stage.

Partitioning tools like METIS or SCOTCH may be assigned weight to help them keep neighboring elements on the same partitions, but I am not sure this can be guaranteed 100%, so we do not currently have the possibility of handling this in Code_Saturne.

If you have no more than 100-200 holes with only a few values per hole, you could handle holes the way we handle probes.

- For each hole and each processor, find the closest cell center to a hole center, using a call to findpt (see findpt.f90), which must returns both a cell number (node) and assciated processor number (ndrang) for all ranks.
Keep both in an array, either a new common block or a section in ituser.

In your case, you may not need just hole centers, but hole inlet centers and hole outlet centers, but the idea is the same.

- For each point/hole, you may broadcast the associated variable values from the processor containing the point to all others, using a call to parbcr (see example in isproj.f90, but arguments to parbcr are: owner_rank, n_values, values, so owner_rank for each point depends on the associated processor number for that point located which was saved in an array previously.

Look at the code in src/base/ecrhis.f90 (probe values output) for illustration if you need.

- At this stage, all processors should have a copy of all the hole values.

Now if the holes spread over several cells and you need to map individual inlet cells to outlet cells, this may not be enough, and you would need to go into Code_Saturne's C code to use fvm_locator type functions (used in parallel coupling with SYRTHES4, coupling of Code_Saturne with itself, turbomachinery applications, and cooling towers), which are designed to work in parallel, but require slighty more advanced coding, at least calling your own C functions from user fortran subroutines, as this API is not available from the Fortran part of the code?

Best regards,

Yvan Fournier
Yvan Fournier is offline   Reply With Quote

Old   April 20, 2010, 08:37
Default
  #8
Member
 
Patrick Wollny
Join Date: Apr 2010
Posts: 58
Rep Power: 16
Pat84 is on a distinguished road
hello

I think the best way to solve my problem is executing my subroutine on only one single processor. So does there exist a sbr / method to force the program, which runs on a multiprocessor machine, to execute parts / single sbrs (here ustsma) on only one processor? If so how can I make the result available to all processors afterwards?

best regards

Patrick
Pat84 is offline   Reply With Quote

Old   April 21, 2010, 06:29
Default
  #9
Member
 
Patrick Wollny
Join Date: Apr 2010
Posts: 58
Rep Power: 16
Pat84 is on a distinguished road
Hello again,

i have tried the use of findpt and parbcr with the easy case i already attached in my 2. post.

It works, but it is not very comfortable to handle with a bunch of more sources. Finding the right face with the processor number, so that each processor could handle the own hole in- and outlets would need no communication and would be more comfortable. The problem here is, that the partitioner must be able to let the in- and outlet of a hole together. Or is it possible to make the partitioning of the grid by myself? It must then be possible to feed code saturne with the partitioning data from Metis.

Please answer my question in the post above, too. So I maybe have more possibilities to parallelize my subroutine.

Best regards,
Patrick
Pat84 is offline   Reply With Quote

Old   April 21, 2010, 09:02
Default
  #10
New Member
 
Yvan Fournier
Join Date: Mar 2010
Posts: 20
Rep Power: 16
Yvan Fournier is on a distinguished road
Hello,

For your previous question, you could call a subroutine only a a given processor (including paral.h and using "if (irangp.eq.0) then..." for example, but each processor has only a part of the data, so that would be basically useless for your needs.

Partitioning the grid yourself would thus be the best solution, and you have 2 alternatives:

- either modify src/appli/cs_partition.c in the Code_Saturne preprocessor's tree (requires some programming).

- or modify src/base/cs_preprocessor_data.c from the Code_Saturne Kernel code (just copy the modified version with your user subroutines).

In this case, you need to modify the _cell_rank_by_sfc function, replacing the call to fvm_io_num_create_from_coords and the following loop defining cell_rank (lines 1661-1669 in Code_Saturne 2.0-rc1) to define your own cell rank (using 0 to p-1 numbering). For a given cell i, the cell center's x, y, and z coordinates are obtained respectively by cell_center[i*3], cell_center[i*3 +1], and cell_center[i*3 +2] (that array is computed above the area you should modify).

You may thus define your own partitioning based on cell coordinates quite easily, but be aware that the quality of your partitioning (i.e. good load balancing as well as minimizing number of neighbors and number of faces on process boundaries) has a major impact on code performance.

If you define your own partitioning in this way, you will also need to tell the code not to run the partitioner in the runcase script by setting EXEC_PARTITION=no around line 215 (it is set to yes by default),
and you need to make sure that a previous PARTITION_OUTPUT is not in your case's DATA directory (otherwise, the code will use that partitioning instead, tough you could also modify cs_preprocessor_data.c further to avoid this).

Best regards,
Yvan Fournier is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
mass conservation problem with VOF model pat77 FLUENT 6 February 24, 2017 05:39
DPM mass problem Henrik Ström FLUENT 9 May 20, 2016 12:00
Heat and mass balance problem Laurent FLUENT 2 October 15, 2002 11:52
State of the art in CFD technology Juan Carlos GARCIA SALAS Main CFD Forum 39 November 1, 1999 15:34
What kind of Cmmercial CFD code you feel well? Lans Main CFD Forum 13 October 27, 1998 11:20


All times are GMT -4. The time now is 00:47.