CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM > OpenFOAM Running, Solving & CFD

parallel computing fails on two nodes with openfoam7

Register Blogs Community New Posts Updated Threads Search

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   February 27, 2020, 11:31
Default parallel computing fails on two nodes with openfoam7
  #1
New Member
 
Join Date: Feb 2020
Posts: 1
Rep Power: 0
cake_rmy is on a distinguished road
i am trying run parallel computing case in my environment.i have two nodes,one is master
,the other runs as the slave node,i don't run the case with nfs filesystem,i install openfoam7 with centos 7 os on my computer individually,the same hardware,the same version os and openfoam。the openmpi version is 4.0.2.
i can successfully run the parallel computing case on each computer individually with multiple cpu ,but fail to run the cases the two node concurrently.
the tutor test case is motobike ,run the test with the code shipped with openfoam7

the command line is :mpirun --allow-run-as-root -np 6 --hostfile machines
simpleFoam -parallel

the error msg as fllows:

[root@A23865399 motorBike]# mpirun --allow-run-as-root -np 6 --hostfile machines simpleFoam -parallel
[A23865398:96825:0:96825] ud_iface.c:763 Fatal: transport error: Endpoint timeout
[A23865398:96826:0:96826] ud_iface.c:763 Fatal: transport error: Endpoint timeout
[A23865398:96824:0:96824] ud_iface.c:763 Fatal: transport error: Endpoint timeout
==== backtrace (tid: 96825) ====
0 0x00000000000474b0 ucs_fatal_error_message() ???:0
1 0x0000000000047655 ucs_fatal_error_format() ???:0
2 0x000000000003bf81 uct_ud_iface_dispatch_async_comps_do() ???:0
3 0x0000000000043e54 uct_ud_mlx5_ep_t_delete() ???:0
4 0x000000000001eaa2 ucp_worker_progress() ???:0
5 0x0000000000003697 mca_pml_ucx_progress() /var/tmp/OFED_topdir/BUILD/openmpi-4.0.2rc3/ompi/mca/pml/ucx/pml_ucx.c:515
6 0x0000000000036d0c opal_progress() /var/tmp/OFED_topdir/BUILD/openmpi-4.0.2rc3/opal/runtime/opal_progress.c:231
7 0x00000000000bead9 wait_completion() hcoll_collectives.c:0
8 0x000000000001c96d comm_allreduce_hcolrte_generic() common_allreduce.c:0
9 0x000000000001d08b comm_allreduce_hcolrte() ???:0
10 0x0000000000013a2b hmca_bcol_ucx_p2p_init_query.part.4() bcol_ucx_p2p_component.c:0
11 0x00000000000cb1cc hmca_bcol_base_init() ???:0
12 0x0000000000049c88 hmca_coll_ml_init_query() ???:0
13 0x00000000000bf897 hcoll_init_with_opts() ???:0
14 0x0000000000004e53 mca_coll_hcoll_comm_query() /var/tmp/OFED_topdir/BUILD/openmpi-4.0.2rc3/ompi/mca/coll/hcoll/coll_hcoll_module.c:292
15 0x00000000000789fd query_2_0_0() /var/tmp/OFED_topdir/BUILD/openmpi-4.0.2rc3/ompi/mca/coll/base/coll_base_comm_select.c:449
16 0x00000000000adc5d ompi_mpi_init() /var/tmp/OFED_topdir/BUILD/openmpi-4.0.2rc3/ompi/runtime/ompi_mpi_init.c:957
17 0x000000000006ad6d PMPI_Init_thread() /var/tmp/OFED_topdir/BUILD/openmpi-4.0.2rc3/ompi/mpi/c/profile/pinit_thread.c:67
18 0x0000000000005d60 Foam::UPstream::init() ???:0
19 0x000000000033675b Foam::argList::argList() ???:0
20 0x000000000041d180 main() ???:0
21 0x0000000000022505 __libc_start_main() ???:0
22 0x000000000041fd6a _start() ???:0
=================================
[A23865398:96825] *** Process received signal ***
[A23865398:96825] Signal: Aborted (6)
[A23865398:96825] Signal code: (-6)
[A23865398:96825] [ 0] /lib64/libc.so.6(+0x363b0)[0x7fc85a58c3b0]
[A23865398:96825] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7fc85a58c337]
[A23865398:96825] [ 2] /lib64/libc.so.6(abort+0x148)[0x7fc85a58da28]
[A23865398:96825] [ 3] /lib64/libucs.so.0(ucs_fatal_error_message+0x55)[0x7fc84d0174b5]
[A23865398:96825] [ 4] /lib64/libucs.so.0(+0x47655)[0x7fc84d017655]
[A23865398:96825] [ 5] /lib64/ucx/libuct_ib.so.0(uct_ud_iface_dispatch_async_comps_d o+0x121)[0x7fc84cba5f81]
[A23865398:96825] [ 6] /lib64/ucx/libuct_ib.so.0(+0x43e54)[0x7fc84cbade54]
[A23865398:96825] [ 7] /lib64/libucp.so.0(ucp_worker_progress+0x22)[0x7fc84d777aa2]
[A23865398:96825] [ 8] /usr/mpi/gcc/openmpi-4.0.2rc3/lib64/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17)[0x7fc84dbb0697]
[A23865398:96825] [ 9] /usr/mpi/gcc/openmpi-4.0.2rc3/lib64/libopen-pal.so.40(opal_progress+0x2c)[0x7fc85579dd0c]
[A23865398:96825] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xbead9)[0x7fc84755cad9]
[A23865398:96825] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c96d)[0x7fc8474ba96d]
[A23865398:96825] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x4b)[0x7fc8474bb08b]
[A23865398:96825] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(+0x13a2b)[0x7fc83f14aa2b]
[A23865398:96825] [14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x4c)[0x7fc8475691cc]
[A23865398:96825] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x68)[0x7fc8474e7c88]
[A23865398:96825] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x307)[0x7fc84755d897]
[A23865398:96825] [17] /usr/mpi/gcc/openmpi-4.0.2rc3/lib64/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x103)[0x7fc8477dce53]
[A23865398:96825] [18] /usr/mpi/gcc/openmpi-4.0.2rc3/lib64/libmpi.so.40(mca_coll_base_comm_select+0x2dd)[0x7fc857c449fd]
[A23865398:96825] [19] /usr/mpi/gcc/openmpi-4.0.2rc3/lib64/libmpi.so.40(ompi_mpi_init+0xc6d)[0x7fc857c79c5d]
[A23865398:96825] [20] /usr/mpi/gcc/openmpi-4.0.2rc3/lib64/libmpi.so.40(PMPI_Init_thread+0x7d)[0x7fc857c36d6d]
[A23865398:96825] [21] /root/renmingyan/openfoam/OpenFOAM-7/platforms/linux64GccDPInt32Opt/lib/openmpi-system/libPstream.so(_ZN4Foam8UPstream4initERiRPPcb+0x20)[0x7fc85a34cd60]
[A23865398:96825] [22] /root/renmingyan/openfoam/OpenFOAM-7/platforms/linux64GccDPInt32Opt/lib/libOpenFOAM.so(_ZN4Foam7argListC1ERiRPPcbbb+0xbdb)[0x7fc85b67d75b]
[A23865398:96825] [23] simpleFoam[0x41d180]
[A23865398:96825] [24] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fc85a578505]
[A23865398:96825] [25] simpleFoam[0x41fd6a]
[A23865398:96825] *** End of error message ***
==== backtrace (tid: 96826) ====
0 0x00000000000474b0 ucs_fatal_error_message() ???:0
1 0x0000000000047655 ucs_fatal_error_format() ???:0
2 0x000000000003bf81 uct_ud_iface_dispatch_async_comps_do() ???:0
3 0x0000000000043e54 uct_ud_mlx5_ep_t_delete() ???:0
4 0x000000000001eaa2 ucp_worker_progress() ???:0
5 0x0000000000003697 mca_pml_ucx_progress() /var/tmp/OFED_topdir/BUILD/openmpi-4.0.2rc3/ompi/mca/pml/ucx/pml_ucx.c:515
6 0x0000000000036d0c opal_progress() /var/tmp/OFED_topdir/BUILD/openmpi-4.0.2rc3/opal/runtime/opal_progress.c:231
7 0x00000000000bead9 wait_completion() hcoll_collectives.c:0
8 0x000000000001c96d comm_allreduce_hcolrte_generic() common_allreduce.c:0
9 0x000000000001d08b comm_allreduce_hcolrte() ???:0
10 0x0000000000013a2b hmca_bcol_ucx_p2p_init_query.part.4() bcol_ucx_p2p_component.c:0
11 0x00000000000cb1cc hmca_bcol_base_init() ???:0
12 0x0000000000049c88 hmca_coll_ml_init_query() ???:0
13 0x00000000000bf897 hcoll_init_with_opts() ???:0
14 0x0000000000004e53 mca_coll_hcoll_comm_query() /var/tmp/OFED_topdir/BUILD/openmpi-4.0.2rc3/ompi/mca/coll/hcoll/coll_hcoll_module.c:292
15 0x00000000000789fd query_2_0_0() /var/tmp/OFED_topdir/BUILD/openmpi-4.0.2rc3/ompi/mca/coll/base/coll_base_comm_select.c:449
16 0x00000000000adc5d ompi_mpi_init() /var/tmp/OFED_topdir/BUILD/openmpi-4.0.2rc3/ompi/runtime/ompi_mpi_init.c:957
17 0x000000000006ad6d PMPI_Init_thread() /var/tmp/OFED_topdir/BUILD/openmpi-4.0.2rc3/ompi/mpi/c/profile/pinit_thread.c:67
18 0x0000000000005d60 Foam::UPstream::init() ???:0
19 0x000000000033675b Foam::argList::argList() ???:0
20 0x000000000041d180 main() ???:0
21 0x0000000000022505 __libc_start_main() ???:0
22 0x000000000041fd6a _start() ???:0
=================================
[A23865398:96826] *** Process received signal ***
[A23865398:96826] Signal: Aborted (6)
[A23865398:96826] Signal code: (-6)
[A23865398:96826] [ 0] /lib64/libc.so.6(+0x363b0)[0x7f16632503b0]
[A23865398:96826] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f1663250337]
[A23865398:96826] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f1663251a28]
[A23865398:96826] [ 3] /lib64/libucs.so.0(ucs_fatal_error_message+0x55)[0x7f1651c034b5]
[A23865398:96826] [ 4] /lib64/libucs.so.0(+0x47655)[0x7f1651c03655]
[A23865398:96826] [ 5] /lib64/ucx/libuct_ib.so.0(uct_ud_iface_dispatch_async_comps_d o+0x121)[0x7f1651791f81]
[A23865398:96826] [ 6] /lib64/ucx/libuct_ib.so.0(+0x43e54)[0x7f1651799e54]
[A23865398:96826] [ 7] /lib64/libucp.so.0(ucp_worker_progress+0x22)[0x7f1652363aa2]
[A23865398:96826] [ 8] /usr/mpi/gcc/openmpi-4.0.2rc3/lib64/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17)[0x7f165279c697]
[A23865398:96826] [ 9] /usr/mpi/gcc/openmpi-4.0.2rc3/lib64/libopen-pal.so.40(opal_progress+0x2c)[0x7f165e461d0c]
[A23865398:96826] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xbead9)[0x7f1650211ad9]
[A23865398:96826] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c96d)[0x7f165016f96d]
[A23865398:96826] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x4b)[0x7f165017008b]
[A23865398:96826] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(+0x13a2b)[0x7f1647e0fa2b]
[A23865398:96826] [14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x4c)[0x7f165021e1cc]
[A23865398:96826] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x68)[0x7f165019cc88]
[A23865398:96826] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x307)[0x7f1650212897]
[A23865398:96826] [17] /usr/mpi/gcc/openmpi-4.0.2rc3/lib64/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x103)[0x7f1650491e53]
[A23865398:96826] [18] /usr/mpi/gcc/openmpi-4.0.2rc3/lib64/libmpi.so.40(mca_coll_base_comm_select+0x2dd)[0x7f16609089fd]
[A23865398:96826] [19] /usr/mpi/gcc/openmpi-4.0.2rc3/lib64/libmpi.so.40(ompi_mpi_init+0xc6d)[0x7f166093dc5d]
[A23865398:96826] [20] /usr/mpi/gcc/openmpi-4.0.2rc3/lib64/libmpi.so.40(PMPI_Init_thread+0x7d)[0x7f16608fad6d]
[A23865398:96826] [21] /root/renmingyan/openfoam/OpenFOAM-7/platforms/linux64GccDPInt32Opt/lib/openmpi-system/libPstream.so(_ZN4Foam8UPstream4initERiRPPcb+0x20)[0x7f1663010d60]
[A23865398:96826] [22] /root/renmingyan/openfoam/OpenFOAM-7/platforms/linux64GccDPInt32Opt/lib/libOpenFOAM.so(_ZN4Foam7argListC1ERiRPPcbbb+0xbdb)[0x7f166434175b]
[A23865398:96826] [23] simpleFoam[0x41d180]
[A23865398:96826] [24] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f166323c505]
[A23865398:96826] [25] simpleFoam[0x41fd6a]
[A23865398:96826] *** End of error message ***
==== backtrace (tid: 96824) ====
0 0x00000000000474b0 ucs_fatal_error_message() ???:0
1 0x0000000000047655 ucs_fatal_error_format() ???:0
2 0x000000000003bf81 uct_ud_iface_dispatch_async_comps_do() ???:0
3 0x0000000000043e54 uct_ud_mlx5_ep_t_delete() ???:0
4 0x000000000001eaa2 ucp_worker_progress() ???:0
5 0x0000000000003697 mca_pml_ucx_progress() /var/tmp/OFED_topdir/BUILD/openmpi-4.0.2rc3/ompi/mca/pml/ucx/pml_ucx.c:515
6 0x0000000000036d0c opal_progress() /var/tmp/OFED_topdir/BUILD/openmpi-4.0.2rc3/opal/runtime/opal_progress.c:231
7 0x00000000000bead9 wait_completion() hcoll_collectives.c:0
8 0x000000000001ca54 comm_allreduce_hcolrte_generic() common_allreduce.c:0
9 0x000000000001d08b comm_allreduce_hcolrte() ???:0
10 0x0000000000013a2b hmca_bcol_ucx_p2p_init_query.part.4() bcol_ucx_p2p_component.c:0
11 0x00000000000cb1cc hmca_bcol_base_init() ???:0
12 0x0000000000049c88 hmca_coll_ml_init_query() ???:0
13 0x00000000000bf897 hcoll_init_with_opts() ???:0
14 0x0000000000004e53 mca_coll_hcoll_comm_query() /var/tmp/OFED_topdir/BUILD/openmpi-4.0.2rc3/ompi/mca/coll/hcoll/coll_hcoll_module.c:292
15 0x00000000000789fd query_2_0_0() /var/tmp/OFED_topdir/BUILD/openmpi-4.0.2rc3/ompi/mca/coll/base/coll_base_comm_select.c:449
16 0x00000000000adc5d ompi_mpi_init() /var/tmp/OFED_topdir/BUILD/openmpi-4.0.2rc3/ompi/runtime/ompi_mpi_init.c:957
17 0x000000000006ad6d PMPI_Init_thread() /var/tmp/OFED_topdir/BUILD/openmpi-4.0.2rc3/ompi/mpi/c/profile/pinit_thread.c:67
18 0x0000000000005d60 Foam::UPstream::init() ???:0
19 0x000000000033675b Foam::argList::argList() ???:0
20 0x000000000041d180 main() ???:0
21 0x0000000000022505 __libc_start_main() ???:0
22 0x000000000041fd6a _start() ???:0
=================================
[A23865398:96824] *** Process received signal ***
[A23865398:96824] Signal: Aborted (6)
[A23865398:96824] Signal code: (-6)
[A23865398:96824] [ 0] /lib64/libc.so.6(+0x363b0)[0x7f3668fb63b0]
[A23865398:96824] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f3668fb6337]
[A23865398:96824] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f3668fb7a28]
[A23865398:96824] [ 3] /lib64/libucs.so.0(ucs_fatal_error_message+0x55)[0x7f36578be4b5]
[A23865398:96824] [ 4] /lib64/libucs.so.0(+0x47655)[0x7f36578be655]
[A23865398:96824] [ 5] /lib64/ucx/libuct_ib.so.0(uct_ud_iface_dispatch_async_comps_d o+0x121)[0x7f365744cf81]
[A23865398:96824] [ 6] /lib64/ucx/libuct_ib.so.0(+0x43e54)[0x7f3657454e54]
[A23865398:96824] [ 7] /lib64/libucp.so.0(ucp_worker_progress+0x22)[0x7f365c1a1aa2]
[A23865398:96824] [ 8] /usr/mpi/gcc/openmpi-4.0.2rc3/lib64/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17)[0x7f365c5da697]
[A23865398:96824] [ 9] /usr/mpi/gcc/openmpi-4.0.2rc3/lib64/libopen-pal.so.40(opal_progress+0x2c)[0x7f36641c7d0c]
[A23865398:96824] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xbead9)[0x7f3655eccad9]
[A23865398:96824] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1ca54)[0x7f3655e2aa54]
[A23865398:96824] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x4b)[0x7f3655e2b08b]
[A23865398:96824] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(+0x13a2b)[0x7f364db74a2b]
[A23865398:96824] [14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x4c)[0x7f3655ed91cc]
[A23865398:96824] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x68)[0x7f3655e57c88]
[A23865398:96824] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x307)[0x7f3655ecd897]
[A23865398:96824] [17] /usr/mpi/gcc/openmpi-4.0.2rc3/lib64/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x103)[0x7f365614ce53]
[A23865398:96824] [18] /usr/mpi/gcc/openmpi-4.0.2rc3/lib64/libmpi.so.40(mca_coll_base_comm_select+0x2dd)[0x7f366666e9fd]
[A23865398:96824] [19] /usr/mpi/gcc/openmpi-4.0.2rc3/lib64/libmpi.so.40(ompi_mpi_init+0xc6d)[0x7f36666a3c5d]
[A23865398:96824] [20] /usr/mpi/gcc/openmpi-4.0.2rc3/lib64/libmpi.so.40(PMPI_Init_thread+0x7d)[0x7f3666660d6d]
[A23865398:96824] [21] /root/renmingyan/openfoam/OpenFOAM-7/platforms/linux64GccDPInt32Opt/lib/openmpi-system/libPstream.so(_ZN4Foam8UPstream4initERiRPPcb+0x20)[0x7f3668d76d60]
[A23865398:96824] [22] /root/renmingyan/openfoam/OpenFOAM-7/platforms/linux64GccDPInt32Opt/lib/libOpenFOAM.so(_ZN4Foam7argListC1ERiRPPcbbb+0xbdb)[0x7f366a0a775b]
[A23865398:96824] [23] simpleFoam[0x41d180]
[A23865398:96824] [24] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f3668fa2505]
[A23865398:96824] [25] simpleFoam[0x41fd6a]
[A23865398:96824] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 5 with PID 96826 on node node2 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
[root@A23865399 motorBike]#
cake_rmy is offline   Reply With Quote

Old   January 31, 2021, 04:32
Default
  #2
New Member
 
Join Date: Jan 2019
Posts: 2
Rep Power: 0
AT90 is on a distinguished road
Did you find a solution to this? Facing similar problem also being killed during an all_reduce operation
AT90 is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Parallel computing error using multiple nodes Jingxue Wang OpenFOAM Running, Solving & CFD 11 January 14, 2018 11:51
[ICEM] Assigning different number of nodes to parallel edges Rohit ANSYS Meshing & Geometry 3 September 22, 2017 10:16
OpenFOAM parallel on multiple nodes FerdiFuchs OpenFOAM Running, Solving & CFD 1 March 4, 2016 18:36
Parallel Computing peter Main CFD Forum 7 May 15, 2006 10:53
Parallel Computing Classes at San Diego Supercomputer Center Jan. 20-22 Amitava Majumdar Main CFD Forum 0 January 5, 1999 13:00


All times are GMT -4. The time now is 20:29.