|
[Sponsors] |
August 11, 2022, 11:47 |
Weird performance problem between hosts
|
#1 |
New Member
Join Date: Aug 2022
Posts: 1
Rep Power: 0 |
Hi,
First of all, I was tasked to debug this problem as a Linux administrator. I have very little experience in cfd/openfoam/openmpi/hpc the problem: I have 5 nodes (servers), all run the same OpenFOAM test locally (standalone). All nodes are 100% identical. 1 node is somehow a supernode and can run tests much faster (37 seconds vs 205 seconds) the specification: Ubuntu 20.04 OpenFOAM 8 8-1c9b5879390b from: http://dl.openfoam.org/ubuntu focal main mpirun (Open MPI) 4.0.3 32GB memory 2x Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50GHz my research: Using stress-ng with all tests show exactly the same numbers on all nodes. (stress-ng --cpu 8 --cpu-method all --metrics-brief --perf -t 100) That is why I suspect mpi When I consider 2 nodes, normalnode (5 of them) and supernode. I do the following: All tests have an endTime of 0.1 Code:
decomposePar -allRegions > log.decomposePar mpirun -n 24 chtMultiRegionFoam -parallel > log.chtMultiRegionFoam cat log.chtMultiRegionFoam|grep "ExecutionTime"|tail -n 1 When I diff both log files between hosts, the only difference is the execution time. All other parameters are identical. When starting the run, normal node outputs: Code:
No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port. Local host: normalnode Local device: mlx4_0 Local port: 1 CPCs attempted: udcm -------------------------------------------------------------------------- 23 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages ExecutionTime = 205.16 s ClockTime = 208 s Code:
No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port. Local host: supernode Local device: mlx4_0 Local port: 1 CPCs attempted: udcm -------------------------------------------------------------------------- -------------------------------------------------------------------------- Open MPI failed an OFI Libfabric library call (fi_endpoint). This is highly unusual; your job may behave unpredictably (and/or abort) after this. Local host: rich-cicada Location: mtl_ofi_component.c:629 Error: No such file or directory (2) -------------------------------------------------------------------------- 23 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages 23 more processes have sent help message help-mtl-ofi.txt / OFI call fail ExecutionTime = 37.04 s ClockTime = 37 s Supernode outputs an extra line “23 more processes have sent help message help-mtl-ofi.txt / OFI call fail” questions: mlx4_0 is the Mellanox nic (no infiniband) on these node, as i understand it mpi should use a direct memory system when using localhost. Can firmware version of the nic cause this? Is the Open MPI OFI error related? Even tough it is an error, it is still much faster. What could I further investigate? Thanks |
|
July 7, 2023, 05:06 |
|
#2 |
New Member
David von Rüden
Join Date: Oct 2019
Location: Germany
Posts: 6
Rep Power: 7 |
Hello,
have you found an answer to your question? I am facing the same warnings and am now wondering if this results in performance issues. Thank you in advance! EDIT: I just fixed my issue by switching from SYSTEMOPENMPI to OPENMPI in the etc/bashrc. |
|
Tags |
openmpi |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
If memory bound : how to improve performance? | aerosayan | Main CFD Forum | 13 | July 7, 2021 06:44 |
Engine performance calculator weird error | MMatt | CONVERGE | 1 | February 4, 2020 09:55 |
Weird problem in diesel spray in OpenFoam-1.6-ext | Slanth | OpenFOAM | 3 | July 10, 2016 19:42 |
Weird reversed flow problem | Kimican | Main CFD Forum | 0 | February 26, 2016 19:15 |
natural convection problem for a CHT problem | Se-Hee | CFX | 2 | June 10, 2007 07:29 |