|
[Sponsors] |
Tutorial: Running STARCCM on Ubuntu with SLURM and OpenMPI over Infiniband |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
July 11, 2020, 06:35 |
Tutorial: Running STARCCM on Ubuntu with SLURM and OpenMPI over Infiniband
|
#1 |
New Member
Erik Lönroth
Join Date: Jul 2020
Location: Sweden
Posts: 3
Rep Power: 6 |
This is a tutorial on running a reference StarCCM+ job on Ubuntu18.04 using the snap version of SLURM with openMPI 4.0.4 over infiniband.
You could use this to perform scaling studies, track down issues and optimizing performance or use it as you like. Much of this will work on other OS:es too. This is the workbench used: * Hardware: 2 hosts with 2x20 cores 187GB ram. * Infiniband: Mellanox MT28908 Family [ConnectX-6] * OS: Linux 4.15.0-109-generic (x86_64) Ubuntu18.04.4 * SLURM 20.04 (https://snapcraft.io/slurm) * OpenMPI: 4.0.4 (ucx, openib) * StarCCM+: STAR-CCM+14.06.012 * A Reference model which is small enough for your computers and large enough to run over 2 nodes. Lets get started. Modify ulimits on all nodes This is done by editing /etc/security/limits.d/30-slurm.conf Code:
* soft nofile 65000 * hard nofile 65000 * soft memlock unlimited * hard memlock unlimited * soft stack unlimited * hard stack unlimited Code:
$ sudo systemctl edit snap.slurm.slurmctld.service Code:
[Service] LimitNOFILE=131072 LimitMEMLOCK=infinity LimitSTACK=infinity * Make sure login nodes has correct ulimits after a login. * Validate that all worker nodes also has correct values on ulimits when using slurm. For example: Code:
$ srun -N 1 ulimit -a Compile OpenMPI 4.0.4 At the time, this is the latest version. This is my configure but I think you can compile it differently for your needs. Code:
$ ./configure --without-cm --with-ib --prefix=/opt/openmpi-4.0.4 Code:
/opt/openmpi-4.0.4/bin/ompi_info | grep -E 'btl|ucx' MCA btl: openib (MCA v2.1.0, API v3.1.0, Component v4.0.4) MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.4) MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.0.4) MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.0.4) MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.0.4) MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.0.4) MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.0.4) * MCA btl: openib (MCA v2.1.0, API v3.1.0, Component v4.0.4) * MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.0.4) The rest are not important at this point. But you might know better, please let me know. You can see in the jobscript later where these modules are referenced. Validate that ucx_info see your Infiniband device and ib_verbs transports In my case, I have a Mellanox device (show with: ibv_devices) so I should see that with ucx_info: Code:
ucx_info -d | grep -1 mlx5_0 # # Memory domain: mlx5_0 # Component: ib -- # Transport: rc_verbs # Device: mlx5_0:1 # -- # Transport: rc_mlx5 # Device: mlx5_0:1 # -- # Transport: dc_mlx5 # Device: mlx5_0:1 # -- # Transport: ud_verbs # Device: mlx5_0:1 # -- # Transport: ud_mlx5 # Device: mlx5_0:1 # Modify the STARCCM+ installation My version of StarCCM uses an old ucx and calls /usr/bin/ucx_info. At some point during startup, it fails when its not able to find libibcm.so.1 when using our custom openMPI. Perhaps there is a way to force starccm+ to look for ucx_info on the system, but I have not found any way to do this. To have StarCCM+ ignore its own ucx, simply remove the ucx from the installation tree and replace with an empty directory. Code:
rm -rf /opt/STAR-CCM+14.06.012/ucx/1.5.0-cda-001/linux-x86_64* mkdir -p /opt/STAR-CCM+14.06.012/ucx/1.5.0-cda-001/linux-x86_64-2.17/gnu7.1/lib Time to write the job-script Code:
#!/bin/bash #SBATCH -J starccmref #SBATCH -N 2 #SBATCH -n 80 set -o xtrace set -e # StarCCM+ export PATH=$PATH:/opt/STAR-CCM+14.06.012/star/bin # OpenMPI export OPENMPI_DIR=/opt/openmpi-4.0.4 export PATH=${OPENMPI_DIR}/bin:$PATH export LD_LIBRARY_PATH=${OPENMPI_DIR}/lib # Report on the versions for logs which ompi_info which mpirun ompi_info | grep btl ompi_info | grep ucx # Kill any leftovers from previous runs kill_starccm+ CDLMD_LICENSE_FILE="27012@license.server.com" SIM_FILE=SteadyFlowBackwardFacingStep_final.sim STAR_CLASS_PATH="/software/Java/poi-3.7-FINAL" NODE_FILE="nodefile" # Assemble a nodelist using this python lib hostListbin=/software/hostlist/python-hostlist-1.18/hostlist $hostListbin --append=: --append-slurm-tasks=$SLURM_TASKS_PER_NODE -e $SLURM_JOB_NODELIST > $NODE_FILE # Start starccm+ -machinefile ${NODE_FILE} \ -power \ -batch ./starccmSim.java \ -np $SLURM_NTASKS \ -ldlibpath $LD_LIBRARY_PATH \ -classpath $STAR_CLASS_PATH \ -fabricverbose \ -mpi openmpi \ -mpiflags "--mca pml ucx --mca btl openib --mca pml_base_verbose 10 --mca mtl_base_verbose 10" \ ./SteadyFlowBackwardFacingStep_final.sim # Kill off any rogue processes kill_starccm+ Code:
$ squeue -d debug -n 80 ./starccmubuntu.sh You can watch your infiniband counters to see that significant amount of traffic is sent over the wire which will indicate that you have succeeded. Code:
watch -d cat /sys/class/infiniband/mlx5_0/ports/1/counters/port_rcv_packets I hope you can make use of this and also that starccm will soon be supporting ubuntu straight out of the box. Last edited by erik_lonroth; July 11, 2020 at 18:06. |
|
Tags |
infiniband, openmpi, slurm, starccm, ubuntu |
|
|