|
[Sponsors] |
September 1, 2019, 18:41 |
Floating Point overflow and MPI tuning parms
|
#1 |
New Member
Join Date: Sep 2019
Posts: 1
Rep Power: 0 |
I am working with star ccm+ 2019.1.1 Build 14.02.012
CentOS 7.6 kernel 3.10.0-957.21.3.el7.x86_64 Intel MPI Version 2018 Update 5 Build 20190404 (this is version shipped with star ccm+) Cisco UCS cluster using USNIC fabric over 10gbe Intel(R) Xeon(R) CPU E5-2698 7 nodes 280 cores enic RPM version kmod-enic-3.2.210.22-738.18.centos7u7.x86_64 installed usnic RPM kmod-usnic_verbs-3.2.158.15-738.18.rhel7u6.x86_64 installed enic modinfo version: 3.2.210.22 enic loaded module version: 3.2.210.22 usnic_verbs modinfo version: 3.2.158.15 usnic_verbs loaded module version: 3.2.158.15 libdaplusnic RPM version 2.0.39cisco3.2.112.8 installed libfabric RPM version 1.6.0cisco3.2.112.9.rhel7u6 installed On runs less than 5 hours, everything works flawlessly and is quite fast. However when running with 280 cores at or around 5 hours into a job, the longer jobs die with the floating point exception. The same job completes fine with 140 cores. Also I am using PBS Pro with 99 hour wall time ------------------ Turbulent viscosity limited on 56 cells in Region A floating point exception has occurred: floating point exception [Overflow]. The specific cause cannot be identified. Please refer to the troubleshooting section of the User's Guide. Context: star.coupledflow.CoupledImplicitSolver Command: Automation.Run error: Server Error ------------------ I have been doing some reading and some say that using other MPI are more stable with Star CCM. I have not ruled out that I am missing some parameters or tuning with Intel MPI as this is a new cluster. I am also trying to make Open MPI work. I have openmpi compiled and it runs, however only with very small number of CPU. Anything over about 2 cores per node it hangs indefinately. I have compiled Open MPI 3.1.3 from https://www.open-mpi.org/ because this is what Star CCM version I am running supports. I am telling star to use the open mpi that I installed so it can support the Cisco USNIC fabric, which I can verify using Cisco native tools. Note that star also ships with openmpi however I am thinking that I need to tune OpenMPI, which was also requried with Intel MPI. With Intel MPI, jobs with more than about 100 cores would hang until I added these parameters: reference: https://software.intel.com/en-us/for...y/topic/542591 reference: https://software.intel.com/en-us/art...ced-techniques export I_MPI_DAPL_UD_SEND_BUFFER_NUM=8208 export I_MPI_DAPL_UD_RECV_BUFFER_NUM=8208 export I_MPI_DAPL_UD_ACK_SEND_POOL_SIZE=8704 export I_MPI_DAPL_UD_ACK_RECV_POOL_SIZE=8704 export I_MPI_DAPL_UD_RNDV_EP_NUM=2 export I_MPI_DAPL_UD_REQ_EVD_SIZE=2000 export I_MPI_DAPL_UD_MAX_MSG_SIZE=4096 export I_MPI_DAPL_UD_DIRECT_COPY_THRESHOLD=2147483647 After adding these parms I can scale to 280 cores and it runs very fast, up until the point where it gets the floating point exception about 5 hours into the job. I am struggling trying to find equivelant turning parms for Open MPI. I have listed all the MCA available with Open using MCA, and have tried setting these parms with no success. btl_max_send_size = 4096 btl_usnic_eager_limit = 2147483647 btl_usnic_rndv_eager_limit = 2147483647 btl_usnic_sd_num = 8208 btl_usnic_rd_num = 8208 btl_usnic_prio_sd_num = 8704 btl_usnic_prio_rd_num = 8704 btl_usnic_pack_lazy_threshold = -1 Does anyone have any advice or ideas for: 1.) The floating point overflow issue and 2.) Know of equivelant tuning parms for Open MPI Many thanks in advance Last edited by lstonebr; September 5, 2019 at 15:08. |
|
|
|