CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > Siemens > STAR-CCM+

StarCCMS+ on AWS Parallel Cluster not distributing workload across multiple nodes

Register Blogs Community New Posts Updated Threads Search

Like Tree5Likes
  • 2 Post By dwagoner
  • 1 Post By dwagoner
  • 2 Post By philip_m_jones

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   October 28, 2020, 00:48
Default StarCCMS+ on AWS Parallel Cluster not distributing workload across multiple nodes
  #1
New Member
 
Dave Wagoner
Join Date: Oct 2020
Posts: 2
Rep Power: 0
dwagoner is on a distinguished road
Summary
1. StarCCMS+ jobs submitted will not run on more than one node in a cluster
2. When jobs run on the first compute node (“local node”) they are significantly CPU-throttled. When the same jobs run on other nodes, they consume all available CPU as intended.


Detail
StarCCM+ runs on AWS ParallelCluster in a batch mode. This implementation uses Sun Grid Engine (SGE) on Linux and uses slurm. The implementation is documented here:
https://aws.amazon.com/blogs/compute/running-simcenter-star-ccm-on-aws/

The document is missing some critical details that may take some investigation to determine. These somewhat important undocumented details include:

  • FSx Lustre file system DNS references are not in the public DNS as is the case for other EC2 resources like ALBs, NLBs, S3 buckets, EC2 instances, etc. Instead, resolution can be done only by the AWS Provided DNS service inside the VPC where the FSx Lustre file system is created. The pcluster utility uses a CloudFormation template and may fail if the DNS resolution cannot be done for the FSx Lustre file system to allow it to be mounted on the master and all compute nodes. This resolution is done at the IP address of x.x.x.2/32, where x.x.x is the lowest set of octets in the VPC CIDR block. The critical piece is that this IP address should be in the DHCP default options for the VPC in which the cluster is being created. While the FSx Lustre DNS reference can be done manually, specifying x.x.x.2 as the DNS server, but this manual step cannot be done in the middle of the Cloudformation template being used to build the cluster.
  • The master and compute nodes may not consistently mount the FSx Lustre partition, even if the IP address translation is done correctly. The lustre modules do not always get loaded, and this approach seems to load the modules so that the FSx partition can be mounted:
    apt update
    apt-get install -y lustre-client-modules-$(uname -r)
  • The master should have local entries for the compute nodes in /etc/hosts. Login to each of the compute nodes and copy the relevant line from /etc/fstab and copy it into the master. You may need to update the security group of the compute node to allow access to port 22 from the rest of your environment. You may need to add ssh keys directly onto the compute nodes as well for the default non-root user (e.g. “ubuntu”, “centos”, or “ec2-user”).
  • The first compute node attempts to launch “remote” jobs on other compute nodes, so the entries that were just put into the /etc/fstab of the master should be copied to the compute nodes for consistency. Out of sheer paranoia, do an “ssh compute-node-name uptime” from the first compute node to the others to ensure that the entries are correct and that the host key has been accepted and that the ssh keys are present. This ssh attempt should be done as the default non-root user, not as root.
  • Update the kernel parameter for ptrace; the default is “1” and it needs to be “0”. This is set in /etc/sysctl.d/10-ptrace.conf and rebooting; however, doing an init 6 on individual nodes causes pcluster to replace them, not just reboot them. Use pcluster stop cluster-name and start it again, or set the value dynamically with:
    sysctl -w kernel.yama.ptrace_scope=0

    Without this, there are likely to be various complaints about *btl_vader_single_copy_mechanism emitted. (Thanks to Dennis.Kingsley@us.fincantieri.com for this valuable tidbit).
  • DNS may not consistently be resolved for the FSx Lustre partition after the first system boot. It may be prudent to replace the DNS name in /etc/fstab with the IP address translation after the cluster has been created.
  • When using a machine file (specified after the “—machinefile” option), do not use the FQDN - what is there must match the result of “/bin/hostname” when run on the compute nodes.
  • Be sure to include the parameter “-mpi openmpi”. Without this, you are likely obtain error messages like those below and the suggestion of updating /etc/security/limits.conf has nothing to do with the real problem:

    starccm+: compute-st-c5n18xlarge-1:[pid-18967] Rank 0:38: MPI_Init: ibv_create_qp(left ring) failed
    starccm+: compute-st-c5n18xlarge-1:[pid-18967] Rank 0:38: MPI_Init: probably you need to increase pinnable memory in /etc/security/limits.conf
    starccm+: compute-st-c5n18xlarge-1:[pid-18967] Rank 0:38: MPI_Init: ibv_ring_createqp() failed
    starccm+: compute-st-c5n18xlarge-1:[pid-18967] Rank 0:38: MPI_Init: Can't initialize RDMA device
    starccm+: compute-st-c5n18xlarge-1:[pid-18967] Rank 0:38: MPI_Init: Internal Error: Cannot initialize RDMA protocol

Disclaimer on my background: I am not a user of StarCCM, just an IT guy trying to help our mechanical engineers by setting this up. I have a large sample job (a .sim file) that I use for testing.

The above constitute the useful items I have been able to collect to date. Where I need some guidance is on:
1. getting workloads to actually run on multiple nodes instead of only 1
2. getting the first compute node to not throttle its CPU usage.


I created a cluster using three large compute nodes (48 vCPUs) and a master.

Test cases:
  1. I submitted a job with no —machine file parameter provided to let the workload default. I set -np (number of slots) to 48.
    All processes were placed on the first compute node and CPU was throttled to nearly zero. (I saw 2-3% when doing this test with slightly smaller compute nodes).

    Starting local server: /fsx/Siemens/15.04.010/STAR-CCM+15.04.010/star/bin/starccm+ -power -podkey XXX -licpath 1999@flex.cd-adapco.com -np 48 -mpi openmpi -machinefile /fsx/machinefile -server /fsx/X.sim

  2. I changed the order of the compute nodes in the machine file, placing the second compute node at the top. Processes were started on the second compute node and were not throttled - the whole system ran at 100% CPU (desirable in this case). The first compute node executed the following statement to start the run on the second compute node:

    Starting remote server: ssh compute-st-c5d24xlarge-2 echo "Remote PID : $$"; exec /fsx/Siemens/15.04.010/STAR-CCM+15.04.010/star/bin/starccm+ -power -podkey XX -licpath 1999@flex.cd-adapco.com -np 48 -mpi openmpi -machinefile /fsx/machinefile -server -rsh ssh /fsx/XXX.sim

In both cases, I had hopes of distributing the workload across all available compute nodes and running on CPUs in an unconstrained fashion.

Any hints from folks who have traveled this road?
bluebase and Nikpap like this.
dwagoner is offline   Reply With Quote

Old   October 29, 2020, 01:19
Default Resolved
  #2
New Member
 
Dave Wagoner
Join Date: Oct 2020
Posts: 2
Rep Power: 0
dwagoner is on a distinguished road
It seems that with the help of Dennis.Kingsley@us.fincantieri.com I was able to get StarCCM+ workloads split across multiple nodes and have them run with unconstrained CPU utilization.

The one additional change to those enumerated earlier in the thread is that the machine file used MUST contain the master (head) node and this entry must appear first in the list of machines in that file.

Next challenge to pursue is the optimal number of nodes in a cluster. After getting a healthy workload running, I noted that the vast majority of the CPU time is in system time rather than user time. A bit of stracing showed that a great deal of polling to coordinate interprocess communication/activity was being done. There is also a significant amount of network traffic between the nodes and this requires CPU to drive. Adding nodes may increase overhead and actually decrease throughput - a topic for subsequent testing.
Nikpap likes this.
dwagoner is offline   Reply With Quote

Old   October 30, 2020, 17:32
Default
  #3
cwl
Senior Member
 
Chaotic Water
Join Date: Jul 2012
Location: Elgrin Fau
Posts: 438
Rep Power: 18
cwl is on a distinguished road
I'd like to thank you for sharing your experience - these notes might save loads of time for someone in the future.
cwl is offline   Reply With Quote

Old   May 25, 2021, 03:39
Default
  #4
New Member
 
Philip Morris Jones
Join Date: Jul 2014
Posts: 1
Rep Power: 0
philip_m_jones is on a distinguished road
A lot of the above makes little sense in the context of a posting in cfd-online.

Obvious disclaimer: I work for Siemens who write STAR-CCM+ and I build and run clusters on AWS regularly and interact with the Amazon team.

When you are running CFD on a cluster you need to have a correctly configured cluster and then you need to operate the CFD code in a manner that reflects the cluster you are using.

If you want to run on AWS then ParallelCluster is a good way to get a cluster set up and running in very little time. If you have problems with running ParallelCluster then there are forums that are applicable to that.

Once you have a working cluster then you have material that is related to CFD.

STAR_CCM+ is batch system aware. I see some confusion over batch systems:

"This implementation uses Sun Grid Engine (SGE) on Linux and uses slurm."

SGE and Slurm are both batch systems and are mutually exclusive, so you can have one or the other but not both.

Once you have one batch system (and if you are coming to this fresh then the latest versions of ParallelCluster are dropping SGE and adopting Slurm as default) then you simply run STAR-CCM+ with the appropriate flag, so either

-bs sge
-bs slurm

These flags mean that STAR-CCM+ picks up the resource it has allocated to it via the batch system and starts the relevant processes

Starting STAR-CCM+ parallel server
MPI Distribution : IBM Platform MPI-09.01.04.03
Host 0 -- ip-10-192-12-64.ec2.internal -- Ranks 0-35
Host 1 -- ip-10-192-12-123.ec2.internal -- Ranks 36-71
Host 2 -- ip-10-192-12-189.ec2.internal -- Ranks 72-107
Host 3 -- ip-10-192-12-161.ec2.internal -- Ranks 108-143
Process rank 0 ip-10-192-12-64.ec2.internal 46154
Total number of processes : 144
cwl and arvindpj like this.
philip_m_jones is offline   Reply With Quote

Reply

Tags
aws, fsx, parallel cluster, starccm+


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
how to set periodic boundary conditions Ganesh FLUENT 15 November 18, 2020 07:09
AWS EC2 Cluster Running in Parallel Issues with v1612+ bassaad17 OpenFOAM Running, Solving & CFD 16 April 15, 2020 18:13
SU2 code scaling poorly on multiple nodes Samirs SU2 1 August 25, 2018 20:15
Script to Run Parallel Jobs in Rocks Cluster asaha OpenFOAM Running, Solving & CFD 12 July 4, 2012 23:51
Help: how to realize UDF on parallel cluster? Haoyin FLUENT 1 August 6, 2007 14:53


All times are GMT -4. The time now is 11:01.