|
[Sponsors] |
StarCCMS+ on AWS Parallel Cluster not distributing workload across multiple nodes |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
October 28, 2020, 00:48 |
StarCCMS+ on AWS Parallel Cluster not distributing workload across multiple nodes
|
#1 |
New Member
Dave Wagoner
Join Date: Oct 2020
Posts: 2
Rep Power: 0 |
Summary
1. StarCCMS+ jobs submitted will not run on more than one node in a cluster 2. When jobs run on the first compute node (“local node”) they are significantly CPU-throttled. When the same jobs run on other nodes, they consume all available CPU as intended. Detail StarCCM+ runs on AWS ParallelCluster in a batch mode. This implementation uses Sun Grid Engine (SGE) on Linux and uses slurm. The implementation is documented here: https://aws.amazon.com/blogs/compute/running-simcenter-star-ccm-on-aws/ The document is missing some critical details that may take some investigation to determine. These somewhat important undocumented details include:
Disclaimer on my background: I am not a user of StarCCM, just an IT guy trying to help our mechanical engineers by setting this up. I have a large sample job (a .sim file) that I use for testing. The above constitute the useful items I have been able to collect to date. Where I need some guidance is on: 1. getting workloads to actually run on multiple nodes instead of only 1 2. getting the first compute node to not throttle its CPU usage. I created a cluster using three large compute nodes (48 vCPUs) and a master. Test cases:
In both cases, I had hopes of distributing the workload across all available compute nodes and running on CPUs in an unconstrained fashion. Any hints from folks who have traveled this road? |
|
October 29, 2020, 01:19 |
Resolved
|
#2 |
New Member
Dave Wagoner
Join Date: Oct 2020
Posts: 2
Rep Power: 0 |
It seems that with the help of Dennis.Kingsley@us.fincantieri.com I was able to get StarCCM+ workloads split across multiple nodes and have them run with unconstrained CPU utilization.
The one additional change to those enumerated earlier in the thread is that the machine file used MUST contain the master (head) node and this entry must appear first in the list of machines in that file. Next challenge to pursue is the optimal number of nodes in a cluster. After getting a healthy workload running, I noted that the vast majority of the CPU time is in system time rather than user time. A bit of stracing showed that a great deal of polling to coordinate interprocess communication/activity was being done. There is also a significant amount of network traffic between the nodes and this requires CPU to drive. Adding nodes may increase overhead and actually decrease throughput - a topic for subsequent testing. |
|
October 30, 2020, 17:32 |
|
#3 |
Senior Member
Chaotic Water
Join Date: Jul 2012
Location: Elgrin Fau
Posts: 438
Rep Power: 18 |
I'd like to thank you for sharing your experience - these notes might save loads of time for someone in the future.
|
|
May 25, 2021, 03:39 |
|
#4 |
New Member
Philip Morris Jones
Join Date: Jul 2014
Posts: 1
Rep Power: 0 |
A lot of the above makes little sense in the context of a posting in cfd-online.
Obvious disclaimer: I work for Siemens who write STAR-CCM+ and I build and run clusters on AWS regularly and interact with the Amazon team. When you are running CFD on a cluster you need to have a correctly configured cluster and then you need to operate the CFD code in a manner that reflects the cluster you are using. If you want to run on AWS then ParallelCluster is a good way to get a cluster set up and running in very little time. If you have problems with running ParallelCluster then there are forums that are applicable to that. Once you have a working cluster then you have material that is related to CFD. STAR_CCM+ is batch system aware. I see some confusion over batch systems: "This implementation uses Sun Grid Engine (SGE) on Linux and uses slurm." SGE and Slurm are both batch systems and are mutually exclusive, so you can have one or the other but not both. Once you have one batch system (and if you are coming to this fresh then the latest versions of ParallelCluster are dropping SGE and adopting Slurm as default) then you simply run STAR-CCM+ with the appropriate flag, so either -bs sge -bs slurm These flags mean that STAR-CCM+ picks up the resource it has allocated to it via the batch system and starts the relevant processes Starting STAR-CCM+ parallel server MPI Distribution : IBM Platform MPI-09.01.04.03 Host 0 -- ip-10-192-12-64.ec2.internal -- Ranks 0-35 Host 1 -- ip-10-192-12-123.ec2.internal -- Ranks 36-71 Host 2 -- ip-10-192-12-189.ec2.internal -- Ranks 72-107 Host 3 -- ip-10-192-12-161.ec2.internal -- Ranks 108-143 Process rank 0 ip-10-192-12-64.ec2.internal 46154 Total number of processes : 144 |
|
Tags |
aws, fsx, parallel cluster, starccm+ |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
how to set periodic boundary conditions | Ganesh | FLUENT | 15 | November 18, 2020 07:09 |
AWS EC2 Cluster Running in Parallel Issues with v1612+ | bassaad17 | OpenFOAM Running, Solving & CFD | 16 | April 15, 2020 18:13 |
SU2 code scaling poorly on multiple nodes | Samirs | SU2 | 1 | August 25, 2018 20:15 |
Script to Run Parallel Jobs in Rocks Cluster | asaha | OpenFOAM Running, Solving & CFD | 12 | July 4, 2012 23:51 |
Help: how to realize UDF on parallel cluster? | Haoyin | FLUENT | 1 | August 6, 2007 14:53 |