|
[Sponsors] |
A cluster of 2 is almost 5x slower than individual node |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
June 21, 2020, 11:16 |
A cluster of 2 is almost 5x slower than individual node
|
#1 |
Member
|
I put together a cluster of two(2) raspberryPi-4 boards and ran MPPICFoam (almost, save not writing out the volume fields) across 8 processors, four from each board. The time step is around 5e-6s. No NFS. Each board gets its own root directory with its own latestTime, system, and constant subdirectories on its local microSD card.
I then compare with running on only 4 processors on one board alone. A lone board took 11s. The cluster took 52s per time step. No difference between whether Distributed is yes or no. This is happening way before getting to the time to write out data to disk. For the cluster, trying to pin down whether it is the communication link between the boards that bogs down the performance, I list below nuggets of info, and estimate that the amount of data exchanged over the Ethernet per time step costs only 0.045s, which is nowhere near the 52s that the 2-node cluster produced. Can someone shine a light on this puzzle? Perhaps its the TCPIP packet size being too small? How do it control this under OpenMPI? Perhaps its because one of the 4 processors (each board has only 4) has to switch back and forth between MPI and the solver? (I haven't tried 3+3.)
|
|
June 23, 2020, 22:59 |
|
#2 |
Senior Member
Join Date: Nov 2010
Location: USA
Posts: 1,232
Rep Power: 25 |
It might be useful to run OpenFOAM's profiling functions or another system information gathering application like sar or valgrind.
https://www.openfoam.com/documentati...rofiling.htmla My initial guess would be some issue with latency going from the pi's SoC > pcie > ethernet controller. |
|
June 25, 2020, 14:34 |
|
#3 |
Member
|
Thanks for the tip off.
I ran it and got no profiling results. It turns out need to set compiling option to 'Prof': Code:
export WM_COMPILE_OPTION=Prof Now running on 4 processors on one board first to see what will get. Anyway evening running just one the same board the four processors together take about 13 seconds per time step of MPPICFoam, whereas running on 4 threads on my 8 year old Dell Inspiron laptop (Intel x64 arch, 2 cores) takes only 4 seconds. The 3x worse performance on the ARM SoC on RPI-4 is shocking. Next thing I need to do is to compile OpenFOAM on the Dell laptop for 'Prof' and re-run mpirun for 4 processors to see where the difference lies. What is sar? A web search only turns up pages after pages about the virus. I tried valgrind, and got some error message and a search on it brought up a 2016 thread of discussion between the developers (https://bugs.kde.org/show_bug.cgi?id=303877.) that seem to indicate that they fixed the problem for some specific architectures. Apparently not for Armhf. |
|
June 25, 2020, 14:52 |
|
#4 |
Senior Member
Join Date: Nov 2010
Location: USA
Posts: 1,232
Rep Power: 25 |
You can read more about sar here:
https://linux.die.net/man/1/sar https://www.geeksforgeeks.org/sar-co...m-performance/ It would be interesting to compute the maximum memory bandwidth available to your old desktop and the pi, and see how that correlates with the performance difference. You could also measure the memory bandwidth while running the application with PCM https://github.com/opcm/pcm I don't know if it can be build for arm though. |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
SAP cluster resource/services not coming online on cluster node 2 | Nthar1@yahoo.com | Hardware | 0 | May 9, 2017 06:55 |
Running UDF with Supercomputer | roi247 | FLUENT | 4 | October 15, 2015 14:41 |
Cluster ID's not contiguous in compute-nodes domain. ??? | Shogan | FLUENT | 1 | May 28, 2014 16:03 |
The fluent stopped and errors with "Emergency: received SIGHUP signal" | yuyuxuan | FLUENT | 0 | December 3, 2013 23:56 |
999999 (../../src/mpsystem.c@1123):mpt_read: failed:errno = 11 | UDS_rambler | FLUENT | 2 | November 22, 2011 10:46 |