OpenFOAM benchmarks on various hardware

meshingpumpkins · October 5, 2020, 04:54

THX to all the contributers here!

Here are my results. Compared to the other results of the Epyc 7542 the results are pretty similar as expected.

I am pretty happy with this setup.

System:

2x AMD EPYC 7542 32-core 16*32GB 3200MHz/ Ubuntu 18 / ESI 2006

Result:

PHP Code:


			
Cores    time    speedup    
1    784,73    1              
4    171,79    4,567960882    
8     89,61    8,757169959    
12    66,93    11,72463768    
16    43,94    17,85912608    
20    40,56    19,34738659    
24    35,06    22,38248716    
28    34,21    22,93861444    
32    30,04    26,12283622    
36    29,57    26,53804532    
40    27,56    28,47351234    
44    27,79    28,23785534    
48    24,38    32,18744873    
52    25,07    31,30155564    
56     25,5    30,77372549    
60    24,48    32,05596405    
64    23,46    33,44970162

Kailee71 · October 7, 2020, 07:46

Hi Meshingpumpkins,

I think your last column might need correction; it needs to be the inverse... You divided the runtime by 100, but what you should do for it/s is to divide 100 by runtime... But no biggy...

Also: As on many platforms the throughput is limited by memory bandwidth, and often max performance is reached well before all cores are utilized - would it not be interesting to somehow get power consumption into the results? Not trivial, I know, but if you get 90% of performance with 60% of used cores, this would be an interesting investigation, no?
Cheers,

Kai.

meshingpumpkins · October 7, 2020, 08:55

Quote:

Originally Posted by Kailee71

Hi Meshingpumpkins,

I think your last column might need correction; it needs to be the inverse... You divided the runtime by 100, but what you should do for it/s is to divide 100 by runtime... But no biggy...

Also: As on many platforms the throughput is limited by memory bandwidth, and often max performance is reached well before all cores are utilized - would it not be interesting to somehow get power consumption into the results? Not trivial, I know, but if you get 90% of performance with 60% of used cores, this would be an interesting investigation, no?
Cheers,

Kai.

thx. you are absolutly right.

about the performance vs power consumption question:

this is interessting. but if you have multiple users of a server you would share the cores. in my opinion it also depends on the used case. but one could say that if you make parameter studies of your case it would be a better idea to use the half number of the cores to increase efficiency.

speedup_1_s6.jpg

JBeilke · October 7, 2020, 10:09

Quote:

Originally Posted by Kailee71

Hi Meshingpumpkins,

I think your last column might need correction; it needs to be the inverse... You divided the runtime by 100, but what you should do for it/s is to divide 100 by runtime... But no biggy...

Kai.

Speedup in the last line means runtine 1 core divided by runtine 64 cores. And the original table is right.

meshingpumpkins · October 7, 2020, 10:21

Quote:

Originally Posted by JBeilke

Speedup in the last line means runtine 1 core divided by runtine 64 cores. And the original table is right.

Sorry: Kai was right. i corrected the first post and deleted the last column.

Kailee71 · October 7, 2020, 12:02

... however, the it/s was an interesting metric! Could you add it back in?

;-)

Kailee71 · October 14, 2020, 14:08

Hello all,

I recently got an HP DL380P with (only) 2x E5-2630v1, 16x 2Rx4Gb 1333 DDR3 (64Gb total), and through the excellent iLO I can also monitor power draw (see attached image). I'll do some comparisons of bare metal Ubuntu 20.04, ESXi 6.7, Win10 WSL, and lastly Freenas with an Ubuntu VM (just for kicks). All will use OF7 from .org natively installed through their Ubuntu repository, so no software optimizing at all.

To start off, here's Ubuntu 20.04:

Code:

SnappyHexMesh
Cores	Pwr(W)	Time(s)	kWh
1	147	2447	0.100
2	158	1557	0.068
4	207	906	0.052
6	223	636	0.039
8	240	522	0.035
12	275	422	0.032

Sim
Cores	Pwr(W)	Time(s)	kWh
1	162	1252	0.056
2	184	645	0.033
4	256	290	0.021
6	288	211	0.017
8	320	176	0.016
12	358	149	0.015

Interesting to me is the last column; certainly for SHM, but also the sim itself, it seems with this setup that using all cores is advisable; of course these CPUs only have 6 cores so are not being bottlenecked by the quad channel memory. However, if one were to extrapolate from these values it seems that we're close to a sweet spot power-draw wise - it's flattening off at 12 used cores, whereas I would expect another 2 or even 4 cores per CPU would reduce runtimes, if maybe only a little with 10c/cpu over the 8c/cpu.

However: I take away from this that running all 12 cores reduces cost of the benchmark by 2/3 for SHM, and nearly 3/4 for the sim, when compared to running single-core.

Results using VMs on the same hardware will follow over the next few days.

Cheers,

Kai.

wkernkamp · October 15, 2020, 22:48

Kai,

What is your idle power?

Will

Kailee71 · October 16, 2020, 04:17

Hi Will,

Idle power hovers between 90 and 100 Watts.

Cheers,

Kai.

Kailee71 · October 16, 2020, 10:34

Ok now with ESXi 6.5, same hardware as above (DL380p, 2x E5-2630v1, 16x 2Rx4, 1333MHz), vm is identical to bare-metal setup above.

Code:

SnappyHexMesh
Cores	Pwr(W)	Time(s)	kWh
1	158	2522	0.110
2	166	1635	0.747
4	209	936	0.054
6	230	646	0.041
8	239	535	0.036
12	273	430	0.033

Sim
Cores	Pwr(W)	Time(s)	kWh
1	169	1285	0.060
2	189	670	0.035
4	257	302	0.022
6	288	217	0.017
8	317	182	0.016
12	357	154	0.015

Very interesting. After my experience with the dual X5670 machine earlier this year I wasn't hopefull but wow this is usuable, especially if all cores are used. Very pleased with this!

Kai.

Kailee71 · October 16, 2020, 17:46

Now with bhyve, under freenas 11.3;

Code:

SnappyHexMesh
Cores	Pwr(W)	Time(s)	kWh
1	160	2728	0.12
2	180	1719	0.086
4	207	1028	0.059
6	232	717	0.046
8	249	604	0.042
12	262	924	0.067

Sim
Cores	Pwr(W)	Time(s)	kWh
1	179	1617	0.080
2	210	756	0.044
4	245	427	0.029
6	266	339	0.025
8	285	317	0.025
12	280	556	0.043

This one threw me. Initially I had the exact same setup as the previous two sets, but got results that were way worse (like, twice the runtime). Looking at the processor usage in Freenas I noticed that indeed, with 12 threads, the machine was running at 50% capacity. So I tried turning *off* hyperthreading so only 12 thread were exposed to free. This did improve things, but only up to the 8 threads test; with 12 threads it was still much worse than bare metal or ESXi/Ubuntu.

If anyone has any information on this please let me know - it would be very interesting for me to run this under Freenas directly, rather than having to revert to running ESXi, then Freenas as one VM, and Openfoam in another.

Any help much appreciated.

Kai.

wildemam · November 16, 2020, 16:46

For OpenFoam8 (foundation), user will have to:

1. comment the function objects (streamlines and "wallBoundedStreamLines") in the control dict.

2. change the etc director to #includeEtc "caseDicts/mesh/generation/meshQualityDict" in the meshQuality Dict

3. copy the 'surfaceFeatureDict' from the tutorial case, and change the surfacefeatureExtract application Allmesh in the base case to "runApplication surfaceFeatures" in line 9.

then it works. Let's see how my server stands out.

wildemam · November 17, 2020, 13:16

4 x Intel(R) Xeon(R) CPU E5-4657L v2 @ 2.40GHz

128 GB DDR3 1600 MHz
openFoam 8
Ubuntu 20.

# cores Wall time (s):
------------------------
48 77.45
44 77.66
40 77.43
36 77.34
32 77.59
28 78.45
24 79.93
16 89.9
8 133.07
4 245.4
2 652.24
1 27.39

Meshing:
48 real 4m19.655s
44 real 3m43.624s
40 real 3m54.778s
36 real 3m51.182s
32 real 3m48.851s
28 real 3m54.084s
24 real 4m19.289s
16 real 5m46.104s
8 real 7m19.078s
4 real 12m8.124s
2 real 23m45.691s
1 real 0m3.501s

Hitting some ceiling there. I verified that I have 32GB per NUMA nodes. Any ideas for checking the reason for the bottleneck beyond 24 cores?

flotus1 · November 17, 2020, 14:43

How is the memory populated? 16*16GB?
# dmidecode -t 17
In case you need to find out.
Htop provides a quick and easy way to check which cores are utilized.

wildemam · November 18, 2020, 09:09

Quote:

Originally Posted by flotus1

How is the memory populated? 16*16GB?
# dmidecode -t 17
In case you need to find out.
Htop provides a quick and easy way to check which cores are utilized.

Thanks for your reply Flotus1.

There are 8 x 16 GB x 1600 MHz.

attached at banks:

0
1
12
13
24
25
36
37

I guess I will need to get more rams.

flotus1 · November 18, 2020, 09:12

Yeah, my math didn't check out. I meant 16x8GB.
Anyway, you would need 16 identical DIMMs to get peak performance with this system. The scaling behavior you got is pretty typical for not having all memory channels populated.

Novel · November 19, 2020, 16:04

We just bought a new Workstation for our department. Thanks to this Thread we were able to find a good configuration.

The following setup was done:
OpenFOAM was compiled with the tag "-march=znver1". Also SMT was switched off and all processors were set to performance mode using "cpupower frequency-set -g performance" from the HPC Tuning Guide provided by AMD ( http://developer.amd.com/wp-content/resources/56420.pdf).

CPU:

2x AMD EPYC 7532 (Zen2-Rome) 32-Core CPU, 200W, 2.4GHz, 256MB L3 Cache, DDR4-3200
RAM:
256GB (16x 16GB) DDR4-3200 DIMM, REG, ECC, 2R

OpenFOAM v7

cores time (s) speedup
1 677,34 1,00
2 363,04 1,87
4 161,42 4,20
6 101,82 6,65
8 77,16 8,78
12 52,28 12,96
16 39,4 17,19
20 32,01 21,16
24 27,31 24,80
28 24,15 28,05
32 21,53 31,46
36 21,32 31,77
40 20,46 33,11
44 18,99 35,67
48 18,12 37,38
52 17,45 38,82
56 17,06 39,70
60 16,5 41,05
64 15,91 42,57

Until 32 cores the scalling is perfect, afterwards it starts to drop... Is it just caused by the bandwith or can there be other things causing this drop?

flotus1 · November 19, 2020, 17:12

Any particular reason for the use of znver1 instead of znver2?
Bandwidth will be part of the reason why scaling tapers off. Lower CPU frequency with more busy cores might be another contribution.
But overall, performance looks pretty impressive.

wildemam · November 19, 2020, 21:29

Quote:

Originally Posted by wildemam

4 x Intel(R) Xeon(R) CPU E5-4657L v2 @ 2.40GHz

128 GB DDR3 1600 MHz
openFoam 8
Ubuntu 20.

# cores Wall time (s):
------------------------
48 77.45
44 77.66
40 77.43
36 77.34
32 77.59
28 78.45
24 79.93
16 89.9
8 133.07
4 245.4
2 652.24
1 27.39

Meshing:
48 real 4m19.655s
44 real 3m43.624s
40 real 3m54.778s
36 real 3m51.182s
32 real 3m48.851s
28 real 3m54.084s
24 real 4m19.289s
16 real 5m46.104s
8 real 7m19.078s
4 real 12m8.124s
2 real 23m45.691s
1 real 0m3.501s

Hitting some ceiling there. I verified that I have 32GB per NUMA nodes. Any ideas for checking the reason for the bottleneck beyond 24 cores?

Just populated the system with 8 more 16GB DDR3 1600MHz ram.

# cores Wall time (s):
------------------------
48 45.04
44 45.62
40 46.08
36 47.52
32 49
28 52.01
24 56.36
16 73.13
8 127.29
4 239.67
2 602.69

So the added ram made it faster and more scalable. Results are similar to other Xeon processors.

Any recommendations or hints for best practices if I run several aimulations on the same machine?

Novel · November 20, 2020, 03:14

Quote:

Originally Posted by flotus1

Any particular reason for the use of znver1 instead of znver2?
Bandwidth will be part of the reason why scaling tapers off. Lower CPU frequency with more busy cores might be another contribution.
But overall, performance looks pretty impressive.

Ups sorry, actually we did compile it using znver2.

October 5, 2020, 04:54	Result: 2x AMD EPYC 7542 32-core/ Ubuntu 18 / ESI 2006	#321
meshingpumpkins New Member Andi Join Date: Jun 2018 Posts: 13 Rep Power: 8	THX to all the contributers here! Here are my results. Compared to the other results of the Epyc 7542 the results are pretty similar as expected. I am pretty happy with this setup. System: 2x AMD EPYC 7542 32-core 1632GB 3200MHz/ Ubuntu 18 / ESI 2006 Result: PHP Code: Cores time speedup 1 784,73 1 4 171,79 4,567960882 8 89,61 8,757169959 12 66,93 11,72463768 16 43,94 17,85912608 20 40,56 19,34738659 24 35,06 22,38248716 28 34,21 22,93861444 32 30,04 26,12283622 36 29,57 26,53804532 40 27,56 28,47351234 44 27,79 28,23785534 48 24,38 32,18744873 52 25,07 31,30155564 56 25,5 30,77372549 60 24,48 32,05596405 64 23,46 33,44970162 aparangement likes this. Last edited by meshingpumpkins; October 7, 2020 at 08:48. Reason: correction of data*

October 16, 2020, 10:34		#330
Kailee71 Member Kailee Join Date: Dec 2019 Posts: 35 Rep Power: 7	Ok now with ESXi 6.5, same hardware as above (DL380p, 2x E5-2630v1, 16x 2Rx4, 1333MHz), vm is identical to bare-metal setup above. Code: SnappyHexMesh Cores Pwr(W) Time(s) kWh 1 158 2522 0.110 2 166 1635 0.747 4 209 936 0.054 6 230 646 0.041 8 239 535 0.036 12 273 430 0.033 Sim Cores Pwr(W) Time(s) kWh 1 169 1285 0.060 2 189 670 0.035 4 257 302 0.022 6 288 217 0.017 8 317 182 0.016 12 357 154 0.015 Very interesting. After my experience with the dual X5670 machine earlier this year I wasn't hopefull but wow this is usuable, especially if all cores are used. Very pleased with this! Kai.

October 16, 2020, 17:46		#331
Kailee71 Member Kailee Join Date: Dec 2019 Posts: 35 Rep Power: 7	Now with bhyve, under freenas 11.3; Code: SnappyHexMesh Cores Pwr(W) Time(s) kWh 1 160 2728 0.12 2 180 1719 0.086 4 207 1028 0.059 6 232 717 0.046 8 249 604 0.042 12 262 924 0.067 Sim Cores Pwr(W) Time(s) kWh 1 179 1617 0.080 2 210 756 0.044 4 245 427 0.029 6 266 339 0.025 8 285 317 0.025 12 280 556 0.043 This one threw me. Initially I had the exact same setup as the previous two sets, but got results that were way worse (like, twice the runtime). Looking at the processor usage in Freenas I noticed that indeed, with 12 threads, the machine was running at 50% capacity. So I tried turning off hyperthreading so only 12 thread were exposed to free. This did improve things, but only up to the 8 threads test; with 12 threads it was still much worse than bare metal or ESXi/Ubuntu. If anyone has any information on this please let me know - it would be very interesting for me to run this under Freenas directly, rather than having to revert to running ESXi, then Freenas as one VM, and Openfoam in another. Any help much appreciated. Kai.

November 16, 2020, 16:46		#332
wildemam New Member M Shaaban Join Date: Jun 2019 Posts: 11 Rep Power: 7	For OpenFoam8 (foundation), user will have to: 1. comment the function objects (streamlines and "wallBoundedStreamLines") in the control dict. 2. change the etc director to #includeEtc "caseDicts/mesh/generation/meshQualityDict" in the meshQuality Dict 3. copy the 'surfaceFeatureDict' from the tutorial case, and change the surfacefeatureExtract application Allmesh in the base case to "runApplication surfaceFeatures" in line 9. then it works. Let's see how my server stands out. oswald and Fabian2602 like this.

November 19, 2020, 16:04		#337
Novel New Member Roman G. Join Date: Apr 2017 Posts: 16 Rep Power: 9	We just bought a new Workstation for our department. Thanks to this Thread we were able to find a good configuration. The following setup was done: OpenFOAM was compiled with the tag "-march=znver1". Also SMT was switched off and all processors were set to performance mode using "cpupower frequency-set -g performance" from the HPC Tuning Guide provided by AMD ( http://developer.amd.com/wp-content/resources/56420.pdf). CPU: 2x AMD EPYC 7532 (Zen2-Rome) 32-Core CPU, 200W, 2.4GHz, 256MB L3 Cache, DDR4-3200 RAM: 256GB (16x 16GB) DDR4-3200 DIMM, REG, ECC, 2R OpenFOAM v7 cores time (s) speedup 1 677,34 1,00 2 363,04 1,87 4 161,42 4,20 6 101,82 6,65 8 77,16 8,78 12 52,28 12,96 16 39,4 17,19 20 32,01 21,16 24 27,31 24,80 28 24,15 28,05 32 21,53 31,46 36 21,32 31,77 40 20,46 33,11 44 18,99 35,67 48 18,12 37,38 52 17,45 38,82 56 17,06 39,70 60 16,5 41,05 64 15,91 42,57 Until 32 cores the scalling is perfect, afterwards it starts to drop... Is it just caused by the bandwith or can there be other things causing this drop? linuxguy123 likes this.

October 7, 2020, 07:46		#322
Kailee71 Member Kailee Join Date: Dec 2019 Posts: 35 Rep Power: 7	Hi Meshingpumpkins, I think your last column might need correction; it needs to be the inverse... You divided the runtime by 100, but what you should do for it/s is to divide 100 by runtime... But no biggy... Also: As on many platforms the throughput is limited by memory bandwidth, and often max performance is reached well before all cores are utilized - would it not be interesting to somehow get power consumption into the results? Not trivial, I know, but if you get 90% of performance with 60% of used cores, this would be an interesting investigation, no? Cheers, Kai.

October 7, 2020, 12:02		#326
Kailee71 Member Kailee Join Date: Dec 2019 Posts: 35 Rep Power: 7	... however, the it/s was an interesting metric! Could you add it back in? ;-)

October 15, 2020, 22:48		#328
wkernkamp Senior Member Will Kernkamp Join Date: Jun 2014 Posts: 372 Rep Power: 14	Kai, What is your idle power? Will

October 16, 2020, 04:17		#329
Kailee71 Member Kailee Join Date: Dec 2019 Posts: 35 Rep Power: 7	Hi Will, Idle power hovers between 90 and 100 Watts. Cheers, Kai.

November 17, 2020, 13:16		#333
wildemam New Member M Shaaban Join Date: Jun 2019 Posts: 11 Rep Power: 7	4 x Intel(R) Xeon(R) CPU E5-4657L v2 @ 2.40GHz 128 GB DDR3 1600 MHz openFoam 8 Ubuntu 20. # cores Wall time (s): ------------------------ 48 77.45 44 77.66 40 77.43 36 77.34 32 77.59 28 78.45 24 79.93 16 89.9 8 133.07 4 245.4 2 652.24 1 27.39 Meshing: 48 real 4m19.655s 44 real 3m43.624s 40 real 3m54.778s 36 real 3m51.182s 32 real 3m48.851s 28 real 3m54.084s 24 real 4m19.289s 16 real 5m46.104s 8 real 7m19.078s 4 real 12m8.124s 2 real 23m45.691s 1 real 0m3.501s Hitting some ceiling there. I verified that I have 32GB per NUMA nodes. Any ideas for checking the reason for the bottleneck beyond 24 cores?

November 17, 2020, 14:43		#334
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,428 Rep Power: 49	How is the memory populated? 16*16GB? # dmidecode -t 17 In case you need to find out. Htop provides a quick and easy way to check which cores are utilized.

November 18, 2020, 09:12		#336
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,428 Rep Power: 49	Yeah, my math didn't check out. I meant 16x8GB. Anyway, you would need 16 identical DIMMs to get peak performance with this system. The scaling behavior you got is pretty typical for not having all memory channels populated.

November 19, 2020, 17:12		#338
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,428 Rep Power: 49	Any particular reason for the use of znver1 instead of znver2? Bandwidth will be part of the reason why scaling tapers off. Lower CPU frequency with more busy cores might be another contribution. But overall, performance looks pretty impressive.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How to contribute to the community of OpenFOAM users and to the OpenFOAM technology	wyldckat	OpenFOAM	17	November 10, 2017 16:54
UNIGE February 13th-17th - 2107. OpenFOAM advaced training days	joegi.geo	OpenFOAM Announcements from Other Sources	0	October 1, 2016 20:20
OpenFOAM Training Beijing 22-26 Aug 2016	cfd.direct	OpenFOAM Announcements from Other Sources	0	May 3, 2016 05:57
New OpenFOAM Forum Structure	jola	OpenFOAM	2	October 19, 2011 07:55
Hardware for OpenFOAM LES	LijieNPIC	Hardware	0	November 8, 2010 10:54