Best practices on hybrid architecture CPUs (like i9-13900K)

JulioPieri · August 29, 2024, 14:55

Hi all,

I'm looking for ways to speedup my simulations with the hardware that I have (i9-13900K, 2x32GB DDR5 6000MHz).

This processor has 8 P-cores and 16 E-cores. I've read here that running a job on all 24 cores is nearly the same as running it on just the 8 P-cores (without hyperthreading).

I'd like to know what you guys suggest to optimize my core usage in two scenarios:

Max speed for one single (large) simulation
Should I only use the 8-P? Could 16-E be better in some scenarios (like very large cell count)? Should HT be on or off?
Multiple mpi job running simultaneously, with 4 to 8 cores each
Something like 8-P + 8-E + 8-E, for instance, would work well? Can I turn HT on to "gain 8 threads", or will I lose performance on the other job using the P-cores?

Some extra questions:
- How can I pin the MPI job to only the use P or E cores? Better yet, how can I make a default configuration that every time I submit an mpi job, it will use the 8 P-cores first, and then use the other E-cores if submit another one?
- When using the "--cpu-set" flag in mpirun, how do I know the core indices for the P and E cores?
- Noob question now: "lscpu" says that I have 16 cores and 2 threads per core (total of 32 threads). I do have 32 threads, but on a different arrangement... Is it somewhat blind to the hybrid architecture of P and E cores, or am I missing some setup?
- Another noob question: my memory seems to be running at 4800MHz, but it's a 6000MHz memory. Can I simply increase the frequency? Should I be concerned about anything by doing this?
- In long simulations (10+days running directly) should I make some special precautions, like not using full RAM speed, less cores, etc?

Thank you all!

JulioPieri · August 30, 2024, 14:49

I made some progress, but still haven't found all the answers!

Using flags like "bind-to core" and "-cpu-list" can pin the job to specific cores. The -report-bindings flag shows that the process has been properly bound. The P-cores IDs are 0-7 and E-cores are 8-23 (with HT off)
However, for some reason the solver runs at full speed (as if P-cores are used) even if I bind it to E-cores. Don't know why, but it seems to me that WSL doesn't have the "privileges" to dictate processor affinity... If I manually set it in task manager it works, but when sending it through mpirun flags, it seems that Windows takes over and decide the affinities.

About the lscpu question, it seems that WSL doesn't see all cores in the right topology. Turning off HT solved this, and now lscpu sees 24 cores.

About the memory questions, it seems it's a matter of overclocking (XMP) the memory. Haven't tried this yet.

When I manage to forcibly use E cores, I'll start experimenting with processorWeights trying to optimize the load on each one.

wkernkamp · August 30, 2024, 19:52

You might have a look in this thread:
Intel i9 13900K with 8 channel were are Game Changer for CFD

JulioPieri · September 2, 2024, 10:51

Yes, I've read it! Many useful info there indeed.

So my conclusions are:
1) There is no benefit to use more than 8 cores, as you face memory channel saturation and/or bottleneck from the E cores. I ran a test can for same results with 8, 12, 16, 24 (HT on/off).
2) HT on/off doesn't seem to change anything as well, maybe because the system is managing thing behind the curtain
3) Load balancing (mesh decomposition with bias to the P cores) actually worsen the results.

I still have some doubts:
1) Why decomposing the domain with bias (say, 2x more mesh elements to the P cores) doesn't work? I'd expect that there would be a point where adding slower E-cores would help at least a little by slightly unloading the P-cores.
2) Simultaneous 4-8 cores simulations seems to run at same speed, even with binding on. I'd expect that the ones bound to cores 0-8 would run faster... Is this a limitation of running it through WSL?
3) Should I overclock my memory? It's advertised as 6000, but its running only at 4800. Would it cause system instabilities, or physical damage to any component?

wkernkamp · September 3, 2024, 01:22

Quote:

Originally Posted by JulioPieri

I still have some doubts:
1) Why decomposing the domain with bias (say, 2x more mesh elements to the P cores) doesn't work? I'd expect that there would be a point where adding slower E-cores would help at least a little by slightly unloading the P-cores.
2) Simultaneous 4-8 cores simulations seems to run at same speed, even with binding on. I'd expect that the ones bound to cores 0-8 would run faster... Is this a limitation of running it through WSL?
3) Should I overclock my memory? It's advertised as 6000, but its running only at 4800. Would it cause system instabilities, or physical damage to any component?

1) It is in theory possible to give slower cores less domain to load every core 100%. However, no improvement can occur if the memory os bottlenecked.
2) I do believe that WSL load balances. Maybe do a dual install with linux?
3) You should definitely overclock your memory. Just up the multiplier and you are probably fine. I have seen experts overclock to 7200.

JulioPieri · September 3, 2024, 09:58

Quote:

Originally Posted by wkernkamp

1) It is in theory possible to give slower cores less domain to load every core 100%. However, no improvement can occur if the memory os bottlenecked.
2) I do believe that WSL load balances. Maybe do a dual install with linux?
3) You should definitely overclock your memory. Just up the multiplier and you are probably fine. I have seen experts overclock to 7200.

Thank you!
I increased the memory to 6000 and got almost 20% improvement indeed. Further overclocking it beyond the nominal spec sounds risky to me...

To upgrade my station, is there anything worth doing with this setup or it's better to save to a completely new setting? Like equal cores and with more memory channels, etc

wkernkamp · September 3, 2024, 16:38

Quote:

Originally Posted by JulioPieri

Thank you!
I increased the memory to 6000 and got almost 20% improvement indeed. Further overclocking it beyond the nominal spec sounds risky to me...

To upgrade my station, is there anything worth doing with this setup or it's better to save to a completely new setting? Like equal cores and with more memory channels, etc

I don't understand the question. Are you thinking of replacing motherboard and processor?

JulioPieri · September 3, 2024, 17:47

I might upgrade this PC in the near future.

Do you think it's better to abandon the hybrid i9-13900K completely and buy a fresh workstation, with a more modular setup, with processors better suited for CFD, etc? Or reusing this processor and, say, just add another one (changing to a two socket motherboard) would be a good choice?

I mean, can I make this PC better for CFD or I'd be better off getting a whole new workstation?

wkernkamp · September 4, 2024, 01:04

Quote:

Originally Posted by JulioPieri

I might upgrade this PC in the near future.

Do you think it's better to abandon the hybrid i9-13900K completely and buy a fresh workstation, with a more modular setup, with processors better suited for CFD, etc? Or reusing this processor and, say, just add another one (changing to a two socket motherboard) would be a good choice?

I mean, can I make this PC better for CFD or I'd be better off getting a whole new workstation?

For CFD you would probably want a dual EPYC system with all channels having a DIMM. Such a configuration is alot faster but also pricier. Maybe the board will not fit in your PC. Also, the dual CPU systems are not as good for gaming.

Have you actually had slow runtimes?

JulioPieri · September 4, 2024, 10:21

Thank you for your suggestion. Actually it's a PC dedicated to CFD, which I purchased only considering the processor's clock. At the time, I got blown away by the 5.6GHz of the 13900, and I thought i could make use of the all 24/32HT cores available - even if at a non linear scaling. But only being able to effectively use 8 cores really fell short of my expectations.

I think I don't have specially slow runtimes. The cavity3D with 1MM cells run to the simulated time of 0.015s in:

8.14s for 8 cores (HT on)
7.71s for 16 cores (HT on)
7.29s for 32 cores (HTded)

From your comments in the other post, it seems a good result for 8 core. For the other decomposition setting, the gains are marginal, maybe within tolerance. I wouldn't expected any reduction in runtime for using more than 8 cores in 13900.

But I want to further increase my processing capacity so I can take on more complex projects. Also being able to run multiple simulations at once is appealing.

August 29, 2024, 14:55	Best practices on hybrid architecture CPUs (like i9-13900K)	#1
JulioPieri Senior Member Julio Pieri Join Date: Sep 2017 Posts: 109 Rep Power: 9	Hi all, I'm looking for ways to speedup my simulations with the hardware that I have (i9-13900K, 2x32GB DDR5 6000MHz). This processor has 8 P-cores and 16 E-cores. I've read here that running a job on all 24 cores is nearly the same as running it on just the 8 P-cores (without hyperthreading). I'd like to know what you guys suggest to optimize my core usage in two scenarios: Max speed for one single (large) simulation Should I only use the 8-P? Could 16-E be better in some scenarios (like very large cell count)? Should HT be on or off? Multiple mpi job running simultaneously, with 4 to 8 cores each Something like 8-P + 8-E + 8-E, for instance, would work well? Can I turn HT on to "gain 8 threads", or will I lose performance on the other job using the P-cores? Some extra questions: - How can I pin the MPI job to only the use P or E cores? Better yet, how can I make a default configuration that every time I submit an mpi job, it will use the 8 P-cores first, and then use the other E-cores if submit another one? - When using the "--cpu-set" flag in mpirun, how do I know the core indices for the P and E cores? - Noob question now: "lscpu" says that I have 16 cores and 2 threads per core (total of 32 threads). I do have 32 threads, but on a different arrangement... Is it somewhat blind to the hybrid architecture of P and E cores, or am I missing some setup? - Another noob question: my memory seems to be running at 4800MHz, but it's a 6000MHz memory. Can I simply increase the frequency? Should I be concerned about anything by doing this? - In long simulations (10+days running directly) should I make some special precautions, like not using full RAM speed, less cores, etc? Thank you all!

August 30, 2024, 14:49		#2
JulioPieri Senior Member Julio Pieri Join Date: Sep 2017 Posts: 109 Rep Power: 9	I made some progress, but still haven't found all the answers! Using flags like "bind-to core" and "-cpu-list" can pin the job to specific cores. The -report-bindings flag shows that the process has been properly bound. The P-cores IDs are 0-7 and E-cores are 8-23 (with HT off) However, for some reason the solver runs at full speed (as if P-cores are used) even if I bind it to E-cores. Don't know why, but it seems to me that WSL doesn't have the "privileges" to dictate processor affinity... If I manually set it in task manager it works, but when sending it through mpirun flags, it seems that Windows takes over and decide the affinities. About the lscpu question, it seems that WSL doesn't see all cores in the right topology. Turning off HT solved this, and now lscpu sees 24 cores. About the memory questions, it seems it's a matter of overclocking (XMP) the memory. Haven't tried this yet. When I manage to forcibly use E cores, I'll start experimenting with processorWeights trying to optimize the load on each one. Last edited by JulioPieri; August 30, 2024 at 16:14.

September 2, 2024, 10:51		#4
JulioPieri Senior Member Julio Pieri Join Date: Sep 2017 Posts: 109 Rep Power: 9	Yes, I've read it! Many useful info there indeed. So my conclusions are: 1) There is no benefit to use more than 8 cores, as you face memory channel saturation and/or bottleneck from the E cores. I ran a test can for same results with 8, 12, 16, 24 (HT on/off). 2) HT on/off doesn't seem to change anything as well, maybe because the system is managing thing behind the curtain 3) Load balancing (mesh decomposition with bias to the P cores) actually worsen the results. I still have some doubts: 1) Why decomposing the domain with bias (say, 2x more mesh elements to the P cores) doesn't work? I'd expect that there would be a point where adding slower E-cores would help at least a little by slightly unloading the P-cores. 2) Simultaneous 4-8 cores simulations seems to run at same speed, even with binding on. I'd expect that the ones bound to cores 0-8 would run faster... Is this a limitation of running it through WSL? 3) Should I overclock my memory? It's advertised as 6000, but its running only at 4800. Would it cause system instabilities, or physical damage to any component? wkernkamp likes this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
OpenFOAM benchmarks on various hardware	eric	Hardware	823	Today 08:30
General recommendations for CFD hardware [WIP]	flotus1	Hardware	19	June 23, 2024 19:02
Workstation Suggestions For A Newbie	mrtcnsmgr	Hardware	1	February 22, 2023 02:13
AMD Epyc 9004 "Genoa" buyers guide for CFD	flotus1	Hardware	8	January 16, 2023 06:23
Version 15 on Mac OS X	gschaider	OpenFOAM Installation	113	December 2, 2009 11:23

August 30, 2024, 19:52		#3
wkernkamp Senior Member Will Kernkamp Join Date: Jun 2014 Posts: 372 Rep Power: 14	You might have a look in this thread: Intel i9 13900K with 8 channel were are Game Changer for CFD

September 3, 2024, 17:47		#8
JulioPieri Senior Member Julio Pieri Join Date: Sep 2017 Posts: 109 Rep Power: 9	I might upgrade this PC in the near future. Do you think it's better to abandon the hybrid i9-13900K completely and buy a fresh workstation, with a more modular setup, with processors better suited for CFD, etc? Or reusing this processor and, say, just add another one (changing to a two socket motherboard) would be a good choice? I mean, can I make this PC better for CFD or I'd be better off getting a whole new workstation?

September 4, 2024, 10:21		#10
JulioPieri Senior Member Julio Pieri Join Date: Sep 2017 Posts: 109 Rep Power: 9	Thank you for your suggestion. Actually it's a PC dedicated to CFD, which I purchased only considering the processor's clock. At the time, I got blown away by the 5.6GHz of the 13900, and I thought i could make use of the all 24/32HT cores available - even if at a non linear scaling. But only being able to effectively use 8 cores really fell short of my expectations. I think I don't have specially slow runtimes. The cavity3D with 1MM cells run to the simulated time of 0.015s in: 8.14s for 8 cores (HT on) 7.71s for 16 cores (HT on) 7.29s for 32 cores (HTded) From your comments in the other post, it seems a good result for 8 core. For the other decomposition setting, the gains are marginal, maybe within tolerance. I wouldn't expected any reduction in runtime for using more than 8 cores in 13900. But I want to further increase my processing capacity so I can take on more complex projects. Also being able to run multiple simulations at once is appealing.