Zen 2 Memory Controller

aparangement · August 9, 2019, 02:21

Quote:

Originally Posted by flotus1

On paper, the most obvious changes to the architecture seem like they would be detrimental to performance of NUMA-aware software. I.e. trading in "fast" close memory access for more homogeneous, but overall slower memory access times, with the introduction of a separate I/O die.
But under the hood there is quite a lot that changed for the better. L3 cache sizes were doubled, which should reduce LLC misses and thus negate some of the higher memory latency. And prefetching was improved a lot, to the same effect.
And I was pleasantly surprised to read that Rome can be configured into a "sub-NUMA clustering" (NPS4) where each core only accesses the 2 memory channels closest to it. Leading to a similar NUMA topology as in Naples with 4 nodes per CPU.
https://www.anandtech.com/show/14694...e-epyc-2nd-gen
This will decrease memory latency quite a bit, leading to better performance in NUMA-aware software.
And let's not forget DDR4-3200 versus DDR4-2666. Memory bandwidth is still important, despite all the fuss about memory latency. I wonder when these will become readily available to consumers.
And the prices are pretty spectacular. 1350$ for 24 cores and 2025$ for 32 cores respectively. I am more convinced than ever about an upgrade of my home workstation with 2xEpyc 7301.

Actually I am confused about the "sub-NUMA clustering" (NPS4) setting, since unlike Naples, Rome uses an unified IO die to communicate with memory, my understanding is that in this case all memory modules should be equal to any CCX.

Besides, it is reported that the memory WRITE bandwidth is havled for Zen2 with single CCD(3700x, 3600 etc), in the beginning I thought this might compromise cfd performance substantially. However, Simbelmynë's test shows quite the opposite... So CFD is sensitive to only memory read bandwidth?

flotus1 · August 9, 2019, 03:25

Looking at the schematics of the CPU, you will see that the distance between each memory controller and CCX is not the same. At the very least this leads to different routing and trace lengths. The latency difference is not huge, but it is measurable. And apparently enough to justify developing this feature for AMD.

Quote:

Besides, it is reported that the memory WRITE bandwidth is havled for Zen2 with single CCD(3700x, 3600 etc)

This is the first time I read about this. But tbh I did not dive too deep into the nuances of Zen2 yet. Where did you read that?

Edit: never mind, I just found it.

aparangement · August 9, 2019, 04:02

Quote:

Originally Posted by flotus1

Looking at the schematics of the CPU, you will see that the distance between each memory controller and CCX is not the same. At the very least this leads to different routing and trace lengths. The latency difference is not huge, but it is measurable. And apparently enough to justify developing this feature for AMD.

This is the first time I read about this. But tbh I did not dive too deep into the nuances of Zen2 yet. Where did you read that?

Thanks about the explanation on NPS4, can't wait to see real world CFD test on new EPYCS, looks very promissing on paper.

About the halved write bandwidth, from

https://forums.anandtech.com/threads...ndrum.2567215/

2 tests are given:

https://www.overclock3d.net/reviews/..._x570_review/9
https://www.tweaktown.com/reviews/90...ew/index3.html

Also from techreport.com:

https://techreport.com/review/34672/...us-reviewed/3/

flotus1 · August 9, 2019, 13:39

Interesting...I guess it remains to be seen how the cores in Epyc Rome CPUs are distributed among the CCDs. For every CPU with less than 256MB L3 it would theoretically be possible to have CCDs without active cores on it.

Quote:

So CFD is sensitive to only memory read bandwidth?

In a very simplified way: read is more important than write.
Imagine a CFD code requests a value from memory, e.g. in order to update the value of a fluid cell, it requests its neighbor value. It will get this value, but along with it at least a whole cache line. So more data is read than the single value that was currently needed. i.e. high read bandwidth usage.
Now of course optimized codes make use of this and try to use the other values too. But for general unstuctured CFD, using all of them is not possible.
Writing the value back to memory after it was updated requires none of this prefetching, just caching.
Or to put it differently: Unstructured CFD codes spend a lot of time reading stuff from memory that gets evicted from the caches before it is used.
Please don't quote me on this though, the stuff happening inside a CPU for reading and writing data on RAM is way more complicated than in my simplified explanation. And I am by no means a computer scientist.

Edit: not highly surprising, but this confirms that running a Ryzen 3000 CPU without 1:1 ratio of RAM and IF is a waste of time
https://www.youtube.com/watch?time_c...&v=nugwAOvijHQ

Noco · August 9, 2019, 17:20

Quote:

Originally Posted by flotus1

Interesting...I guess it remains to be seen how the cores in Epyc Rome CPUs are distributed among the CCDs. For every CPU with less than 256MB L3 it would theoretically be possible to have CCDs without active cores on it.

In a very simplified way: read is more important than write.
Imagine a CFD code requests a value from memory, e.g. in order to update the value of a fluid cell, it requests its neighbor value. It will get this value, but along with it at least a whole cache line. So more data is read than the single value that was currently needed. i.e. high read bandwidth usage.
Now of course optimized codes make use of this and try to use the other values too. But for general unstuctured CFD, using all of them is not possible.
Writing the value back to memory after it was updated requires none of this prefetching, just caching.
Or to put it differently: Unstructured CFD codes spend a lot of time reading stuff from memory that gets evicted from the caches before it is used.
Please don't quote me on this though, the stuff happening inside a CPU for reading and writing data on RAM is way more complicated than in my simplified explanation. And I am by no means a computer scientist.

Edit: not highly surprising, but this confirms that running a Ryzen 3000 CPU without 1:1 ratio of RAM and IF is a waste of time
https://www.youtube.com/watch?time_c...&v=nugwAOvijHQ

Very interesting!

And what do you think about AVX-512 in Zen 2?:

https://www.reddit.com/r/Amd/comment...rmed_by_cpuid/

https://www.kitguru.net/components/c...-cores-for-7k/

Will it actually help a lot in CFD?

I play some times with AVX numbers in BIOS, but I do not get clear understanding is it reduce the actual time or not.

flotus1 · August 9, 2019, 17:38

As far as I am aware, Zen2 does not have AVX-512. They improved their AVX2 implementation which is now on par with Intel.
Lack of AVX-512 is no big deal in my opinion, especially not with a focus on CFD. It can help with some compute-heavy problems, but the benefit for typical CFD applications is negligible. In the past you may have stumbled upon some publications by Intel in partnership with Ansys that made it look like AVX-512 was responsible for xx% performance uplift over the previous CPU generation. But that's just marketing. The CPUs just got over 50% more memory bandwidth which accounted for the majority of the generational improvement.

Noco · August 9, 2019, 18:39

We make rebuilding of our 3D model after each iteration (Ansys CFX + CF Turbo). About 30% of all our calculation time (total calculation time is about 100 hours) computer use only 2-4 cores for rebuilding the 3D model with new angels, etc., and 70% of all 2x16 cores (we build system on dual 7301, based on your positive experience).

In this case this floating point controller in CPU can actually help to reduce the time on rebuilding of 3D model?

https://techreport.com/news/34242/am...d-128-threads/

AMD also addressed a major competitive shortcoming of the Zen architecture for high-performance computing applications. The first Zen cores used 128-bit-wide registers to execute SIMD instructions, and in the case of executing 256-bit-wide AVX2 instructions, each Zen floating-point unit had to shoulder half of the workload. Compared to Intel's Skylake CPUs (for just one example), which have two 256-bit-wide SIMD execution units capable of independent operation, Ryzen CPUs offered half the throughput for floating-point and integer SIMD instructions.

Zen 2 addresses this shortcoming by doubling each core's SIMD register width to 256 bits. The floating-point side of the Zen 2 core has two 256-bit floating-point add units and two floating-point multiply units that can presumably be yoked together to perform two fused multiply-add operations simultaneously.

That capability would bring the Zen 2 core on par with the Skylake microarchitecture for SIMD throughput (albeit not the Skylake Server core, which boasts even wider data paths and 512-bit-wide SIMD units to support AVX-512 instructions.) To feed those 256-bit-wide execution engines, AMD also widened the load-store unit, load data path, and floating-point register file to support 256-bit chunks of data.

flotus1 · August 10, 2019, 06:55

Quote:

In this case this floating point controller in CPU can actually help to reduce the time on rebuilding of 3D model?

Most likely not the case. If this part of the code is poorly parallelized, then it is safe to assume vectorization is not that great either.
Meshing and remeshing does a lot of memory (re-)allocation and accesses in a non-predictable way. For these tasks machine with less NUMA domains and lower memory latency -along with high single-threaded performance- are better suited.
I can observe this with my own mostly single-threaded grid generator. It runs comparable to Intel CPUs as long as the total memory consumption does not exceed the size of a single NUMA node. Going beyond that, performance is reduced significantly with my Epyc 7301 workstation. It is no big issue for me because I don't need remeshing during a simulation.
For your application an upgrade to Epyc 7002 -with sub-NUMA clustering disabled- could pay off.

Noco · August 10, 2019, 07:27

It is true, in some tasks one 7980 XE or one TR 1950X are just 50% slower, then dual 7301, just because of the big difference in 3D model remeshing time for each iteration. (7980XE make it much faster).

Anyway we feel like reducing the time is one of the main problems for us now. We are trying to evaluate what will be actual difference in time for our tasks between dual 7301 and dual 7302 with all this new controllers and architecture in 7002 CPUs. Dependence on time regarding the number of cores, GHz, channel amount and etc. is more or less understandable.

Wobbler · August 12, 2019, 05:33

Quote:

Originally Posted by flotus1

I am more convinced than ever about an upgrade of my home workstation with 2xEpyc 7301.

I anxiously await your benchmarks on Rome !!!

flotus1 · August 12, 2019, 05:51

It will take some time. Availability for CPUs and boards in the retail market seems to be a similar issue as with Naples. And DDR4-3200 RDIMM is not that easy to source either, at least in Europe.

August 9, 2019, 17:38		#26
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,427 Rep Power: 49	As far as I am aware, Zen2 does not have AVX-512. They improved their AVX2 implementation which is now on par with Intel. Lack of AVX-512 is no big deal in my opinion, especially not with a focus on CFD. It can help with some compute-heavy problems, but the benefit for typical CFD applications is negligible. In the past you may have stumbled upon some publications by Intel in partnership with Ansys that made it look like AVX-512 was responsible for xx% performance uplift over the previous CPU generation. But that's just marketing. The CPUs just got over 50% more memory bandwidth which accounted for the majority of the generational improvement. Noco and Wobbler like this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Lenovo C30 memory configuration and discussions with Lenovo	matthewe	Hardware	3	October 17, 2013 11:23
[OpenFOAM] Color display problem to view OpenFOAM results.	Sargam05	ParaView	16	May 11, 2013 01:10
[OpenFOAM] [Critical] ParaView 3.12.0 breaks monitor signal in Ubuntu 11.04	v_mil	ParaView	5	March 18, 2012 14:39
CFX CPU time & real time	Nick Strantzias	CFX	8	July 23, 2006 18:50

August 9, 2019, 18:39		#27
Noco Member Ivan Join Date: Oct 2017 Location: 3rd planet Posts: 34 Rep Power: 9	We make rebuilding of our 3D model after each iteration (Ansys CFX + CF Turbo). About 30% of all our calculation time (total calculation time is about 100 hours) computer use only 2-4 cores for rebuilding the 3D model with new angels, etc., and 70% of all 2x16 cores (we build system on dual 7301, based on your positive experience). In this case this floating point controller in CPU can actually help to reduce the time on rebuilding of 3D model? https://techreport.com/news/34242/am...d-128-threads/ AMD also addressed a major competitive shortcoming of the Zen architecture for high-performance computing applications. The first Zen cores used 128-bit-wide registers to execute SIMD instructions, and in the case of executing 256-bit-wide AVX2 instructions, each Zen floating-point unit had to shoulder half of the workload. Compared to Intel's Skylake CPUs (for just one example), which have two 256-bit-wide SIMD execution units capable of independent operation, Ryzen CPUs offered half the throughput for floating-point and integer SIMD instructions. Zen 2 addresses this shortcoming by doubling each core's SIMD register width to 256 bits. The floating-point side of the Zen 2 core has two 256-bit floating-point add units and two floating-point multiply units that can presumably be yoked together to perform two fused multiply-add operations simultaneously. That capability would bring the Zen 2 core on par with the Skylake microarchitecture for SIMD throughput (albeit not the Skylake Server core, which boasts even wider data paths and 512-bit-wide SIMD units to support AVX-512 instructions.) To feed those 256-bit-wide execution engines, AMD also widened the load-store unit, load data path, and floating-point register file to support 256-bit chunks of data.

August 10, 2019, 07:27		#29
Noco Member Ivan Join Date: Oct 2017 Location: 3rd planet Posts: 34 Rep Power: 9	It is true, in some tasks one 7980 XE or one TR 1950X are just 50% slower, then dual 7301, just because of the big difference in 3D model remeshing time for each iteration. (7980XE make it much faster). Anyway we feel like reducing the time is one of the main problems for us now. We are trying to evaluate what will be actual difference in time for our tasks between dual 7301 and dual 7302 with all this new controllers and architecture in 7002 CPUs. Dependence on time regarding the number of cores, GHz, channel amount and etc. is more or less understandable.

August 12, 2019, 05:51		#31
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,427 Rep Power: 49	It will take some time. Availability for CPUs and boards in the retail market seems to be a similar issue as with Naples. And DDR4-3200 RDIMM is not that easy to source either, at least in Europe.