How many mesh elements per second per CPU core ?

zx-81 · February 5, 2017, 09:23

Hi,

in order to compare the performance of the many different existing CFD solver software in an approximately standardized/objective way it would be helpful to know how many mesh elements (grid points) per second per CPU core a particular software is able to achive for a single time step of a minimal standard reference problem (i.e. Taylor green vortex, single phase incompressible flow)

(For unstructured or adaptively refined meshes, "number of mesh elements" should of course refer to the number of actually computed elements, not the number of nodes of an equivalent regular non-adaptive grid. (This way, a good comparison of the 'efficiency-of-adaptivity' of AMR(Adaptive Mesh refinement)-based codes would be possible.))

Does anyone know if/where this figure is available for the commonly used software packages ?
Any comparison papers that compare software with respect to this criterion ?

(Background for the question: The optimized Fluid solver I'm currently developping solves about 1e5 velocity grid cells per second per CPU core and I'm interested how this compares to the industry state of the art software)

mprinkey · February 6, 2017, 00:19

I understand your goal, but I have to tell you that I don't think you can boil performance down to a single number like this. If the CFD codes you are comparing use all the same algorithms, you may be able to compare "cell updates" for say, an explicit RK updates. If the CFD codes are not all density based or not all using the same Riemann solvers/linear equation solvers, this comparison is going to be hard to make sense of. The costs of a W-cycle MG sweep vs. one iteration of SOR are not at all comparable either in CPU time or in solution improvement.

Even for the same algorithm, there is going to be large variability in performance based on problem size (does it all or a significant working set fit in L2/L3 cache, what happens as you move from 1 to many cores). As a result, most CFD benchmarking tends to be holistic--pick a geometry and fixed mesh and solve to some precision or advance to some final time.

Simple geometric problems (lid driven cavity, backward/forward facing step, isotropic turbulence decay, Rayleigh-Taylor instability, vortex shedding behind a cube/cylinder) have been explored for years and you can probably find performance/scalability data from them to compare to your results. Beyond that, I think you will find difficulty. Once geometries get complicated, you need to find/handle the exact mesh that they used. Or you need to get their code and make it work with your mesh. That is a lot of headaches. And in the end, I don't think that I would find "cell updates per second" as a compelling indicator of relative performance. But show me relative timings of Fluent or Star or Euler3D or OpenFOAM results vs your code for a few relevant problems and sizes for a range of core count--then you have my attention.

sbaffini · February 7, 2017, 13:22

To summarize what Michael wrote: it is of low relevance how much fast you advance your solution with respect to the direction in which you advance and by how much you advance in that specific direction.

3 examples:

1) the fastest update of a variable is var_new = var_old + 0.0 at each iteration. you're just going nowhere with your solution, but you're certainly doing it very fast.

2) with explicit methods you tipically are very very fast in updating but... how fast are you moving toward the steady state solution at CFL = O(1)? not much i guess

3) if you have a bug of some sort, no matter what, implicit or explicit, your iterations are probably going to diverge. So even if very fast, you are going in the wrong direction.

Besides this, absolute numbers are also affected by the specific hardware, including the number of processors. Not to mention compiler aspects.

FMDenaro · February 7, 2017, 13:46

I agree with the above scripts... I want to add a suggestion. If you are solving the compressible form then the computational cost in the time update has meaning. Conversely, when solving the incompressible form you should take into account that the computational cost to be evaluated largely depends on the pressure equation, that is on the algebric solver of the discrete Poisson equation.

zx-81 · February 7, 2017, 18:54

Thanks for the answers so far that all contain valid points.

I still think it could make sense to have a "number of variables solved per second per CPU" as a very basic, rough reference number to compare different solves/codes for exactly defined benchmark problems.
Consider the follwing example:

"Compute 256^3 points of the 3D velocity field of the standard Taylor-Green-Vortex at Re=1600 and at nondimensionalized point in time t=10"

Then divide the wall-clock time necessary to solve for this well defined 3*256^3 unknowns by the number of CPU cores used.

Now we would have a very basic metric to be able to compare solver performance. Of course this would be not perfectly objective for many reasons but still "better than nothing" to compare the performance & scalability potential of the many different existing schemes and codes against.

Unfortunately, the existing literature I searched so far on the web does not seem to contain much information about standardized performance metrics/comparisons of CFD solvers which seems strange given the vast number of competing schemes & codes.

sbaffini · February 7, 2017, 21:02

I tested my code and it also gave me 1.2e5 variable updates per core per second. Does this information tells you anything?

mprinkey · February 7, 2017, 22:28

Quote:

Originally Posted by sbaffini

I tested my code and it also gave me 1.2e5 variable updates per core per second. Does this information tells you anything?

Yes. That one of you copied the other's code.

mprinkey · February 10, 2017, 11:52

If I may, I'd like to add something constructive to this discussion instead of just being a dismissive jerk. The spirit of the OP's question is very close to something that I've spent a lot of time working on. Two of you have codes that are performing O(1e5) cell updates per core per second. Modern CPUs have AVX2 with fused-multiply-add operations. At 2-3 GHz, a Haswell/Broadwell running AVX2 code can perform 16 DP FLOP per clock cycle. So, call it O(1e10). The better question, in my mind, is this...how many actual FLOPS does a cell update really take and how does that compare to the theoretical peak? For an explict RK stage, I suspect this is O(1e3). So, with perfect memory access, we should be able to do more than O(1e7) updates per second on a single Haswell core.

I bring this up because I spent a lot of the 2000s trying to make things work fast with MPI across various networks. That was hard, but the on-node performance was pretty easily attainable. Now, that situation is completely reversed, and attaining single-core and multi-core performance on a single node is a genuine challenge, and it is only getting worse as core counts rise. Are any of you also in the weeds with me trying to optimize at this level? Obviously, we are up against memory and cache throughput limitations, and in my opinion, we have to look at significant changes both to our storage order/access scheduling (like wavefront algorithms, blocking, etc) but also at our basic solver/CFD approximation algorithms.

Here are a few examples: Using DG is a good strategy to make computations more accurate and more computationally intensive (so, you can get high-order results for roughly the same CPU time as low order FV). I'm looking at cell-sorting based on the number of neighbors to better load up four-wide AVX2 instructions. I then sort those based on locality (using space-filling curves) to improve cache behavior. Unstructured CFD is just a really hard problem to fit on a multi-core highly vectorized CPU with comparably slow memory interfaces.

What else are you doing or investigating to try to make per-core and per-node performance better?

sbaffini · February 10, 2017, 13:13

Dear Michael,

as you might have noticed from some of my latest posts, I am currently working on a CFD code which I'm developing from scratch. You might also remember that I am working on the preconditioned, density-based formulation by Weiss & Smith. I generate my own anisotropic cartesian grids (like snappyhexmesh but... without the snapping) and use the immersed boundary method but, nonetheless, the code is actually unstructured and I use such special information only at preprocessing (where i can be very fast as i do not have to read mesh nodes).

The aspect i am working on right now is the whole code restructuring in blocks. E.g., I previously had a Green-Gauss cell based gradient routine working on scalars, now it works for variables organized in blocks, such as vars(nvar,ncells). The same, of course, for all the other aspects as well (limiting, bc, fluxes, matrix solver, monitors, etc.). This comes out very naturally from a density based solver and has no particular overhead for scalars (at least in Fortran). The only overhead at the moment (with respect to scalars) is for output (e.g., for visualization), which I still write variable by variable at the moment (however, I expect to be able to reduce this by using some trick with the indexing in MPI_IO).

Another thing i'm working on right now is the separation between sends and recvs with non blocking calls. However, while it is straighforward to substitute the blocking calls with the non-blocking ones, finding room for computations between them in a way which is general enough is not simple for me at the moment.

In the past I also worked on cell reordering via Hilbert SFC (you actually explained me what they are), which is now my default cell ordering as it costs me nothing during the grid generation.

However, especially because of the blocking of variables, I'm now starting exploring the face ordering. As most of my algorithms proceed by faces, and i never carefully looked at their ordering, I suspect that jumping from a face to another which are not spatially close might still make me incurr in cache misses between two consecutive faces. Basically, at the moment, I have no idea what's going on with my face ordering

, I just know that the two adjacent cells will be pretty close (within the bounds determined by the HSFC), but not where will be the next two adjacent cells for the next face.

Edit: the 1.2e5 figure was not produced under any control. A more correct one would be: 1.35e5 variables update per second on a single core (of this proc https://ark.intel.com/products/78939...up-to-3_90-GHz), with a preconditioned density based algorithm (the one above) working on an unstructured grid in explicit RK mode (each update actually is 5 stages, so multiply the previous figure by 5) with non uniform time steps and all the equation terms in place (2nd order, iterative green-gauss cell based with 3 iterations and venkatakrishnan directional limiting). The code was compiled with gfortran 4.8.4 with a simple -O2 compilation flag. The OS is Ubuntu 14.04 running under a VM hosted by Windows 7. The machine is a Dell M4800 with 32 GB RAM. Some immersed boundary overhead might be present (for regular boundaries the adjacent cell center and its variables are just there, for the immersed boundary cell faces I have just a point where I first need to interpolate the variables using a cloud of nearby cell centers, around 30 in 3D). The code was run under a python GUI visualizing its log and residual curves for the 5 main equations.

2nd Edit: the variable updates per second go up to 1.75e5 if I exclude the gradient limiting, 2e5 if I also reduce the gradient iterations from 3 to just 1.

mprinkey · February 10, 2017, 14:16

Interesting stuff.

I'd look into ordering faces just as you do cells, using the face centroid location -> SPC index and sorting on the index. SPC is never perfect, but it is usually pretty good 99.(99?)% of the time. I think you are going to be in good shape if you number your faces based on location. Certainly better than lexical ordering or whatever random way your input mesh may have them enumerated. Another approach could just be greedy ordering based on the SPC ordered cells. Just loop through the faces on each cell in order and add them to the face list. Mark each face as you add them. Then only add unmarked faces. That will get your faces close to cells and the cells are already close to each other. My guess is that neither approach will be a clear winner, but both should be good. I'm sorting everything with SFC because I am too lazy to try new things. 8)

Also, I think that the normal loopOverFaces() concept may be limiting. I've seen in everywhere...Fluent, OpenFOAM. It always looks like:

forInteriorFaces(face)
{
flux = someFluxComputation(...);
cellData[cellPlusSide[face]] += flux;
cellData[cellMinusSide[face]] -= flux;
}

So, this is a loop over all faces between active cell pairs. It computes a flux using data from the face and maybe the cells on the plus/minus sides of the face. Then it updates the cell values on each side using that flux. That would be for an explicit update, but an implicit matrix formation would += and -= the center coefficient and rhs in the same way. Those indirect writes to memory associated with the cellData updates are (in my experience) a big problem. Compilers can't be sure how to vectorize that code because they can't be sure when the cellPlusSide and cellMinusSide are going to show up nearby each other and clobber each other. If you've done face/cell ordering well, they will likely be close-by pretty often. And vectorized operations (by definition) need to be independent so the code will fall back to non-vector ops.

For example, if face 59 and it has cellPlus = 23, cellMinus = 24. And then face 60 has cellPlus = 24, cellMinus = 34, the +=/-= operations trying to update data in cell 24 will clobber each other if you try to do them at the same time, so the operations can't be vectorized.

To avoid these problems, I've been looking at schemes that follow some basic rules:

If you are updating cell values, do it in cell loops with an =, not +=,...
If you are updating face values, do it in face loops with an =, not a +=,...
Only reference non-cell values in cell loops as reads.
Only reference non-face values in face loops as reads.
Avoid overwriting any field that you just read from.

Compilers/CPUs seem to do a good job of vectorizing if you follow this process..you can even have the compile auto-flush the writes when you follow this protocol opening up room for read caches because you want to NOT have that data reused. You only want read-in data to be reused. I should note that this can increase the number of loops you have in total, but it does fit in nicely with RK or timestepping scheme where you are reading from old-time/stage fields and writing to new-time/stage fields. I can't swear that this is the best way, but I as feel my way along, these guidelines seems to help.

One final point on MPI nonblocking calls. On a given PE, I think off three types of cells: Interior, Exported, and Ghost. Ghost cells are cells that belong to neighbor PEs and my PE doesn't update them except with RECVs. Interior cells are cells that do not have any Ghost cells as neighbors. Exported cells are cells that have one or more Ghost cell neighbors and hence have to be exported to one or more neighbor PEs.. Our PE updates values in interior and exported cells. So, if you loop over the exported cells first and update their values, then issue the nonblocking SEND/RECV sending updated exported data and recving updated ghost data, and then loop over the interior cells, and finalize the SEND/RECVs afterward, you can get a fair amount of work done during communication. Of course, it is best if you have a few 100k cells in the interior cell list to give the network more time, but I think that is the best that you can do. That is certainly the best I know how to do.

Edit: Paolo, it looks like you are doing some nice work there. Congrats.

sbaffini · February 10, 2017, 14:58

Quote:

Originally Posted by mprinkey

So, if you loop over the exported cells first and update their values, then issue the nonblocking SEND/RECV sending updated exported data and recving updated ghost data, and then loop over the interior cells, and finalize the SEND/RECVs afterward, you can get a fair amount of work done during communication.

I was thinking about doing the exact opposite actually:

1) Issue the IRECV for the ghosts
2) Issue the ISEND for the exported
3) Update the interiors
4) Finalize ISEND/IRECV
5) Update the exported

would this still work? It's just that it either requires renumbering the cells (again!!!), or carrying over cell lists (not really an option here). Two things I am not yet in the mood of doing (also because of some idiosyncrasies in the way I read my grid in parallel and determine the ghosts).

For what concerns the read/write issue, I am afraid that, at the moment, i am packing too much stuff together in structs (to the point that some routines have potential aliasing) and the Fortran compiler is somehow suffering from this. Otherwise, the INTENT of Fortran variables should be there also for this.

Edit: in my case, the face ordering suffers from the the way the grid is built, as they keep the numbering obtained during the construction (which, typically, tends to separate faces originally close while augmenting the grid), while the cells are ordered after the construction.

mprinkey · February 10, 2017, 15:33

Quote:

Originally Posted by sbaffini

I was thinking about doing the exact opposite actually:

1) Issue the IRECV for the ghosts
2) Issue the ISEND for the exported
3) Update the interiors
4) Finalize ISEND/IRECV
5) Update the exported

would this still work? It's just that it either requires renumbering the cells (again!!!), or carrying over cell lists (not really an option here). Two things I am not yet in the mood of doing (also because of some idiosyncrasies in the way I read my grid in parallel and determine the ghosts).

For what concerns the read/write issue, I am afraid that, at the moment, i am packing too much stuff together in structs (to the point that some routines have potential aliasing) and the Fortran compiler is somehow suffering from this. Otherwise, the INTENT of Fortran variables should be there also for this.

Yeah, your way works too. In my version, you can actually delay the Finalize until you next need to access the ghost layer data for that updated field. My philosophy is that at the end of a code section, I want to know that all of the data that I updated is consistent across all of the PEs, at least when coding initially. Then I can optimize a bit by sliding that finalize further down through the code. Maybe I just updated density and now I am finding/limiting the gradient of velocity...that finalize for density can go way below some of that code too. But, I find it is better to have a touchstone code with everything consistent that I can revert back to if I think there is a problem. Some of this is style, but it is style informed by experience. 8)

And, yes, unfortunately, renumbering comes back again. I worked on Fluent for a long time, so I am used to thinking about cell threads and face threads. For me, there is not just one cell list or one face list. I have groups of lists of each. So, I have exportedSixNeighbor, exportedFiveNeighbor, exportedFourNeighbor lists and similar three for interior cells. Ordering by neighbor count seems to be important. Ordering interior vs exported I know is important for MPI latency hiding. The good thing to remember that at the SPC ordering still "persists" if you just group the SPC-sorted list into subsets. Each sub group you build will still exhibit about the best locality you can muster. As with the next topic, I think much of the performance is found in the bookkeeping.

And if you haven't already, you should really read up on AofS versus SofA. (Arrays of Structs/Structs of Arrays) This is a huge topic touching on exactly your concerns about shoving all of your fields into structs. How those data are grouped and accessed can dramatically impact your code performance. I am targeting Intel MIC processors like Knights Landing (and maybe GPUs) and AofS/SofA and tweaking to encourage vectorization are absolutely critical considerations. I don't have it figured out, but I know that performance on near-term and future platforms is going to be driven by these considerations.

zx-81 · February 10, 2017, 17:45

Quote:

Originally Posted by mprinkey

Unstructured CFD is just a really hard problem to fit on a multi-core highly vectorized CPU with comparably slow memory interfaces.

Yes that's exactly the point. When studying most publications about efficiency of schemes, especially for adaptive ones, we usually get an idea only for the algorithmic complexity (in big-O() notation) that abstracts away any constant factor. But in practice it is often precisely the constant factor of "implementation efficiency" that makes the difference.

An example: The popular (excellent) Gerris flow solver (http://gfs.sourceforge.net/wiki/index.php/Main_Page) uses an octree-based adaptive mesh refinement that can (for geometrically sparse problems) reduce the number of actually computed cells easily by a factor of five when compared to an equivalent number of cells of a regular structured grid.
So we have (for a 1024^3 problem domain): A non-adaptive scheme needes to solve for all 1024^3=1e9 variables whereas the adaptive scheme needs only a fraction, say (1/5)*1024^3=5e8 vriables to compute for solving the same problem.

Unfortunately, the book keeping overhead of adaptive data structures is both large and not modern-CPU (SIMD,AVX etc.) friendly.
In my programming experience it would not be unusual to find a performance penalty of factor five for a fully adaptive octree code when compared to an optimized regular grid code that taks fully advantage of all SIMD/AVX lanes and spatial coherence of regular grid memory access patterns.

So in this example, the implementation overhead of the adaptive scheme would outweigh the theoretical advantge offered by the adaptivity of the algorithm.

The metric of "actually commputed variables per second per core" would adress both the algorithmic performance and implementation efficiency.

In the case of adaptive solvers it could be used to define a measure of "break-even sparseness" i.e. the degree of geometrically sparseness that a given problem should have in order for a particular adaptive code to become more efficient than a regular non-adaptive code.

arjun · February 10, 2017, 21:20

As far as my experience is concerned, in incompressible flow case, all this discussion is meaningless as the most dominant part is solving linear system for pressure and its efficiency goes down very much as problem size increases. As you go above 100 million sizes then this solver time could be as much as 90% of the simulation time and at that point all this time savings in parrallel exchange here and there is pretty small.

Quote:

Originally Posted by zx-81

So we have (for a 1024^3 problem domain): A non-adaptive scheme needes to solve for all 1024^3=1e9 variables whereas the adaptive scheme needs only a fraction, say (1/5)*1024^3=5e8 vriables to compute for solving the same problem.
.

If I have to tell you that in incompressible case your cartesian 1024^3 could run much more efficiently than 5e8 octree based case would you believe. (I mean non uniform cartesian mesh with SiMPLE method). In fact i have patent about it

http://patents.justia.com/patent/20120296616

In this solver the linear system converges to 1E-4 tolerance in 3 iterations and that 3 iterations is fixed. We used 576 processors to run cases between 1billion to 3 billion cell counts.
That time I estimated that i was 50 or so times more efficient than AMG based unstructured solvers.

FMDenaro · February 11, 2017, 04:54

Yes, for incompressible flow the framework is totally different...

I admit that I don't have much passion for this information technology aspect of the CFD and I was frustrated working on HPC for solving incompressible flows.
I worked with collegues who are specialized in the topic of algebric system solvers and HPC and I tried to solve the Poisson equation using many modern solvers. Despite that, I always found that my old, simple, black-red coloured SOR with OpenMP directives was the best in terms of CPU time for my LES code.

FMDenaro · February 11, 2017, 04:56

Quote:

Originally Posted by arjun

In this solver the linear system converges to 1E-4 tolerance in 3 iterations and that 3 iterations is fixed.

What if you use a lower threshold, at least 10^-7?

zx-81 · February 11, 2017, 06:19

Quote:

Originally Posted by arjun

As far as my experience is concerned, in incompressible flow case, all this discussion is meaningless as the most dominant part is solving linear system for pressure

Not necessarily, if multigrid methods are used for the pressure solve.

Here is the top-level breakdown of CPU time consumption for a complete time step of a one billlion grid-point (1024^3) simulation of the solver introduced here: Newly devloped hi-res fluid solver: need advice for validation:

solvePressure() 45%
applyPressure() 5%
solveAdvection() 40%
others 10%

This solver also uses an adaptive mesh but not based on octree but on a more modern-CPU-friendly AMR algorithm. In my opinon the key for good overall solver performance is to implement the main mesh/grid data structure in a way such that it can serve the two goals of adaptive refinement *and* serve as a multigrid hierarchy for the linear pressure solve at the same time.

arjun · February 11, 2017, 07:36

Quote:

Originally Posted by zx-81

Not necessarily, if multigrid methods are used for the pressure solve.

Here is the top-level breakdown of CPU time consumption for a complete time step of a one billlion grid-point (1024^3) simulation of the solver introduced here: Newly devloped hi-res fluid solver: need advice for validation:

solvePressure() 45%
applyPressure() 5%
solveAdvection() 40%
others 10%

I am talking about multigrid. This is one setting where you have cartesian type meshes and skew and aspect ratio are pretty limited. (45% is already pressure solve here). The text book efficiency of multigrid as grid sizes increase only remain with good meshes. Even 1 cell out of billion with screwed up aspect ratio and skew can kill this efficiency.

The reason why I said this discussion is meaningless is because what you learn from one situation can not extrapolate to another situation and settings.
You can measure all the performance etc you want it it has no use other than what you solve for yourself (that means the moment you take tetrahedral meshes and apply the same solver your measurements all go out of window).

arjun · February 11, 2017, 07:40

Quote:

Originally Posted by FMDenaro

What if you use a lower threshold, at least 10^-7?

We can , we will need 1 or 2 more iterations.

In this solver the we use FFT based poisson solver for pressure. Which you get from explicit type solvers (this you know very well).
But since explicit solvers are courant limited, we used SIMPLE but somehow made use of FFT based direct solvers in background. (Avoided AMG). This is this patent about.

We compared the efficiency with and without AMG actually for this patent. Figures shall be part of patent (if i remember correctly).

zx-81 · February 11, 2017, 08:29

Quote:

Originally Posted by arjun

The text book efficiency of multigrid as grid sizes increase only remain with good meshes. Even 1 cell out of billion with screwed up aspect ratio and skew can kill this efficiency.

Yes. that's why we use adaptive refinement of a cartesian mesh. That gives us at the same time both the regularity & efficiency of cartesian cell layout and the support of arbitrarily skewed/complex model geometry through the adative resolution refinement.

Simple example: think of a very 'anisotropic' cell with dimensions 8x1x1 centimeters at resolution level zero.
After three steps of adaptive refinement, the same 'irregular' cell is represented as a row of eight perfectly regular 1x1x1 cells.
This way, complex geometries and topologies can be modelled as hierarchies of simple regular cartesian grids.

The remainig challenge is to

1. Take care that the same multiscale structure that is used for adaptive refinement can also be used as the multigrid hierarchy of the linear pressure solver.

2. Implement the multi-scale grid in a modern-CPU/SIMD-friendly way with extremely low bookkeeping overhead in order not to lose too much of the efficiency compared to a completely non-adaptive cartesian grid

February 5, 2017, 09:23	How many mesh elements per second per CPU core ?	#1
zx-81 Member Bernd Join Date: Jul 2012 Posts: 42 Rep Power: 14	Hi, in order to compare the performance of the many different existing CFD solver software in an approximately standardized/objective way it would be helpful to know how many mesh elements (grid points) per second per CPU core a particular software is able to achive for a single time step of a minimal standard reference problem (i.e. Taylor green vortex, single phase incompressible flow) (For unstructured or adaptively refined meshes, "number of mesh elements" should of course refer to the number of actually computed elements, not the number of nodes of an equivalent regular non-adaptive grid. (This way, a good comparison of the 'efficiency-of-adaptivity' of AMR(Adaptive Mesh refinement)-based codes would be possible.)) Does anyone know if/where this figure is available for the commonly used software packages ? Any comparison papers that compare software with respect to this criterion ? (Background for the question: The optimized Fluid solver I'm currently developping solves about 1e5 velocity grid cells per second per CPU core and I'm interested how this compares to the industry state of the art software)

February 6, 2017, 00:19		#2
mprinkey Senior Member Michael Prinkey Join Date: Mar 2009 Location: Pittsburgh PA Posts: 363 Rep Power: 25	I understand your goal, but I have to tell you that I don't think you can boil performance down to a single number like this. If the CFD codes you are comparing use all the same algorithms, you may be able to compare "cell updates" for say, an explicit RK updates. If the CFD codes are not all density based or not all using the same Riemann solvers/linear equation solvers, this comparison is going to be hard to make sense of. The costs of a W-cycle MG sweep vs. one iteration of SOR are not at all comparable either in CPU time or in solution improvement. Even for the same algorithm, there is going to be large variability in performance based on problem size (does it all or a significant working set fit in L2/L3 cache, what happens as you move from 1 to many cores). As a result, most CFD benchmarking tends to be holistic--pick a geometry and fixed mesh and solve to some precision or advance to some final time. Simple geometric problems (lid driven cavity, backward/forward facing step, isotropic turbulence decay, Rayleigh-Taylor instability, vortex shedding behind a cube/cylinder) have been explored for years and you can probably find performance/scalability data from them to compare to your results. Beyond that, I think you will find difficulty. Once geometries get complicated, you need to find/handle the exact mesh that they used. Or you need to get their code and make it work with your mesh. That is a lot of headaches. And in the end, I don't think that I would find "cell updates per second" as a compelling indicator of relative performance. But show me relative timings of Fluent or Star or Euler3D or OpenFOAM results vs your code for a few relevant problems and sizes for a range of core count--then you have my attention. sbaffini, FMDenaro and flotus1 like this.

February 10, 2017, 11:52		#8
mprinkey Senior Member Michael Prinkey Join Date: Mar 2009 Location: Pittsburgh PA Posts: 363 Rep Power: 25	If I may, I'd like to add something constructive to this discussion instead of just being a dismissive jerk. The spirit of the OP's question is very close to something that I've spent a lot of time working on. Two of you have codes that are performing O(1e5) cell updates per core per second. Modern CPUs have AVX2 with fused-multiply-add operations. At 2-3 GHz, a Haswell/Broadwell running AVX2 code can perform 16 DP FLOP per clock cycle. So, call it O(1e10). The better question, in my mind, is this...how many actual FLOPS does a cell update really take and how does that compare to the theoretical peak? For an explict RK stage, I suspect this is O(1e3). So, with perfect memory access, we should be able to do more than O(1e7) updates per second on a single Haswell core. I bring this up because I spent a lot of the 2000s trying to make things work fast with MPI across various networks. That was hard, but the on-node performance was pretty easily attainable. Now, that situation is completely reversed, and attaining single-core and multi-core performance on a single node is a genuine challenge, and it is only getting worse as core counts rise. Are any of you also in the weeds with me trying to optimize at this level? Obviously, we are up against memory and cache throughput limitations, and in my opinion, we have to look at significant changes both to our storage order/access scheduling (like wavefront algorithms, blocking, etc) but also at our basic solver/CFD approximation algorithms. Here are a few examples: Using DG is a good strategy to make computations more accurate and more computationally intensive (so, you can get high-order results for roughly the same CPU time as low order FV). I'm looking at cell-sorting based on the number of neighbors to better load up four-wide AVX2 instructions. I then sort those based on locality (using space-filling curves) to improve cache behavior. Unstructured CFD is just a really hard problem to fit on a multi-core highly vectorized CPU with comparably slow memory interfaces. What else are you doing or investigating to try to make per-core and per-node performance better? sbaffini likes this.

February 10, 2017, 13:13		#9
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,191 Blog Entries: 29 Rep Power: 39	Dear Michael, as you might have noticed from some of my latest posts, I am currently working on a CFD code which I'm developing from scratch. You might also remember that I am working on the preconditioned, density-based formulation by Weiss & Smith. I generate my own anisotropic cartesian grids (like snappyhexmesh but... without the snapping) and use the immersed boundary method but, nonetheless, the code is actually unstructured and I use such special information only at preprocessing (where i can be very fast as i do not have to read mesh nodes). The aspect i am working on right now is the whole code restructuring in blocks. E.g., I previously had a Green-Gauss cell based gradient routine working on scalars, now it works for variables organized in blocks, such as vars(nvar,ncells). The same, of course, for all the other aspects as well (limiting, bc, fluxes, matrix solver, monitors, etc.). This comes out very naturally from a density based solver and has no particular overhead for scalars (at least in Fortran). The only overhead at the moment (with respect to scalars) is for output (e.g., for visualization), which I still write variable by variable at the moment (however, I expect to be able to reduce this by using some trick with the indexing in MPI_IO). Another thing i'm working on right now is the separation between sends and recvs with non blocking calls. However, while it is straighforward to substitute the blocking calls with the non-blocking ones, finding room for computations between them in a way which is general enough is not simple for me at the moment. In the past I also worked on cell reordering via Hilbert SFC (you actually explained me what they are), which is now my default cell ordering as it costs me nothing during the grid generation. However, especially because of the blocking of variables, I'm now starting exploring the face ordering. As most of my algorithms proceed by faces, and i never carefully looked at their ordering, I suspect that jumping from a face to another which are not spatially close might still make me incurr in cache misses between two consecutive faces. Basically, at the moment, I have no idea what's going on with my face ordering , I just know that the two adjacent cells will be pretty close (within the bounds determined by the HSFC), but not where will be the next two adjacent cells for the next face. Edit: the 1.2e5 figure was not produced under any control. A more correct one would be: 1.35e5 variables update per second on a single core (of this proc https://ark.intel.com/products/78939...up-to-3_90-GHz), with a preconditioned density based algorithm (the one above) working on an unstructured grid in explicit RK mode (each update actually is 5 stages, so multiply the previous figure by 5) with non uniform time steps and all the equation terms in place (2nd order, iterative green-gauss cell based with 3 iterations and venkatakrishnan directional limiting). The code was compiled with gfortran 4.8.4 with a simple -O2 compilation flag. The OS is Ubuntu 14.04 running under a VM hosted by Windows 7. The machine is a Dell M4800 with 32 GB RAM. Some immersed boundary overhead might be present (for regular boundaries the adjacent cell center and its variables are just there, for the immersed boundary cell faces I have just a point where I first need to interpolate the variables using a cloud of nearby cell centers, around 30 in 3D). The code was run under a python GUI visualizing its log and residual curves for the 5 main equations. 2nd Edit: the variable updates per second go up to 1.75e5 if I exclude the gradient limiting, 2e5 if I also reduce the gradient iterations from 3 to just 1. mprinkey likes this. Last edited by sbaffini; February 10, 2017 at 14:15.

February 10, 2017, 14:16		#10
mprinkey Senior Member Michael Prinkey Join Date: Mar 2009 Location: Pittsburgh PA Posts: 363 Rep Power: 25	Interesting stuff. I'd look into ordering faces just as you do cells, using the face centroid location -> SPC index and sorting on the index. SPC is never perfect, but it is usually pretty good 99.(99?)% of the time. I think you are going to be in good shape if you number your faces based on location. Certainly better than lexical ordering or whatever random way your input mesh may have them enumerated. Another approach could just be greedy ordering based on the SPC ordered cells. Just loop through the faces on each cell in order and add them to the face list. Mark each face as you add them. Then only add unmarked faces. That will get your faces close to cells and the cells are already close to each other. My guess is that neither approach will be a clear winner, but both should be good. I'm sorting everything with SFC because I am too lazy to try new things. 8) Also, I think that the normal loopOverFaces() concept may be limiting. I've seen in everywhere...Fluent, OpenFOAM. It always looks like: forInteriorFaces(face) { flux = someFluxComputation(...); cellData[cellPlusSide[face]] += flux; cellData[cellMinusSide[face]] -= flux; } So, this is a loop over all faces between active cell pairs. It computes a flux using data from the face and maybe the cells on the plus/minus sides of the face. Then it updates the cell values on each side using that flux. That would be for an explicit update, but an implicit matrix formation would += and -= the center coefficient and rhs in the same way. Those indirect writes to memory associated with the cellData updates are (in my experience) a big problem. Compilers can't be sure how to vectorize that code because they can't be sure when the cellPlusSide and cellMinusSide are going to show up nearby each other and clobber each other. If you've done face/cell ordering well, they will likely be close-by pretty often. And vectorized operations (by definition) need to be independent so the code will fall back to non-vector ops. For example, if face 59 and it has cellPlus = 23, cellMinus = 24. And then face 60 has cellPlus = 24, cellMinus = 34, the +=/-= operations trying to update data in cell 24 will clobber each other if you try to do them at the same time, so the operations can't be vectorized. To avoid these problems, I've been looking at schemes that follow some basic rules: If you are updating cell values, do it in cell loops with an =, not +=,... If you are updating face values, do it in face loops with an =, not a +=,... Only reference non-cell values in cell loops as reads. Only reference non-face values in face loops as reads. Avoid overwriting any field that you just read from. Compilers/CPUs seem to do a good job of vectorizing if you follow this process..you can even have the compile auto-flush the writes when you follow this protocol opening up room for read caches because you want to NOT have that data reused. You only want read-in data to be reused. I should note that this can increase the number of loops you have in total, but it does fit in nicely with RK or timestepping scheme where you are reading from old-time/stage fields and writing to new-time/stage fields. I can't swear that this is the best way, but I as feel my way along, these guidelines seems to help. One final point on MPI nonblocking calls. On a given PE, I think off three types of cells: Interior, Exported, and Ghost. Ghost cells are cells that belong to neighbor PEs and my PE doesn't update them except with RECVs. Interior cells are cells that do not have any Ghost cells as neighbors. Exported cells are cells that have one or more Ghost cell neighbors and hence have to be exported to one or more neighbor PEs.. Our PE updates values in interior and exported cells. So, if you loop over the exported cells first and update their values, then issue the nonblocking SEND/RECV sending updated exported data and recving updated ghost data, and then loop over the interior cells, and finalize the SEND/RECVs afterward, you can get a fair amount of work done during communication. Of course, it is best if you have a few 100k cells in the interior cell list to give the network more time, but I think that is the best that you can do. That is certainly the best I know how to do. Edit: Paolo, it looks like you are doing some nice work there. Congrats. sbaffini likes this.

February 7, 2017, 13:22		#3
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,191 Blog Entries: 29 Rep Power: 39	To summarize what Michael wrote: it is of low relevance how much fast you advance your solution with respect to the direction in which you advance and by how much you advance in that specific direction. 3 examples: 1) the fastest update of a variable is var_new = var_old + 0.0 at each iteration. you're just going nowhere with your solution, but you're certainly doing it very fast. 2) with explicit methods you tipically are very very fast in updating but... how fast are you moving toward the steady state solution at CFL = O(1)? not much i guess 3) if you have a bug of some sort, no matter what, implicit or explicit, your iterations are probably going to diverge. So even if very fast, you are going in the wrong direction. Besides this, absolute numbers are also affected by the specific hardware, including the number of processors. Not to mention compiler aspects.

February 7, 2017, 13:46		#4
FMDenaro Senior Member Filippo Maria Denaro Join Date: Jul 2010 Posts: 6,855 Rep Power: 73	I agree with the above scripts... I want to add a suggestion. If you are solving the compressible form then the computational cost in the time update has meaning. Conversely, when solving the incompressible form you should take into account that the computational cost to be evaluated largely depends on the pressure equation, that is on the algebric solver of the discrete Poisson equation.

February 7, 2017, 18:54		#5
zx-81 Member Bernd Join Date: Jul 2012 Posts: 42 Rep Power: 14	Thanks for the answers so far that all contain valid points. I still think it could make sense to have a "number of variables solved per second per CPU" as a very basic, rough reference number to compare different solves/codes *for exactly defined* benchmark problems. Consider the follwing example: "Compute 256^3 points of the 3D velocity field of the standard Taylor-Green-Vortex at Re=1600 and at nondimensionalized point in time t=10" Then divide the wall-clock time necessary to solve for this well defined 3*256^3 unknowns by the number of CPU cores used. Now we would have a very basic metric to be able to compare solver performance. Of course this would be not perfectly objective for many reasons but still "better than nothing" to compare the performance & scalability potential of the many different existing schemes and codes against. Unfortunately, the existing literature I searched so far on the web does not seem to contain much information about standardized performance metrics/comparisons of CFD solvers which seems strange given the vast number of competing schemes & codes.

February 7, 2017, 21:02		#6
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,191 Blog Entries: 29 Rep Power: 39	I tested my code and it also gave me 1.2e5 variable updates per core per second. Does this information tells you anything?

February 11, 2017, 04:54		#15
FMDenaro Senior Member Filippo Maria Denaro Join Date: Jul 2010 Posts: 6,855 Rep Power: 73	Yes, for incompressible flow the framework is totally different... I admit that I don't have much passion for this information technology aspect of the CFD and I was frustrated working on HPC for solving incompressible flows. I worked with collegues who are specialized in the topic of algebric system solvers and HPC and I tried to solve the Poisson equation using many modern solvers. Despite that, I always found that my old, simple, black-red coloured SOR with OpenMP directives was the best in terms of CPU time for my LES code.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
stop when I run in parallel	Nolwenn	OpenFOAM	36	March 21, 2021 05:56
[snappyHexMesh] No layers in a small gap	bobburnquist	OpenFOAM Meshing & Mesh Conversion	6	August 26, 2015 10:38
Superlinear speedup in OpenFOAM 13	msrinath80	OpenFOAM Running, Solving & CFD	18	March 3, 2015 06:36
Moving mesh	Niklas Wikstrom (Wikstrom)	OpenFOAM Running, Solving & CFD	122	June 15, 2014 07:20
[snappyHexMesh] snappyHexMesh won't work - zeros everywhere!	sc298	OpenFOAM Meshing & Mesh Conversion	2	March 27, 2011 22:11