single i7 MUCH faster than dual xeon E5-2650 v3 !!!

acasas · November 24, 2014, 13:49

hi there!

I'm posting this info in here so no one throw away they money as I've done.

I did run a FSI (with fluent) analysis in Ansys 15.0.7 under windows 7 64 bits pro with sp1. I did perform the same analysis over a SINGLE i7 3820 and over a DUAL xeon E5 2650 v3 , and the SINGLE i7 is 2 to 3 time much faster !!!!!
So guys, be aware, a 1200 Euro computer is 2 to 3 times faster than a 6000 Euro workstation.
I really should have check CFD online forums before I bought this computer, I know, I´m guilty, but how in the hell I should imagine this would happen!!! I have tried many bios settings for the new dual xeon, like hyperthreding on and off, overckloking/turbo on and off, numa on and off, QPI auto or fixed, power management on and off, etc etc, and it seems to make small difference in performance. I've been thinking that I should use 8x8 GB RAM instead of 4x16 GB RAM configuration, so I could have the processor at full 4 channel ,but some people said it won't make a big difference.

So please guys, tell me, have I waste my money?? Any idea of what is going on? Is it hardware company fault, OS software or CFD software ? Are they trying to sell products not worth it the price AT ALL compared to others?? Or is it me, and my poor knowledge on the subject? I really hope its my fault, otherwise, guys, be aware and don't spent your money on those products.

Thanks

mehulkumar · November 24, 2014, 14:27

Can you share some basic details of the job you want to compare on two different configurations.
- total mesh count
- complexity of flow physics/ various model used
- both hardware configuration in detail

acasas · November 24, 2014, 15:00

I don't want to compare, I have compared.... And its really bad news for who have spent their money on xeon E5 2650 v3.

Its a 2 way FSI, ( 3d, dp, pbns, dynamesh, vof, skw, transient). Its a body falling from 1m height into a water open channel. So far, its a 20k elements for the fluid and 500 for the body. I just use 20 cores, because if I use more (hyperthreading), the affinity is not set ( I don´t know if I should laugh or cry.... after the money I have spent, its more of the second). Both computers are using 64 GB RAM . for the i7 its 8x8 dimms, and for the dual xeon its 4x16. For the i7 its DD3 1605 Mhz and for the xeons is DD4 2133 Mhz. The time duration for the analysis is about 30 min for the single i7 and more than an hour for the dual xeon e5- 2650 v3. I did use SSD for both computers. If you need more info, let me know. I also would appreciate any comment or suggestion to try to invert this situation.

thanks

kyle · November 24, 2014, 19:14

First, you should probably calm down. You didn't waste money. The Xeon machine isn't going to be twice as fast as the i7 machine, but it should be at least a little faster.

Assuming you are doing everything correctly, it could be that your processes are hopping around to different cores. There are huge inefficiencies when a process hops to a core on a different socket. You could check if this is the case by disabling one of the processors and running your benchmark again (note that this will cut your memory in half).

wyldckat · November 24, 2014, 19:26

Greetings to all!

@acasas:

Quote:

Originally Posted by acasas

I've been thinking that I should use 8x8 GB RAM instead of 4x16 GB RAM configuration, so I could have the processor at full 4 channel ,but some people said it won't make a big difference.

It makes so much of a big difference, that you should sue whomever told you that.
The explanation is simple: you have 2 Xeon processors and only 4 RAM modules. This means that technically you only have 2 RAM modules per processor. Technically, what this means, is that you have created a massive bottleneck on your system.

If you look at the specs for the 2 processors and notice the maximum memory bandwidth:

E5-2650 v3: http://ark.intel.com/products/81705/...Cache-2_30-GHz - Max 68 GB/s
i7 3820: http://ark.intel.com/products/63698/...up-to-3_80-GHz - Max 51.2 GB/s

Now, remember that I mentioned that you created a massive bottleneck, by using only 2 RAM modules for each processor E5-2650 v3? This means that instead of having 68 GB/s for each processor, you only have 34 GB/s. Which is almost half of what the bandwidth the i7 3820 has got.

Now, the other bottleneck:

The E5-2650 v3 processor has 10 cores, which will likely operate at around 2.5GHz when all cores are running at full throttle. This means that it has an equivalent potential of 25 GHz processing power per processor, which is more than 3.6*4 = 14.4 GHz of the i7 3820.
But when we overlap the constriction imposed by using only 2 RAM modules, this means that you have 10 cores fighting for access to 2 RAM modules, over 2 memory channels. This means that, with luck, each processor E5-2650 v3 is only being used at 50% capacity, i.e. potentially at 12.5GHz.
But since 10 cores are fighting for the existing RAM, the capacity is likely being dropped down to 30-40%, due to the entropy.

Worst even is if the 4 RAM modules are all allocated only to the first processor E5-2650 v3, and the second processor is nonetheless (somehow) turned on and trying to ask the first processor to share its memory, so that it can do its own share of the calculation as well. Although this shouldn't be possible.

On the other side, the i7 3820 only has 4 cores and also has 4 memory channels, which means that it pretty much gives optimum access to memory for each core, at full bandwidth.

Now, this essentially equates to you having bought a 500000$ super-car and then shooting off 2 of its tyres and expecting that it can still travel at 300 km/h... when it can barely run at 50 km/h, while burning out the rims of the wheels

The solution: take the expensive machine back to the shop where you bought it and ask for replacing the 4*16GB modules of RAM for 8*8GB modules.

Want proof, before heading back to the shop? It's simple:

Turn off HyperThreading. It's useless for CFD. Use pure cores only.
Use only 2 cores on each processor for running your case. This equates to roughly 4*2.6GHz = 10.4GHz. And use "cpu affinity binding", do it manually if you have to, via Windows' Task Manager.
This means that the simulation should take roughly 14.4/10.4 = 1.38 times longer to run, i.e., it will take 30 * 1.38 = 41 to 45 minutes to run.

Best regards,
Bruno

acasas · November 24, 2014, 19:27

Hi Kyle, thanks´for your answer, also trying me to calm down. you are right I´ve been like that for a week . The computer is new, I don´t want yet to say to whom what company I did buy, in case I´m doing something wrong. Intel and windows are so big, that hopefully they won't be upset with me.
Any way, since the computer is new, I may wait a little to just unplug one processor. Do I need to do it physically, really? i don't have many experience on this, it does not seems difficult but very delicate.
On the other hand, I did run the Intel processor diagnostic tool 64 bits and its showing a big red fault for the QPI link, so I guess it may be very relevant.

Also, I know its not the same case, but please check this out http://www.cfd-online.com/Forums/har...-3930k-x2.html

thanks

acasas · November 24, 2014, 19:34

Bruno!!!! if you are right , you made my day and soooo happy you can´t imagine.
Thank´s a lot. If you was a woman I would kiss you. But , hey, I will first go to the shop and try. I must say, that the company its a very important international one, specialized in superservers, mainly for data, so maybe they should know that, shouldn´t they. I have been telling them about the RAM memory many times, and they keep insisting that it won't make a difference. So I really hope you are right as a saint and they are the EVIL...

Thanks a lot. I will post the result in here once I change the RAM configuration.

acasas · November 24, 2014, 20:00

Quote:

Originally Posted by wyldckat

Greetings to all!

@acasas:

It makes so much of a big difference, that you should sue whomever told you that.
The explanation is simple: you have 2 Xeon processors and only 4 RAM modules. This means that technically you only have 2 RAM modules per processor. Technically, what this means, is that you have created a massive bottleneck on your system.

If you look at the specs for the 2 processors and notice the maximum memory bandwidth:

E5-2650 v3: http://ark.intel.com/products/81705/...Cache-2_30-GHz - Max 68 GB/s
i7 3820: http://ark.intel.com/products/63698/...up-to-3_80-GHz - Max 51.2 GB/s

Now, remember that I mentioned that you created a massive bottleneck, by using only 2 RAM modules for each processor E5-2650 v3? This means that instead of having 68 GB/s for each processor, you only have 34 GB/s. Which is almost half of what the bandwidth the i7 3820 has got.

Now, the other bottleneck:

The E5-2650 v3 processor has 10 cores, which will likely operate at around 2.5GHz when all cores are running at full throttle. This means that it has an equivalent potential of 25 GHz processing power per processor, which is more than 3.6*4 = 14.4 GHz of the i7 3820.
But when we overlap the constriction imposed by using only 2 RAM modules, this means that you have 10 cores fighting for access to 2 RAM modules, over 2 memory channels. This means that, with luck, each processor E5-2650 v3 is only being used at 50% capacity, i.e. potentially at 12.5GHz.
But since 10 cores are fighting for the existing RAM, the capacity is likely being dropped down to 30-40%, due to the entropy.

Worst even is if the 4 RAM modules are all allocated only to the first processor E5-2650 v3, and the second processor is nonetheless (somehow) turned on and trying to ask the first processor to share its memory, so that it can do its own share of the calculation as well. Although this shouldn't be possible.

On the other side, the i7 3820 only has 4 cores and also has 4 memory channels, which means that it pretty much gives optimum access to memory for each core, at full bandwidth.

Now, this essentially equates to you having bought a 500000$ super-car and then shooting off 2 of its tyres and expecting that it can still travel at 300 km/h... when it can barely run at 50 km/h, while burning out the rims of the wheels

The solution: take the expensive machine back to the shop where you bought it and ask for replacing the 4*16GB modules of RAM for 8*8GB modules.

Want proof, before heading back to the shop? It's simple:

Turn off HyperThreading. It's useless for CFD. Use pure cores only.
Use only 2 cores on each processor for running your case. This equates to roughly 4*2.6GHz = 10.4GHz. And use "cpu affinity binding", do it manually if you have to, via Windows' Task Manager.
This means that the simulation should take roughly 14.4/10.4 = 1.38 times longer to run, i.e., it will take 30 * 1.38 = 41 to 45 minutes to run.

Best regards,
Bruno

Can I activate cpu affinity binding through the bios settings? What about clock spread spectrum? should I enable or disable.

thanks

wyldckat · November 24, 2014, 20:21

Quote:

Originally Posted by acasas

Can I activate cpu affinity binding through the bios settings?

I specifically wrote:

Quote:

And use "cpu affinity binding", do it manually if you have to, via Windows' Task Manager.

My guess is that if you had at least Googled for:

Code:

windows task manager affinity

you should have found this explanation: http://superuser.com/questions/18157...oes-it-provide

Quote:

Originally Posted by acasas

What about clock spread spectrum?

The advantage of high price motherboards for Xeon processors is that they try to avoid the users to shoot their own feet.
In other words: Do not mess with it!! Most of the settings should not be messed with, if they came already properly pre-configured. And even if it didn't, the motherboard should be able to automatically diagnose the correct settings when loading the default settings.

acasas · November 24, 2014, 20:39

Quote:

Originally Posted by wyldckat

I specifically wrote:
My guess is that if you had at least Googled for:

Code:

windows task manager affinity

you should have found this explanation: http://superuser.com/questions/18157...oes-it-provide

thanks I will go through it very carefully. My OS is in spanish so I was not sure where to look... plus I was very anxious to check your proposal. And you know what??? It DID WORK !!!! without checking on the affinity binding, but turning off hyperthreading, and telling fluent to run 4 cores it did perform the same time as the i7. I´m very happy, but now there is one question.. what I will do with my other 20 cores? I mean if I disable the hyper threading, are you sure it does not make a difference in CFD? what about if I run a 2 way FSI, should I turn off or on the hyper threading?
Any way, for you it may be nothing, since you must be a genius, but for me that means a lot. You are my saviour!!! i´ll make you a monument in my town!

The advantage of high price motherboards for Xeon processors is that they try to avoid the users to shoot their own feet.
In other words: Do not mess with it!! Most of the settings should not be messed with, if they came already properly pre-configured. And even if it didn't, the motherboard should be able to automatically diagnose the correct settings when loading the default settings.

Hey..... to turn off hyperthreading was your suggestion, and by default it was activated...

But again, thanks a lot really.

evcelica · November 26, 2014, 11:50

EASY FIX:

These machines must have their memory configured correctly to have good performance. I'm assuming you are using ECC, so pick up 4 more identical DIMMs and populate bank 1 of all 4 channels for each CPU.

I've seen a lot of these dual/quad CPU workstations that had horrible performance with unbalanced memory configuration. (I saw a $20K Quad CPU XEON machine that was 1/3 the speed of an i7) because they were using 6 DIMMs per CPU.

For best performance, you have to have a balanced memory configuration:
** All 4 channels of each CPU populated evenly. That means you should be using DIMMs in identical sets of sets of 8 (4 per CPU) with your dual CPU machine. If you need more RAM, you have to fill the second bank completely. NEVER have one CPU or channel different than any other.

Don't worry about any other settings, just fill all 4 channels evenly (your motherboard or computer manufacturer should be able to tell you which slots to fill for Bank 1) Usually A1,A2,A3,A4 for CPU 0 and B1,B2,B3,B4 for CPU 1.

I had to explain this over and over to our computing department, then they added the DIMMs to balance the channels and Voila, performance skyrocketed.
Here is some links on balancing memory if you need them to convince you IT:

https://roianalyst.alinean.com/dell/AutoLogin.do?d=240493329964944458

Page 34:
http://i.dell.com/sites/doccontent/shared-content/data-sheets/en/Documents/12g-memory-performance-guide.pdf

acasas · November 26, 2014, 12:12

hi Erik,

thanks a lot for your answer. That is exactly the reason why it was performing so poorly. But yet, can you believe, the company kept claiming I was wrong? It is a very important server producer and workstation company from USA, but I won't tell their name. They claim that for their applications, big data storage, labs, etc, this is not an important issue. Are they right??

Any way, I just wanted to ask one 2 more things. This motherboard have 16 memory modules. Do I need to fill ALL of them (16) OR 8 (4 per each processor) will be enough? Of course I´ll populate them as in the motherboard specification, and yes, they are ECC DDR4. The 2nd question is related with discs and storage disposal. I do have 3 SSD 250 GB each. One for the system and software and the other 2 in RAID 0 mode for the working and scratching folders. Is it a good configuration for best performance?

Thanks a lot Erik, and Bruno too, and all the people whom contribute in this forums and the creators of CFD online too.

Chris Lee · November 26, 2014, 19:46

Quote:

Originally Posted by wyldckat

Greetings to all!

@acasas:

It makes so much of a big difference, that you should sue whomever told you that.
The explanation is simple: you have 2 Xeon processors and only 4 RAM modules. This means that technically you only have 2 RAM modules per processor. Technically, what this means, is that you have created a massive bottleneck on your system.

If you look at the specs for the 2 processors and notice the maximum memory bandwidth:

E5-2650 v3: http://ark.intel.com/products/81705/...Cache-2_30-GHz - Max 68 GB/s
i7 3820: http://ark.intel.com/products/63698/...up-to-3_80-GHz - Max 51.2 GB/s

Now, remember that I mentioned that you created a massive bottleneck, by using only 2 RAM modules for each processor E5-2650 v3? This means that instead of having 68 GB/s for each processor, you only have 34 GB/s. Which is almost half of what the bandwidth the i7 3820 has got.

Wyldckat,
I'm hoping you can help me understand if I'm about to make a similar mistake. I am getting ready to buy a mobile "workstation", to run a couple of different CFD codes, along with doing some CAD work on Autodesk or Solidworks. My main focus is running the CFD as fast as possible. (I may be getting well into the 10's of millions of grid points, which translates into close to 30GB of memory requirement, and I don't want to be sitting around for hours or days waiting for each solution).
The system I am shopping has options to go up to 12 cores. There are, for example, options on the configuration which include
E5-1660 V2 , and
i7-4960X, and
E5-2680 V2 .

For RAM, being a laptop, this system uses DDR3, with the best option being
32 GB (4 x 8G) 204-pin "quad channel" memory.

Now I don't understand well the architecture of how the RAM channels and CPU communicate, but I think you need to have at least 4 DIMM slots filled to get 4-channel functionality out of the RAM.

The question is, am I not spending my $ efficiently if I go a number of cores greater than the number of channels in the memory? (If so, why would anyone ever go with more than 4 cores?)

I'm guessing that as long as you have 4 DIMM slots filled (for any of these single physical CPUs) there is no bottleneck being made as in the example above with two physical CPUs. Is that right?

I was going to get a 10 core system (or 12 core, if I can find the budget for it) but I want to make sure I'm not throwing money away if I get more than 4 cores.

Note, I'm assuming the E5-2680 v2 is a "single CPU" with 10 cores, and so I would still have 4 channels of RAM available to all 10 cores, or in terms similar to yours above, I would still have the full 59.7 GB/s max memory bandwidth.

(As a side question, with regard to the limiting factor in time to solution, I guess what I don't really know is how much time in the solution is spent with the cpu cores cranking away on the equations, vs updating the information in the RAM, . . . but I'll suppose for the time being that my CFD problem will be memory bandwidth limited. If you've got some rules of thumb on how to figure where the overall bottleneck is, i'd be most grateful.)

Any light you can shed on this is most appreciated.

Cheers,

acasas · November 26, 2014, 19:59

Hey Chris! You see? It was not bad hijacking your thread even by mistake. Now you can ask interesting things in mine and I dont mind ;-)

Micael · November 27, 2014, 18:42

Quote:

Originally Posted by acasas

its a 20k elements for the fluid and 500 for the body.
thanks

A 25 000 cells model will not scale well at 20 cores. You may have hit a communication bottle neck. Try running your case with different core amount, for both workstation, to see how they perform. Especially, compare both machine while running them with identical amount of core.

I would expect your E5 to beat the i7 for a case with, say 1M cells.

Good luck

HyperNova · December 1, 2014, 16:35

Hi every one , good discussion i like it
i agree with Micael , you should try a huge mesh for example 10M cells , then maybe E5 shows itself , for small mesh data transfer between cores is the dominant process , i experience this by solving a problem with 8000 cells , i tried 1 to 8 cores , but at 3 cores i got the shortest duration of solution ,

for Chris Lee , try GPU accelerator like K80 , it costs 5000$ but it is 10 times faster than the strongest CPU right now like 5960X or else

huey1080 · December 12, 2014, 11:34

I agree on that, splitting a small mesh over 2 sockets and a large number of cores is just going to be slowed down by the interconnect process. i7 are way faster than Xeon for sure but they are reserved for non-intensive use, they use a larger amount of power and they are usually coupled with faster non-ECC RAM which makes them good for quick and non-24/7 use but for a more intensive use where a computational server is constantly loaded, I will not try to compete with the stability of Xeons.
And correctly scale the number of Cores is always crucial, on i7 or Xeon.

wyldckat · December 14, 2014, 07:24

Greetings to all!

I'll be trying to answer on this post the questions posed by acasas and by Chris Lee:

@acasas:

Quote:

Originally Posted by acasas

But yet, can you believe, the company kept claiming I was wrong? It is a very important server producer and workstation company from USA, but I won't tell their name. They claim that for their applications, big data storage, labs, etc, this is not an important issue. Are they right??

For the average range of applications, they are somewhat correct. The performance difference is in the range of 1-10%, depending on the application. Problem is that CFD requires a very optimized (or at least very good) system, not an average system

In which case, memory access is critical and 2 vs 4 channels can mean something in the range of 10 to 30% performance, depending on the cases.

Quote:

Originally Posted by acasas

Any way, I just wanted to ask one 2 more things. This motherboard have 16 memory modules. Do I need to fill ALL of them (16) OR 8 (4 per each processor) will be enough? Of course I´ll populate them as in the motherboard specification, and yes, they are ECC DDR4.

Each processor/socket has 4 memory channels, which implies that a minimum of 4 modules/slots should be occupied; above that, it should be a multiple of 4.

Quote:

Originally Posted by acasas

The 2nd question is related with discs and storage disposal. I do have 3 SSD 250 GB each. One for the system and software and the other 2 in RAID 0 mode for the working and scratching folders. Is it a good configuration for best performance?

Seems OK. It depends on how frequently you need to write the data to disk and how big your cases are for each time/iteration snapshot. It might make more sense to have smaller SSDs for RAID 0 and to have a 2-4TB hard-drive for off-loading data after writing to SSD is complete. But again, it strongly depends on your work-flow and file frequency+sizes.

-----------------------
@Chris Lee:

Quote:

Originally Posted by Chris Lee

The system I am shopping has options to go up to 12 cores. There are, for example, options on the configuration which include
E5-1660 V2 , and
i7-4960X, and
E5-2680 V2 .

We might not yet have the technology to make a car compact itself into a briefcase, but at least computers are getting there

Perhaps the cartoons "The Jetsons" were actually referring to teleworking...

Quote:

Originally Posted by Chris Lee

For RAM, being a laptop, this system uses DDR3, with the best option being 32 GB (4 x 8G) 204-pin "quad channel" memory.

Mmm... if you might go up to 30 GB for a case, you might eventually see a need to go even further to 64GB of RAM... but I guess that if you ever need that, you will use a cluster or a server to do the mesh and calculations.

Quote:

Originally Posted by Chris Lee

Now I don't understand well the architecture of how the RAM channels and CPU communicate, but I think you need to have at least 4 DIMM slots filled to get 4-channel functionality out of the RAM.

Yes.

Quote:

Originally Posted by Chris Lee

The question is, am I not spending my $ efficiently if I go a number of cores greater than the number of channels in the memory? (If so, why would anyone ever go with more than 4 cores?)

I did a bit of lengthy mathematics on this topic yesterday: http://www.cfd-online.com/Forums/har...tml#post523825 - post #10

The essential concept is that you have to think that mores cores will be running slower, but they will also be responsible for lesser RAM to be crunched. Then you have to take into account for the total available memory bandwidth. Beyond that, it starts depending on the complexity of your case... this to say that in some crazy situations, overscheduling a 12 core machine with 18-36 processes might provide results slightly faster, because of an alignment in memory accesses.

Quote:

Originally Posted by Chris Lee

I'm guessing that as long as you have 4 DIMM slots filled (for any of these single physical CPUs) there is no bottleneck being made as in the example above with two physical CPUs. Is that right?

The idea is that each socket should use 4 DIMMs for itself. In your case, you only have 1 socket

Quote:

Originally Posted by Chris Lee

I was going to get a 10 core system (or 12 core, if I can find the budget for it) but I want to make sure I'm not throwing money away if I get more than 4 cores.

As I mentioned a bit above about the mathematics I did yesterday, it really depends. For example, if you search online for:

Code:

OpenFOAM xeon benchmark

I guess it's quicker to give the link I'm thinking of: http://www.anandtech.com/show/8423/i...l-ep-cores-/19
there you might find that a system with 12 cores @ 2.5GHz that costs roughly 1000 USD gives a better bang-for-your-buck than 8 cores @ 3.9 GHz that cost 2000 USD (not sure of the exact values). But the 8 core system gives the optimum performance of RAM bandwidth and core efficiency, but the 12 core system costs a lot less and spends a lot less in electrical power consumption, while running only at 76% CPU compute performance of the 8 core system.

In such a case, you might want to weigh in an additional and very important factor: how fast do you want your meshes to be generated, if they can only be generated in serial mode, not in parallel?

Quote:

Originally Posted by Chris Lee

Note, I'm assuming the E5-2680 v2 is a "single CPU" with 10 cores, and so I would still have 4 channels of RAM available to all 10 cores, or in terms similar to yours above, I would still have the full 59.7 GB/s max memory bandwidth.

Yes, and at 32 GB of total RAM, would equate to 3.2 GB per core at roughly 5.97 GB/s access speed.

For comparison, the i7-4960X with 6 cores would be using 32 GB, with 5.33 GB per core at roughly 9.95 GB/s.

Now that I look more closely at the 3 CPUs you proposed for comparison, the only major difference is:

How much maximum RAM do you really want to use.
Are you willing to pay the extra cost for ECC memory. This can give you a greater piece of mind when running CFD cases, but it will make a bigger hole in the wallet as well.

For 32GB of RAM, from these 3, I would vote on the i7-4960X, which you could potentially be overclocked on situations where you need a little bit more performance and are willing to spend more electricity to achieve it... although on a laptop, this isn't easily achieved, and OC is a bit risky (namely it takes some time to master). Either way, it roughly gives you the same performance as the other 2 CPUs and you save a lot of money. Just make sure you keep your workplace clean and once a year have your laptop cleaned in the fans and heat-sinks, to ensure that it's always properly being cooled.

Quote:

Originally Posted by Chris Lee

As a side question, with regard to the limiting factor in time to solution, I guess what I don't really know is how much time in the solution is spent with the cpu cores cranking away on the equations, vs updating the information in the RAM, . . . but I'll suppose for the time being that my CFD problem will be memory bandwidth limited. If you've got some rules of thumb on how to figure where the overall bottleneck is, i'd be most grateful.

Already mentioned on this post. Nonetheless, the primary rule of thumb is that it can strongly depend on the kind of simulations you need to perform. Some cases are easily parallelised, others aren't.
And don't forget about the time it takes to generate the mesh, when using a CPU that has more cores, but less top speed when running in single core.

-------------------
@acasas:

Quote:

Originally Posted by acasas

Hey Chris! You see? It was not bad hijacking your thread even by mistake. Now you can ask interesting things in mine and I dont mind ;-)

You might not mind, but others might and probably will. It's considerably hard to be talk/discuss about two or more different topics on the same thread, without loosing track of whom the questions are being asked/answered to. The only reason why I (as a moderator) haven't moved the implied posts was because it seemed it wasn't a complete hijack and the details were still somewhat related.

Best regards,
Bruno

acasas · January 13, 2015, 13:58

guys, check out Erik´s benchmark thread

http://www.cfd-online.com/Forums/har...quad-xeon.html

acasas · January 15, 2015, 05:58

Hi guys, I came up with some results over, what from now on I would like to call the "Erik´s Benchmark" , wich you can find at http://www.cfd-online.com/Forums/har...quad-xeon.html

Model:
Geometry: 1m x 1m x 5m long duct
Mesh: 100 x 100 x 500 "cubes" all 1x1x1cm (5M cells)
Flow: Default Water enters @ 10m/s at 300K, goes out other side at 0Pa. Walls are 400K.
High Resolution Turbulence and advection
Everything else default.
Double Precision: ON
20 iterations (you must reduce your convergence criteria or it will converge in less iterations.)

I did perform the "Erik´s Benchmark" over a single i7 3820 and over a dual xeon E5-2650 v3, both under Windows 7 Pro 64 bits
On the i7 3820 @ 3.6 Ghz and DDR3 SDRAM PC·-12800 @ 800 MHZ, with 4 real cores and 8 threads, and with affinity fully set, it took 1598 sec wall time.
On the dual Xeon E5-2650 v3, 20 real cores, no hyper threading, overclocking on, RAM memory DDR4-2133 (1066 MHz), it took 533 sec wall time.

On the dual Xeon for other amount of cores, affinity was not automatically set, so the run time wouldn´t be useful for this benchmark comparison. In some cases the computer was almost not doing any progress until I did set manually the affinity for every single core on the task manager for the "solver-pcmpi.exe" tasks.
If any of you guys, would like I do run this "Erik´s Benchmark" over my dual Xeon for any other amount of cores than 20, and post in here the results, please, could you explain how to establish or set the affinity "in advanced" before running the test. Is there any way to program or define the affinity for the solver-pcmpi.exe in advance?

thank´s a lot

November 24, 2014, 13:49	single i7 MUCH faster than dual xeon E5-2650 v3 !!!	#1
acasas Member Antonio Casas Join Date: May 2013 Location: world Posts: 85 Rep Power: 13	hi there! I'm posting this info in here so no one throw away they money as I've done. I did run a FSI (with fluent) analysis in Ansys 15.0.7 under windows 7 64 bits pro with sp1. I did perform the same analysis over a SINGLE i7 3820 and over a DUAL xeon E5 2650 v3 , and the SINGLE i7 is 2 to 3 time much faster !!!!! So guys, be aware, a 1200 Euro computer is 2 to 3 times faster than a 6000 Euro workstation. I really should have check CFD online forums before I bought this computer, I know, I´m guilty, but how in the hell I should imagine this would happen!!! I have tried many bios settings for the new dual xeon, like hyperthreding on and off, overckloking/turbo on and off, numa on and off, QPI auto or fixed, power management on and off, etc etc, and it seems to make small difference in performance. I've been thinking that I should use 8x8 GB RAM instead of 4x16 GB RAM configuration, so I could have the processor at full 4 channel ,but some people said it won't make a big difference. So please guys, tell me, have I waste my money?? Any idea of what is going on? Is it hardware company fault, OS software or CFD software ? Are they trying to sell products not worth it the price AT ALL compared to others?? Or is it me, and my poor knowledge on the subject? I really hope its my fault, otherwise, guys, be aware and don't spent your money on those products. Thanks Last edited by acasas; November 26, 2014 at 12:41. Reason: some typos and spelling mistakes

November 24, 2014, 15:00		#3
acasas Member Antonio Casas Join Date: May 2013 Location: world Posts: 85 Rep Power: 13	I don't want to compare, I have compared.... And its really bad news for who have spent their money on xeon E5 2650 v3. Its a 2 way FSI, ( 3d, dp, pbns, dynamesh, vof, skw, transient). Its a body falling from 1m height into a water open channel. So far, its a 20k elements for the fluid and 500 for the body. I just use 20 cores, because if I use more (hyperthreading), the affinity is not set ( I don´t know if I should laugh or cry.... after the money I have spent, its more of the second). Both computers are using 64 GB RAM . for the i7 its 8x8 dimms, and for the dual xeon its 4x16. For the i7 its DD3 1605 Mhz and for the xeons is DD4 2133 Mhz. The time duration for the analysis is about 30 min for the single i7 and more than an hour for the dual xeon e5- 2650 v3. I did use SSD for both computers. If you need more info, let me know. I also would appreciate any comment or suggestion to try to invert this situation. thanks Last edited by acasas; November 24, 2014 at 15:13. Reason: spelling

November 24, 2014, 19:14		#4
kyle Senior Member Join Date: Mar 2009 Location: Austin, TX Posts: 160 Rep Power: 18	First, you should probably calm down. You didn't waste money. The Xeon machine isn't going to be twice as fast as the i7 machine, but it should be at least a little faster. Assuming you are doing everything correctly, it could be that your processes are hopping around to different cores. There are huge inefficiencies when a process hops to a core on a different socket. You could check if this is the case by disabling one of the processors and running your benchmark again (note that this will cut your memory in half). acasas likes this.

November 26, 2014, 11:50		#11
evcelica Senior Member Erik Join Date: Feb 2011 Location: Earth (Land portion) Posts: 1,188 Rep Power: 23	EASY FIX: These machines must have their memory configured correctly to have good performance. I'm assuming you are using ECC, so pick up 4 more identical DIMMs and populate bank 1 of all 4 channels for each CPU. I've seen a lot of these dual/quad CPU workstations that had horrible performance with unbalanced memory configuration. (I saw a $20K Quad CPU XEON machine that was 1/3 the speed of an i7) because they were using 6 DIMMs per CPU. For best performance, you have to have a balanced memory configuration: ** All 4 channels of each CPU populated evenly. That means you should be using DIMMs in identical sets of sets of 8 (4 per CPU) with your dual CPU machine. If you need more RAM, you have to fill the second bank completely. NEVER have one CPU or channel different than any other. Don't worry about any other settings, just fill all 4 channels evenly (your motherboard or computer manufacturer should be able to tell you which slots to fill for Bank 1) Usually A1,A2,A3,A4 for CPU 0 and B1,B2,B3,B4 for CPU 1. I had to explain this over and over to our computing department, then they added the DIMMs to balance the channels and Voila, performance skyrocketed. Here is some links on balancing memory if you need them to convince you IT: https://roianalyst.alinean.com/dell/AutoLogin.do?d=240493329964944458 Page 34: http://i.dell.com/sites/doccontent/shared-content/data-sheets/en/Documents/12g-memory-performance-guide.pdf shreyasr, fluidsong and acasas like this.

November 26, 2014, 12:12		#12
acasas Member Antonio Casas Join Date: May 2013 Location: world Posts: 85 Rep Power: 13	hi Erik, thanks a lot for your answer. That is exactly the reason why it was performing so poorly. But yet, can you believe, the company kept claiming I was wrong? It is a very important server producer and workstation company from USA, but I won't tell their name. They claim that for their applications, big data storage, labs, etc, this is not an important issue. Are they right?? Any way, I just wanted to ask one 2 more things. This motherboard have 16 memory modules. Do I need to fill ALL of them (16) OR 8 (4 per each processor) will be enough? Of course I´ll populate them as in the motherboard specification, and yes, they are ECC DDR4. The 2nd question is related with discs and storage disposal. I do have 3 SSD 250 GB each. One for the system and software and the other 2 in RAID 0 mode for the working and scratching folders. Is it a good configuration for best performance? Thanks a lot Erik, and Bruno too, and all the people whom contribute in this forums and the creators of CFD online too. Last edited by acasas; November 26, 2014 at 13:50.

November 24, 2014, 14:27		#2
mehulkumar Member Join Date: Mar 2009 Posts: 41 Rep Power: 17	Can you share some basic details of the job you want to compare on two different configurations. - total mesh count - complexity of flow physics/ various model used - both hardware configuration in detail

November 24, 2014, 19:27		#6
acasas Member Antonio Casas Join Date: May 2013 Location: world Posts: 85 Rep Power: 13	Hi Kyle, thanks´for your answer, also trying me to calm down. you are right I´ve been like that for a week . The computer is new, I don´t want yet to say to whom what company I did buy, in case I´m doing something wrong. Intel and windows are so big, that hopefully they won't be upset with me. Any way, since the computer is new, I may wait a little to just unplug one processor. Do I need to do it physically, really? i don't have many experience on this, it does not seems difficult but very delicate. On the other hand, I did run the Intel processor diagnostic tool 64 bits and its showing a big red fault for the QPI link, so I guess it may be very relevant. Also, I know its not the same case, but please check this out http://www.cfd-online.com/Forums/har...-3930k-x2.html thanks

November 24, 2014, 19:34		#7
acasas Member Antonio Casas Join Date: May 2013 Location: world Posts: 85 Rep Power: 13	Bruno!!!! if you are right , you made my day and soooo happy you can´t imagine. Thank´s a lot. If you was a woman I would kiss you. But , hey, I will first go to the shop and try. I must say, that the company its a very important international one, specialized in superservers, mainly for data, so maybe they should know that, shouldn´t they. I have been telling them about the RAM memory many times, and they keep insisting that it won't make a difference. So I really hope you are right as a saint and they are the EVIL... Thanks a lot. I will post the result in here once I change the RAM configuration.

November 26, 2014, 19:59		#14
acasas Member Antonio Casas Join Date: May 2013 Location: world Posts: 85 Rep Power: 13	Hey Chris! You see? It was not bad hijacking your thread even by mistake. Now you can ask interesting things in mine and I dont mind ;-)

December 1, 2014, 16:35		#16
HyperNova Senior Member B_Kia Join Date: May 2014 Location: Ir Posts: 123 Rep Power: 12	Hi every one , good discussion i like it i agree with Micael , you should try a huge mesh for example 10M cells , then maybe E5 shows itself , for small mesh data transfer between cores is the dominant process , i experience this by solving a problem with 8000 cells , i tried 1 to 8 cores , but at 3 cores i got the shortest duration of solution , for Chris Lee , try GPU accelerator like K80 , it costs 5000$ but it is 10 times faster than the strongest CPU right now like 5960X or else acasas likes this.

December 12, 2014, 11:34		#17
huey1080 New Member Quentin Lux Join Date: Feb 2012 Location: Quebec Posts: 23 Rep Power: 14	I agree on that, splitting a small mesh over 2 sockets and a large number of cores is just going to be slowed down by the interconnect process. i7 are way faster than Xeon for sure but they are reserved for non-intensive use, they use a larger amount of power and they are usually coupled with faster non-ECC RAM which makes them good for quick and non-24/7 use but for a more intensive use where a computational server is constantly loaded, I will not try to compete with the stability of Xeons. And correctly scale the number of Cores is always crucial, on i7 or Xeon. acasas likes this.

January 13, 2015, 13:58		#19
acasas Member Antonio Casas Join Date: May 2013 Location: world Posts: 85 Rep Power: 13	guys, check out Erik´s benchmark thread http://www.cfd-online.com/Forums/har...quad-xeon.html

January 15, 2015, 05:58		#20
acasas Member Antonio Casas Join Date: May 2013 Location: world Posts: 85 Rep Power: 13	Hi guys, I came up with some results over, what from now on I would like to call the "Erik´s Benchmark" , wich you can find at http://www.cfd-online.com/Forums/har...quad-xeon.html Model: Geometry: 1m x 1m x 5m long duct Mesh: 100 x 100 x 500 "cubes" all 1x1x1cm (5M cells) Flow: Default Water enters @ 10m/s at 300K, goes out other side at 0Pa. Walls are 400K. High Resolution Turbulence and advection Everything else default. Double Precision: ON 20 iterations (you must reduce your convergence criteria or it will converge in less iterations.) I did perform the "Erik´s Benchmark" over a single i7 3820 and over a dual xeon E5-2650 v3, both under Windows 7 Pro 64 bits On the *i7 3820* @ 3.6 Ghz and DDR3 SDRAM PC·-12800 @ 800 MHZ, with 4 real cores and 8 threads, and with affinity fully set, it took *1598 sec* wall time. On the *dual Xeon E5-2650 v3, 20 real cores, no hyper threading, overclocking on, RAM memory DDR4-2133 (1066 MHz), it took 533 sec* wall time. On the dual Xeon for other amount of cores, affinity was not automatically set, so the run time wouldn´t be useful for this benchmark comparison. In some cases the computer was almost not doing any progress until I did set manually the affinity for every single core on the task manager for the "solver-pcmpi.exe" tasks. If any of you guys, would like I do run this "Erik´s Benchmark" over my dual Xeon for any other amount of cores than 20, and post in here the results, please, could you explain how to establish or set the affinity "in advanced" before running the test. Is there any way to program or define the affinity for the solver-pcmpi.exe in advance? thank´s a lot wanrui likes this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Xeon e5-2403 (Dual) vs. single i7	zx9cp	Hardware	7	February 26, 2014 15:59
Dual cpu workstation VS 2 node cluster single cpu workstation	Verdi	Hardware	18	September 2, 2013 04:09
Performance of dual xeon 2643	tally_ho	Hardware	7	December 17, 2012 13:01
Dual Xeon PIV 3.8Ghz vs 2x Dual Core E5130 2.0 GHz	Michiel	Hardware	4	July 31, 2009 07:06
P4 1.5 or Dual P3 800EB on Gibabyte board	Danial	FLUENT	4	September 12, 2001 12:44