CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

ECC vs. non ECC ram: My opinion

Register Blogs Community New Posts Updated Threads Search

Like Tree8Likes
  • 2 Post By wyldckat
  • 3 Post By kyle
  • 1 Post By siefdi
  • 1 Post By wyldckat
  • 1 Post By ghost82

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   September 27, 2013, 11:00
Default ECC vs. non ECC ram: My opinion
  #1
Senior Member
 
ghost82's Avatar
 
Rick
Join Date: Oct 2010
Posts: 1,016
Rep Power: 27
ghost82 will become famous soon enough
Hi cfd users!
I would like to share my opinion about ecc vs non ecc ram.

I recently bought a new workstation:

- double intel xeon E5-2630
- Asus Z9PE-D8 WS
- Nvidia quadro 600
- 64 gb ram (I got ecc and non ecc to test them)

Non-ecc ram: Corsair valueselect 8x8gb (cmv8gx3m1a1333c9)
ecc ram: Samsung 8x8gb (M393B1K70CH0-CH9)

Both types are ddr3 and work at 1333 Mhz (PC3-10600).

I read in this forum that non ecc ram works good for cfd and ecc is not a must.

In internet I read where ecc is usefull, I read about cosmic rays..so my first feeling was that ecc is not so usefull compared to non ecc.

But in my opinion, and from my tests, ecc ram is a must:
with my system and latest ansys 14.7, working in parallel with all real cores (12) with a mesh of about 1.5 million cells, fluent crashes every 2-3 hours; in the log file errors were very generic.
However, a couple of hours of test running memtest86+ on non ecc ram shows no error.

Then I changed to ecc ram: same mesh and same cores; no errors at all after 3 continuous days.

So, in my opinion, if you buy a new worstation: go for ecc ram!!!

Daniele
ghost82 is offline   Reply With Quote

Old   September 27, 2013, 12:07
Default
  #2
HMN
New Member
 
Join Date: Apr 2012
Posts: 27
Rep Power: 14
HMN is on a distinguished road
Quote:
Originally Posted by ghost82 View Post
Hi cfd users!
I would like to share my opinion about ecc vs non ecc ram.

I recently bought a new workstation:

- double intel xeon E5-2630
- Asus Z9PE-D8 WS
- Nvidia quadro 600
- 64 gb ram (I got ecc and non ecc to test them)

Non-ecc ram: Corsair valueselect 8x8gb (cmv8gx3m1a1333c9)
ecc ram: Samsung 8x8gb (M393B1K70CH0-CH9)

Both types are ddr3 and work at 1333 Mhz (PC3-10600).

I read in this forum that non ecc ram works good for cfd and ecc is not a must.

In internet I read where ecc is usefull, I read about cosmic rays..so my first feeling was that ecc is not so usefull compared to non ecc.

But in my opinion, and from my tests, ecc ram is a must:
with my system and latest ansys 14.7, working in parallel with all real cores (12) with a mesh of about 1.5 million cells, fluent crashes every 2-3 hours; in the log file errors were very generic.
However, a couple of hours of test running memtest86+ on non ecc ram shows no error.

Then I changed to ecc ram: same mesh and same cores; no errors at all after 3 continuous days.

So, in my opinion, if you buy a new worstation: go for ecc ram!!!

Daniele
How can you be sure that the problem comes from the non-ECC memory modules?
ECC memory modules needs extra storage for parity bits that ckeck the integrity of the data and can correct some errors......

Is it really necesary? I use ansys 14.5.7 in a computer without ECC memory without errors.


By the way, you cannot have ansys 14.7. I think you mean 14.5.7.
HMN is offline   Reply With Quote

Old   September 27, 2013, 12:24
Default
  #3
Senior Member
 
ghost82's Avatar
 
Rick
Join Date: Oct 2010
Posts: 1,016
Rep Power: 27
ghost82 will become famous soon enough
Quote:
Originally Posted by HMN View Post
How can you be sure that the problem comes from the non-ECC memory modules?
ECC memory modules needs extra storage for parity bits that ckeck the integrity of the data and can correct some errors......

Is it really necesary? I use ansys 14.5.7 in a computer without ECC memory without errors.


By the way, you cannot have ansys 14.7. I think you mean 14.5.7.
Yes, ansys 14.5.7
I'm sure because I run same case with same hardware several times, by change only memory modules.
I noticed that in serial mode I haven't any errors with non ecc modules, but problems begin with parallel calculation.
For that particular case ecc for me is a must as I cannot restart simulation every 2-3 hours.

Daniele
ghost82 is offline   Reply With Quote

Old   September 27, 2013, 22:25
Default
  #4
Senior Member
 
Join Date: Mar 2009
Location: Austin, TX
Posts: 160
Rep Power: 18
kyle is on a distinguished road
I run a cluster with 15 quad core i7 CPUs, and it seems like 1 crash a week is of the "random" variety. These are crashes that don't happen again when you restart the run. I have about 50% utilization.

Even if all of those crashes are due to non-ECC memory, it still isn't enough to justify the additional cost and slower speed of ECC memory.
kyle is offline   Reply With Quote

Old   September 28, 2013, 11:12
Default
  #5
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,982
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Greetings to all!

Quote:
Originally Posted by kyle View Post
Even if all of those crashes are due to non-ECC memory, it still isn't enough to justify the additional cost and slower speed of ECC memory.
It's just a matter of weighing the costs with the benefits. The experience on our office is that the results are always needed with the utmost urgency, so if there is a crash overnight or over the weekend, that's simply unacceptable.

And it's bad enough when machines can crash on their own for some hardware reason or other (example: http://whatif.xkcd.com/63/, section "10 Exabytes"). Having non-ECC RAM being the cause of additional frequent crashes, that might not be acceptable for some situations.

But hey, few are those that know that the quality of the electricity can play a very important role in cluster environments.

As for the original post: the problem might have been something that wasn't properly configured on the BIOS or perhaps the RAM modules simply were not compatible with the motherboard (yes, that can happen!).
And memtest86+ is no longer an accurate way to assess if RAM is OK or not. This is why Google has made available the stressapptest utility: http://code.google.com/p/stressapptest/

Best regards,
Bruno
ghost82 and Anna Tian like this.
wyldckat is offline   Reply With Quote

Old   September 28, 2013, 15:20
Default
  #6
Senior Member
 
Join Date: Mar 2009
Location: Austin, TX
Posts: 160
Rep Power: 18
kyle is on a distinguished road
You could just have an extremely simple script to restart from the last save file. If it crashes on the same iteration as before, then give up.

If your runs are urgent then that is all the more reason not to buy ECC memory and the incredibly expensive CPUs and motherboards you need to use it. For any given hardware budget you can, conservatively, get at least double the speed if you do not purchase enterprise class hardware.

This starts to break down once you get to a massive system where data is hopping across multiple switches, but unless you are Boeing or Lockheed, you probably aren't working at that scale. <400 cores, I'd stick with i7's and overclocked low-latency non-ECC memory.
wyldckat, HMN and Anna Tian like this.
kyle is offline   Reply With Quote

Old   September 30, 2013, 12:19
Default
  #7
HMN
New Member
 
Join Date: Apr 2012
Posts: 27
Rep Power: 14
HMN is on a distinguished road
Quote:
Originally Posted by kyle View Post
You could just have an extremely simple script to restart from the last save file. If it crashes on the same iteration as before, then give up.

If your runs are urgent then that is all the more reason not to buy ECC memory and the incredibly expensive CPUs and motherboards you need to use it. For any given hardware budget you can, conservatively, get at least double the speed if you do not purchase enterprise class hardware.

This starts to break down once you get to a massive system where data is hopping across multiple switches, but unless you are Boeing or Lockheed, you probably aren't working at that scale. <400 cores, I'd stick with i7's and overclocked low-latency non-ECC memory.
Sorry for the newbye question, but how does the script should look like?
Is it something that you can set up for every project automatically? I am still a newbye and don't use scripts.

Can this code be in the calls from my visual basic/excel application?

Thanks
HMN is offline   Reply With Quote

Old   September 30, 2013, 17:34
Default
  #8
Senior Member
 
Joern Beilke
Join Date: Mar 2009
Location: Dresden
Posts: 528
Rep Power: 20
JBeilke is on a distinguished road
Quote:
Originally Posted by ghost82 View Post
Yes, ansys 14.5.7
I'm sure because I run same case with same hardware several times, by change only memory modules.
I noticed that in serial mode I haven't any errors with non ecc modules, but problems begin with parallel calculation.
For that particular case ecc for me is a must as I cannot restart simulation every 2-3 hours.

Daniele
There might just be a problem with one of your non ECC modules. I had similar problems some time ago. After running stressapptest and replacing the broken module I had no more crashes.
JBeilke is offline   Reply With Quote

Old   September 30, 2013, 23:28
Default
  #9
New Member
 
CFD
Join Date: Jan 2013
Posts: 23
Rep Power: 13
siefdi is on a distinguished road
Quote:
I recently bought a new workstation:

- double intel xeon E5-2630
- Asus Z9PE-D8 WS
- Nvidia quadro 600
- 64 gb ram (I got ecc and non ecc to test them)

Non-ecc ram: Corsair valueselect 8x8gb (cmv8gx3m1a1333c9)
ecc ram: Samsung 8x8gb (M393B1K70CH0-CH9)

Both types are ddr3 and work at 1333 Mhz (PC3-10600).

Well, if you have this board (Z9PE-D8 WS) and Samsung ECC RAM DDR3 1333 MHz, I would recommend you to overclock the memory and run it at 1600 MHz through setting in BIOS (I could run it stable in my system which has almost the same configuration as yours, and get about 30% performance increases in my OpenFOAM calculation). Strangely enough (at least for me), I could not do it for the non-ECC ones even its originaly has speed up to 1866 MHz.

+1 for ECC

Regards,
siefdi
ghost82 likes this.
siefdi is offline   Reply With Quote

Old   October 1, 2013, 10:25
Default
  #10
Senior Member
 
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,186
Rep Power: 23
evcelica is on a distinguished road
Quote:
Originally Posted by JBeilke View Post
There might just be a problem with one of your non ECC modules. I had similar problems some time ago. After running stressapptest and replacing the broken module I had no more crashes.
Correct, this may be a problem related to memory modules themselves, not so much ECC vs non-ECC in general.
Crucial does make some ECC memory rated to 1866 MHz, CL timings are 13.
evcelica is offline   Reply With Quote

Old   October 1, 2013, 17:13
Default
  #11
New Member
 
Benj FitzPatrick
Join Date: Apr 2012
Posts: 4
Rep Power: 14
wazoo42 is on a distinguished road
You should have options to turn several ecc options off in the bios. Then you could run the tests again with the ECC ram and see if it crashes.
wazoo42 is offline   Reply With Quote

Old   October 3, 2013, 07:49
Default
  #12
Senior Member
 
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,186
Rep Power: 23
evcelica is on a distinguished road
Quote:
Originally Posted by wazoo42 View Post
You should have options to turn several ecc options off in the bios. Then you could run the tests again with the ECC ram and see if it crashes.
That's actually an excellent idea. It would show a real ECC vs non-ECC with the same memory sticks.
evcelica is offline   Reply With Quote

Old   October 3, 2013, 10:10
Default
  #13
Senior Member
 
ghost82's Avatar
 
Rick
Join Date: Oct 2010
Posts: 1,016
Rep Power: 27
ghost82 will become famous soon enough
Quote:
Originally Posted by wazoo42 View Post
You should have options to turn several ecc options off in the bios. Then you could run the tests again with the ECC ram and see if it crashes.
Unfortunately in the bios I can see "ECC Enabled", but I cannot modify it
ghost82 is offline   Reply With Quote

Old   October 7, 2013, 11:52
Default
  #14
Senior Member
 
ghost82's Avatar
 
Rick
Join Date: Oct 2010
Posts: 1,016
Rep Power: 27
ghost82 will become famous soon enough
Quote:
Originally Posted by siefdi View Post
Well, if you have this board (Z9PE-D8 WS) and Samsung ECC RAM DDR3 1333 MHz, I would recommend you to overclock the memory and run it at 1600 MHz through setting in BIOS (I could run it stable in my system which has almost the same configuration as yours, and get about 30% performance increases in my OpenFOAM calculation). Strangely enough (at least for me), I could not do it for the non-ECC ones even its originaly has speed up to 1866 MHz.

+1 for ECC

Regards,
siefdi
But processors support only 1333 Mhz, so I think is not usefull.
What is/are your cpu(s)?
ghost82 is offline   Reply With Quote

Old   October 7, 2013, 21:07
Default
  #15
New Member
 
CFD
Join Date: Jan 2013
Posts: 23
Rep Power: 13
siefdi is on a distinguished road
Quote:
But processors support only 1333 Mhz, so I think is not usefull.
What is/are your cpu(s)?
Ah, my bad. Sorry, didn't check your CPU's spec before I wrote previous comment. I am working with E5-2660 which support 1600 MHz.

Regards,
siefdi
siefdi is offline   Reply With Quote

Old   October 11, 2013, 12:43
Default
  #16
Senior Member
 
ghost82's Avatar
 
Rick
Join Date: Oct 2010
Posts: 1,016
Rep Power: 27
ghost82 will become famous soon enough
I noticed that I have some errors in the cortexerror.log file:

Code:
Error [cortex] [time 10/7/13 0:29:23] 
C:\PROGRA~1\ANSYSI~1\v145\fluent\fluent14.5.7\win64\3ddp\fl1457s.exe received fatal signal ()
1. Note exact events leading to error.
2. Save case/data under new name.
3. Exit program and restart to continue.
4. Report error to your distributor.

Error [cortex] [time 10/7/13 0:32:33] 
C:\PROGRA~1\ANSYSI~1\v145\fluent\fluent14.5.7\win64\3ddp\fl1457s.exe received fatal signal ()
1. Note exact events leading to error.
2. Save case/data under new name.
3. Exit program and restart to continue.
4. Report error to your distributor.

Error [cortex] [time 10/7/13 0:52:45] ‡flØ

Error [cortex] [time 10/7/13 1:34:47] ‡flØ

Error [cortex] [time 10/7/13 1:46:29] ‡flØ

Error [cortex] [time 10/7/13 1:56:52] ‡flØ

Error [cortex] [time 10/7/13 2:8:49] ‡flØ

Error [cortex] [time 10/7/13 2:11:16] 
C:\PROGRA~1\ANSYSI~1\v145\fluent\fluent14.5.7\cortex\win64\cx1457.exe received fatal signal ()
1. Note exact events leading to error.
2. Save case/data under new name.
3. Exit program and restart to continue.
4. Report error to your distributor.

Error [cortex] [time 10/7/13 19:51:13] ‡flû

Error [cortex] [time 10/7/13 19:57:50] ‡flû

Error [cortex] [time 10/8/13 23:55:1] ‡flû

Error [cortex] [time 10/9/13 7:55:24] 
C:\PROGRA~1\ANSYSI~1\v145\fluent\fluent14.5.7\cortex\win64\cx1457.exe received fatal signal ()
1. Note exact events leading to error.
2. Save case/data under new name.
3. Exit program and restart to continue.
4. Report error to your distributor.
This type of error

C:\PROGRA~1\ANSYSI~1\v145\fluent\fluent14.5.7\win6 4\3ddp\fl1457s.exe received fatal signal ()
1. Note exact events leading to error.
2. Save case/data under new name.
3. Exit program and restart to continue.
4. Report error to your distributor.

comes sometimes when I'm exiting, the window closes and all seems ok, but in the log file this error is written.

The second type of error

Error [cortex] [time 10/7/13 19:57:50] ‡flû

comes randomly.

Now, when I had non ecc ram fluent crashes wtih this type of error, now, with ecc, simulation continues without problems and error is logged in the file.

Am I invested by cosmic rays??

Daniele
ghost82 is offline   Reply With Quote

Old   October 11, 2013, 14:37
Default
  #17
Senior Member
 
Joern Beilke
Join Date: Mar 2009
Location: Dresden
Posts: 528
Rep Power: 20
JBeilke is on a distinguished road
Did you check your non ecc rams using stressapptest? It is meaningless to compare broken non ecc ram modules to anything else.
JBeilke is offline   Reply With Quote

Old   October 23, 2013, 07:49
Default
  #18
New Member
 
John McEntee
Join Date: Jun 2013
Posts: 8
Rep Power: 13
jmcentee is on a distinguished road
I think the intel xeon only supports ecc ram.
jmcentee is offline   Reply With Quote

Old   October 26, 2013, 06:40
Default
  #19
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,982
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Quote:
Originally Posted by jmcentee View Post
I think the intel xeon only supports ecc ram.
I guess that it's best to quote the manufacturer on this one. Here's an example: http://www.intel.com/cd/channel/rese...eon/440799.htm - "DDR3 Memory for the Intel® Xeon® Processor 5600 Series"
Quote:
Multiple DDR3 DIMM types are supported:
  • Registered DIMM (RDIMM)
  • Unbuffered DIMM (UDIMM) Error-Correcting Code (ECC)
  • Unbuffered DIMM (UDIMM) Non Error-Correcting Code (Non-ECC)
More specifically, for the CPU reported by the original poster, the specs page for E5-2630 is this: http://ark.intel.com/products/64593/...-QPI?q=e5-2630 - it indicates that ECC is supported and that it will only work if both the CPU and the chipset support it.

The chipset is embedded into the motherboard, so the limitation might actually come from said motherboard, in either direction, i.e. ECC only or non-ECC only.
Another limitation in some cases is that the certain memory modules are not compatible with the motherboard. This is why motherboard vendors usually have a list per motherboard on compatible memory modules.

Let me see if I can find a motherboard that specifically says that only ECC is supported... mmm... apparently there shouldn't exist such a motherboard/chipset, as indicated here: http://www.intel.com/support/motherb.../cs-009023.htm


I did a bit more research and found out that the RAM that the original poster used is meant for dual and tripple-channel motherboards: http://www.corsair.com/en/memory-by-...m1a1333c9.html
Quote:
Designed for use with all DDR3 motherboards with two or three memory channels
While the CPU is quad-channel: http://ark.intel.com/products/64593
Quote:
# of Memory Channels 4
So perhaps this is the real reason why it doesn't work on his box. The RAM simply wasn't designed for quad-channel.
Anna Tian likes this.
wyldckat is offline   Reply With Quote

Old   February 13, 2014, 10:32
Default
  #20
Senior Member
 
ghost82's Avatar
 
Rick
Join Date: Oct 2010
Posts: 1,016
Rep Power: 27
ghost82 will become famous soon enough
Updates on this topic:
since I upgraded my workstation to 2x xeon e5-2687w I read some usefull info about my motherboard asus z9pe-d8 ws; several users around the internet claim problems with non ecc ram with this mobo even if asus claims that it is compatible with non ecc memory.
So my problem could be related to my mobo/bios version and not to ecc/non ecc ram.
Anyway, non ecc ram was sold and buyers are still happy with that ram.

Daniele
flotus1 likes this.
ghost82 is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
[GAMBIT] Understanding memory (ram) limits ghost82 ANSYS Meshing & Geometry 5 September 9, 2013 13:54
New workstation for different usage scenarios - CPU and RAM natem Hardware 6 August 7, 2013 03:47
How much RAM for a cluster @ big output-files? Eike Hardware 5 December 8, 2011 08:46
Increasing RAM decreases CPU time!!! Melih GULEREN FLUENT 2 April 5, 2004 07:21
Can FLUENT run under Linux with 2 Gb of RAM? Paul Gregory FLUENT 0 February 13, 2001 21:10


All times are GMT -4. The time now is 12:56.