|
[Sponsors] |
September 27, 2013, 11:00 |
ECC vs. non ECC ram: My opinion
|
#1 |
Senior Member
Rick
Join Date: Oct 2010
Posts: 1,016
Rep Power: 27 |
Hi cfd users!
I would like to share my opinion about ecc vs non ecc ram. I recently bought a new workstation: - double intel xeon E5-2630 - Asus Z9PE-D8 WS - Nvidia quadro 600 - 64 gb ram (I got ecc and non ecc to test them) Non-ecc ram: Corsair valueselect 8x8gb (cmv8gx3m1a1333c9) ecc ram: Samsung 8x8gb (M393B1K70CH0-CH9) Both types are ddr3 and work at 1333 Mhz (PC3-10600). I read in this forum that non ecc ram works good for cfd and ecc is not a must. In internet I read where ecc is usefull, I read about cosmic rays..so my first feeling was that ecc is not so usefull compared to non ecc. But in my opinion, and from my tests, ecc ram is a must: with my system and latest ansys 14.7, working in parallel with all real cores (12) with a mesh of about 1.5 million cells, fluent crashes every 2-3 hours; in the log file errors were very generic. However, a couple of hours of test running memtest86+ on non ecc ram shows no error. Then I changed to ecc ram: same mesh and same cores; no errors at all after 3 continuous days. So, in my opinion, if you buy a new worstation: go for ecc ram!!! Daniele |
|
September 27, 2013, 12:07 |
|
#2 | |
New Member
Join Date: Apr 2012
Posts: 27
Rep Power: 14 |
Quote:
ECC memory modules needs extra storage for parity bits that ckeck the integrity of the data and can correct some errors...... Is it really necesary? I use ansys 14.5.7 in a computer without ECC memory without errors. By the way, you cannot have ansys 14.7. I think you mean 14.5.7. |
||
September 27, 2013, 12:24 |
|
#3 | |
Senior Member
Rick
Join Date: Oct 2010
Posts: 1,016
Rep Power: 27 |
Quote:
I'm sure because I run same case with same hardware several times, by change only memory modules. I noticed that in serial mode I haven't any errors with non ecc modules, but problems begin with parallel calculation. For that particular case ecc for me is a must as I cannot restart simulation every 2-3 hours. Daniele |
||
September 27, 2013, 22:25 |
|
#4 |
Senior Member
Join Date: Mar 2009
Location: Austin, TX
Posts: 160
Rep Power: 18 |
I run a cluster with 15 quad core i7 CPUs, and it seems like 1 crash a week is of the "random" variety. These are crashes that don't happen again when you restart the run. I have about 50% utilization.
Even if all of those crashes are due to non-ECC memory, it still isn't enough to justify the additional cost and slower speed of ECC memory. |
|
September 28, 2013, 11:12 |
|
#5 | |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,982
Blog Entries: 45
Rep Power: 128 |
Greetings to all!
Quote:
And it's bad enough when machines can crash on their own for some hardware reason or other (example: http://whatif.xkcd.com/63/, section "10 Exabytes"). Having non-ECC RAM being the cause of additional frequent crashes, that might not be acceptable for some situations. But hey, few are those that know that the quality of the electricity can play a very important role in cluster environments. As for the original post: the problem might have been something that wasn't properly configured on the BIOS or perhaps the RAM modules simply were not compatible with the motherboard (yes, that can happen!). And memtest86+ is no longer an accurate way to assess if RAM is OK or not. This is why Google has made available the stressapptest utility: http://code.google.com/p/stressapptest/ Best regards, Bruno |
||
September 28, 2013, 15:20 |
|
#6 |
Senior Member
Join Date: Mar 2009
Location: Austin, TX
Posts: 160
Rep Power: 18 |
You could just have an extremely simple script to restart from the last save file. If it crashes on the same iteration as before, then give up.
If your runs are urgent then that is all the more reason not to buy ECC memory and the incredibly expensive CPUs and motherboards you need to use it. For any given hardware budget you can, conservatively, get at least double the speed if you do not purchase enterprise class hardware. This starts to break down once you get to a massive system where data is hopping across multiple switches, but unless you are Boeing or Lockheed, you probably aren't working at that scale. <400 cores, I'd stick with i7's and overclocked low-latency non-ECC memory. |
|
September 30, 2013, 12:19 |
|
#7 | |
New Member
Join Date: Apr 2012
Posts: 27
Rep Power: 14 |
Quote:
Is it something that you can set up for every project automatically? I am still a newbye and don't use scripts. Can this code be in the calls from my visual basic/excel application? Thanks |
||
September 30, 2013, 17:34 |
|
#8 | |
Senior Member
Joern Beilke
Join Date: Mar 2009
Location: Dresden
Posts: 533
Rep Power: 20 |
Quote:
|
||
September 30, 2013, 23:28 |
|
#9 | |
New Member
CFD
Join Date: Jan 2013
Posts: 23
Rep Power: 13 |
Quote:
Well, if you have this board (Z9PE-D8 WS) and Samsung ECC RAM DDR3 1333 MHz, I would recommend you to overclock the memory and run it at 1600 MHz through setting in BIOS (I could run it stable in my system which has almost the same configuration as yours, and get about 30% performance increases in my OpenFOAM calculation). Strangely enough (at least for me), I could not do it for the non-ECC ones even its originaly has speed up to 1866 MHz. +1 for ECC Regards, siefdi |
||
October 1, 2013, 10:25 |
|
#10 | |
Senior Member
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,188
Rep Power: 23 |
Quote:
Crucial does make some ECC memory rated to 1866 MHz, CL timings are 13. |
||
October 1, 2013, 17:13 |
|
#11 |
New Member
Benj FitzPatrick
Join Date: Apr 2012
Posts: 4
Rep Power: 14 |
You should have options to turn several ecc options off in the bios. Then you could run the tests again with the ECC ram and see if it crashes.
|
|
October 3, 2013, 07:49 |
|
#12 |
Senior Member
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,188
Rep Power: 23 |
||
October 3, 2013, 10:10 |
|
#13 |
Senior Member
Rick
Join Date: Oct 2010
Posts: 1,016
Rep Power: 27 |
||
October 7, 2013, 11:52 |
|
#14 | |
Senior Member
Rick
Join Date: Oct 2010
Posts: 1,016
Rep Power: 27 |
Quote:
What is/are your cpu(s)? |
||
October 7, 2013, 21:07 |
|
#15 | |
New Member
CFD
Join Date: Jan 2013
Posts: 23
Rep Power: 13 |
Quote:
Regards, siefdi |
||
October 11, 2013, 12:43 |
|
#16 |
Senior Member
Rick
Join Date: Oct 2010
Posts: 1,016
Rep Power: 27 |
I noticed that I have some errors in the cortexerror.log file:
Code:
Error [cortex] [time 10/7/13 0:29:23] C:\PROGRA~1\ANSYSI~1\v145\fluent\fluent14.5.7\win64\3ddp\fl1457s.exe received fatal signal () 1. Note exact events leading to error. 2. Save case/data under new name. 3. Exit program and restart to continue. 4. Report error to your distributor. Error [cortex] [time 10/7/13 0:32:33] C:\PROGRA~1\ANSYSI~1\v145\fluent\fluent14.5.7\win64\3ddp\fl1457s.exe received fatal signal () 1. Note exact events leading to error. 2. Save case/data under new name. 3. Exit program and restart to continue. 4. Report error to your distributor. Error [cortex] [time 10/7/13 0:52:45] ‡flØ Error [cortex] [time 10/7/13 1:34:47] ‡flØ Error [cortex] [time 10/7/13 1:46:29] ‡flØ Error [cortex] [time 10/7/13 1:56:52] ‡flØ Error [cortex] [time 10/7/13 2:8:49] ‡flØ Error [cortex] [time 10/7/13 2:11:16] C:\PROGRA~1\ANSYSI~1\v145\fluent\fluent14.5.7\cortex\win64\cx1457.exe received fatal signal () 1. Note exact events leading to error. 2. Save case/data under new name. 3. Exit program and restart to continue. 4. Report error to your distributor. Error [cortex] [time 10/7/13 19:51:13] ‡flû Error [cortex] [time 10/7/13 19:57:50] ‡flû Error [cortex] [time 10/8/13 23:55:1] ‡flû Error [cortex] [time 10/9/13 7:55:24] C:\PROGRA~1\ANSYSI~1\v145\fluent\fluent14.5.7\cortex\win64\cx1457.exe received fatal signal () 1. Note exact events leading to error. 2. Save case/data under new name. 3. Exit program and restart to continue. 4. Report error to your distributor. C:\PROGRA~1\ANSYSI~1\v145\fluent\fluent14.5.7\win6 4\3ddp\fl1457s.exe received fatal signal () 1. Note exact events leading to error. 2. Save case/data under new name. 3. Exit program and restart to continue. 4. Report error to your distributor. comes sometimes when I'm exiting, the window closes and all seems ok, but in the log file this error is written. The second type of error Error [cortex] [time 10/7/13 19:57:50] ‡flû comes randomly. Now, when I had non ecc ram fluent crashes wtih this type of error, now, with ecc, simulation continues without problems and error is logged in the file. Am I invested by cosmic rays?? Daniele |
|
October 11, 2013, 14:37 |
|
#17 |
Senior Member
Joern Beilke
Join Date: Mar 2009
Location: Dresden
Posts: 533
Rep Power: 20 |
Did you check your non ecc rams using stressapptest? It is meaningless to compare broken non ecc ram modules to anything else.
|
|
October 23, 2013, 07:49 |
|
#18 |
New Member
John McEntee
Join Date: Jun 2013
Posts: 8
Rep Power: 13 |
I think the intel xeon only supports ecc ram.
|
|
October 26, 2013, 06:40 |
|
#19 | |||
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,982
Blog Entries: 45
Rep Power: 128 |
I guess that it's best to quote the manufacturer on this one. Here's an example: http://www.intel.com/cd/channel/rese...eon/440799.htm - "DDR3 Memory for the Intel® Xeon® Processor 5600 Series"
Quote:
The chipset is embedded into the motherboard, so the limitation might actually come from said motherboard, in either direction, i.e. ECC only or non-ECC only. Another limitation in some cases is that the certain memory modules are not compatible with the motherboard. This is why motherboard vendors usually have a list per motherboard on compatible memory modules. Let me see if I can find a motherboard that specifically says that only ECC is supported... mmm... apparently there shouldn't exist such a motherboard/chipset, as indicated here: http://www.intel.com/support/motherb.../cs-009023.htm I did a bit more research and found out that the RAM that the original poster used is meant for dual and tripple-channel motherboards: http://www.corsair.com/en/memory-by-...m1a1333c9.html Quote:
Quote:
|
||||
February 13, 2014, 10:32 |
|
#20 |
Senior Member
Rick
Join Date: Oct 2010
Posts: 1,016
Rep Power: 27 |
Updates on this topic:
since I upgraded my workstation to 2x xeon e5-2687w I read some usefull info about my motherboard asus z9pe-d8 ws; several users around the internet claim problems with non ecc ram with this mobo even if asus claims that it is compatible with non ecc memory. So my problem could be related to my mobo/bios version and not to ecc/non ecc ram. Anyway, non ecc ram was sold and buyers are still happy with that ram. Daniele |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
[GAMBIT] Understanding memory (ram) limits | ghost82 | ANSYS Meshing & Geometry | 5 | September 9, 2013 13:54 |
New workstation for different usage scenarios - CPU and RAM | natem | Hardware | 6 | August 7, 2013 03:47 |
How much RAM for a cluster @ big output-files? | Eike | Hardware | 5 | December 8, 2011 08:46 |
Increasing RAM decreases CPU time!!! | Melih GULEREN | FLUENT | 2 | April 5, 2004 07:21 |
Can FLUENT run under Linux with 2 Gb of RAM? | Paul Gregory | FLUENT | 0 | February 13, 2001 21:10 |