|
[Sponsors] |
inconsistant SIGSEGV: memory access exception |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
April 18, 2012, 10:41 |
inconsistant SIGSEGV: memory access exception
|
#1 |
New Member
James Lo
Join Date: Feb 2012
Posts: 8
Rep Power: 14 |
Hi guys,
I have read through another thread about SIGSEGV: memory access exception, and have checked my mesh and cell sizes, I believe its an different cause than others had encountered. I have 2 machines which run identical mesh but different bc for parametric analysis, and this error will only occur in one machine sometimes. I meant in 1 run it will fail and get this error, then if I reload the case and run it again, it might work. Sometimes I get 3-4 fails in the roll, all at different iteration/time steps, sometimes I won't get it at all for 3-4 runs. Its driving me nuts... Is it possible we just have a bad stick in there and whenever it accessed that bad sector the process crashes? At this point I am fairly certain it is not my model, since this error never happens on the other machine I am running. Again, all the cases have the same mesh just different BC. But I don't want to go out and buy ram just on a haunch here, any suggestion would help thanks! - James |
|
April 18, 2012, 11:31 |
|
#2 |
Member
Ryan Coe
Join Date: Jun 2010
Location: Albuquerque, NM
Posts: 98
Rep Power: 16 |
I have also had this error occur seemingly randomly at times. I'm curious is anyone else out there has a solution.
__________________
Ryan |
|
April 18, 2012, 12:43 |
|
#3 |
Senior Member
Ryne Whitehill
Join Date: Aug 2009
Posts: 312
Rep Power: 19 |
This error is probably the most frustrating one to me, because it doesnt provide any hints as to what is going wrong....and to top it off there are lots of things that can cause this.
Examples: One of my simulations was having this only yesterday, I spent a whole week trying to figure out what was wrong. In the end, the culprit was...a mass flow report, which was set to "inital surface" rather than "volume mesh". Another one was having a similar situation to yours: if I submitted it across say X cores, it would not run. But with X/2 cores, it would. I had no explanation for this. Yet another one would not run at all on my cluster. I opened it locally, and did a single iteration locally then submitted it to cluster. Ran fine after that. |
|
April 18, 2012, 20:03 |
|
#4 | |
Senior Member
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 1,286
Rep Power: 34 |
Quote:
If this happens then there is a good chance that this is a bug in the program. You should report this to your local support. Based on my experience with programming CFD codes, my guess is that this happens when starccm tries to delete some memory which is not already allocated. Typically most of the compilers initiate the variables to value 0. Some of them do not. So if you try to delete such array which is not initiated to 0 it might create problem. This would work something like this. real *Array; /// should be initiated to 0 delete [] Array. (works fine if Array initiated to 0, gives the error you mentioned , if not initiated to 0). If you are wondering why would I do like this, because first time array is declared like real *Array; but rest of the iterations it might be: delete [] Array; Array = new real [ size ] ; All this is system and compiler dependent. |
||
April 28, 2012, 08:03 |
|
#5 | |
Senior Member
Join Date: Oct 2009
Location: Germany
Posts: 636
Rep Power: 22 |
Quote:
I usually get this error when a lot changes, e.g. an interfaces needs update due to moving meshes, therefore the face count or the vertex count for a cell or boundary changes. I totally agree to arjun that it is mostly a bug which should be reported. |
||
August 13, 2012, 14:42 |
|
#6 |
New Member
John Anastos
Join Date: Aug 2012
Posts: 1
Rep Power: 0 |
But if this is a program bug, why would it fail on the one machine and not the other. I am currently experiencing the same issue, and we have two machines that are the exact same. We just launched the same exact run on both and on one it failed with this error, and on the other it did not. We have had this issue for the past two weeks and it is always the one machine that fails. One time it ran 267 iterations, then last Friday it died within the first 100.
Any thoughts? |
|
August 13, 2012, 15:40 |
|
#7 | |
Senior Member
Ryne Whitehill
Join Date: Aug 2009
Posts: 312
Rep Power: 19 |
Quote:
|
||
August 15, 2012, 08:36 |
|
#8 | |
Senior Member
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 1,286
Rep Power: 34 |
Quote:
it should be compiler dependent and not dependent on the machine. |
||
August 15, 2012, 10:31 |
|
#9 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,982
Blog Entries: 45
Rep Power: 128 |
Greetings to all!
We've had these kinds of problems in our office machines and have realized that memtest86+ isn't enough to catch most of these memory errors on the latest hardware (since 2008-2009). We've only be able to detect these issues when using stressapptest: http://code.google.com/p/stressapptest/ - it's already available in most of the latest Linux distros. Example of commands for properly testing RAM: Code:
stressapptest -W --cc_test stressapptest -W --cc_test -M 5000 Best regards, Bruno
__________________
|
|
August 17, 2012, 05:07 |
|
#10 |
Senior Member
Joern Beilke
Join Date: Mar 2009
Location: Dresden
Posts: 533
Rep Power: 20 |
Hi Bruno,
thanks very much for the hint. I had a very strange prostar behaviour after upgrading from 16GB to 32GB memory. Prostar crashed while running a long postprocessing script without any error message. The crashes occured randomly just on one machine. stressapptest reported some problems. Did you see differences between normal RAM and ECC RAM? |
|
August 17, 2012, 08:31 |
|
#11 |
Senior Member
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 1,286
Rep Power: 34 |
can you guys provide any test case that i could run and reproduce the issue??
|
|
August 17, 2012, 10:27 |
|
#12 |
Senior Member
Joern Beilke
Join Date: Mar 2009
Location: Dresden
Posts: 533
Rep Power: 20 |
What testcase do you want? The output of a failed stressapptest run?
Code:
> stressapptest -W --cc_test -M 20000 ... Report Error: miscompare : DIMM Unknown : 1 : 6s Hardware Error: miscompare on CPU 2(0x2) at 0x7f6cae2b6798(0x0:DIMM Unknown): read:0xe9e9e9e8e9e9e9e9, reread:0xe9e9e9e8e9e9e9e9 expected:0xe9e9e9e9e9e9e9e9 Log: Seconds remaining: 10 Stats: CC Thread(0): Time=20033474 us, Increments=1396261000, Increments/sec = 69696399.136765 Stats: CC Thread(1): Time=20034022 us, Increments=1167366000, Increments/sec = 58269178.300793 Stats: CC Thread(2): Time=20033434 us, Increments=935507000, Increments/sec = 46697286.146748 Stats: CC Thread(3): Time=19973276 us, Increments=1269766000, Increments/sec = 63573246.572070 Log: Thread 3 found 715 hardware incidents Stats: Found 715 hardware incidents Stats: Completed: 11814.00M in 21.00s 562.53MB/s, with 715 hardware incidents, 0 errors Stats: Memory Copy: 11814.00M at 590.10MB/s Stats: File Copy: 0.00M at 0.00MB/s Stats: Net Copy: 0.00M at 0.00MB/s Stats: Data Check: 0.00M at 0.00MB/s Stats: Invert Data: 0.00M at 0.00MB/s Stats: Disk: 0.00M at 0.00MB/s Status: FAIL - test discovered HW problems |
|
August 17, 2012, 10:35 |
|
#13 | |
Senior Member
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 1,286
Rep Power: 34 |
Quote:
any sim file that i could run on my machine and shows the problem. I can run starccm+ but so far have not encountered this problem so if problem does not show up it is really hard to fix. Note: I now work with cd adapco, so it kinda interests me more now. |
||
August 17, 2012, 10:41 |
|
#14 | |
Senior Member
Ryne Whitehill
Join Date: Aug 2009
Posts: 312
Rep Power: 19 |
Quote:
From his post above, it looks to be a hardware issue, not a simulation issue. Thats why it was only occuring on one machine. |
||
August 17, 2012, 17:55 |
|
#15 |
Senior Member
Joern Beilke
Join Date: Mar 2009
Location: Dresden
Posts: 533
Rep Power: 20 |
It can also be a software problem if it only occures on one machine (unless you have an identical configuration).
Thanks to Bruno hint with stressapptest I was able to find out, that it is a hardware problem. In the meantime also other people came across the same prostar crashes. It will be interesting to see what results we get from the test on these machines. |
|
August 18, 2012, 05:29 |
|
#16 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,982
Blog Entries: 45
Rep Power: 128 |
So far the experience I've had is that:
__________________
|
|
February 6, 2015, 09:11 |
|
#17 | |
New Member
Paul D
Join Date: Nov 2011
Posts: 4
Rep Power: 15 |
Quote:
I am getting exactly the same error with memory access as you did so wondering if you have managed to solve the problem. Funny thing is that I am getting the error occasionally only when trying to load a simulation on multiple cores and not on single core. I don't think this is a matter of hardware as my computer is 2 processors 24 cores 2.3GHz, 64GB RAM. Many thanks, Pavlos |
||
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
SIGSEGV: memory access exception | shawn123 | STAR-CCM+ | 14 | January 20, 2020 07:07 |
Streamlines -> SIGSEGV: memory access exception | eRzBeNgEl | Siemens | 1 | July 28, 2011 10:35 |
question to memory access error | hang1984 | STAR-CD | 0 | July 26, 2010 05:25 |
Memory Exception Workaround?! | Maddin | STAR-CCM+ | 5 | September 14, 2009 17:37 |
CFX CPU time & real time | Nick Strantzias | CFX | 8 | July 23, 2006 18:50 |