CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > Siemens > STAR-CCM+

inconsistant SIGSEGV: memory access exception

Register Blogs Community New Posts Updated Threads Search

Like Tree2Likes
  • 1 Post By rwryne
  • 1 Post By arjun

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   April 18, 2012, 10:41
Default inconsistant SIGSEGV: memory access exception
  #1
New Member
 
James Lo
Join Date: Feb 2012
Posts: 8
Rep Power: 14
hiddenbunny is on a distinguished road
Hi guys,

I have read through another thread about SIGSEGV: memory access exception, and have checked my mesh and cell sizes, I believe its an different cause than others had encountered.

I have 2 machines which run identical mesh but different bc for parametric analysis, and this error will only occur in one machine sometimes. I meant in 1 run it will fail and get this error, then if I reload the case and run it again, it might work. Sometimes I get 3-4 fails in the roll, all at different iteration/time steps, sometimes I won't get it at all for 3-4 runs. Its driving me nuts...

Is it possible we just have a bad stick in there and whenever it accessed that bad sector the process crashes? At this point I am fairly certain it is not my model, since this error never happens on the other machine I am running. Again, all the cases have the same mesh just different BC.

But I don't want to go out and buy ram just on a haunch here,

any suggestion would help

thanks!

- James
hiddenbunny is offline   Reply With Quote

Old   April 18, 2012, 11:31
Default
  #2
Member
 
Ryan Coe
Join Date: Jun 2010
Location: Albuquerque, NM
Posts: 98
Rep Power: 16
ryancoe is on a distinguished road
I have also had this error occur seemingly randomly at times. I'm curious is anyone else out there has a solution.
__________________
Ryan
ryancoe is offline   Reply With Quote

Old   April 18, 2012, 12:43
Default
  #3
Senior Member
 
Ryne Whitehill
Join Date: Aug 2009
Posts: 312
Rep Power: 19
rwryne is on a distinguished road
This error is probably the most frustrating one to me, because it doesnt provide any hints as to what is going wrong....and to top it off there are lots of things that can cause this.

Examples:

One of my simulations was having this only yesterday, I spent a whole week trying to figure out what was wrong. In the end, the culprit was...a mass flow report, which was set to "inital surface" rather than "volume mesh".

Another one was having a similar situation to yours: if I submitted it across say X cores, it would not run. But with X/2 cores, it would. I had no explanation for this.

Yet another one would not run at all on my cluster. I opened it locally, and did a single iteration locally then submitted it to cluster. Ran fine after that.
JM27 likes this.
rwryne is offline   Reply With Quote

Old   April 18, 2012, 20:03
Default
  #4
Senior Member
 
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 1,286
Rep Power: 34
arjun will become famous soon enougharjun will become famous soon enough
Quote:
Originally Posted by hiddenbunny View Post
Hi guys,

I have read through another thread about SIGSEGV: memory access exception, and have checked my mesh and cell sizes, I believe its an different cause than others had encountered.

I have 2 machines which run identical mesh but different bc for parametric analysis, and this error will only occur in one machine sometimes. I meant in 1 run it will fail and get this error, then if I reload the case and run it again, it might work. Sometimes I get 3-4 fails in the roll, all at different iteration/time steps, sometimes I won't get it at all for 3-4 runs. Its driving me nuts...

Is it possible we just have a bad stick in there and whenever it accessed that bad sector the process crashes? At this point I am fairly certain it is not my model, since this error never happens on the other machine I am running. Again, all the cases have the same mesh just different BC.

But I don't want to go out and buy ram just on a haunch here,

any suggestion would help

thanks!

- James

If this happens then there is a good chance that this is a bug in the program. You should report this to your local support.

Based on my experience with programming CFD codes, my guess is that this happens when starccm tries to delete some memory which is not already allocated. Typically most of the compilers initiate the variables to value 0. Some of them do not.
So if you try to delete such array which is not initiated to 0 it might create problem.

This would work something like this.

real *Array; /// should be initiated to 0

delete [] Array.
(works fine if Array initiated to 0, gives the error you mentioned , if not initiated to 0).


If you are wondering why would I do like this, because first time array is declared like real *Array; but rest of the iterations it might be:

delete [] Array;
Array = new real [ size ] ;

All this is system and compiler dependent.
ryancoe likes this.
arjun is offline   Reply With Quote

Old   April 28, 2012, 08:03
Default
  #5
Senior Member
 
Join Date: Oct 2009
Location: Germany
Posts: 636
Rep Power: 22
abdul099 is on a distinguished road
Quote:
Originally Posted by arjun View Post
Based on my experience with programming CFD codes, my guess is that this happens when starccm tries to delete some memory which is not already allocated.
I think not only when deleting some memory, but also when trying to access memory which isn't already allocated. E.g. one thread allocates a huge array. Another thread tries to put a value or get a value from an address which is not yet completely initialised by the first thread.

I usually get this error when a lot changes, e.g. an interfaces needs update due to moving meshes, therefore the face count or the vertex count for a cell or boundary changes.

I totally agree to arjun that it is mostly a bug which should be reported.
abdul099 is offline   Reply With Quote

Old   August 13, 2012, 14:42
Default
  #6
New Member
 
John Anastos
Join Date: Aug 2012
Posts: 1
Rep Power: 0
Loothin is on a distinguished road
But if this is a program bug, why would it fail on the one machine and not the other. I am currently experiencing the same issue, and we have two machines that are the exact same. We just launched the same exact run on both and on one it failed with this error, and on the other it did not. We have had this issue for the past two weeks and it is always the one machine that fails. One time it ran 267 iterations, then last Friday it died within the first 100.

Any thoughts?
Loothin is offline   Reply With Quote

Old   August 13, 2012, 15:40
Default
  #7
Senior Member
 
Ryne Whitehill
Join Date: Aug 2009
Posts: 312
Rep Power: 19
rwryne is on a distinguished road
Quote:
Originally Posted by Loothin View Post
But if this is a program bug, why would it fail on the one machine and not the other. I am currently experiencing the same issue, and we have two machines that are the exact same. We just launched the same exact run on both and on one it failed with this error, and on the other it did not. We have had this issue for the past two weeks and it is always the one machine that fails. One time it ran 267 iterations, then last Friday it died within the first 100.

Any thoughts?
Bad memory mayhaps? try running memtest86+ ?
rwryne is offline   Reply With Quote

Old   August 15, 2012, 08:36
Default
  #8
Senior Member
 
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 1,286
Rep Power: 34
arjun will become famous soon enougharjun will become famous soon enough
Quote:
Originally Posted by Loothin View Post
But if this is a program bug, why would it fail on the one machine and not the other. I am currently experiencing the same issue, and we have two machines that are the exact same. We just launched the same exact run on both and on one it failed with this error, and on the other it did not. We have had this issue for the past two weeks and it is always the one machine that fails. One time it ran 267 iterations, then last Friday it died within the first 100.

Any thoughts?

it should be compiler dependent and not dependent on the machine.
arjun is offline   Reply With Quote

Old   August 15, 2012, 10:31
Default
  #9
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,982
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Greetings to all!

We've had these kinds of problems in our office machines and have realized that memtest86+ isn't enough to catch most of these memory errors on the latest hardware (since 2008-2009).
We've only be able to detect these issues when using stressapptest: http://code.google.com/p/stressapptest/ - it's already available in most of the latest Linux distros.

Example of commands for properly testing RAM:
Code:
stressapptest -W --cc_test
stressapptest -W --cc_test -M 5000
The second one uses only 5GB of RAM.

Best regards,
Bruno
__________________
wyldckat is offline   Reply With Quote

Old   August 17, 2012, 05:07
Default
  #10
Senior Member
 
Joern Beilke
Join Date: Mar 2009
Location: Dresden
Posts: 533
Rep Power: 20
JBeilke is on a distinguished road
Hi Bruno,

thanks very much for the hint. I had a very strange prostar behaviour after upgrading from 16GB to 32GB memory. Prostar crashed while running a long postprocessing script without any error message. The crashes occured randomly just on one machine. stressapptest reported some problems.

Did you see differences between normal RAM and ECC RAM?
JBeilke is offline   Reply With Quote

Old   August 17, 2012, 08:31
Default
  #11
Senior Member
 
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 1,286
Rep Power: 34
arjun will become famous soon enougharjun will become famous soon enough
can you guys provide any test case that i could run and reproduce the issue??
arjun is offline   Reply With Quote

Old   August 17, 2012, 10:27
Default
  #12
Senior Member
 
Joern Beilke
Join Date: Mar 2009
Location: Dresden
Posts: 533
Rep Power: 20
JBeilke is on a distinguished road
What testcase do you want? The output of a failed stressapptest run?
Code:
> stressapptest -W --cc_test -M 20000
...
Report Error: miscompare : DIMM Unknown : 1 : 6s
Hardware Error: miscompare on CPU 2(0x2) at 0x7f6cae2b6798(0x0:DIMM Unknown): read:0xe9e9e9e8e9e9e9e9, reread:0xe9e9e9e8e9e9e9e9 expected:0xe9e9e9e9e9e9e9e9
Log: Seconds remaining: 10
Stats: CC Thread(0): Time=20033474 us, Increments=1396261000, Increments/sec = 69696399.136765
Stats: CC Thread(1): Time=20034022 us, Increments=1167366000, Increments/sec = 58269178.300793
Stats: CC Thread(2): Time=20033434 us, Increments=935507000, Increments/sec = 46697286.146748
Stats: CC Thread(3): Time=19973276 us, Increments=1269766000, Increments/sec = 63573246.572070
Log: Thread 3 found 715 hardware incidents
Stats: Found 715 hardware incidents
Stats: Completed: 11814.00M in 21.00s 562.53MB/s, with 715 hardware incidents, 0 errors
Stats: Memory Copy: 11814.00M at 590.10MB/s
Stats: File Copy: 0.00M at 0.00MB/s
Stats: Net Copy: 0.00M at 0.00MB/s
Stats: Data Check: 0.00M at 0.00MB/s
Stats: Invert Data: 0.00M at 0.00MB/s
Stats: Disk: 0.00M at 0.00MB/s

Status: FAIL - test discovered HW problems
JBeilke is offline   Reply With Quote

Old   August 17, 2012, 10:35
Default
  #13
Senior Member
 
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 1,286
Rep Power: 34
arjun will become famous soon enougharjun will become famous soon enough
Quote:
Originally Posted by JBeilke View Post
What testcase do you want? The output of a failed stressapptest run?[CODE]

any sim file that i could run on my machine and shows the problem. I can run starccm+ but so far have not encountered this problem so if problem does not show up it is really hard to fix.

Note: I now work with cd adapco, so it kinda interests me more now.
arjun is offline   Reply With Quote

Old   August 17, 2012, 10:41
Default
  #14
Senior Member
 
Ryne Whitehill
Join Date: Aug 2009
Posts: 312
Rep Power: 19
rwryne is on a distinguished road
Quote:
Originally Posted by arjun View Post
any sim file that i could run on my machine and shows the problem. I can run starccm+ but so far have not encountered this problem so if problem does not show up it is really hard to fix.

Note: I now work with cd adapco, so it kinda interests me more now.

From his post above, it looks to be a hardware issue, not a simulation issue. Thats why it was only occuring on one machine.
rwryne is offline   Reply With Quote

Old   August 17, 2012, 17:55
Default
  #15
Senior Member
 
Joern Beilke
Join Date: Mar 2009
Location: Dresden
Posts: 533
Rep Power: 20
JBeilke is on a distinguished road
It can also be a software problem if it only occures on one machine (unless you have an identical configuration).

Thanks to Bruno hint with stressapptest I was able to find out, that it is a hardware problem.

In the meantime also other people came across the same prostar crashes. It will be interesting to see what results we get from the test on these machines.
JBeilke is offline   Reply With Quote

Old   August 18, 2012, 05:29
Default
  #16
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,982
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Quote:
Originally Posted by JBeilke View Post
Did you see differences between normal RAM and ECC RAM?
So far the experience I've had is that:
  • Normal RAM is very susceptible to electrical quality. Overclocking and weak/cheap power supply can lead to the occasional damaged module
  • Even when electrical quality isn't the issue, normal RAM has seemed so far to be more prone to the occasional hiccup, i.e., has to go back to the store.
  • ECC is considered better because it can do hardware based self-correction, while normal RAM sometimes uses software based correction. For more on ECC: http://en.wikipedia.org/wiki/ECC_memory
    For more on software based correction... sorry, don't have a reference on this one; it's sort-of a gut feeling from past experiences, but I don't have technical evidence.
  • Either way, it's good to buy RAM in a single combo package. For example, for normal RAM, it's best to fill all slots with modules from a single package purchase, because those modules have been tested to perform well as a team.
  • When it comes to multi-socket motherboards (more than one CPU), it seems that you can progressively fill memory slots, but the combo package criteria still stands. You can buy 1 package of combos per CPU socket and install one combo (2,3,4 or 6 modules for each combo) at a time. Asymmetry here is always very bad.
  • After all of this, always keep in mind to be careful if the voltage, CL and other specs of the RAM modules (such as rank) are all the same, otherwise there might be incompatibilities between between them that the motherboard cannot resolve automagically.
__________________
wyldckat is offline   Reply With Quote

Old   February 6, 2015, 09:11
Default
  #17
New Member
 
Paul D
Join Date: Nov 2011
Posts: 4
Rep Power: 15
padimgr is on a distinguished road
Quote:
Originally Posted by rwryne View Post
This error is probably the most frustrating one to me, because it doesnt provide any hints as to what is going wrong....and to top it off there are lots of things that can cause this.

Examples:

One of my simulations was having this only yesterday, I spent a whole week trying to figure out what was wrong. In the end, the culprit was...a mass flow report, which was set to "inital surface" rather than "volume mesh".

Another one was having a similar situation to yours: if I submitted it across say X cores, it would not run. But with X/2 cores, it would. I had no explanation for this.

Yet another one would not run at all on my cluster. I opened it locally, and did a single iteration locally then submitted it to cluster. Ran fine after that.
Hi rwryne,

I am getting exactly the same error with memory access as you did so wondering if you have managed to solve the problem.
Funny thing is that I am getting the error occasionally only when trying to load a simulation on multiple cores and not on single core.
I don't think this is a matter of hardware as my computer is 2 processors 24 cores 2.3GHz, 64GB RAM.

Many thanks,
Pavlos
padimgr is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
SIGSEGV: memory access exception shawn123 STAR-CCM+ 14 January 20, 2020 07:07
Streamlines -> SIGSEGV: memory access exception eRzBeNgEl Siemens 1 July 28, 2011 10:35
question to memory access error hang1984 STAR-CD 0 July 26, 2010 05:25
Memory Exception Workaround?! Maddin STAR-CCM+ 5 September 14, 2009 17:37
CFX CPU time & real time Nick Strantzias CFX 8 July 23, 2006 18:50


All times are GMT -4. The time now is 19:16.