|
[Sponsors] |
Segmentation fault in interFoam run through openMPI |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
September 22, 2011, 04:28 |
Segmentation fault in interFoam run through openMPI
|
#1 |
Member
Luca Giannelli
Join Date: Jun 2010
Location: Kobe, Japan
Posts: 58
Rep Power: 16 |
Hello everybody...
here I go with a hard question. I hope somebody there would like to help me to figure out a solution. Long (loooong) story short: I have 2 boxes with OpenFOAM 1.6 installed and I want them to run in parallel: kumori ---> i386 PS3 ---> PPC64 It took me a long time to compile the FOAM over the PS3 and now it is working like a charm. I'm still trying to run the damBreak tutorial, however when I launch the mpirun command it spits out an error like this: Code:
mpirun -np 1 -host kumori interFoam -parallel : -np 1 -host ps3 interFoam -parallel /*---------------------------------------------------------------------------*\ | ========= | | | \\ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \\ / O peration | Version: 1.6 | | \\ / A nd | Web: www.OpenFOAM.org | | \\/ M anipulation | | \*---------------------------------------------------------------------------*/ Build : 1.6-53b7f692aa41 Exec : interFoam -parallel Date : Sep 22 2011 Time : 16:13:07 Host : kumori PID : 32267 [PS3:29504] *** Process received signal *** [PS3:29504] Signal: Segmentation fault (11) [PS3:29504] Signal code: Address not mapped (1) [PS3:29504] Failing at address: 0xa28c2c7a [PS3:29504] [ 0] [0xfff82960418] [PS3:29504] [ 1] /home/piota/OpenFOAM/OpenFOAM-1.6/lib/linuxPPC64GccDPOpt/libOpenFOAM.so(_ZN4Foam8IPstream4readERNS_5tokenE-0x3a21ec) [0xfff80e947d4] [PS3:29504] [ 2] /home/piota/OpenFOAM/OpenFOAM-1.6/lib/linuxPPC64GccDPOpt/libOpenFOAM.so(_ZN4Foam5tokenC1ERNS_7IstreamE-0x3b7e4c) [0xfff80e7d2e4] [PS3:29504] [ 3] /home/piota/OpenFOAM/OpenFOAM-1.6/lib/linuxPPC64GccDPOpt/libOpenFOAM.so(_ZN4FoamrsERNS_7IstreamERNS_6stringE-0x3d24d0) [0xfff80e613d0] [PS3:29504] [ 4] /home/piota/OpenFOAM/OpenFOAM-1.6/lib/linuxPPC64GccDPOpt/libOpenFOAM.so(_ZN4FoamrsINS_6stringEEERNS_7IstreamES3_RNS_4ListIT_EE-0x3e1908) [0xfff80e50ae0] [PS3:29504] [ 5] /home/piota/OpenFOAM/OpenFOAM-1.6/lib/linuxPPC64GccDPOpt/libOpenFOAM.so(_ZN4Foam7argListC1ERiRPPcbb-0x3eb320) [0xfff80e471b8] [PS3:29504] [ 6] interFoam() [0x1001f7fc] [PS3:29504] [ 7] /lib64/libc.so.6(+0x4f5e8) [0xfff808875e8] [PS3:29504] [ 8] /lib64/libc.so.6(__libc_start_main-0x1534f8) [0xfff80887800] [PS3:29504] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 1 with PID 29504 on node ps3 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- Final HINTS... maybe the problem is here: 1) both machines' openmpi installation had to be recompiled including the option "--enable-heterogeneous" because of the difference between the architectures. Maybe the library included in the openfoam.so during previous compile is outated and it gets a SIGSEGV? 2) if in the above command I remove the "-parallel" switch, the program runs flawlessly but only on the remote pc. Please help me to get out from this trap Thank you! Luca Last edited by voingiappone; September 22, 2011 at 04:37. Reason: typo in the title |
|
September 24, 2011, 08:32 |
|
#2 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Greetings Luca,
Nice! PS3 for running OpenFOAM... too bad it doesn't had much memory by default... Sadly I don't have experience (yet) with running OpenFOAM on a hybrid platform parallel execution, but here are a few links that may help you:
If there is something you don't understand about the content of some of the links that I've provided here, feel free to ask! Best regards, Bruno
__________________
|
|
September 26, 2011, 00:34 |
|
#3 |
Member
Luca Giannelli
Join Date: Jun 2010
Location: Kobe, Japan
Posts: 58
Rep Power: 16 |
Howdy Bruno
and thanks for your message. I read the posts you mentioned and I followed all the suggestions there. What I did: - compile the parallelTest app on both pc (kumori and ps3) - run it separately ---> it works flawlessly, with the output that you pointed out in those threads. - run it through foamJob including in the folder the original "machines" I was passing before and the output is *EXACTLY* the same error as above. To be more exact thanks to this test I found a problem that is probably linked to the different arch. It was looking for the "orted" file in the wrong directory so I made a symlink to the right one and it went fine. Even if it is using the right file now, it still spits out the very same error. I just came to the lab and I'm planning to make my "next move".... I suppose (totally basing on my vastly unpredictable fantasy) that the problem depends on the lack of the proper libraries during compilation. Why do I think so? 1) Years of Linux compilation against wrong libs (my fault of course) 2) The i386 version is the precompiled binary distribution 3) The ps3 version is compiled using the boundled openMPI ----> To include the heterogeneous arch support I had to recompile the openMPI and thus the libraries can be changed (maybe rendering the previous installation binaries useless for parallel). I cannot, however, understand why it works in standalone machines.... maybe because no massages are there to exchange on different archs? Boh.... I don't really want to recompile everything from scratch but I think I need to.... めんどうくさい!I'll report back if I get it working this way but meanwhile, if you have any idea (or you believe that what I said is stupid), please drop a line. Thanks Luca P.S. Yes... PS3 has low memory and is quite slow. It is a real pity that we cannot access the CELL... |
|
September 26, 2011, 02:45 |
|
#4 |
Member
Luca Giannelli
Join Date: Jun 2010
Location: Kobe, Japan
Posts: 58
Rep Power: 16 |
Run the compilation again.... but without results. The error is still the same. BTW, the libs where up to date, so there's non need to recompile when changing heterogeneous support in openMPI.
I'm stuck. |
|
September 26, 2011, 03:50 |
|
#5 |
Member
Luca Giannelli
Join Date: Jun 2010
Location: Kobe, Japan
Posts: 58
Rep Power: 16 |
Wow... 3rd post in a row! I should think better before writing.
BTW I actually noticed something *REALLY* weird.... I describe all the things I have done for making it easy to understand: 1) I run the interFoam app on both the machines separately appending the time command to check the result. 2) I run the mpirun command but forgot to write -parallel on both nodes and.... == MAGIC == __IT IS RUNNING!__ ==END OF MAGIC== The awful thing is that it is actually sharing the interFoam command on both the machines without executing the calculations for the decomposed case but the complete one! Even more interesting.... without any further request by the user it is automatically selecting all the cores from all the CPUs to have the calculations done (I mapped them with my xfce4 taskmanager). So, I practically run a case without breaking the mesh on a cluster made of 3 cores on 2 pc.... without knowing neither why, nor how. Quite exciting/sad. Btw, it still saves 30% of time (roughly) over the standalone execution so I can get some advantage from the parallelization however I would like to know why I do see this behavior. |
|
September 26, 2011, 04:42 |
|
#6 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Hi Luca,
It should only run in parallel if you use the "-parallel" argument, along with mpirun. If you use: Code:
foamJob -s -p interFoam Both OpenFOAM installations should be recompiled with the dedicated options. Otherwise, it's unlikely it'll work. Post #21 on the thread http://www.cfd-online.com/Forums/ope...tml#post292700 shows how to avoid using certain network connections. It might be useful for your case. Another thing that's worrying me: the fact that the PC is 32bit and the PS3/PPC is 64bit; that's just adding to the confusion of architectures If the PC was 64bit... Additionally, how is the case folder being shared between the machines? Is it placed in the same exact path? Best regards and good luck! Bruno
__________________
Last edited by wyldckat; September 26, 2011 at 04:42. Reason: typo |
|
September 26, 2011, 23:37 |
|
#7 | ||||
Member
Luca Giannelli
Join Date: Jun 2010
Location: Kobe, Japan
Posts: 58
Rep Power: 16 |
Hello Bruno,
thanks for your precious help... I made a "small" step forward thanks to the info in the thread that you posted. I removed the wireless interface and the segfault error disappeared (exactly as the original poster suggested. The command was: Code:
mpirun --mca btl_tcp_if_exclude eth1 -hostfile machines -np 2 "executable file"-parallel Code:
piota@ps3's password: /*---------------------------------------------------------------------------*\ | ========= | | | \\ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \\ / O peration | Version: 1.6 | | \\ / A nd | Web: www.OpenFOAM.org | | \\/ M anipulation | | \*---------------------------------------------------------------------------*/ Build : 1.6-53b7f692aa41 Exec : parallelTest -parallel Date : Sep 27 2011 Time : 10:49:38 Host : kumori PID : 2366 Quote:
Quote:
Quote:
Quote:
Onother thing I was thinking about.... kumori is single cpu, single core but the PS3 is double core.... the case is decomposed in two parts and I am asking to MPI to run it separated on the two pc.... maybe the fact that on the PS3 only one core is working may be an issue? I saw a lot of threads whith multicore CPUs giving problems when executed without using all the cores.... I will try to decompose it in 3 parts with scotch.... I hope it will succeed! Luca |
|||||
October 1, 2011, 12:23 |
|
#8 | |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Hi Luca,
Sorry for the late reply, but here goes in a somewhat random order of replies: 1. I've never tested the "--enable-heterogenous option" so I don't know what limitations there are for it. This is why I think that both machines should be 64bit, but hopefully that won't be necessary. 2. Quote:
The other detail was about environment variables, which I'll write about later on in this post. 3. When parallelTest and other OpenFOAM applications lock up during parallel execution, it's likely related to a problem with the backwards lookup for the master machine. In other words: kumori has an IP, ps3 has another IP; but while kumori might know the IP associated to ps3, ps3 might not know the IP addressed to kumori! Basically, check the file "/etc/hosts" on each machine, where both machines should have something like this: Code:
10.11.12.1 kumori 10.11.12.2 ps3 4. Since you have a multi-architecture setup, a certain detail is very important: the environment shell variables must be properly set for each remote process. A way to check this is by running something like this: Code:
mpirun --mca btl_tcp_if_exclude eth1 -hostfile machines -np 2 bash -c "echo \$HOSTNAME; export > $HOME/log.\$HOSTNAME" Another test, a bit more simple and exact is this: Code:
mpirun --mca btl_tcp_if_exclude eth1 -hostfile machines -np 2 bash -c "echo \$HOSTNAME; which icoFoam" Yet another test, for checking if all machines have the necessary case files, run this from within the case folder: Code:
mpirun --mca btl_tcp_if_exclude eth1 -hostfile machines -np 2 bash -c "echo \$HOSTNAME; ls -l \$PWD" Code:
mpirun --mca btl_tcp_if_exclude eth1 -hostfile machines -np 2 bash -c "echo \$HOSTNAME; ls -l $PWD"
5. The paths for running in parallel must be the same, but for the simulation case itself. Sorry about not making myself clearer on my previous post Well, they could be in different places, but that would be complicating a bit the tests that are being made right now. OK, what happens with running OpenFOAM applications in parallel is this (by default, but can be changed in "decomposeParDict"):
6. When problems arise, usually using a divide-and-conquer method is the best way to go. This is what I've been writing so far. So, when the details above have been checked and/or solved, I would first test running two parallel processes on each independent machine. This would isolate the problem to either being a problem with different architectures, or it being a problem with the general setup. To test this, modify the "machines" file you've been using to run mpirun with, for the following scenarios:
_____________________________ OK, I'm not sure, but I think I've answered most of the problems that were described... Now it's up to you to do the tests Best regards and good luck! Bruno
__________________
Last edited by wyldckat; October 1, 2011 at 12:26. Reason: typos... |
||
October 2, 2011, 23:29 |
|
#9 |
Member
Luca Giannelli
Join Date: Jun 2010
Location: Kobe, Japan
Posts: 58
Rep Power: 16 |
Hello Bruno,
thank you very much for the complete and detailed explaination. That's a lot of testing that I have to do, so I will try and then report back.... I hope that one of this suggestions will do the trick! |
|
October 9, 2011, 08:33 |
|
#10 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Hi Luca,
Apparently there was a minor detail I forgot/didn't know before: one must be very careful about where mpirun is located! For example, when you're running in a single machine, the path is the same for all processes, so the following way would be a quick fix: Code:
`which mpirun` -np 2 `which foamExec` icoFoam -parallel But since you are using machines with different builds, then a few possible rules apply:
Best regards, Bruno
__________________
|
|
October 11, 2011, 00:30 |
|
#11 |
Member
Luca Giannelli
Join Date: Jun 2010
Location: Kobe, Japan
Posts: 58
Rep Power: 16 |
Hello Bruno,
first of all let me thank you for the efforts you are making in helping me out with this issue. I was a bit late in answering as I was trying to tame some Python scripts for Paraview that didn't want to do what I wanted..... I won. I have executed all the tests that you suggested and the results are: 3) Hosts file I have modified it in all the ways I could... I found that on one machine (kumori) the name of the other host was lowercase... I changed this to be UPPERCASE in both the machines; you never know... It didn't help though. 4) Environmental variables I did run the command you specified and it indeed generated the log files where the variables are listed. I don't see anything wrong there but it may be hiding somewhere (or even missing) rendering the identification of the problem quite tough. Of course all the needed files are there and they show up with the command that you pointed out. Cross running on the hosts If I run the parralel process (I am using the dambreak case) starting from a machine on the other, it finalizes without problems: kumori ----> ps3 (executes all the threads on the remote PS3 and succeedes) PS3 ----> kumori (executes all the threads on the remote kumori and succeedes) Of course "on machine" parallelization works flawlessly... Still multi-arch runs are not working. I suppose that the problem is related to the openfoam executable (interFoam in this case). I say this as I see that the program hangs after being launched: Code:
piota@kumori:~/OpenFOAM/OpenFOAM-1.6/tutorials/multiphase/interFoam/laminar/damBreak$ `which mpirun` --mca btl_tcp_if_exclude eth1 -hostfile machines -np 3 `which foamExec` interFoam -parallel piota@ps3's password: /*------------------------------------------------------------------------------*\ | ========= | | | \\ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \\ / O peration | Version: 1.6 | | \\ / A nd | Web: www.OpenFOAM.org | | \\/ M anipulation | | \*------------------------------------------------------------------------------*/ Build : 1.6-53b7f692aa41 Exec : interFoam -parallel Date : Oct 11 2011 Time : 12:19:29 Host : kumori PID : 26361 Probably the problem comes from the mixed arch that *maybe* is not supported by openfoam |
|
October 11, 2011, 17:34 |
|
#12 | ||||
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Hi Luca,
Quote:
Quote:
Quote:
Quote:
Another thing, what about parallelTest? What happens when you use parallelTest instead of interFoam? Another test that should be made would be a small and simple program built to work with Open-MPI only, i.e. without any links to OpenFOAM. Problem is that I still don't know of any good test app for Open-MPI I know they have some examples in Open-MPI's source code, but I haven't tested any of them. Best regards, Bruno
__________________
|
|||||
October 31, 2011, 04:17 |
|
#13 |
Member
Luca Giannelli
Join Date: Jun 2010
Location: Kobe, Japan
Posts: 58
Rep Power: 16 |
OMG.... I forgot to reply after executing the tests!
The result is exactly the same as above with the program stuck where it assigns the PID. I can however tell you that you cannot add the "prefix" in the hosts file as it complains about an unknown option and you have to manually specify it in the command line. Getting back on the PS3 I decided to launch "which mpirun" directly on the ps3 itself (not remotely) and I obviously found out that it is in a different position than that on the "kumori" node. I was naive as I thought that simply executing a link to a dir with the same name on both machines could solve the problem which probably is indeed this one. I however understand that the PATH to Open Mpi is automatically set by the bashrc script from OF and I don't know how to change it. That can be the last try before giving up with the whole thing.... Any suggestions? |
|
October 31, 2011, 08:11 |
|
#14 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Hi Luca,
If you examine the file "OpenFOAM-1.6/etc/settings.sh" and search for "OPENMPI", you'll find out how the path for it is set. My advice is to copy the build you have on each node to each respective local global folder, such as "/usr/local/OpenMPI". In other words, on each node, move/copy the build of OpenMPI you've gotten somewhere in "ThirdParty-1.6/platforms" on that node onto "/usr/local/OpenMPI". Let me know if you can figure out how to follow these instructions I've written above. I'm also getting very curious on how to make this work in general as well... I'm going to have to setup two virtual machines - one with 32bit and the other with 64bit - and build the hybrid OpenMPI on each and see for myself how to make things happen I'll write about it after I make the tests... Best regards, Bruno
__________________
|
|
November 1, 2011, 09:16 |
|
#15 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Hi Luca,
Well... I've done some tests with the latest OpenFOAM 2.0.x and reached the conclusion that although I have two builds in Double Precision, the simple fact remains that having one in 32bit and another in 64bit will lead to a messy issue. I haven't looked deep enough to figure out the problem, but my guess is that the Integers are to be blamed here They don't have the same number of bytes and therefore won't jump well from one side to the other. Anyway, since you are using two 64bit builds, it might work as intended (edit: I didn't remember when I wrote this part). I've unpacked the old 1.6 packages to check the details of how folders were organized back then and the following trick should work to allow you to have different Open-MPI builds look like they are on the same path. Go to the "ThirdParty-1.6" folder on each machine and run something like this: Code:
for a in */platforms; do ln -s $WM_OPTIONS $a/linuxOtherDPOpt ; done for a in */platforms; do ll $a; done The second command will list the folders where the links were created, so you can confirm if it's all OK. So, just in case my explanation wasn't very clear, here's what I think you should run:
Any chance you can build a "PPC32" version to run on the PS3? Or get a 64bit PC with a 64bit Linux? Best regards, Bruno
__________________
|
|
November 2, 2011, 03:17 |
|
#16 |
Member
Luca Giannelli
Join Date: Jun 2010
Location: Kobe, Japan
Posts: 58
Rep Power: 16 |
Thank you Bruno,
The linking part is one thing that I already did when orted was complaining about that and it did not work out as it should. BTW, I have actually decided to do like you suggest, moving the 32 bit to a 64 bit linux pc.... It make damn sense that the different length in the numbers can screw all the things up even with the multiarch feature enabled. I wanted to just make a test with a simple case (the DamBreak) and opted to go for the most unconvenient set-up ever seen .... Blame on me.... Now I am trying to compile OF 2.0.x on the new 64bit pc and it is refusing to compile due to an unknown error.... worst: it compiles 90% of the bins and not those I want... but this is another story. So, what should we be doing now? Close the thread stating that it is not possible to mix 32/64 bit and intel/ppc architectures? |
|
November 2, 2011, 07:49 |
|
#17 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,981
Blog Entries: 45
Rep Power: 128 |
Hi Luca,
I think that since you're probably still going to try and work with the PS3, we can keep using this thread As for the building problems with 2.0.x, tell me a few things:
Bruno
__________________
|
|
Tags |
openfoam, openmpi, parallel, segmentation fault |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Working directory via command line | Luiz | CFX | 4 | March 6, 2011 21:02 |
forrtl: severe (174): SIGSEGV, segmentation fault occurred | therockyy | FLOW-3D | 7 | January 19, 2011 23:52 |
Customized code based on DieselEngineFoamSolver, always getting segmentation fault | dipling | OpenFOAM Programming & Development | 5 | July 30, 2009 10:33 |
Grid resolution - Segmentation fault | George | Main CFD Forum | 0 | September 4, 2007 18:38 |
Workbench on Linux Segmentation Fault | John Smith | CFX | 6 | January 3, 2007 14:45 |