|
[Sponsors] |
parallel fluent runs being killed at partitioing |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
September 23, 2005, 15:51 |
parallel fluent runs being killed at partitioing
|
#1 |
Guest
Posts: n/a
|
We suddenly have started seeing parallel fluent runs on our cluster die very early in their runs, generally during or right after partitioning.
We are running red hat e3 on a 64-bit opteron cluster. We use a beefy head node to host our runs and farm out the gmpi processes to compute nodes. We user PBSPro as our scheduler. I've got a ticket in to fluent and they are concerned about the OS causing this issue but havent been too specific as to why. PBS's vendor thinks the kernel on the head nodes may be running out of memory and killing these jobs to preserve itself. We've been running fluent jobs in this way for several months with no problems. This issue cropped up Tuesday and intermittently will kill jobs. There seems to be no rhyme or reason to what can run and what cant. It seems like once a job starts iterating, its ok (unless it starts to partition again). Below is the output we get to stdout when these processes are killed. It looks pretty much identical to what happens when someone kills one or more of the mpi processes from the cl when a job is running. Has anyone here runinto this issue themselves or does anybody have any possible culprits? Thanks, r/ben -------------------------------------------------- Parallel variables... Building... grid, auto partitioning mesh by Principal Axes, distributing mesh parts.., faces.., nodes.., cells.., materials, interface, domains, mixture liquid-phase vapor-phase interaction zones, fluid (liquid-phase) outlet (liquid-phase) inlet (liquid-phase) internal.5 (liquid-phase) symm2 (liquid-phase) symm1 (liquid-phase) wall (liquid-phase) default-interior (liquid-phase) fluid (vapor-phase) outlet (vapor-phase) inlet (vapor-phase) internal.5 (vapor-phase) symm2 (vapor-phase) symm1 (vapor-phase) wall (vapor-phase) default-interior (vapor-phase) default-interior wall symm1 symm2 internal.5 inlet outlet fluid parallel, shell conduction zones, Done. > : > iter continuity x-velocity y-velocity z-velocity k epsilon vf-vapor-pnode 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... ... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... 999999 (mpsystem.c@1228): mpt_read: failed: errno = 11 999999: mpt_read: error: read failed trying to read 8 bytes: Resource temporarily unavailable /apps/Fluent/Fluent.Inc/bin/fluent: line 3875: 6678 Killed $NO_RUN $EXE_CMD $MPI_ENABLED_OPTIONS [bt] Execution path: [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(Process_Stackframe+0x17) [0x9f6e97] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(mpt_error+0x109) [0x9e50e9] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(mpt_read+0xc6) [0x9e88b6] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(mpt_tcpip_crecv_raw+0x28) [0x9ea408] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(mpt_tcpip_crecv_all+0x28) [0x9ec948] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(MPT_crecv_double+0x112) [0x9d6ee2] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16 [0x5e81d8] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(Models_Send_update_solve+0xbe) [0x56769e] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(Flow_Iterate+0x19e) [0x4e143e] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16 [0x546788] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(eval+0x773) [0xa27403] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(eval+0x860) [0xa274f0] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(eval+0x460) [0xa270f0] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(eval+0x49a) [0xa2712a] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16 [0xa2873c] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(eval_errprotect+0x32) [0xa280d2] The fluent process could not be started. time/iter |
|
September 23, 2005, 18:11 |
Re: parallel fluent runs being killed at partitioi
|
#2 |
Guest
Posts: n/a
|
Hi
I am getting the same kind of error, it writes mpt_connect_error. It is related to hardware, that is for sure. I use METIS and socket connection on P-IV nodes. Check with you hardware connections, since it may again start working if you make sure that all the connections are physically alright. Vinod Dhiman |
|
September 23, 2005, 21:37 |
Re: parallel fluent runs being killed at partitioi
|
#3 |
Guest
Posts: n/a
|
In my not so big experience running parallel jobs on Opterons I came across such error but only when re-reading several cases one after the other and not restarting the parallel session, or re-partitioning same case several times.
Anyway, my advice is this: since you're having Opterons and 64-bit OS, I can't accept the idea that you do not have at least one workstation capable of reading your cases by itself in serial mode (4GB of memory for one process must be enough to at least read the mesh); so, read the damn mesh in serial solver, partition it using the best method you can find, which is NOT always Metis!!!, and then write a case. Re-read it this time in parallel mode and make the rest of the settings you need. I found this to be the best method for running parallel cases, because parallel solvers do not make the partitioning the same as serial solvers! I observed big differences between results obtained with the two ones, usually parallel partitioning using SAME algoritm gives a higher number of interface faces. Best wishes, Razvan |
|
June 8, 2012, 11:40 |
Suspect Solution for this problem
|
#4 |
New Member
TCH
Join Date: Jul 2010
Location: Beijing City
Posts: 15
Rep Power: 16 |
Regulate up your pagefile as follow:
My computer>properties>advanced>Performance>Configura tion> advanced>modify Regulate up your possible pagefile scale by 1.0~2.0 time size of physical memory |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Parallel fluent not using all processors specified | Paul | FLUENT | 18 | October 26, 2023 04:54 |
Parallel Error in ANSYS FLUENT 12 | zeusxx | FLUENT | 25 | July 17, 2015 05:40 |
Urgent; parallel processing in fluent 12 | Mansureh | FLUENT | 4 | September 25, 2012 12:12 |
Parallel fluent 4 nodes machine (Quad 6600 SUSE) | Rafa | FLUENT | 4 | June 7, 2011 07:33 |
error parallel fluent session | Diet | FLUENT | 2 | January 27, 2005 13:31 |