parallel fluent runs being killed at partitioing

September 23, 2005, 15:51

We suddenly have started seeing parallel fluent runs on our cluster die very early in their runs, generally during or right after partitioning.

We are running red hat e3 on a 64-bit opteron cluster. We use a beefy head node to host our runs and farm out the gmpi processes to compute nodes. We user PBSPro as our scheduler. I've got a ticket in to fluent and they are concerned about the OS causing this issue but havent been too specific as to why. PBS's vendor thinks the kernel on the head nodes may be running out of memory and killing these jobs to preserve itself. We've been running fluent jobs in this way for several months with no problems. This issue cropped up Tuesday and intermittently will kill jobs. There seems to be no rhyme or reason to what can run and what cant. It seems like once a job starts iterating, its ok (unless it starts to partition again). Below is the output we get to stdout when these processes are killed. It looks pretty much identical to what happens when someone kills one or more of the mpi processes from the cl when a job is running. Has anyone here runinto this issue themselves or does anybody have any possible culprits?

Thanks, r/ben

--------------------------------------------------

Parallel variables... Building...

grid,

auto partitioning mesh by Principal Axes,

distributing mesh

parts..,

faces..,

nodes..,

cells..,

materials,

interface,

domains,

mixture

liquid-phase

vapor-phase

interaction

zones,

fluid (liquid-phase)

outlet (liquid-phase)

inlet (liquid-phase)

internal.5 (liquid-phase)

symm2 (liquid-phase)

symm1 (liquid-phase)

wall (liquid-phase)

default-interior (liquid-phase)

fluid (vapor-phase)

outlet (vapor-phase)

inlet (vapor-phase)

internal.5 (vapor-phase)

symm2 (vapor-phase)

symm1 (vapor-phase)

wall (vapor-phase)

default-interior (vapor-phase)

default-interior

wall

symm1

symm2

internal.5

inlet

outlet

fluid

parallel,

shell conduction zones, Done.

>
:

> iter continuity x-velocity y-velocity z-velocity k epsilon vf-vapor-pnode 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... ... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read.....

999999 (mpsystem.c@1228): mpt_read: failed: errno = 11

999999: mpt_read: error: read failed trying to read 8 bytes: Resource temporarily unavailable /apps/Fluent/Fluent.Inc/bin/fluent: line 3875: 6678 Killed $NO_RUN $EXE_CMD $MPI_ENABLED_OPTIONS [bt] Execution path: [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(Process_Stackframe+0x17) [0x9f6e97] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(mpt_error+0x109) [0x9e50e9] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(mpt_read+0xc6) [0x9e88b6] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(mpt_tcpip_crecv_raw+0x28) [0x9ea408] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(mpt_tcpip_crecv_all+0x28) [0x9ec948] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(MPT_crecv_double+0x112) [0x9d6ee2] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16 [0x5e81d8] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(Models_Send_update_solve+0xbe) [0x56769e] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(Flow_Iterate+0x19e) [0x4e143e] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16 [0x546788] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(eval+0x773) [0xa27403] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(eval+0x860) [0xa274f0] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(eval+0x460) [0xa270f0] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(eval+0x49a) [0xa2712a] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16 [0xa2873c] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(eval_errprotect+0x32) [0xa280d2] The fluent process could not be started.

time/iter

September 23, 2005, 18:11

Hi

I am getting the same kind of error, it writes mpt_connect_error. It is related to hardware, that is for sure. I use METIS and socket connection on P-IV nodes. Check with you hardware connections, since it may again start working if you make sure that all the connections are physically alright.

Vinod Dhiman

SAH2006 · September 23, 2005, 21:37

In my not so big experience running parallel jobs on Opterons I came across such error but only when re-reading several cases one after the other and not restarting the parallel session, or re-partitioning same case several times.

Anyway, my advice is this: since you're having Opterons and 64-bit OS, I can't accept the idea that you do not have at least one workstation capable of reading your cases by itself in serial mode (4GB of memory for one process must be enough to at least read the mesh); so, read the damn mesh in serial solver, partition it using the best method you can find, which is NOT always Metis!!!, and then write a case. Re-read it this time in parallel mode and make the rest of the settings you need.

I found this to be the best method for running parallel cases, because parallel solvers do not make the partitioning the same as serial solvers! I observed big differences between results obtained with the two ones, usually parallel partitioning using SAME algoritm gives a higher number of interface faces.

Best wishes, Razvan

Solarberiden · June 8, 2012, 11:40

Regulate up your pagefile as follow:

My computer>properties>advanced>Performance>Configura tion>
advanced>modify
Regulate up your possible pagefile scale by 1.0~2.0 time size of physical memory

September 23, 2005, 15:51	parallel fluent runs being killed at partitioing	#1
Ben Aga Guest Posts: n/a	We suddenly have started seeing parallel fluent runs on our cluster die very early in their runs, generally during or right after partitioning. We are running red hat e3 on a 64-bit opteron cluster. We use a beefy head node to host our runs and farm out the gmpi processes to compute nodes. We user PBSPro as our scheduler. I've got a ticket in to fluent and they are concerned about the OS causing this issue but havent been too specific as to why. PBS's vendor thinks the kernel on the head nodes may be running out of memory and killing these jobs to preserve itself. We've been running fluent jobs in this way for several months with no problems. This issue cropped up Tuesday and intermittently will kill jobs. There seems to be no rhyme or reason to what can run and what cant. It seems like once a job starts iterating, its ok (unless it starts to partition again). Below is the output we get to stdout when these processes are killed. It looks pretty much identical to what happens when someone kills one or more of the mpi processes from the cl when a job is running. Has anyone here runinto this issue themselves or does anybody have any possible culprits? Thanks, r/ben -------------------------------------------------- Parallel variables... Building... grid, auto partitioning mesh by Principal Axes, distributing mesh parts.., faces.., nodes.., cells.., materials, interface, domains, mixture liquid-phase vapor-phase interaction zones, fluid (liquid-phase) outlet (liquid-phase) inlet (liquid-phase) internal.5 (liquid-phase) symm2 (liquid-phase) symm1 (liquid-phase) wall (liquid-phase) default-interior (liquid-phase) fluid (vapor-phase) outlet (vapor-phase) inlet (vapor-phase) internal.5 (vapor-phase) symm2 (vapor-phase) symm1 (vapor-phase) wall (vapor-phase) default-interior (vapor-phase) default-interior wall symm1 symm2 internal.5 inlet outlet fluid parallel, shell conduction zones, Done. > : > iter continuity x-velocity y-velocity z-velocity k epsilon vf-vapor-pnode 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... ... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... 999999 (mpsystem.c@1228): mpt_read: failed: errno = 11 999999: mpt_read: error: read failed trying to read 8 bytes: Resource temporarily unavailable /apps/Fluent/Fluent.Inc/bin/fluent: line 3875: 6678 Killed $NO_RUN $EXE_CMD $MPI_ENABLED_OPTIONS [bt] Execution path: [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(Process_Stackframe+0x17) [0x9f6e97] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(mpt_error+0x109) [0x9e50e9] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(mpt_read+0xc6) [0x9e88b6] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(mpt_tcpip_crecv_raw+0x28) [0x9ea408] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(mpt_tcpip_crecv_all+0x28) [0x9ec948] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(MPT_crecv_double+0x112) [0x9d6ee2] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16 [0x5e81d8] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(Models_Send_update_solve+0xbe) [0x56769e] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(Flow_Iterate+0x19e) [0x4e143e] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16 [0x546788] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(eval+0x773) [0xa27403] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(eval+0x860) [0xa274f0] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(eval+0x460) [0xa270f0] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(eval+0x49a) [0xa2712a] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16 [0xa2873c] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(eval_errprotect+0x32) [0xa280d2] The fluent process could not be started. time/iter

September 23, 2005, 18:11	Re: parallel fluent runs being killed at partitioi	#2
Vinod Dhiman Guest Posts: n/a	Hi I am getting the same kind of error, it writes mpt_connect_error. It is related to hardware, that is for sure. I use METIS and socket connection on P-IV nodes. Check with you hardware connections, since it may again start working if you make sure that all the connections are physically alright. Vinod Dhiman

September 23, 2005, 21:37	Re: parallel fluent runs being killed at partitioi	#3
Razvan Guest Posts: n/a	In my not so big experience running parallel jobs on Opterons I came across such error but only when re-reading several cases one after the other and not restarting the parallel session, or re-partitioning same case several times. Anyway, my advice is this: since you're having Opterons and 64-bit OS, I can't accept the idea that you do not have at least one workstation capable of reading your cases by itself in serial mode (4GB of memory for one process must be enough to at least read the mesh); so, read the damn mesh in serial solver, partition it using the best method you can find, which is NOT always Metis!!!, and then write a case. Re-read it this time in parallel mode and make the rest of the settings you need. I found this to be the best method for running parallel cases, because parallel solvers do not make the partitioning the same as serial solvers! I observed big differences between results obtained with the two ones, usually parallel partitioning using SAME algoritm gives a higher number of interface faces. Best wishes, Razvan SAH2006 and NasharuddinMJ like this.

June 8, 2012, 11:40	Suspect Solution for this problem	#4
Solarberiden New Member TCH Join Date: Jul 2010 Location: Beijing City Posts: 15 Rep Power: 16	Regulate up your pagefile as follow: My computer>properties>advanced>Performance>Configura tion> advanced>modify Regulate up your possible pagefile scale by 1.0~2.0 time size of physical memory

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Parallel fluent not using all processors specified	Paul	FLUENT	18	October 26, 2023 04:54
Parallel Error in ANSYS FLUENT 12	zeusxx	FLUENT	25	July 17, 2015 05:40
Urgent; parallel processing in fluent 12	Mansureh	FLUENT	4	September 25, 2012 12:12
Parallel fluent 4 nodes machine (Quad 6600 SUSE)	Rafa	FLUENT	4	June 7, 2011 07:33
error parallel fluent session	Diet	FLUENT	2	January 27, 2005 13:31