CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM

massive parallel run messed up lustre file system

Register Blogs Community New Posts Updated Threads Search

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   May 24, 2011, 12:21
Default massive parallel run messed up lustre file system
  #1
Member
 
Matthias Walter
Join Date: Mar 2009
Location: Rostock, Germany
Posts: 63
Rep Power: 17
matthias is on a distinguished road
Hello folks,

I have performed a parallel run with 2048 cores (in future more than 2048) using an HPC with lustre file system. So far the simulation ran fine, results were good but there occured a problem I had not expected.
After some runs of my case accompanied by big IO traffic which was produced by the other users on the cluster, the storage system (lustre) of the cluster was messed up. A detailed analysis by SGI revealed that a massive simultaneous parallel access to the storage system was responsible for the damage.
For this reason, the admins of the cluster have passed some rules concerning the usage of the HPC. From now not more than ~600-800 processes (or tasks/threads/files) should be read or written simultaneously (especially not more than 6000 files should be written simultaneously by all users).

The admins asked me, if it would be possible to serialize OF for read/write access when using lustre file systems and more than (let's say) 1500 cores.
They suggested to read/write the first block of 128 or 256 processes/files/threads and then the next one and so on until all data is loaded or written within a time step.
Time steps without IO traffic would be not affected by this restriction.

I would therefore like to forward the question to the experts.

Best regards

Matthias
matthias is offline   Reply With Quote

Old   May 28, 2011, 17:18
Default massively parallel and lustre
  #2
New Member
 
Cliff white
Join Date: May 2011
Posts: 1
Rep Power: 0
cliffw is on a distinguished road
600-800 threads is actually kinda small for Lustre, large sites routinely run >100k threads (see http://www.nccs.gov/jaguar/ for example)

If your backend storage cannot keep up with the volume of Lustre IO requests, there are various ways to tune the Lustre clients to reduce IO load.
You can reduce the number of RPCs in flight, reduce the amount of dirty memory cached per client, etc. Client tuning is quite easy - certainly simpler than forcing serialized IO.

See lustre.org or whamcloud.com for the Lustre manual, which has the tuning information. Also see the lustre-discuss email list

(Note: I work for Whamcloud)
cliffw is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
SnappyHexMesh OF-1.6-ext crashes on a parallel run norman1981 OpenFOAM Bugs 5 December 7, 2011 13:48
1.7.x Environment Variables on Linux 10.04 rasma OpenFOAM Installation 9 July 30, 2010 05:43
OpenFOAM Install Script ljsh OpenFOAM Installation 82 October 12, 2009 12:47
OpenFOAM on MinGW crosscompiler hosted on Linux allenzhao OpenFOAM Installation 127 January 30, 2009 20:08
Problem with rhoSimpleFoam matteo_gautero OpenFOAM Running, Solving & CFD 0 February 28, 2008 07:51


All times are GMT -4. The time now is 22:01.