Would Apache Hadoop be useful for CFD?

peter.pan · March 14, 2014, 06:11

Hi,

I stumbled across some open source software called Apache Hadoop. Wanted to know if any member here has experience with that thing. Apparently it is a 'software for reliable, scalable, distributed computing'.

Is it worth giving it a try?

Thanks,

wyldckat · March 16, 2014, 10:04

Greetings Peetak,

AFAIK, Hadoop was developed for distributed computing of a very different kind of computing, when compared to CFD. It was developed for maintaining web-based platforms, such as social websites, financial stock dealing platforms and other high complexity inter-relational metadata.

Using it for CFD would likely only be useful as a highly distributed job scheduling system (http://en.wikipedia.org/wiki/Job_scheduler), interconnecting millions of clusters around the world, to solve independent problems on each cluster, cataloguing each simulation performed and then gathering the inputs, outputs and post-processing as a big gigantic library of results, including an inter-relational connection between those simulations. Sort-of like a very big fingerprint database.

If you only have access to one or two clusters, it's really a massive overkill to use Hadoop, unless you want to build a platform for a University or some other teaching facility, where the platform can point out to students whether a particular simulation will never work, as other students in previous years had already attempted to perform and fail in the past.

Best regards,
Bruno

peter.pan · March 18, 2014, 07:37

Dear Bruno,

Thanks for your reply.

Well if I understand it correctly, does it mean I can actually use the resource to access remote HPC clusters? Currently my rights to access some of the HPC clusters that I had previously used is over and hence I am finding it hard to run simulations in parallel.

So if HADOOP could actually allow me to access and use multiple clusters or even one cluster, that would be immensely beneficial in my research I guess.

Thanks,
Peetak

clived · April 5, 2015, 18:39

I came across this forum quite by chance as a result of your hadoop questions. I am a hadoop newbies and am looking for a hadoop related forum here.

Any suggestions would be appreciated

Clive

wyldckat · April 11, 2015, 13:51

Greetings to all!

So I found out recently, thanks to Lorena Barba re-tweeting about this, that MPI is apparently getting too old and that Hadoop/Spark is just one of a few of the technologies that are likely to replace MPI sometime in the future:

In addition, those blog posts refer to Chapel, which is a programming language that has already found it's way to here on the forum: http://www.cfd-online.com/Forums/mai...languages.html

Therefore, I'm posting about the original post here on this thread:

Quote:

Originally Posted by peter.pan

Apparently it is a 'software for reliable, scalable, distributed computing'.

Is it worth giving it a try?

I still stand by my post #2, namely that Hadoop would only be worth for helping managing the execution of simulations, using already existing applications. But with this new information, I can write a bit more on this topic.

Essentially, Hadoop/Spark is pretty much a platform in its own right. It's mostly written in Java, which is a language that (AFAIK) is rarely used for programming CFD software, simply because Java is an interpreted language and won't be as run-time efficient as C/C++/FORTRAN. But as the blog post defends, with today's CPUs and how things have evolved, this language overhead might not be what's stopping us any more, it's actually how long things take to code. In fact, there are already optimization strategies embedded into these languages, that we are unlikely to be able to reproduce with C/C++/FORTRAN without some considerable effort (or at least a matter of searching for the right library).
Then there is the other detail: at least in theory, to make the most of the Hadoop platform, it's best to create the source code for the CFD software directly in Java and directly linked to Hadoop's libraries, which in most cases, implies having to re-write the whole code.
Using C++ and other languages to connect to Hadoop is also possible, but after a quick search, it seems that it requires some investigation into what should be really used as the base library for making the connection; MapReduce-MPI, Hadoop Pipes and MR4C (Google's implementation) are just to name a few, over the few dozens that already exist.

Then there is also complete alternatives to any of the above, such as:

UPC: http://en.wikipedia.org/wiki/Unified_Parallel_C
MPI-RMA, which is allegedly (one of) the next generation(s) of MPI.

All of this just to say that using Hadoop as a building block for creating CFD applications is something that perhaps might happen in 3-5 years from now, or be used in the back-office in cloud services that provide CFD software as an online service, without us even knowing about it.

Best regards,
Bruno

March 14, 2014, 06:11	Would Apache Hadoop be useful for CFD?	#1
peter.pan New Member Peetak Mitra Join Date: Jul 2012 Posts: 19 Rep Power: 14	Hi, I stumbled across some open source software called Apache Hadoop. Wanted to know if any member here has experience with that thing. Apparently it is a 'software for reliable, scalable, distributed computing'. Is it worth giving it a try? Thanks,

April 5, 2015, 18:39	Hadoop	#4
clived New Member Clive DaSilva Join Date: Apr 2015 Location: Toronto, Canada Posts: 1 Rep Power: 0	I came across this forum quite by chance as a result of your hadoop questions. I am a hadoop newbies and am looking for a hadoop related forum here. Any suggestions would be appreciated Clive

March 16, 2014, 10:04		#2
wyldckat Retired Super Moderator Bruno Santos Join Date: Mar 2009 Location: Lisbon, Portugal Posts: 10,981 Blog Entries: 45 Rep Power: 128	Greetings Peetak, AFAIK, Hadoop was developed for distributed computing of a very different kind of computing, when compared to CFD. It was developed for maintaining web-based platforms, such as social websites, financial stock dealing platforms and other high complexity inter-relational metadata. Using it for CFD would likely only be useful as a highly distributed job scheduling system (http://en.wikipedia.org/wiki/Job_scheduler), interconnecting millions of clusters around the world, to solve independent problems on each cluster, cataloguing each simulation performed and then gathering the inputs, outputs and post-processing as a big gigantic library of results, including an inter-relational connection between those simulations. Sort-of like a very big fingerprint database. If you only have access to one or two clusters, it's really a massive overkill to use Hadoop, unless you want to build a platform for a University or some other teaching facility, where the platform can point out to students whether a particular simulation will never work, as other students in previous years had already attempted to perform and fail in the past. Best regards, Bruno

March 18, 2014, 07:37		#3
peter.pan New Member Peetak Mitra Join Date: Jul 2012 Posts: 19 Rep Power: 14	Dear Bruno, Thanks for your reply. Well if I understand it correctly, does it mean I can actually use the resource to access remote HPC clusters? Currently my rights to access some of the HPC clusters that I had previously used is over and hence I am finding it hard to run simulations in parallel. So if HADOOP could actually allow me to access and use multiple clusters or even one cluster, that would be immensely beneficial in my research I guess. Thanks, Peetak