Re: apache hadoop

From: Eric Lease Morgan <emorgan_at_nyob> Date: Wed, 19 Dec 2018 13:52:58 -0500 To: CODE4LIB_at_LISTS.CLIR.ORG

Thank you for the replies, and now I am aware of three different tools for parallel/distributed computing:

  1. GNU Parallel (https://www.gnu.org/software/parallel/) - 'Works great for me. Easy to install. Easy to use. Given well-written programs, can be used to take advantage of all the cores on your computer.

  2. Apache Hadoop (https://hadoop.apache.org) - 'Seems to be optimized for map/reduced computing, but can take advantage of many computers with many cores.

  3. Apache Spark (http://spark.apache.org) - 'Seems to have learned lessons from Hadoop. 'Seems easier to install, and can also take advantage of many computers with many cores, but also comes with a few programming interfaces such as Python & Java. Also seems a bit easier to install than Hadoop.

If I had one computer with 132 cores, then I'd probably stick with Parallel. If I had multiple computers with multiple cores, then I'd probably look more closely at Spark. Both Hadoop & Spark are "heavy" frameworks, compared to Parallel which is one Perl script.

--
Eric Morgan