March 3, 2017

Apache projects around Hadoop in a nutshell

You might be a little surprised to know that Apache Software Foundation hosts over 100 top-level software projects, but you will certainly be surprised to know that there are at least a dozen Hadoop related Apache projects, if you haven’t come across them yet. Today I will mention briefly about 4 projects in the Hadoop ecosystem, as a follow up to my Hadoop primer.

Apache Whirr

Configuring Hadoop can be tricky and complicated. Apache Whirr can help in setting up a Hadoop cluster from scratch from a client terminal, but its use is not limited to Hadoop. Perhaps inspired by Chef, it uses configuration files or “whirr recipes” for running different services in a cloud-neutral way. For instance, one can use it to launch a Hadoop cluster on the Amazon Cloud. In a Whirr recipe for Hadoop, you can specify how many nodes you want, what Amazon image (AIM) you want to use, what version of Hadoop you want to install on the nodes and so on.

Apache Hive

Initially developed by Facebook, Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Simply put, it is a relational database infrastructure running on Hadoop, which is ideal for structured data analysis. Hive reads and uses Hadoop configuration properties, but you can override Hadoop settings in your Hive setup.

Internally, all HiveQL statements are translated into a set of MapReduce jobs that are submitted to Hadoop. The “hive” command starts a MySQL-like terminal, which supports many standard SQL statements such as “show tables”, “describe”, “select” etc. Apache also provides a simple web-based user interface for submitting Hive queries, which is included in the Hive source tarball.

Hive is not Oracle! The downside of Hive is the substantial overheads in job submission and scheduling due to the underlying Hadoop/HDFS. So latency for Hive queries is generally high even when data sets involved are very small. Hive is not designed for online transaction processing and it does not offer real-time queries.

Apache Pig

Apache Pig is a high-level platform for creating MapReduce programs used with Hadoop. The corresponding programming language is called Pig Latin, and for those who don’t know, it’s a humorous reference to the English word game with the same name. Pig Latin scripts can be extended using user defined functions (UDF) written in Java or Python. However, it also supports making Perl-like “back-ticked” system calls whose output can be passed to the Pig’s “stream” command. Pig was originally developed at Yahoo Research in 2006. It was moved to Apache later. Its prominent users include Yahoo, Twitter, LinkedIn, AOL, and Nokia among the many others.

Advanced MapReduce requires Java use. Pig is a strong alternative. Roughly 200 lines of Java code for submitting a Hadoop job can be written as 10 lines of Pig Latin! Pig scripts can be submitted directly, as in “pig myScript.pig”, or Pig statements can be run in a Pig shell that is called “grunt”. So, as you see, if you ever to make any contribution to the Apache Pig project, having at least some degree of sense of humor is a must. :)

Each statement is first parsed and checked for you by Pig in terms of syntax, and then the statement is optimized behind the scenes and converted to mapper and reducer programs transparently before automatically submitted on Hadoop. There is an Eclipse plug-in for Pig, PigPen, and also a Pig debugger, Penny.

Apache HBase

Hbase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. Facebook’s messaging platform uses HBase. As explained in the Apache HBase web site, it is suitable for hosting of very large tables spanning billions of rows and millions of columns -- atop clusters of commodity hardware.

As I mentioned earlier, one of the shortcomings of Apache Hive is the lack of realtime data access support. HBase can be very useful when you need random, realtime read/write access to your Big Data on a distributed environment. Nevertheless, HBase is not the answer if you need a distributed relational database. HBase is a type of "NoSQL" database, and actually, defining it as a "data store" rather than a "database" would have been more correct.

Having mentioned about Hive and Hbase, one might muse how come HBase, a Hadoop-based technology, can provide faster data access than Hive, which also runs on top of Hadoop? Well, the answer is two-fold: firstly, for high-speed lookup HBase internally keeps your data in indexed files called "StoreFiles" on HDFS, and secondly, Hbase has no typed columns, secondary indexes, or support for online transactions.

Topics: Apache, Bioinformatics, Cloud, Hadoop, HBase, Hive, Pig