Hadoop is an open source Apache project hadoop.apache.org (downloads are available through mirror sites which are listed on their home page)HadoopLogo2_460

The current tool for Big Data analysis is Hadoop, a software framework that supports the distribution of computational work to many nodes in a computer cluster. Hadoop’s design makes it particularly effective at processing large data volumes, but it can be used for data sets of almost any size. A Hadoop test cluster can be built with as few as four machines. For companies that don’t wish to invest in their own cluster, Hadoop can be run on cloud services such as Amazon Web Services or Windows Azure.

The core components of a Hadoop system are the Hadoop Distributed File System (HDFS), which manages the placement of data throughout the cluster, and MapReduce, which provides a simple parallel programming model for data analysis using a cluster of computers. Importing data into HDFS involves copying, and perhaps reformatting from whatever medium you have the data currently stored in, to the HDFS.

HDFS takes care of the distribution of data across the cluster, replication of the data to ensure the dataset’s integrity, and interacts with the job management system so programs have efficient access to the data. The data should be “unstructured” in the sense that if the data is split into chunks (blocks), there won’t be any dependency between the chunks that will require the analysis program to communicate or receive information in order to complete the analysis of that chunk.

MapReduce will now be able to run the analysis programs, Mappers and Reducers, on the distributed data. The system design ensures that many Mappers can work on the full dataset, and each mapper is working on data located as close as possible to the CPU it is running on, so time to read and write data is minimized. This results in a large speedup in analysis time over using a single computer with a large attached disk system.

There is an ecosystem of related projects that have the Hadoop File System and MapReduce as the base. These projects include:

  • Pig – A high-level language for expressing data analysis programs. And a high-level data-flow programming language and execution framework for data-intensive computing.
  • Hive – offers an SQL-like language on top of MapReduce. A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis, initially developed by Facebook.
  • Sqoop – transfer data between Hadoop and structured datastores such as relational databases.
  • Impala (announcement) – Cloudera Impala is a way to query the data using SQL.
  • Flume – ingests data as it’s generated by external sources and puts it into HDFS.
  • Hue – a graphical front end to Hadoop.
  • Oozie – a workflow management tool.
  • Mahout – is a machine learning library.
  • Zookeeper – keep the packaging mess above in order.