Sunday, August 7, 2016

BigData: Hadoop and MapReduce

This post will give general idea about Big Data and Hadoop Architecture with MapReduce.

Data Sources
According to IBM: "Every day, 2.5 billion gigabytes of high-velocity data are created in a variety of forms, such as social media posts, information gathered in sensors and medical devices, videos and transaction records"

Definition of Big Data 
Big Data is a loosely defined term used to describe data sets so large and complex that they become awkward to work with using standard statistical software.(International Journal of Internet Science, 2012, 7 (1), 1–5) 

The 3 Vs - Volume, Variety, Velocity
Volume - Huge volume of data needs to be stored. For this we need cheap and reliable storage solutions.
Variety - Is storing all type of data and in its raw format.
Velocity - Refers to ability to process the data as it arrives which can be very fast in case of huge amount of data.


The 3 V's were first defined in a research report by Douglas Laney in 2001 titled "3D Data Management: Controlling Data Volume, Velocity and Variety" .
In 2012 he updated the definition as follows "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization".

Doug Cutting, Creator of Hadoop
He used papers published by Google about distributed file system (GFS), and processing framework MapReduce, to create open source version of Google system. Then after joining Yahoo, Hadoop project was created as which was scalable and stable version of their older system.

Here are the papers Google published about their distributed file system (GFS) and their processing framework, MapReduce.
 
"Hadoop" is name of elephant toy of Cutting's son.

Core Hadoop
Consist of distributed network of computer cluster with HDFS (Hadoop Distributed File System) process it with MapReduce.

Hadoop Ecosystem 

HDFS is used to store data
MR(MapReduce) is used to process data in HDFS. But to use MR we need programming language like Java, Python, etc.

Alternative is to use,
Hive, whcih converts normal SQL query to MR. These are used to run batch process.
Pig, converts simple script commands to MR

But these may be time consuming if we have large amount of data.

To overcome these, another project was developed
Impala, takes commands as SQL query and directly runs over HDFS and process data. It is very optimised and much faster than Hive or MP.
HBase, is real time DB build on top of HDFS

Sqoop, takes data from SQL and puts it in HDFS for processing with other data.
Flume, injust data as it is generated by external system and puts in cluster.

Hue, is graphical front end to cluster.
Oozie is work flow management tool
Mahout is machine learning library

All these comes as part of CDH (Cloudera Distribution of Hadoop including apache Hadoop). CDH is Free and Open Source.

Understanding HDFS
 
In HDFS, files are divided into blocks of 64mb. Each block is names blk_
Each block is stored on seperate node in cluster called data node. Each cluser have a name node which keeps track of which block is part of which data. This information in name node is meta data.


64mb block size helps as compared to 16kb by other file system
1. easy management by name server, else there will be too many block to track
2. Mapper is needed to each block we want to process, there would be a lot of mapper, each processing a piece of data which isin't efficient.

To guard against N/w failure and Disk failure each block is stored on 3 different data node. This is data redudency. If one of the node failes, name node will detect that some blocks are not replicated 3 times and so it will start making new copies of failed blocks so that each block is replicated 3 times in a cluster.

To protect Name Node failure, it is also backed up with Standby name node.Standby name node is activated when primary is not responding. Name nodes can also be backed up   using NFS (Natwrok Filed System) on remote disk.
Hadoop Commands
Hadoop commands start with "hadoop fs". Hadoop commands are similar to UNIX. 
eg.- hadoop fs - ls :-List
       hadoop fs - put :- put file to particular folder

Map Reduce
MapReduce breaks the complete data into small chunks and process it in parallel. mappers distribute and work in parallel and reducers will sort and reduce data.
Hadoop takes care of the Shuffle and Sort phase. You do not have to sort the keys in your reducer code, you get them in already sorted order.

 
Job Trackers are daemons of MR to keep track of all pending Jobs in a cluster. And each data node has its own Task Trackers to track task assigned to it. This is make it simple to process the data in individual data node in the cluster. Processing the data by mappers locally speeds up the process. Once the mapper have finished, data is sorted and passed to reducers, which may be running on different node in same clusters.

Resource Web Link:
Udacity Course: https://www.udacity.com/course/intro-to-hadoop-and-mapreduce--ud617
Cloudera (CDH): https://www.cloudera.com/products/apache-hadoop/key-cdh-components.html