Tuesday 18 April 2017

Hadoop Introduction

Below are the notes I have taken while taking the course "Hadoop 101" in https://bigdatauniversity.com. Keeping it on the blog for my own reference and it may be helpful to others.  

M1: Intro to Hadoop


What is Hadoop?

*Framework developed in java for processing Structured, un and semi structured data on commodity computer node.
* Not suitable for Online Trans processing or online Analytic Processing
* Not replacement to RDBMS
* FB - processes 600GB data
* Twitter processed 7TB everyday
* 80% data are unstructured

Hadoop related opensource projects:

* Eclipse
* Apache Lucene - Full text search engine written in java
* HBASE - Hadoop Database
* Hive - SQL like language to access HBASE
* PIC - High level language
* Zookeeper - for managing the name nodes
* Spark - Execution engine
* Apache Aambari - Web UI for managing & monitoring the hadoop cluster
* Apache Avro - for data serialization
* Apache UIMA - Unstructured Information Management Applications

LabWork
-------

M2: Hadoop Architecture & HDFS


Terminologies:

* Each node is called a commodity computer
* Rack are collection of 30 to 40 nodes
* Hadoop cluster is collection of racks
* Within rack the bandwidth is more

Pre 2.2 Arch:

Two components
1. Distributed File System or HDFS - FOr storing the data
>Name Node and Data NOde
>HDFS runs on top of existing FS
>No random access
>Default block size is 128 MB   
>Replication of blocks across cluster

2. MapReduce Engine - a framework for processing 
>Job tracker and Task Tracker
>SIngle JobTracker
>MapReduce is based on Google paper


Hadoop 2.2

* Provides YARN (Yet Another Res Negotiator)
>Referred to as MapReduce v2
>Resource Manager and Scheduler are introduced
>Name Node and Data Node exists
>No Jobtracker and tasktracker
>App Master do the resaource negotiation
>NameNode was Single point of failure in Hadoop 1.1

HDFS Replication:
-----------------
> Eg: Replication factor 3.
First a block is placed in Rack1, then the replication will be placed in a data node other than Rack1, eg: Rack 2. The third replication will be placed in same Rack where 2nd replication happed. In this case Rack2.

HDFS COMMAND LINE
-----------------
hadoop fs <args>
hdfs dfs <args>
eg: hdfs dfs -ls

copyFromLocal / put
copyToLocal /get
getMerge
setRep - for setting Replication

hadoop fs -help

M3: Hadoop Administration


1. Adding /removing Nodes
> from Ambari console
> services can be added ot removed from a particular node
2. Verifying health
> hadoop fs - report
3. Start and stopping component
> Services like Pig, Hive, Sqoop etc can be started and stopped
4. Configuration
> multiple config files like
a. hadoop-env.xml - specify where JAVA is
b. core-site.xml - for hadoop core such as IO setting common for HDFS and MR
c. hdfs-site.xml - configuration of name node dir, sec name node, data node, block size

Before changing, we need to stop the service.

M4: Hadoop Components


1. MapReduce
Data enters in unstructured format, Map will convert it into a key and value format

2. Pig and Hive
Converts high level language to MR programs
Pig:
> 2 execution env
Local (interpreter) & Distributed
We can use Grunt prompt
We can use script
Hive:

3. Flume:
By CoudEra
eg: collecting log from all node and move to a persistence storage
tail(acces.log) --> HDFS

AGENT   | COLLECTOR | HDFS

4. Sqoop:
Trasnfer data between Hadoop and DB

5. Oozie:
- Manages workflows in HDFS
- Used to control hadoop jobs
- workflow defined in hPDL (XML Process Definition Language)

No comments:

Post a Comment