Venkatesh's Note book: Hadoop Introduction

Below are the notes I have taken while taking the course "Hadoop 101" in https://bigdatauniversity.com. Keeping it on the blog for my own reference and it may be helpful to others.

M1: Intro to Hadoop

What is Hadoop?

*Framework developed in java for processing Structured, un and semi structured data on commodity computer node.

* Not suitable for Online Trans processing or online Analytic Processing

* Not replacement to RDBMS

* FB - processes 600GB data

* Twitter processed 7TB everyday

* 80% data are unstructured

Hadoop related opensource projects:

* Eclipse

* Apache Lucene - Full text search engine written in java

* HBASE - Hadoop Database

* Hive - SQL like language to access HBASE

* PIC - High level language

* Zookeeper - for managing the name nodes

* Spark - Execution engine

* Apache Aambari - Web UI for managing & monitoring the hadoop cluster

* Apache Avro - for data serialization

* Apache UIMA - Unstructured Information Management Applications

LabWork

-------

M2: Hadoop Architecture & HDFS

Terminologies:

* Each node is called a commodity computer

* Rack are collection of 30 to 40 nodes

* Hadoop cluster is collection of racks

* Within rack the bandwidth is more

Pre 2.2 Arch:

Two components

1. Distributed File System or HDFS - FOr storing the data

>Name Node and Data NOde

>HDFS runs on top of existing FS

>No random access

>Default block size is 128 MB

>Replication of blocks across cluster

2. MapReduce Engine - a framework for processing

>Job tracker and Task Tracker

>SIngle JobTracker

>MapReduce is based on Google paper

Hadoop 2.2

* Provides YARN (Yet Another Res Negotiator)

>Referred to as MapReduce v2

>Resource Manager and Scheduler are introduced

>Name Node and Data Node exists

>No Jobtracker and tasktracker

>App Master do the resaource negotiation

>NameNode was Single point of failure in Hadoop 1.1

HDFS Replication:

-----------------

> Eg: Replication factor 3.

First a block is placed in Rack1, then the replication will be placed in a data node other than Rack1, eg: Rack 2. The third replication will be placed in same Rack where 2nd replication happed. In this case Rack2.

HDFS COMMAND LINE

-----------------

hadoop fs <args>

hdfs dfs <args>

eg: hdfs dfs -ls

copyFromLocal / put

copyToLocal /get

getMerge

setRep - for setting Replication

hadoop fs -help

M3: Hadoop Administration

1. Adding /removing Nodes

> from Ambari console

> services can be added ot removed from a particular node

2. Verifying health

> hadoop fs - report

3. Start and stopping component

> Services like Pig, Hive, Sqoop etc can be started and stopped

4. Configuration

> multiple config files like

a. hadoop-env.xml - specify where JAVA is

b. core-site.xml - for hadoop core such as IO setting common for HDFS and MR

c. hdfs-site.xml - configuration of name node dir, sec name node, data node, block size

Before changing, we need to stop the service.

M4: Hadoop Components

1. MapReduce

Data enters in unstructured format, Map will convert it into a key and value format

2. Pig and Hive

Converts high level language to MR programs

Pig:

> 2 execution env

Local (interpreter) & Distributed

We can use Grunt prompt

We can use script

Hive:

3. Flume:

By CoudEra

eg: collecting log from all node and move to a persistence storage

tail(acces.log) --> HDFS

AGENT | COLLECTOR | HDFS

4. Sqoop:

Trasnfer data between Hadoop and DB

5. Oozie:

- Manages workflows in HDFS

- Used to control hadoop jobs

- workflow defined in hPDL (XML Process Definition Language)

Venkatesh's Note book

Tuesday, 18 April 2017

Hadoop Introduction