Below are the notes I have taken while taking the course "Hadoop 101" in https://bigdatauniversity.com. Keeping it on the blog for my own reference and it may be helpful to others.
M1: Intro to Hadoop
What is Hadoop?
*Framework developed in java for processing Structured, un and semi structured data on commodity computer node.
* Not suitable for Online Trans processing or online Analytic Processing
* Not replacement to RDBMS
* FB - processes 600GB data
* Twitter processed 7TB everyday
* 80% data are unstructured
Hadoop related opensource projects:
* Eclipse
* Apache Lucene - Full text search engine written in java
* HBASE - Hadoop Database
* Hive - SQL like language to access HBASE
* PIC - High level language
* Zookeeper - for managing the name nodes
* Spark - Execution engine
* Apache Aambari - Web UI for managing & monitoring the hadoop cluster
* Apache Avro - for data serialization
* Apache UIMA - Unstructured Information Management Applications
LabWork
-------
M2: Hadoop Architecture & HDFS
Terminologies:
* Each node is called a commodity computer
* Rack are collection of 30 to 40 nodes
* Hadoop cluster is collection of racks
* Within rack the bandwidth is more
Pre 2.2 Arch:
Two components
1. Distributed File System or HDFS - FOr storing the data
>Name Node and Data NOde
>HDFS runs on top of existing FS
>No random access
>Default block size is 128 MB
>Replication of blocks across cluster
2. MapReduce Engine - a framework for processing
>Job tracker and Task Tracker
>SIngle JobTracker
>MapReduce is based on Google paper
Hadoop 2.2
* Provides YARN (Yet Another Res Negotiator)
>Referred to as MapReduce v2
>Resource Manager and Scheduler are introduced
>Name Node and Data Node exists
>No Jobtracker and tasktracker
>App Master do the resaource negotiation
>NameNode was Single point of failure in Hadoop 1.1
HDFS Replication:
-----------------
> Eg: Replication factor 3.
First a block is placed in Rack1, then the replication will be placed in a data node other than Rack1, eg: Rack 2. The third replication will be placed in same Rack where 2nd replication happed. In this case Rack2.
HDFS COMMAND LINE
-----------------
hadoop fs <args>
hdfs dfs <args>
eg: hdfs dfs -ls
copyFromLocal / put
copyToLocal /get
getMerge
setRep - for setting Replication
hadoop fs -help
M3: Hadoop Administration
1. Adding /removing Nodes
> from Ambari console
> services can be added ot removed from a particular node
2. Verifying health
> hadoop fs - report
3. Start and stopping component
> Services like Pig, Hive, Sqoop etc can be started and stopped
4. Configuration
> multiple config files like
a. hadoop-env.xml - specify where JAVA is
b. core-site.xml - for hadoop core such as IO setting common for HDFS and MR
c. hdfs-site.xml - configuration of name node dir, sec name node, data node, block size
Before changing, we need to stop the service.
M4: Hadoop Components
1. MapReduce
Data enters in unstructured format, Map will convert it into a key and value format
2. Pig and Hive
Converts high level language to MR programs
Pig:
> 2 execution env
Local (interpreter) & Distributed
We can use Grunt prompt
We can use script
Hive:
3. Flume:
By CoudEra
eg: collecting log from all node and move to a persistence storage
tail(acces.log) --> HDFS
AGENT | COLLECTOR | HDFS
4. Sqoop:
Trasnfer data between Hadoop and DB
5. Oozie:
- Manages workflows in HDFS
- Used to control hadoop jobs
- workflow defined in hPDL (XML Process Definition Language)