Tuesday 18 April 2017

Spark Fundamentals

Below are the notes I have taken while taking the course "Spark Fundamentals I" in https://bigdatauniversity.com. Keeping it on the blog for my own reference and it may be helpful to others.  

M1: Introduction to Spark


Spark is an execution engine

MapReduce does the job but takes more time
MapReduce needs high learning curve
MapReduce uses disk

Speed: 
In Memory computation
Generality:
Interactive algorithms
Ease of Use:
APIs for Scala, Java, python


Spark can run on top of 
* HDFS
* Apache Mesos (Same principles as Linux kernel only at a different level of abstraction)
* Cloud

Why Spark?
Provides parallel distributed processing, fault-tolerance on commodity hardware


Spark unified stack


-------------------------------------------------------------
SPARK |   Spark      |      MLib                 | GraphX      |
 SQL     | Streaming | Machine Learning |                   |
--------------------------------------------------------------
|                                          |
|            SPARK Core                                                  |
-------------------------------------------------------------
| Standalone |  YARN       |  Mesos                                |
| Scheduler   |                |                                           |
--------------------------------------------------------------

* Spark core provides scheduling, distributing and monitoring of applications
* Spark SQL is designed to work with Spark via SQL and HiveQL
* Spark streaming provides processing of live streams of data



Resilient Distributed Dataset:

* Spark's primary abstraction
* Two types of RDD operations
  1. Transformation - Creates DAG (Directed Ascyclic Graph), Lazy evaluation, No return value
  2. Actions - Until the action operaration comes, nothing will be executed, once the action is encountered, all the transformations will be performed

* Fault tolerance aspect of RDD allow it to reconstruct the transformation used
* Caching is provided in-memory, if not enough, it will use disk

> Spark jobs can be written in Scala, python and java. ALso on Spark shell
> Spark's native language is Scala
> FUnctions are object in scala, even integer, boolean are also objects, everything is object in scala   

M2: Resilient Distributed Dataset


* Spark's primary abstraction
* Is a fault tolerant and can be operated in parallel
* They are immutable
* 3 methods of creating RDD
  1. Parallelizing existing collection
  2. Referencing a dataset from HDFS, Cassandra, HBase etc
  3. Transformation from existing RDD
* Supported file
  Text file, sequence file & Hadoop inputFormat

RDD persistence
Two APIs
1. Persist
> Option to persist in memory, disk, serialize, memory and disk etc
2. Cache

Shared variable:
Two types
1. Broadcast variable
  Readonly copy on each machine
2. Accumulator
  used for count and sums
  only the driver can read the accumulator value


M3: Spark Application Programming


* SparkContext is main entry point
* Used to create RDDs, accumulators and broadcast
* Spark 1.1.1 depends on scala 2.0
* Spark 1.1.1 with python 2.6 or higher < 3
* spark 1.1.1 with java

Passing functions to Spark:
1. Anonymous function -> useful for one time use
2. Static methods
3. Passing by reference

package the code into .jar (scala/java) or .py or .zip (python)

we should use spark-submit command to run the applications


M4: Introduction to Spark libraries


* Spark SQL  - Schema RDD
* Spark Streaming
* MLip
* GraphX
all these libraries are extenstion of Spark core

Spark SQL:
* The SQL context is created from SparkContext
* Schema RDD
* Two methods to infer the schema for RDD
1. Using Reflection, useful when we know the schema before hand
2. Programmatic interface to construct a schema, useful when we know the schema only at runtime

Spark Streaming
* To process live streaming data in small batches
* It receives data from including Kafka, Flume, HDFS, Kinesis or Twitter

Spark Libraries:
* MLib
Algo and utilities for 
1. Classification
2. Regression
3. Clustering
4. Colaborative Filtering
5. Dimensionality reduction

*Graph X:
Graph processing for social networking and language modelling


M5: Spark Configuration, monitoring, and tuning


Components:
1. Driver - where the Spark Context is
2. Cluster Master - Can be a standalone, Apache Mesos, Hadoop YARN
3. Executor - Work Node

Driver sends the application or python file to each executor and a task.
Applications cannot share data, if we want to share, we need to externalize it.

*Three Locations
1. Spark properties - Where application parameters can be set
2. Environment properties - used to set per machine setup like IP addresses
3. Loggine property

SPARK_HOME/conf - default conf directory

We can set the Spark properties either statically or dynamically
SparkConf is sent while creating SparkContext object

Priority of Spark conf
1. Setting prop directly on SparkConf
2. Flags passed during spark-submit or spark-shell
3. Finally from spark-defaults.conf

Monitoring:
1. From Web UI
Application Web UI
http://<driver>:4040
2. Metrics
Based on Coda Hale
Configured in conf/metrics.property
3. External instrumentations

Tuning:
1. Data Serialization
  - crusial for network performance
  - Java serializaion
  - Kyro serialization -> best that Java but not support al type
2. Memory Tuning
  - Cost of accessing the object
  - Overhead of GC

Imporoving performance
1. Level of parallelism
2. Memory usage of reduce tasks
3. Broadcasting large variables
4. Serialized or each tasks

Hadoop Introduction

Below are the notes I have taken while taking the course "Hadoop 101" in https://bigdatauniversity.com. Keeping it on the blog for my own reference and it may be helpful to others.  

M1: Intro to Hadoop


What is Hadoop?

*Framework developed in java for processing Structured, un and semi structured data on commodity computer node.
* Not suitable for Online Trans processing or online Analytic Processing
* Not replacement to RDBMS
* FB - processes 600GB data
* Twitter processed 7TB everyday
* 80% data are unstructured

Hadoop related opensource projects:

* Eclipse
* Apache Lucene - Full text search engine written in java
* HBASE - Hadoop Database
* Hive - SQL like language to access HBASE
* PIC - High level language
* Zookeeper - for managing the name nodes
* Spark - Execution engine
* Apache Aambari - Web UI for managing & monitoring the hadoop cluster
* Apache Avro - for data serialization
* Apache UIMA - Unstructured Information Management Applications

LabWork
-------

M2: Hadoop Architecture & HDFS


Terminologies:

* Each node is called a commodity computer
* Rack are collection of 30 to 40 nodes
* Hadoop cluster is collection of racks
* Within rack the bandwidth is more

Pre 2.2 Arch:

Two components
1. Distributed File System or HDFS - FOr storing the data
>Name Node and Data NOde
>HDFS runs on top of existing FS
>No random access
>Default block size is 128 MB   
>Replication of blocks across cluster

2. MapReduce Engine - a framework for processing 
>Job tracker and Task Tracker
>SIngle JobTracker
>MapReduce is based on Google paper


Hadoop 2.2

* Provides YARN (Yet Another Res Negotiator)
>Referred to as MapReduce v2
>Resource Manager and Scheduler are introduced
>Name Node and Data Node exists
>No Jobtracker and tasktracker
>App Master do the resaource negotiation
>NameNode was Single point of failure in Hadoop 1.1

HDFS Replication:
-----------------
> Eg: Replication factor 3.
First a block is placed in Rack1, then the replication will be placed in a data node other than Rack1, eg: Rack 2. The third replication will be placed in same Rack where 2nd replication happed. In this case Rack2.

HDFS COMMAND LINE
-----------------
hadoop fs <args>
hdfs dfs <args>
eg: hdfs dfs -ls

copyFromLocal / put
copyToLocal /get
getMerge
setRep - for setting Replication

hadoop fs -help

M3: Hadoop Administration


1. Adding /removing Nodes
> from Ambari console
> services can be added ot removed from a particular node
2. Verifying health
> hadoop fs - report
3. Start and stopping component
> Services like Pig, Hive, Sqoop etc can be started and stopped
4. Configuration
> multiple config files like
a. hadoop-env.xml - specify where JAVA is
b. core-site.xml - for hadoop core such as IO setting common for HDFS and MR
c. hdfs-site.xml - configuration of name node dir, sec name node, data node, block size

Before changing, we need to stop the service.

M4: Hadoop Components


1. MapReduce
Data enters in unstructured format, Map will convert it into a key and value format

2. Pig and Hive
Converts high level language to MR programs
Pig:
> 2 execution env
Local (interpreter) & Distributed
We can use Grunt prompt
We can use script
Hive:

3. Flume:
By CoudEra
eg: collecting log from all node and move to a persistence storage
tail(acces.log) --> HDFS

AGENT   | COLLECTOR | HDFS

4. Sqoop:
Trasnfer data between Hadoop and DB

5. Oozie:
- Manages workflows in HDFS
- Used to control hadoop jobs
- workflow defined in hPDL (XML Process Definition Language)

Tuesday 11 April 2017

Big Data Introduction

Below are the notes I have taken while taking the course "Big Data Foundations - Level 1" in https://bigdatauniversity.com. Keeping it on the blog for my own reference and it may be helpful to others.  


Big Data Adoption

State:
Data coming from:
transactions or log data
Clinical, Mobile cam - 70% unstructured
Educate>Explore>Engage>Execute

Source of Big Data:
Good Data scientist:
pick the right problem for the org rather solving the problem


Data is new oil, is a natural resource that is going.

Sensors are the important contribution to data.
4 engines 1PB of data from LON to SIN
CERN 1GB per sec
Radio telescope20,000 PB/day


Big data with examples:
-----------------------------
Holistic approach:
What to achieve?
Collect traffic data to avoid congestion. To find best time and method to travel.

Intelligence: Handles data stream:
Instrumented: Gather from diff sources
Interconnected: can handle both structured and unstructured

Case1: 
Political debate:
Predict public sentiment.
Process>filter>analysis
Connected to twitter firehose (100%) vs public API (1%)
analyze each tweet and aggregate

NLP: 
the project at Penn Treebank project, Stanford
Carnegie Mellon Univ

public opinion vs political analysts



Characteristics of BigData:


What and How?
new insight from untouched data
New tools to do more analysis

Four Vs
Volume, Velocity, Variety, Veracity

BigData is not only Hadoop.

Governance:
Can we trust the source?

NoSQL - Not Only SQL
NoHadoop - Not Only 

Big Data is a platform and not a software.

Data Warehouse > provided OLAP (Online Analytic Processing)
Stream Computing > Real-time analytic processing (RTAP)

Hadoop - Sofware for structured and unstructured

Accelerators are s/w libraries above the DataWarehouse, Hadoop, Stream


The BigData platform


Key aspect of BigData platform:
----------------------------------------
1 Integration : It is more than just 1 tech
2 Analytics
3 Visualization : Tools
4 Development tools for engine
5 Oprimization
6 Security and Governance

RTAP (Realtime analytic processing) & HDFS (long 

Governance for BigData
-------------------------------
Growing variety and volume makes it difficult to manage

Infosphere: IBM producttime storage) stores the result in warehouse zone

Alaytics must be driven by trusted data

Different data requires different types of governance

True insight requires confidence in data.

FOurth V - Veracity means truthfulness and confidence


Highvalue Big Data Usecase


Sweet spots
1. Big data Exploration
2. Enhance CRM - for cross sell and up sell
3. Security/Intelligence extension
   * improve intelligence & law enforcement
   * Find pattern
4. Operation Analysis
   * analyse large volume of multi structured data which are in motion. To integrate with existing enterprise data, need large amount of analysis 
5. Dataware house augumentation
   * Builds on top of existing datawarehouse to leverage big data
   * Hadoop as a source for data warehouse

Examples:
1. Airbus:
Data Expplorer is starting point

Multiple Hypothesis tracking (MHT)

2. Terra Echos
Stream and Hadoop is the starting point

3. Cisco
Intelligence infrastructure monitoring
log analystics
Energy bill forecasting


Technical Details of Big Data Component


Sentiment from Social media, Machine log, call center log, email, finanical services etc

AQL - Annotation Query Language (AQL program, tells what needs to be done rather how it needs to be done)
-->
Text Analytics Optimizer, ( will compile the AQL and optimize it and generate execution plan)
-->
Text Analytic Runtime (analyse the stream or doc)


tuples goes to operators for execution
Filter->Transform->Annotate
Correlate (Join from multiple source) ->Classify (training data)

Stream softwares:
----------------------
IBM Infosphere Streams
Storm - Twitter
S4 - Yahoo
Apache Spark
Samza - Linked In
Kinesis - Amazon

IBM Hadoop - BigInsight

SQL for Hadoop (ANSI 92 support)
* Hive
* Impala (Cloud Era)
* Big SQL (IBM)
* Stinger (Hortonworks)
* Drill (MapR)
* HAWQ (Pivotal)
* SQL-H (Teradata)

Improvements:
In Multimedia Analysis

Monday 10 April 2017

Modifying the Namespace prefix while marshaling the JAXB object

There would be times when we want to see a meaningful name to namespace prefixes given by JAXB marshaling API. By default for any QName which does not have prefix attribute, JAXB will give prefix in this format 'ns<number>' eg: ns0, ns1 etc.

To achieve it, there is a property called MarshallerProperties.NAMESPACE_PREFIX_MAPPER ("eclipselink.namespace-prefix-mapper"); this is for JAXB reference implementation by eclipselink. If it is a different vendor, the property name will vary. For this key, we need to set a value as an instance of org.eclipse.persistence.oxm.NamespacePrefixMapper class.


  jc = JAXBContext.newInstance("com.mypackage.jaxb");
  m = jc.createMarshaller();
  m.setProperty(MarshallerProperties.NAMESPACE_PREFIX_MAPPER,
                new NamespacePrefixMapper() {
                   // Map to store the namespace and prefixes
                    Map<String, String> namespacePrefixMap = new HashMap<>();
                        @Override
                        public String getPreferredPrefix(String namespaceuri, String suggestion, boolean requiredPrefix) {      
                            String prefix = namespacePrefixMap.get(namespaceuri);
                            if (null == prefix) {
                               prefix = "myprefix" + suggestion;
                               namespacePrefixMap.put(namespaceuri, prefix);
                             }
                             return prefix;
                        }
               });