Venkatesh's Note book: Spark Fundamentals

Below are the notes I have taken while taking the course "Spark Fundamentals I" in https://bigdatauniversity.com. Keeping it on the blog for my own reference and it may be helpful to others.

M1: Introduction to Spark

Spark is an execution engine

MapReduce does the job but takes more time
MapReduce needs high learning curve
MapReduce uses disk

Speed:
In Memory computation
Generality:
Interactive algorithms
Ease of Use:
APIs for Scala, Java, python

Spark can run on top of
* HDFS
* Apache Mesos (Same principles as Linux kernel only at a different level of abstraction)
* Cloud

Why Spark?
Provides parallel distributed processing, fault-tolerance on commodity hardware

Spark unified stack

Resilient Distributed Dataset:

* Spark's primary abstraction
* Two types of RDD operations
1. Transformation - Creates DAG (Directed Ascyclic Graph), Lazy evaluation, No return value
2. Actions - Until the action operaration comes, nothing will be executed, once the action is encountered, all the transformations will be performed

* Fault tolerance aspect of RDD allow it to reconstruct the transformation used
* Caching is provided in-memory, if not enough, it will use disk

> Spark jobs can be written in Scala, python and java. ALso on Spark shell
> Spark's native language is Scala
> FUnctions are object in scala, even integer, boolean are also objects, everything is object in scala

M2: Resilient Distributed Dataset

* Spark's primary abstraction
* Is a fault tolerant and can be operated in parallel
* They are immutable
* 3 methods of creating RDD
1. Parallelizing existing collection
2. Referencing a dataset from HDFS, Cassandra, HBase etc
3. Transformation from existing RDD
* Supported file
Text file, sequence file & Hadoop inputFormat

RDD persistence
Two APIs
1. Persist
> Option to persist in memory, disk, serialize, memory and disk etc
2. Cache

Shared variable:
Two types
1. Broadcast variable
Readonly copy on each machine
2. Accumulator
used for count and sums
only the driver can read the accumulator value

M3: Spark Application Programming

* SparkContext is main entry point
* Used to create RDDs, accumulators and broadcast
* Spark 1.1.1 depends on scala 2.0
* Spark 1.1.1 with python 2.6 or higher < 3
* spark 1.1.1 with java

Passing functions to Spark:
1. Anonymous function -> useful for one time use
2. Static methods
3. Passing by reference

package the code into .jar (scala/java) or .py or .zip (python)

we should use spark-submit command to run the applications

M4: Introduction to Spark libraries

* Spark SQL - Schema RDD
* Spark Streaming
* MLip
* GraphX
all these libraries are extenstion of Spark core

Spark SQL:
* The SQL context is created from SparkContext
* Schema RDD
* Two methods to infer the schema for RDD
1. Using Reflection, useful when we know the schema before hand
2. Programmatic interface to construct a schema, useful when we know the schema only at runtime

Spark Streaming
* To process live streaming data in small batches
* It receives data from including Kafka, Flume, HDFS, Kinesis or Twitter

Spark Libraries:
* MLib
Algo and utilities for
1. Classification
2. Regression
3. Clustering
4. Colaborative Filtering
5. Dimensionality reduction

*Graph X:
Graph processing for social networking and language modelling

M5: Spark Configuration, monitoring, and tuning

Components:
1. Driver - where the Spark Context is
2. Cluster Master - Can be a standalone, Apache Mesos, Hadoop YARN
3. Executor - Work Node

Driver sends the application or python file to each executor and a task.
Applications cannot share data, if we want to share, we need to externalize it.

*Three Locations
1. Spark properties - Where application parameters can be set
2. Environment properties - used to set per machine setup like IP addresses
3. Loggine property

SPARK_HOME/conf - default conf directory

We can set the Spark properties either statically or dynamically
SparkConf is sent while creating SparkContext object

Priority of Spark conf
1. Setting prop directly on SparkConf
2. Flags passed during spark-submit or spark-shell
3. Finally from spark-defaults.conf

Monitoring:
1. From Web UI
Application Web UI
http://<driver>:4040
2. Metrics
Based on Coda Hale
Configured in conf/metrics.property
3. External instrumentations

Tuning:
1. Data Serialization
- crusial for network performance
- Java serializaion
- Kyro serialization -> best that Java but not support al type
2. Memory Tuning
- Cost of accessing the object
- Overhead of GC

Imporoving performance
1. Level of parallelism
2. Memory usage of reduce tasks
3. Broadcasting large variables
4. Serialized or each tasks

Venkatesh's Note book

Tuesday, 18 April 2017

Spark Fundamentals