Tuesday 18 April 2017

Spark Fundamentals

Below are the notes I have taken while taking the course "Spark Fundamentals I" in https://bigdatauniversity.com. Keeping it on the blog for my own reference and it may be helpful to others.  

M1: Introduction to Spark


Spark is an execution engine

MapReduce does the job but takes more time
MapReduce needs high learning curve
MapReduce uses disk

Speed: 
In Memory computation
Generality:
Interactive algorithms
Ease of Use:
APIs for Scala, Java, python


Spark can run on top of 
* HDFS
* Apache Mesos (Same principles as Linux kernel only at a different level of abstraction)
* Cloud

Why Spark?
Provides parallel distributed processing, fault-tolerance on commodity hardware


Spark unified stack


-------------------------------------------------------------
SPARK |   Spark      |      MLib                 | GraphX      |
 SQL     | Streaming | Machine Learning |                   |
--------------------------------------------------------------
|                                          |
|            SPARK Core                                                  |
-------------------------------------------------------------
| Standalone |  YARN       |  Mesos                                |
| Scheduler   |                |                                           |
--------------------------------------------------------------

* Spark core provides scheduling, distributing and monitoring of applications
* Spark SQL is designed to work with Spark via SQL and HiveQL
* Spark streaming provides processing of live streams of data



Resilient Distributed Dataset:

* Spark's primary abstraction
* Two types of RDD operations
  1. Transformation - Creates DAG (Directed Ascyclic Graph), Lazy evaluation, No return value
  2. Actions - Until the action operaration comes, nothing will be executed, once the action is encountered, all the transformations will be performed

* Fault tolerance aspect of RDD allow it to reconstruct the transformation used
* Caching is provided in-memory, if not enough, it will use disk

> Spark jobs can be written in Scala, python and java. ALso on Spark shell
> Spark's native language is Scala
> FUnctions are object in scala, even integer, boolean are also objects, everything is object in scala   

M2: Resilient Distributed Dataset


* Spark's primary abstraction
* Is a fault tolerant and can be operated in parallel
* They are immutable
* 3 methods of creating RDD
  1. Parallelizing existing collection
  2. Referencing a dataset from HDFS, Cassandra, HBase etc
  3. Transformation from existing RDD
* Supported file
  Text file, sequence file & Hadoop inputFormat

RDD persistence
Two APIs
1. Persist
> Option to persist in memory, disk, serialize, memory and disk etc
2. Cache

Shared variable:
Two types
1. Broadcast variable
  Readonly copy on each machine
2. Accumulator
  used for count and sums
  only the driver can read the accumulator value


M3: Spark Application Programming


* SparkContext is main entry point
* Used to create RDDs, accumulators and broadcast
* Spark 1.1.1 depends on scala 2.0
* Spark 1.1.1 with python 2.6 or higher < 3
* spark 1.1.1 with java

Passing functions to Spark:
1. Anonymous function -> useful for one time use
2. Static methods
3. Passing by reference

package the code into .jar (scala/java) or .py or .zip (python)

we should use spark-submit command to run the applications


M4: Introduction to Spark libraries


* Spark SQL  - Schema RDD
* Spark Streaming
* MLip
* GraphX
all these libraries are extenstion of Spark core

Spark SQL:
* The SQL context is created from SparkContext
* Schema RDD
* Two methods to infer the schema for RDD
1. Using Reflection, useful when we know the schema before hand
2. Programmatic interface to construct a schema, useful when we know the schema only at runtime

Spark Streaming
* To process live streaming data in small batches
* It receives data from including Kafka, Flume, HDFS, Kinesis or Twitter

Spark Libraries:
* MLib
Algo and utilities for 
1. Classification
2. Regression
3. Clustering
4. Colaborative Filtering
5. Dimensionality reduction

*Graph X:
Graph processing for social networking and language modelling


M5: Spark Configuration, monitoring, and tuning


Components:
1. Driver - where the Spark Context is
2. Cluster Master - Can be a standalone, Apache Mesos, Hadoop YARN
3. Executor - Work Node

Driver sends the application or python file to each executor and a task.
Applications cannot share data, if we want to share, we need to externalize it.

*Three Locations
1. Spark properties - Where application parameters can be set
2. Environment properties - used to set per machine setup like IP addresses
3. Loggine property

SPARK_HOME/conf - default conf directory

We can set the Spark properties either statically or dynamically
SparkConf is sent while creating SparkContext object

Priority of Spark conf
1. Setting prop directly on SparkConf
2. Flags passed during spark-submit or spark-shell
3. Finally from spark-defaults.conf

Monitoring:
1. From Web UI
Application Web UI
http://<driver>:4040
2. Metrics
Based on Coda Hale
Configured in conf/metrics.property
3. External instrumentations

Tuning:
1. Data Serialization
  - crusial for network performance
  - Java serializaion
  - Kyro serialization -> best that Java but not support al type
2. Memory Tuning
  - Cost of accessing the object
  - Overhead of GC

Imporoving performance
1. Level of parallelism
2. Memory usage of reduce tasks
3. Broadcasting large variables
4. Serialized or each tasks

No comments:

Post a Comment