Below are the notes I have taken while taking the course "Spark Fundamentals I" in https://bigdatauniversity.com. Keeping it on the blog for my own reference and it may be helpful to others.
Spark is an execution engine
MapReduce does the job but takes more time
MapReduce needs high learning curve
MapReduce uses disk
Speed:
In Memory computation
Generality:
Interactive algorithms
Ease of Use:
APIs for Scala, Java, python
Spark can run on top of
* HDFS
* Apache Mesos (Same principles as Linux kernel only at a different level of abstraction)
* Cloud
Why Spark?
Provides parallel distributed processing, fault-tolerance on commodity hardware
-------------------------------------------------------------
SPARK | Spark | MLib | GraphX |
SQL | Streaming | Machine Learning | |
--------------------------------------------------------------
| |
| SPARK Core |
-------------------------------------------------------------
| Standalone | YARN | Mesos |
| Scheduler | | |
--------------------------------------------------------------
* Spark core provides scheduling, distributing and monitoring of applications
* Spark SQL is designed to work with Spark via SQL and HiveQL
* Spark streaming provides processing of live streams of data
* Spark's primary abstraction
* Two types of RDD operations
1. Transformation - Creates DAG (Directed Ascyclic Graph), Lazy evaluation, No return value
2. Actions - Until the action operaration comes, nothing will be executed, once the action is encountered, all the transformations will be performed
* Fault tolerance aspect of RDD allow it to reconstruct the transformation used
* Caching is provided in-memory, if not enough, it will use disk
> Spark jobs can be written in Scala, python and java. ALso on Spark shell
> Spark's native language is Scala
> FUnctions are object in scala, even integer, boolean are also objects, everything is object in scala
* Spark's primary abstraction
* Is a fault tolerant and can be operated in parallel
* They are immutable
* 3 methods of creating RDD
1. Parallelizing existing collection
2. Referencing a dataset from HDFS, Cassandra, HBase etc
3. Transformation from existing RDD
* Supported file
Text file, sequence file & Hadoop inputFormat
RDD persistence
Two APIs
1. Persist
> Option to persist in memory, disk, serialize, memory and disk etc
2. Cache
Shared variable:
Two types
1. Broadcast variable
Readonly copy on each machine
2. Accumulator
used for count and sums
only the driver can read the accumulator value
* SparkContext is main entry point
* Used to create RDDs, accumulators and broadcast
* Spark 1.1.1 depends on scala 2.0
* Spark 1.1.1 with python 2.6 or higher < 3
* spark 1.1.1 with java
Passing functions to Spark:
1. Anonymous function -> useful for one time use
2. Static methods
3. Passing by reference
package the code into .jar (scala/java) or .py or .zip (python)
we should use spark-submit command to run the applications
* Spark SQL - Schema RDD
* Spark Streaming
* MLip
* GraphX
all these libraries are extenstion of Spark core
Spark SQL:
* The SQL context is created from SparkContext
* Schema RDD
* Two methods to infer the schema for RDD
1. Using Reflection, useful when we know the schema before hand
2. Programmatic interface to construct a schema, useful when we know the schema only at runtime
Spark Streaming
* To process live streaming data in small batches
* It receives data from including Kafka, Flume, HDFS, Kinesis or Twitter
Spark Libraries:
* MLib
Algo and utilities for
1. Classification
2. Regression
3. Clustering
4. Colaborative Filtering
5. Dimensionality reduction
*Graph X:
Graph processing for social networking and language modelling
Components:
1. Driver - where the Spark Context is
2. Cluster Master - Can be a standalone, Apache Mesos, Hadoop YARN
3. Executor - Work Node
Driver sends the application or python file to each executor and a task.
Applications cannot share data, if we want to share, we need to externalize it.
*Three Locations
1. Spark properties - Where application parameters can be set
2. Environment properties - used to set per machine setup like IP addresses
3. Loggine property
SPARK_HOME/conf - default conf directory
We can set the Spark properties either statically or dynamically
SparkConf is sent while creating SparkContext object
Priority of Spark conf
1. Setting prop directly on SparkConf
2. Flags passed during spark-submit or spark-shell
3. Finally from spark-defaults.conf
Monitoring:
1. From Web UI
Application Web UI
http://<driver>:4040
2. Metrics
Based on Coda Hale
Configured in conf/metrics.property
3. External instrumentations
Tuning:
1. Data Serialization
- crusial for network performance
- Java serializaion
- Kyro serialization -> best that Java but not support al type
2. Memory Tuning
- Cost of accessing the object
- Overhead of GC
Imporoving performance
1. Level of parallelism
2. Memory usage of reduce tasks
3. Broadcasting large variables
4. Serialized or each tasks
M1: Introduction to Spark
Spark is an execution engine
MapReduce does the job but takes more time
MapReduce needs high learning curve
MapReduce uses disk
Speed:
In Memory computation
Generality:
Interactive algorithms
Ease of Use:
APIs for Scala, Java, python
Spark can run on top of
* HDFS
* Apache Mesos (Same principles as Linux kernel only at a different level of abstraction)
* Cloud
Why Spark?
Provides parallel distributed processing, fault-tolerance on commodity hardware
Spark unified stack
-------------------------------------------------------------
SPARK | Spark | MLib | GraphX |
SQL | Streaming | Machine Learning | |
--------------------------------------------------------------
| |
| SPARK Core |
-------------------------------------------------------------
| Standalone | YARN | Mesos |
| Scheduler | | |
--------------------------------------------------------------
* Spark core provides scheduling, distributing and monitoring of applications
* Spark SQL is designed to work with Spark via SQL and HiveQL
* Spark streaming provides processing of live streams of data
Resilient Distributed Dataset:
* Spark's primary abstraction* Two types of RDD operations
1. Transformation - Creates DAG (Directed Ascyclic Graph), Lazy evaluation, No return value
2. Actions - Until the action operaration comes, nothing will be executed, once the action is encountered, all the transformations will be performed
* Fault tolerance aspect of RDD allow it to reconstruct the transformation used
* Caching is provided in-memory, if not enough, it will use disk
> Spark jobs can be written in Scala, python and java. ALso on Spark shell
> Spark's native language is Scala
> FUnctions are object in scala, even integer, boolean are also objects, everything is object in scala
M2: Resilient Distributed Dataset
* Spark's primary abstraction
* Is a fault tolerant and can be operated in parallel
* They are immutable
* 3 methods of creating RDD
1. Parallelizing existing collection
2. Referencing a dataset from HDFS, Cassandra, HBase etc
3. Transformation from existing RDD
* Supported file
Text file, sequence file & Hadoop inputFormat
RDD persistence
Two APIs
1. Persist
> Option to persist in memory, disk, serialize, memory and disk etc
2. Cache
Shared variable:
Two types
1. Broadcast variable
Readonly copy on each machine
2. Accumulator
used for count and sums
only the driver can read the accumulator value
M3: Spark Application Programming
* SparkContext is main entry point
* Used to create RDDs, accumulators and broadcast
* Spark 1.1.1 depends on scala 2.0
* Spark 1.1.1 with python 2.6 or higher < 3
* spark 1.1.1 with java
Passing functions to Spark:
1. Anonymous function -> useful for one time use
2. Static methods
3. Passing by reference
package the code into .jar (scala/java) or .py or .zip (python)
we should use spark-submit command to run the applications
M4: Introduction to Spark libraries
* Spark SQL - Schema RDD
* Spark Streaming
* MLip
* GraphX
all these libraries are extenstion of Spark core
Spark SQL:
* The SQL context is created from SparkContext
* Schema RDD
* Two methods to infer the schema for RDD
1. Using Reflection, useful when we know the schema before hand
2. Programmatic interface to construct a schema, useful when we know the schema only at runtime
Spark Streaming
* To process live streaming data in small batches
* It receives data from including Kafka, Flume, HDFS, Kinesis or Twitter
Spark Libraries:
* MLib
Algo and utilities for
1. Classification
2. Regression
3. Clustering
4. Colaborative Filtering
5. Dimensionality reduction
*Graph X:
Graph processing for social networking and language modelling
M5: Spark Configuration, monitoring, and tuning
Components:
1. Driver - where the Spark Context is
2. Cluster Master - Can be a standalone, Apache Mesos, Hadoop YARN
3. Executor - Work Node
Driver sends the application or python file to each executor and a task.
Applications cannot share data, if we want to share, we need to externalize it.
*Three Locations
1. Spark properties - Where application parameters can be set
2. Environment properties - used to set per machine setup like IP addresses
3. Loggine property
SPARK_HOME/conf - default conf directory
We can set the Spark properties either statically or dynamically
SparkConf is sent while creating SparkContext object
Priority of Spark conf
1. Setting prop directly on SparkConf
2. Flags passed during spark-submit or spark-shell
3. Finally from spark-defaults.conf
Monitoring:
1. From Web UI
Application Web UI
http://<driver>:4040
2. Metrics
Based on Coda Hale
Configured in conf/metrics.property
3. External instrumentations
Tuning:
1. Data Serialization
- crusial for network performance
- Java serializaion
- Kyro serialization -> best that Java but not support al type
2. Memory Tuning
- Cost of accessing the object
- Overhead of GC
Imporoving performance
1. Level of parallelism
2. Memory usage of reduce tasks
3. Broadcasting large variables
4. Serialized or each tasks