As part of this topic we will understand different modules, spark architecture and how it is mapped to different execution modes such as YARN, Mesos etc.
Spark is nothing but distributed computing framework. To leverage the framework we need to learn API categorized into different modules and build applications using supported programming languages (like Scala, Python, Java etc).
Data Frame (A wrapper on top of RDD with structure)
Spark Framework and Execution Modes
Quick Review Of APIs
Spark Modules
In the earlier versions of Spark we have core API at the bottom and all the higher level modules work with core API. Examples of core API are map, reduce, join, groupByKey etc. But with Spark 2, Data Frames and Spark SQL is becoming the core module.
Core – Transformations and Actions – APIs such as map, reduce, join, filter etc. They typically work on RDD
Spark SQL and Data Frames – APIs and Spark SQL interface for batch processing on top of Data Frames or Data Sets (not available for Python)
Structured Streaming – APIs and Spark SQL interface for stream data processing on top of Data Frames
Machine Learning Pipelines – Machine Learning data pipelines to apply Machine Learning algorithms on top of Data Frames
GraphX Pipelines
Spark Data Structures
We need to deal with 2 types of data structures in Spark – RDD and Data Frames. We will see Data Structures in detail as part of the next topic.
RDD is there for quite some time and it is the low level data structure which spark uses to distribute the data between tasks while data is being processed
RDD will be divided into partitions while data being processed. Each partition will be processed by one task.
Data Frame is nothing but RDD with structure
Typically we read data from file systems such as HDFS, S3, Azure Blob, Local file system etc
Based on the file formats we need to use different APIs available in Spark to read data into RDD or Data Frame
Spark uses HDFS APIs to read and/or write data from underlying file system
Simple Application
Let us start with simple application to understand details related to architecture using pyspark.
As we have multiple versions of Spark on our lab and we are exploring Spark 2 we need to export SPARK_MAJOR_VERSION with 2
Launch pyspark using YARN and num-executors 2 (use spark.ui.port as well to specify unique port)
Develop simple word count program by reading data from /public/randomtextwriter/part-m-00000
Save output to /user/training/bootcamp/pyspark/wordcount
Using this I will walk you through Spark Framework.
Spark Framework
Let us understand these different components of Spark Framework. Also we will understand different execution modes.
https://youtu.be/S5kffm6aM2Q
Driver Program
Spark Context
Executors
Executor Cache
Executor Tasks
Job
Stage
Task (Executor Tasks)
Following are the different execution modes supported by Spark