As part of this topic we will understand different modules, spark architecture and how it is mapped to different execution modes such as YARN, Mesos etc.
Spark is nothing but distributed computing framework. To leverage the framework we need to learn API categorized into different modules and build applications using supported programming languages (like Scala, Python, Java etc).
- Spark Official Documentation
- Spark Modules
- Core – Transformations and Actions
- Spark SQL and Data Frames
- Structured Streaming
- Machine Learning Pipelines
- GraphX Pipelines
- and more
- Spark Data Structures
- Resilient Distributed Datasets (An in-memory distributed collection)
- Data Frame (A wrapper on top of RDD with structure)
- Spark Framework and Execution Modes
Quick Review Of APIs
In the earlier versions of Spark we have core API at the bottom and all the higher level modules work with core API. Examples of core API are map, reduce, join, groupByKey etc. But with Spark 2, Data Frames and Spark SQL is becoming the core module.
- Core – Transformations and Actions – APIs such as map, reduce, join, filter etc. They typically work on RDD
- Spark SQL and Data Frames – APIs and Spark SQL interface for batch processing on top of Data Frames or Data Sets (not available for Python)
- Structured Streaming – APIs and Spark SQL interface for stream data processing on top of Data Frames
- Machine Learning Pipelines – Machine Learning data pipelines to apply Machine Learning algorithms on top of Data Frames
- GraphX Pipelines
Spark Data Structures
We need to deal with 2 types of data structures in Spark – RDD and Data Frames. We will see Data Structures in detail as part of the next topic.
- RDD is there for quite some time and it is the low level data structure which spark uses to distribute the data between tasks while data is being processed
- RDD will be divided into partitions while data being processed. Each partition will be processed by one task.
- Data Frame is nothing but RDD with structure
- Typically we read data from file systems such as HDFS, S3, Azure Blob, Local file system etc
- Based on the file formats we need to use different APIs available in Spark to read data into RDD or Data Frame
- Spark uses HDFS APIs to read and/or write data from underlying file system
Let us start with simple application to understand details related to architecture using pyspark.
- As we have multiple versions of Spark on our lab and we are exploring Spark 2 we need to export SPARK_MAJOR_VERSION with 2
- Launch pyspark using YARN and num-executors 2 (use spark.ui.port as well to specify unique port)
- Develop simple word count program by reading data from /public/randomtextwriter/part-m-00000
- Save output to /user/training/bootcamp/pyspark/wordcount
Using this I will walk you through Spark Framework.
Let us understand these different components of Spark Framework. Also we will understand different execution modes.
- Driver Program
- Spark Context
- Executor Cache
- Executor Tasks
- Task (Executor Tasks)
Following are the different execution modes supported by Spark
- Local (for development)
- Standalone (for development)