Spark 2.3.x Overview

Let us go through the overview of Spark 2.3.x Overview such as documentation, modules, execution modes etc…

  • Quick Recap of Scala
  • Develop first Spark application
  • Understanding SparkConf and SparkContext
  • Overview of Spark Shell
  • Resilient Distributed Datasets – Overview
  • Overview of Core APIs
  • Spark Modules

Quick Recap of Scala

Let us recap about collections and map reduce APIs from Scala.

  • Functions and Anonymous Functions
  • Collections  – List, Set, and Map
  • Map Reduce APIs – filter, map, reduce etc
  • Typical Development Lifecycle

Develop first Spark Application

Let us develop our first Spark Application using Core APIs using IntelliJ.

  • Define necessary dependencies in build.sbt
  • Create a resources directory and then application.properties in it.
  • Create a new object with the main function and add logic to load externalized properties.
  • Create SparkConf object and then the SparkContext object
  • Read data using SparkContext object and perform basic operations.
[gist]62c6c5c1770f4a9240fd15511ba41df6[/gist]

Understanding SparkConf and SparkContext

Let us understand more about SparkConf and SparkContext.

  • Every application will have a runtime configuration. For Spark it is SparkConf.
  • SparkConf has several APIs to set and get the properties. We can overwrite any property using set.
  • There are some additional APIs such as appName, master etc. appName is a handy function which overwrites spark.app.name property while master overwrites spark.master property.
  • Once the SparkConf object is created, it can be passed to the constructor of SparkContext to create a context object.
    • It will get the resources from the underlying framework (YARN, Mesos, Stand Alone or local)
    • It will start web service on port number 4040 (we can override by overwriting the value for spark.ui.port)
  • SparkContext object exposes several APIs which include those APIs that can be used to read the data from files (e. g.: textFile, sequenceFile etc)
  • We have to set all the properties first using SparkConf object and then create a SparkContext object.

Overview of Spark Shell

Let us go through some of the details related to Spark Shell.

  • spark-shell is the command to launch Scala REPL with Spark binaries.
  • On top of launching REPL, it will also create two variables sc and spark (in 2.x).
    • sc is of type SparkContext (for core APIs)
    • spark is of type SparkSession (for Data Frames and other higher level modules)
  • As SparkContext object name is created, Spark Shell will get the resources using the underlying framework and start web service on 4040 or the specified port.
  • We can use APIs and start developing the code using APIs on sc or spark.
  • We can get all the configuration properties using sc.getAll

Resilient Distributed Datasets – Overview

Resilient Distributed Datasets (RDD) is the core data structure of Spark.

  • It is similar to Scala List
  • RDD will be loaded on demand by Spark Framework into the executor task’s memory. Each task will get a chunk of RDD and it is called as RDD partition.
  • RDD partition is determined by the underlying file system. If the underlying file system is HDFS, then partition size is by default 128 MB. However, we can change the number of partitions at runtime.
  • RDD is also resilient, which means if a task processing RDD partition is failed, then that partition will be reprocessed by another executor task.
  • There are several APIs which can be used on RDDs as part of the data processing
    • Row-level transformations – map and flatMap
    • Filtering data – filter
    • Aggregations – reduceByKey, aggregateByKey
    • Joins – join, leftOuterJoin, rightOuterJoin
    • Sorting – sortByKey
    • Ranking can be done with groupByKey along with anonymous functions
    • Set Operations – union, intersection
    • and more

Overview of Core APIs

Let us check some of the Spark Core APIs such as filter, map, join, reduceByKey, sortByKey etc.

APIs on top of RDDs are categorized into Transformations and Actions.

  • Transformations take RDD as input and return another RDD as output.
    • Row-level transformations – map and flatMap
    • Filtering data – filter
    • Aggregations – reduceByKey, aggregateByKey
    • Joins – join, leftOuterJoin, rightOuterJoin
    • Sorting – sortByKey
    • Ranking can be done with groupByKey along with anonymous functions
    • Set Operations – union, intersection
  • When transformations are used, Spark Job will not run immediately. It will update DAG.
  • Actions take RDD as input and perform the specified action such as converting RDD into the local collection in driver program or save into file system.
  • Action trigger execution of the Spark Job. If there are multiple actions in the Spark application, each of that action triggers a new job.
  • DAG stands for Directed Acyclic Graph and the process of deferring execution until action is performed is called as lazy evaluation.

Spark Modules

Core Spark provide RDD as Data Structure, Transformations, and Actions as APIs. However, they are considered as low level or core and there are high-level modules on top of Core Spark.

  • Data Frame or DataSet as Data Structure and Spark SQL or Data Frame Operations as API to process data. This is considered to be Core Module starting from Spark 2.x
  • On top of Data Frame Operations or Spark SQL, we have other high-level modules.
  • Spark Structured Streaming – to build streaming pipelines (micro batch as well as continuous)
  • MLLib Pipelines
  • GraphX Pipelines

We will be covering the first two very extensively as part of the rest of our course.