Overview of Spark using Scala and Python

As part of this section, we will see high-level details about spark, spark-shell, pyspark as well as spark-submit.

  • Spark Architecture – Overview
  • Spark Modules – Overview
  • Spark Execution Modes
  • spark-shell – Overview
  • pyspark – Overview

Spark Architecture – Overview

Spark is a distributed computing framework with rich APIs to process data at Scale.

  • It uses any file system that is provided as part of HDFS APIs
    • HDFS
    • AWS s3
    • Azure Blob
    • and more
  • For distributed computing, it uses frameworks such as YARN, Mesos etc
  • Let us go through the official documentation and understand high level architecture of Spark cluster running under YARN or Mesos.
    • Driver Program with Spark Context
    • Executors on Worker Nodes (Node Managers in YARN)
    • Cluster Manager (Resource Manager in YARN)

Spark Modules – Overview

Spark provides rich APIs to process data at scale. However, the APIs are grouped into modules for different purposes.

  • Core APIs – RDDs as well as Transformations and Actions
  • Spark SQL and Data Frames
  • Machine Learning Pipelines
  • GraphX Pipelines

spark-shell – Overview

pyspark – Overview