As part of this section, we will see high-level details about spark, spark-shell, pyspark as well as spark-submit.
- Spark Architecture – Overview
- Spark Modules – Overview
- Spark Execution Modes
- spark-shell – Overview
- pyspark – Overview
Spark Architecture – Overview
Spark is a distributed computing framework with rich APIs to process data at Scale.
- It uses any file system that is provided as part of HDFS APIs
- HDFS
- AWS s3
- Azure Blob
- and more
- For distributed computing, it uses frameworks such as YARN, Mesos etc
- Let us go through the official documentation and understand high level architecture of Spark cluster running under YARN or Mesos.
- Driver Program with Spark Context
- Executors on Worker Nodes (Node Managers in YARN)
- Cluster Manager (Resource Manager in YARN)
Spark Modules – Overview
Spark provides rich APIs to process data at scale. However, the APIs are grouped into modules for different purposes.
- Core APIs – RDDs as well as Transformations and Actions
- Spark SQL and Data Frames
- Machine Learning Pipelines
- GraphX Pipelines
spark-shell – Overview
pyspark – Overview