Spark Overview and Installation

As part of this section, we will see how to set up Spark components while exploring some of the key concepts of this very important service.

There are 2 versions of Spark, we will see how to setup both Spark 1.6 as well as Spark 2.3. However to setup Spark 2.3, we need to use CDS which is possible using Parcels. We will see how to convert our cluster to Parcels before setting up Spark 2.3.

  • Setup and Validate Spark 1.6.x
  • Review Important Properties
  • Spark Execution Life Cycle
  • Convert Cluster to Parcels
  • Setup Spark 2.3.x
  • Run Spark Jobs – Spark 2.3.x

Cluster Topology

We are setting up the cluster on 7+1 nodes. We start with 7 nodes and then we will add one more node later.

  • Gateway(s) and Management Service
    • bigdataserver-1
  • Masters
    • bigdataserver-2
      • Zookeeper
      • Active/Standby Namenode
    • bigdataserver-3
      • Zookeeper
      • Active/Standby Namenode
      • Active/Standby Resource Manager
    • bigdataserver-4
      • Zookeeper
      • Active/Standby Resource Manager
      • Job History Server
      • Spark History Server
  • Slaves or Worker Nodes
    • bigdataserver-5 – Datanode, Node Manager
    • bigdataserver-6 – Datanode, Node Manager
    • bigdataserver-7 – Datanode, Node Manager

Learning Process

We will follow the same standard process to learn while adding any software-based service.

  • Downloading and Installing – Even though we have setup softwares already using Packages, it is not good enough to setup Spark 2. With Cloudera Distribution we need to use Parcels to setup Spark 2. Hence we will see how to migrate cluster from Packages to Parcels.
  • Configuration – we need to understand architecture and plan for the configuration.
    • Architecture – Uses HDFS for File System and YARN for Resource Management.
    • Components – Spark Job History Server
    • Configuration Files
      • Spark 1.6.x: /etc/spark/conf
      • Spark 2.3.x: /etc/spark2/conf
    • With cloudera the location is a bit different and we will see it after setting up the service.
  • Service logs/var/log/spark
  • Service Data – Spark is distributed Computing Framework and it can use any File System that is supported by HDFS APIs (such as HDFS, AWS s3 etc)

Share this post