Yet Another Resource Negotiator (YARN)

Let us get into some of the important details with respect to YARN.

  • YARN Architecture
  • Running Map Reduce Jobs using YARN
  • Running Spark Jobs using YARN
  • YARN Application Life Cycle
  • Spark Job Execution Life Cycle
  • YARN Schedulers – Overview

YARN Architecture

YARN also follows Master-Slave Architecture. It is primarily used for Resource Management and Scheduling of Jobs.

  • YARN Components
    • Resource Manager is the master in YARN
    • Node Managers on worker nodes are slaves in YARN
    • Node Managers collect usage information from respective nodes and send the details to the Resource Manager as part of the heartbeat
    • Resource Manager keeps track of the usage of the cluster. You can review this information using Resource Manager UI.
    • App timeline Server is to keep track of YARN applications submitted on the cluster
  • We can run Map Reduce Jobs as well as Spark Jobs using YARN.

Running Map Reduce Jobs using YARN

Let us run simple Map Reduce Job and see what happens. We will be using Hadoop examples that come as part of the setup process itself.

  • We can use Hadoop jar or yarn jar to submit map reduce job as YARN application. Let us run an application called randomtextwriter which will generate 10 GB of data per node by default.
  • This job will take some to run.
[gist]f2852840916b1e79f4fb6830d93c8b22[/gist]
  • Typically data will be processed using map tasks and reduce tasks.
    • Map Tasks read the data and perform row-level transformations.
    • Reduce Tasks read the output of Map Tasks and perform transformations such as joins, aggregations etc.
    • Shuffling Process between Map Tasks and Reduce Tasks take care of grouping and partitioning of data based on keys.
    • We do not have to get into too many details at this time as an administrator.
    • This particular application randomtextwriter is map only job where it tries to create 10 GB data per data node. In our case, we will see 30 GB of data.
Exercise: Run relevant Hadoop fs commands to get the size of data that is created by randomtextwriter.
  • Map tasks and Reduce tasks will run as part of YARN containers on Nodemanagers.
  • The life cycle of the job is managed by per job application master.
  • Typically Map Reduce jobs read data from HDFS, process it and save it back to HDFS.
  • This examples job does not take any data from HDFS, it just randomly generates text and writes it back to HDFS.

We can keep track of running jobs as well as troubleshoot completed jobs using Resource Manager UI.

Running Spark Jobs using YARN

Spark provides APIs as well as Framework for distributed processing.

  • Developers take care of developing Spark based applications using Scala or Python or Java.
  • When the code is released, it is the responsibility of Developers to provide a run guide for their applications.
  • As part of Spark setup, we get examples and they can be submitted using the spark-submit command. Let us review some of the arguments we can pass using spark-submit to control the runtime behavior of Spark Application.
  • We can also launch Scala REPL with Spark dependencies using spark-shell and Python CLI with Spark dependencies using pyspark
[gist]65128e88405c9b80e8bc34d3e878c6c3[/gist]
  • After running the jobs let us also review UI to monitor either running or completed jobs.
  • Here Spark is integrated with YARN and hence Spark Job or Application is nothing but YARN Application.

Run using Spark 2 – HDP

We can have multiple versions of Spark on the clusters set up using HDP.

  • Let us assume that we have Spark 1.6.x and Spark 2.3.x
  • By default it might pick up Spark 1.6, you can validate by running the spark-shell command.
  • If it launches Spark Shell using Spark 1.6 and if you want to use Spark 2.x instead then you have to first run export SPARK_MAJOR_VERSION=2 in Linux prompt and then launch spark-shell.
  • If you want to only run using Spark 2 then, you can add export SPARK_MAJOR_VERSION=2 to .bash_profile or .profile (whatever is relevant to your environment).
  • Once the entry is added, either you have to relaunch terminal or run the script. You can validate whether the environment variable is set or not by running echo $SPARK_MAJOR_VERSION.
  • Now you can run spark-submit, spark-shell or pyspark. All of them use Spark 2 instead of Spark 1.6.

Run using Spark 2 – CDH

We can have multiple versions of Spark on the clusters set up using HDP.

  • Let us assume that we have Spark 1.6.x and Spark 2.3.x
  • By default it might pick up Spark 1.6, you can validate by running the spark-shell command.
  • To launch Spark 2, you need to just switch to spark2-shell to launch Spark Shell with Scala, pyspark2 to launch Spark Shell with Python or spark2-submit to submit jobs using Spark 2.

YARN Application Life Cycle

Now let us talk about YARN Application Life Cycle. YARN is the resource management framework.

  • We can use distributed data processing frameworks such as Map Reduce, Spark etc., by plugging into YARN.
  • A YARN application can be Map Reduce Job or Spark Application.
  • From YARN perspective data is being processed by containers.
  • Let us understand the life cycle of YARN Application.
    • We use the client to submit YARN Application (for e. g.: Map Reduce Job)
    • The request will go to the Resource Manager. Resource Manager has up to date information about the usage of all the servers on registered Node Managers running on servers.
    • Resource Manager will decide a node on which container should run to manage the job or application using different criteria such as usage of the servers.
    • This container is called as Application Master. It will be up and running until the application is either completed or killed.
    • Now Application Master will talk to Node Managers directly and decide on which nodes containers should run to process the data. It uses Data Locality and Server Usage as criteria before creating containers.
    • These containers will process the data and might get garbage collected depending upon the underlying data processing framework.

Spark Job Execution Life Cycle

Let us understand the Execution Life Cycle of Spark. You can review this using Spark Official Documentation.

  • We submit the job for the client. The JVM typically acts as the Driver Program.
  • It will talk to the Resource Manager and create the Application Master.
  • Application Master will talk to Worker Nodes on which Node Managers are running and provision resources based on Allocation Settings. Allocation can be either static or dynamic.
  • These resources are nothing but Executors. From YARN perspective they are Containers.
  • The Executor is nothing but JVM which can run multiple concurrent threads until the Job is complete
[gist]65128e88405c9b80e8bc34d3e878c6c3[/gist]

YARN Schedulers – Overview

Let us see details about Schedulers in YARN.

  • YARN primarily take care of Resource Management and Job Scheduling.
  • There are three types of Schedulers supported by YARN. It is all about how the jobs and tasks associated with respect to the jobs are queued up as they execute.
    • FIFO Scheduler
    • Fair Scheduler
    • Capacity Scheduler
  • FIFO Scheduler is default in Plain Vanilla Hadoop.
  • In Cloudera Distribution, Fair Scheduler is default.
  • With FIFO Scheduler, jobs are given priority in the order of submission.
  • With Fair Scheduler as resources are freed up, all the job tasks in queue will get equal priority.
  • It is very easy to switch the schedulers using Cloudera Manager.
  • Fair Scheduler and Capacity Scheduler have pools and queues which need to be configured so that resources are distributed between different categories of applications.