Even though Cloudera support both Spark 1.6.x as well as Spark 2.3.x, we can only setup Spark 1.6.x with Packages.
- Go to the Cloudera Manager Dashboard
- Click on Add Service in drop down of the cluster
- Choose Spark 1.6.x (don’t choose Stand Alone)
- We will be using bigdataserver-4 as Spark Job History Server.
- Review properties and complete the setup process.
Spark is Distributed Computing Engine which uses File Systems supported by HDFS APIs for storage and YARN or Mesos for Resource Management. With distributions like Cloudera we can only configure with YARN.
Run Spark Jobs – Spark 1.6.x
Spark provides APIs as well as Framework for distributed processing.
- Developers take care of developing Spark based applications using Scala or Python or Java.
- When code is released, it is the responsibility of Developers to provide run guide for their applications.
- As part of Spark setup we get examples and they can be submitted using spark-submit command. Let us review some of the arguments we can pass using spark-submit to control the run time behavior of Spark Application.
- We can also launch Scala REPL with Spark dependencies using spark-shell and Python CLI with Spark dependencies using pyspark
https://gist.github.com/dgadiraju/65128e88405c9b80e8bc34d3e878c6c3
- After running the jobs let us also review UI to monitor either running or completed jobs.
- Here Spark is integrated with YARN and hence Spark Job or Application is nothing but YARN Application.