Durga Gadiraju

Setup and Validate Spark 1.6.x

Even though Cloudera support both Spark 1.6.x as well as Spark 2.3.x, we can only setup Spark 1.6.x with Packages.

Go to the Cloudera Manager Dashboard
Click on Add Service in drop down of the cluster
Choose Spark 1.6.x (don’t choose Stand Alone)
We will be using bigdataserver-4 as Spark Job History Server.
Review properties and complete the setup process.

Spark is Distributed Computing Engine which uses File Systems supported by HDFS APIs for storage and YARN or Mesos for Resource Management. With distributions like Cloudera we can only configure with YARN.

Run Spark Jobs – Spark 1.6.x

Spark provides APIs as well as Framework for distributed processing.

Developers take care of developing Spark based applications using Scala or Python or Java.
When code is released, it is the responsibility of Developers to provide run guide for their applications.
As part of Spark setup we get examples and they can be submitted using spark-submit command. Let us review some of the arguments we can pass using spark-submit to control the run time behavior of Spark Application.
We can also launch Scala REPL with Spark dependencies using spark-shell and Python CLI with Spark dependencies using pyspark

https://gist.github.com/dgadiraju/65128e88405c9b80e8bc34d3e878c6c3

After running the jobs let us also review UI to monitor either running or completed jobs.
Here Spark is integrated with YARN and hence Spark Job or Application is nothing but YARN Application.

Durga Gadiraju

Setup and Validate Spark 1.6.x

Run Spark Jobs – Spark 1.6.x

Share this post

Join Our Community

Follow Us

Links

Contact Info

Address

Phone

Email