Run Spark Jobs – Spark 2.3.x

Spark provides APIs as well as Framework for distributed processing.

  • Even though there are differences with respect to the development of applications using Spark 1.6.x versus Spark 2.3.x, deployment does not change much.
  • Instead of spark-submit, we need to use spark2-submit. Same is the case with spark-shell and pyspark, we have to use spark2-shell and pyspark2
  • We can run the same commands which we have seen earlier by just modifying spark-submit to spark2-submit, spark-shell to spark2-shell and pyspark to pyspark2.
  • Execution life cycle does not change much. It is the same between Spark 1.6.x and Spark 2.3.x.

https://gist.github.com/dgadiraju/65128e88405c9b80e8bc34d3e878c6c3

By this time you should have your cluster running with Parcels along with Zookeeper, HDFS and YARN including High Availability, Spark 1.6.0 and Spark 2.3.0. Also, you should be familiar with all relevant Web UIs for the above services and some of the commands, especially from the administration perspective.

Make sure to stop services in Cloudera Manager and also shut down servers provisioned from GCP or AWS to leverage credits or control costs.

Share this post