Development and Deployment Life Cycle

Private: HDP Certified Spark Developer (HDPCSD) – Python Apache Spark 2 – Core APIs Development and Deployment Life Cycle

Now let us understand development life cycle using PyCharm and deployment lifecycle. As part of the deployment life cycle, we will see how to control runtime behavior.

Development Life Cycle

Let us check the development life cycle of Spark applications using PyCharm with word count and daily revenue.

Create new project
Make sure PyCharm is configured for Pyspark Python
Externalize Properties using ConfigParser
Create Spark Configuration object and Spark Context object
Develop logic to read, process and save the output back
Externalize execution mode, input base directory and output path
Validate locally using pycharm and probably spark-submit in local mode

Deployment Life Cycle

Once the code is developed, we can deploy it on the gateway node on the cluster.

Ship the folder which contain all python files to gateway node
Run using spark-submit command
Review capacity of the cluster.
- Node Manager capacity
- Default YARN Container -> Spark Executor configuration
- Default Spark Executor configuration – 1 GB, 1 Core
Here are the different modes in which we will be running and understand how it impact the execution (demo is done using word count).
- Disable dynamic allocation and run with defaults (num-executors, executor-memory and executor-cores)
- Understanding executor overhead and impact on executor-memory
- Increasing num-executors, executor-memory and executor-cores
- Dynamic Allocation

GitHub repository for the code will be shared after session is done.

Development and Deployment Life Cycle – Daily Revenue – Externalize Properties

Development and Deployment Life Cycle – Daily Revenue – Validate Locally

Development and Deployment Life Cycle – Daily Revenue – Spark Configuration

Development and Deployment Life Cycle – Word Count – Spark Configuration

Previous Topic

Back to Lesson

Next Topic