Now let us understand development life cycle using PyCharm and deployment lifecycle. As part of the deployment life cycle, we will see how to control runtime behavior.
Development Life Cycle
Let us check the development life cycle of Spark applications using PyCharm with word count and daily revenue.
- Create new project
- Make sure PyCharm is configured for Pyspark Python
- Externalize Properties using ConfigParser
- Create Spark Configuration object and Spark Context object
- Develop logic to read, process and save the output back
- Externalize execution mode, input base directory and output path
- Validate locally using pycharm and probably spark-submit in local mode
Deployment Life Cycle
Once the code is developed, we can deploy it on the gateway node on the cluster.
- Ship the folder which contain all python files to gateway node
- Run using spark-submit command
- Review capacity of the cluster.
- Node Manager capacity
- Default YARN Container -> Spark Executor configuration
- Default Spark Executor configuration – 1 GB, 1 Core
- Here are the different modes in which we will be running and understand how it impact the execution (demo is done using word count).
- Disable dynamic allocation and run with defaults (num-executors, executor-memory and executor-cores)
- Understanding executor overhead and impact on executor-memory
- Increasing num-executors, executor-memory and executor-cores
- Dynamic Allocation
GitHub repository for the code will be shared after session is done.
Development and Deployment Life Cycle – Daily Revenue – Externalize Properties
Development and Deployment Life Cycle – Daily Revenue – Validate Locally
Development and Deployment Life Cycle – Daily Revenue – Spark Configuration
Development and Deployment Life Cycle – Word Count – Spark Configuration