Introduction – Setup Python, PyCharm and Spark on Windows
As part of this blog post we will see detailed instructions about setting up development environment for Spark and Python using PyCharm IDE using Windows.
- We have used Windows 10 for this demo using 64 bit version on
- Setup development environment on Windows
- For each of the section we will see
- Why we need to perform the step?
- How to perform the step?
- How we can validate whether it is working as expected?
- We will also develop few programs to validate whether our setup is progressing as expected or not
- In case you run into any issues, please log those in our forums
- Click here for the coupons for our content. Our training approach is certification oriented.
- Click here to go to our state of the art lab to practice Spark hands on for more realistic experience
Steps required to run Spark based applications using Python:
- Setup Python
- Setup PyCharm IDE
- Setup Spark
Once the above steps are done we will see how to use PyCharm to develop Spark based applications using Python.
Understanding Pre-requisites
Before setting up the environment, Let us have an understanding of the prerequisites.
- Minimum 4 GB (RAM) memory required. If the memory is less than 4GB, it’s not recommended to setup the environment as it will lead to memory related issues
- 8 GB is highly desired memory
- Operating System Version – 32 bit or 64 bit. It should be 64 bit for our environment
- Open the File explorer and right click on ‘This PC’ to get the following details
- The details of RAM: 4.00 GB
- System type: 64-bit Operating System
- Once the above details are confirmed we can go further
- Google chrome browser is recommended to follow the process in detail. Setting up Google chrome is recommended if you do not have.
- In case of any issues while setting up the environment:
- Launch the url – http://discuss.itversity.com
- Signup for an account by clicking on the Sign up button at the left top window
- You can sign up by using Google or Facebook or by using the Local sign up form
- Once account creation is done, you can login.
- You can use the New Topic button to create a new topic and choose the appropriate Category to post the topic.
- For example if you want to troubleshoot any issues related to Python, you can go to ‘Programming Languages | Python’ choose that Category and create the topic with meaningful title
- Enter the detailed description of the issue.
- This way you can avail the support if any issues occur while setting up the environment
- We will respond and try to resolve the issue if any
Setup Python on Windows 10
Steps to setup Python on Windows 10:
- How to Install?
- Launch Google chrome and search for Python
- From www.python.org you can find the Downloads link.
- Click on the link to launch the download page
- As we are going to work with Spark, we need to choose the compatible version for Spark
- For our environment, the Spark version we are using is 1.6.3
- Spark version 1.6.3 and Spark 2.x are compatible with Python 2.7
- Make sure you choose Python 2.7.14 for download and click on the link
- .msi version will be downloaded(Microsoft Installer)
- Double click on the file and progress for further steps
- You can choose ‘install for all users’, click on next
- Path can be changed if you want to change. click on next
- click on next and click on ‘yes’ at User Account Control window.
- Setup will be automatically progressed and the setup is done.
- Click on Finish button
- Restart the System once it’s finished.
- After restarting the System search for Python and click on it
- It will launch the Python console.
- How to Validate?
- type a command
print("Hello World")
and Enter - It will print the output as Hello World at console output
- We can confirm that Python setup is done without any issues
- type a command
Configure environment variables for Python on Windows 10
Let us see how to set up environment variables for Python in windows
- Why to set up Environment variables?
- To run the binary files of the setup from any location
- Environment variables help programs know what directory to install files in, where to store temporary files, where to find user profile settings, and other things
- How to setup?
- Go to the folder where Python got installed
- Find the binary file named as ‘python.exe'(ex: C:\Python27″)
- Navigate to Environment variables
- Edit the ‘PATH’ variable in System variables
- Add the path to the list of variables
- How to validate?
- Launch command prompt and type ‘python’ and see that python is successfully launched
- As the command prompt checks in the path for the python binaries now it works without any issues
- Validate it by typing ‘print(“Hello World”) and make sure that it prints the output “Hello World”
- Now Python Environment variable setup is done
Setup Pycharm on Windows 10
- Now install PyCharm
- There are 2 versions of PyCharm community edition and enterprise edition
- Community edition is free and at times you need to install additional plugins
- Enterprise edition is paid and supported and comes with most of the important plugins pre-installed. Also set of plugins are bundled together as part of enterprise edition
- Unless you have corporate license for now consider installing community edition.
- Why PyCharm?
- PyCharm is created by JetBrains and it is very popular in building IDEs which boost productivity in team development
- Most commonly used tools such as git comes out of the box for versioning the code in the process of application development by teams.
- How to Install?
- Go to the downloads page and make sure right version is chosen.
- Once downloaded, just double click on installable and follow typical installation process
- How to validate?
- We will develop a program as part of next section to validate.
Develop Python program using PyCharm
We will see how to create first program using Python.
- Create New project
- Give name to the project -> gettingstarted
- It will take sometime to create the project
Once done you will see:
- you will find ‘gettingstarted’ folder under project
- Right click on the ‘gettingstarted’ folder
- choose new Python file and name it HelloWorld
- type command “print(“Hello World”) in the file
- Right click and run the program
- You should see Hello World in the console
Let us create a program to use sys.arguments
- copy the below code and replace with the previous code
import sys
print("Hello " + sys.argv[1] + " from " + sys.argv[0])
- To pass the arguments navigate to Run in main menu, Select ‘Edit configurations’
- Pass the parameters at Parameters Input box: world
- Save it return to the code window
- Right click and run the program
- You should see Hello world from and the program name in the console
Make sure PyCharm setup with Python is done and validated by running Hello World program. In case of any issues, please log in our forums.
Download Spark – compressed tar ball
Now let us see the details about setting up Spark on Windows
- Why to setup Spark?
- Before deploying on the cluster, it is good practice to test the script using spark-submit.
- To run using spark-submit locally, it is nice to setup Spark on Windows
- Which version of Spark?
- We will be using Spark version 1.6.3 which is the stable version as of today
- Search for spark 1.6.3 and find the link from downloads
- Choose Spark Release 1.6.3
- Download Spark with .tgz file extension
- Same instructions will work with any Spark version (even Spark 2.3.x)
We will see how to setup Spark and validate in the next section
Install 7z to uncompress and untar on Windows 10
- Install 7z so that we can unzip and untar spark tar ball, from here
- Use 7z software to unzip and untar to complete setup of spark
Setup Spark on Windows 10 using compressed tar ball
Let us see to untar the compressed tar ball for Spark Setup
- Make sure to untar the file to a folder in the location where you want to install spark
- Now run command prompt. Go to spark directory -> bin directory
- give pyspark command to run pyspark
- warning message may appear if Java is not installed
Let us see further steps in the next section
Setup JDK 1.8 on Windows 10 and configure environment variables
Let us see how to Setup Java and JDK on Windows PC
- Before getting started check whether Java and JDK are installed or not
- Launch command prompt – Go to search bar on windows laptop, type cmd and hit enter
- Type
java -version
If it return version, check whether 1.8 or not. It is better to have 1.8 version. If you have other version, consider uninstall and install 1.8 (Search for programs installed and uninstall Java) - Type
javac -version
If it return version, check whether 1.8 or not. It is better to have 1.8 version. If you have other version, consider uninstall and install 1.8 (Search for programs installed and uninstall Java) - If you need other versions, make sure environment variables point to 1.8
- If you do not have Java at all, make sure to follow the instructions and install 1.8 version of JRE and JDK.
- Why do we need to install Java and JDK? Spark and many other technologies require Java and JDK to develop and build the applications.
- How to install Java and JDK?
- Go to official page of Oracle where downloads are available
- Accept the terms and download 64 bit version
- How to validate?
- Use
java -version
andjavac -version
commands in command prompt and see they return 1.8 or not
- Use
Configure environment variables for Spark
- Why to setup Environment Variables? To run spark-submit, pyspark from any where on the PC using the jar file.
- How to configure Environment Variables?
- Let us assume that Spark is setup under C:\spark-1.6.3-bin-hadoop2.7
- Setup new environment variable SPARK_HOME
- Search for Environment Variables on Windows search bar
- Click on Add Environment Variables
- There will be 2 categories of environment variables
-
- User Variables on top
- System Variables on bottom
- Make sure to click on Add for System Variables
- Name: SPARK_HOME
- Value: C:\spark-1.6.3-bin-hadoop2.7 (don’t include bin)
-
- Also choose Path and click on Edit
- Click on Add
- Add new entry %SPARK_HOME%\bin
- How to validate?
- Go to any directory and run pyspark
Setup winutils to integrate HDFS APIs with Windows
- Why to install winutils?
- In the process of building data processing applications using Spark, we need to read data from files
- Spark uses HDFS API to read files from several file systems like HDFS, s3, local etc
- For HDFS APIs to work on Windows, we need to have WinUtils
- How to install winutils?
- Click here to download 64 bit winutils.exe
- Create directory structure like this
C:/hadoop/bin
- Setup new environment variable HADOOP_HOME
- Search for Environment Variables on Windows search bar
- Click on Add Environment Variables
- There will be 2 categories of environment variables
- User Variables on top
- System Variables on bottom
- Make sure to click on Add for System Variables
- Name: HADOOP_HOME
- Value: C:\hadoop (don’t include bin)
- Also choose Path and click on Edit
- Click on Add
- Add new entry %HADOOP_HOME%\bin
Develop pyspark program using Pycharm on Windows 10
We will see the steps to execute pyspark program in PyCharm
- How to set up Spark for PyCharm?
- Launch PyCahrm IDE
- Select the project ‘gettingstarted’
- Go to Main menu, select Settings from File
- Go to project: gettingstarted
- expand the link and select Project Interpreter
- make sure that Python version is 2.7
- Navigate to Project Structure -> Click on ‘Add Content Root’ -> Go to folder where Spark is setup -> Select python folder
- Again click on Add Content Root -> Go to Spark Folder -> expand python -> expand lib -> select py4j-0.9-src.zip and apply the changes and wait for the indexing to be done
- Return to Project window
- How to develop?
-
- Select the project and create a new Python File and name it -> sparkdemo.py
- Copy the below and code place in the File
- execute the code by clicking on run
-
from pyspark import SparkConf, SparkContext sc = SparkContext(master="local",appName="Spark Demo") print(sc.textFile("C:\\deckofcards.txt").first()) output will be: "BLACK|SPADE|2"
Setup Ubuntu using Windows subsystem for Linux
Now let us see how we can setup Ubuntu on Windows 10
- Why to setup Ubuntu?
- Windows is not completely fool proof in running spark jobs.
- Using Ubuntu is better alternative and you will run into fewer issues
- Using Windows subsystem for Linux we can quickly set up Ubuntu virtual machine
- How to setup Ubuntu using Windows subsystem for Linux?
- Follow this link to setup Ubuntu using Windows subsystem for Linux
- Complete the setup process by giving username for the Ubuntu virtual machine
Accessing C Drive using Ubuntu built using Windows subsystem for Linux
- It is better to understand how we can access C drive in Ubuntu built using subsystem for Linux
- It will facilitate us to access files in C drive
- In Linux root file system starts with / and does not have partitions like C drive
- The location of C drive is
/mnt/C
Setup Java and JDK on Ubuntu
- Before getting started check whether Java and JDK are installed or not
- Launch command prompt – Go to search bar on windows laptop, type cmd and hit enter
- Type
java -version
If it return version, check whether 1.8 or not. It is better to have 1.8 version. If you have other version, consider uninstall and install 1.8 (Search for programs installed and uninstall Java) - Type
javac -version
If it return version, check whether 1.8 or not. It is better to have 1.8 version. If you have other version, consider uninstall and install 1.8 (Search for programs installed and uninstall Java) - If you need other versions, make sure environment variables point to 1.8
- If you do not have Java at all, make sure to follow the instructions and install 1.8 version of JRE and JDK.
- Why do we need to install Java and JDK? Scala, Spark and many other technologies require Java and JDK to develop and build the applications. Scala is JVM based programming language.
- How to install Java and JDK on Ubuntu?
sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java8-installer
- How to validate?
- Use
java -version
andjavac -version
commands in command prompt and see they return 1.8 or not
- Use
Download and Untar Spark
Now let us see the details about setting up Spark on Ubuntu or any Linux flavor or Mac.
- Why to setup Spark?
- Before deploying on the cluster, it is good practice to test the script using spark-submit.
- To run using spark-submit locally, it is nice to setup Spark on Windows
- How to setup Spark?
- In this video we will be seeing Spark 2.3.0 as example. You can follow the same steps with earlier versions as well.
- Download spark 2.3 tar ball by going here. We can use wget to download the tar ball.
- Choose Spark Release: 2.3.0
- Choose a package type: Pre-built for Hadoop 2.7 or later
- It gives the appropriate link pointing to mirror
- Click on it go to mirror and click on it to download
- Use tar xzf command to untar and unzip tar ball –
tar xzf spark-2.3.0-bin-hadoop2.7.tgz
- We need to configure environment variables to run Spark any where
Setup Environment Variables – Mac or Linux
Let us see how we can configure environment variables of Spark
- Why to setup Environment Variables? To run spark-submit, spark-shell from any where on the PC using the jar file.
- How to configure Environment Variables?
- Let us assume that Spark is setup under
- /Users/itversity/spark-2.3.0-bin-hadoop2.7 on Mac
- /mnt/c/spark-2.3.0-bin-hadoop2.7 on Ubuntu built using Windows subsystem
- Setup new environment variable SPARK_HOME and update PATH
- Make sure to restart terminal (no need to reboot the machine)
- Let us assume that Spark is setup under
# On Mac - .bash_profile export SPARK_HOME=/Users/itversity/spark export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin # On Ubuntu built using Windows subsystem for Linux - .profile export SPARK_HOME=/mnt/c/spark-2.3.0-bin-hadoop2.7 export PATH=$PATH:$SPARK_HOME/bin
- How to validate?
- Example in the video have spark-shell and scala based code. Instead of using code demonstrated as part of video try below code to make sure pyspark is working as expected.
- Go to any directory and run
pyspark
orderItems = sc.textFile("C:\\data\\retail_db\\order_items") revenuePerOrder = orderItems. \ map(lambda oi: (oi.split(",")[1].toInt, oi.split(",")[4].toFloat)). \ reduceByKey(lambda curr, next: curr + next). \ map(lambda oi: oi[0] + "," + oi[1]) for i in revenuePerOrder.take(10): print(i)
Below video is demonstrated using spark-shell with Scala based code. Instead of that you can use the above code after launching pyspark to validate the installation.