pythonSpark

Setup Spark Development Environment – PyCharm and Python

Introduction – Setup Python, PyCharm and Spark on Windows

As part of this blog post we will see detailed instructions about setting up development environment for Spark and Python using PyCharm IDE using Windows.

  • We have used Windows 10 for this demo using 64 bit version on
  • Setup development environment on Windows
  • For each of the section we will see
    • Why we need to perform the step?
    • How to perform the step?
    • How we can validate whether it is working as expected?
  • We will also develop few programs to validate whether our setup is progressing as expected or not
  • In case you run into any issues, please log those in our forums
  • Click here for the coupons for our content. Our training approach is certification oriented.
  • Click here to go to our state of the art lab to practice Spark hands on for more realistic experience

Steps required to run Spark based applications using Python:

  • Setup Python
  • Setup PyCharm IDE
  • Setup Spark

Once the above steps are done we will see how to use PyCharm to develop Spark based applications using Python.

Understanding Pre-requisites

Before setting up the environment, Let us have an understanding of the prerequisites.

  • Minimum 4 GB (RAM) memory required. If the memory is less than 4GB, it’s not recommended to setup the environment as it will lead to memory related issues
  • 8 GB is highly desired memory
  • Operating System Version – 32 bit or 64 bit. It should be 64 bit for our environment
    • Open the File explorer and right click on ‘This PC’ to get the following details
    • The details of RAM: 4.00 GB
    • System type: 64-bit Operating System
  • Once the above details are confirmed we can go further
  • Google chrome browser is recommended to follow the process in detail. Setting up Google chrome is recommended if you do not have.
  • In case of any issues while setting up the environment:
    • Launch the url – http://discuss.itversity.com
    • Signup for an account by clicking on the Sign up button at the left top window
    • You can sign up by using Google or Facebook or by using the Local sign up form
    • Once account creation is done, you can login.
    • You can use the New Topic button to create a new topic and choose the appropriate Category to post the topic.
    • For example if you want to troubleshoot any issues related to Python, you can go to ‘Programming Languages | Python’ choose that Category and create the topic with meaningful title
    • Enter the detailed description of the issue.
    • This way you can avail the support if any issues occur while setting up the environment
    • We will respond and try to resolve the issue if any

Setup Python on Windows 10

Steps to setup Python on Windows 10:

  • How to Install?
    • Launch Google chrome and search for Python
    • From www.python.org you can find the Downloads link.
    • Click on the link to launch the download page
  • As we are going to work with Spark, we need to choose the compatible version for Spark
  • For our environment, the Spark version we are using is 1.6.3
  • Spark version 1.6.3 and Spark 2.x are compatible with Python 2.7
  • Make sure you choose Python 2.7.14 for download and click on the link
  • .msi version will be downloaded(Microsoft Installer)
  • Double click on the file and progress for further steps
    • You can choose ‘install for all users’, click on next
    • Path can be changed if you want to change. click on next
    • click on next and click on ‘yes’ at User Account Control window.
    • Setup will be automatically progressed and the setup is done.
    • Click on Finish button
    • Restart the System once it’s finished.
  • After restarting the System search for Python and click on it
  • It will launch the Python console.
  • How to Validate?
    • type a command print("Hello World") and Enter
    • It will print the output as Hello World at console output
    • We can confirm that Python setup is done without any issues

Configure environment variables for Python on Windows 10

Let us see how to set up environment variables for Python in windows

  • Why to set up Environment variables?
    • To run the binary files of the setup from any location
    • Environment variables help programs know what directory to install files in, where to store temporary files, where to find user profile settings, and other things
  • How to setup?
    • Go to the folder where Python got installed
    • Find the binary file named as ‘python.exe'(ex: C:\Python27″)
    • Navigate to Environment variables
    • Edit the ‘PATH’ variable in System variables
    • Add the path to the list of variables
  • How to validate? 
    • Launch command prompt and type ‘python’ and see that python is successfully launched
    • As the command prompt checks in the path for the python binaries now it works without any issues
    • Validate it by typing ‘print(“Hello World”) and make sure that it prints the output “Hello World”
    • Now Python Environment variable  setup is done

Setup Pycharm on Windows 10

  • Now install PyCharm
  • There are 2 versions of PyCharm community edition and enterprise edition
  • Community edition is free and at times you need to install additional plugins
  • Enterprise edition is paid and supported and comes with most of the important plugins pre-installed. Also set of plugins are bundled together as part of enterprise edition
  • Unless you have corporate license for now consider installing community edition.
  • Why PyCharm?
    • PyCharm is created by JetBrains and it is very popular in building IDEs which boost productivity in team development
    • Most commonly used tools such as git comes out of the box for versioning the code in the process of application development by teams.
  • How to Install?
    • Go to the downloads page and make sure right version is chosen.
    • Once downloaded, just double click on installable and follow typical installation process
  • How to validate?
    • We will develop a program as part of next section to validate.

Develop Python program using PyCharm

We will see how to create first program using Python.

  • Create New project
  • Give name to the project -> gettingstarted
  • It will take sometime to create the project

Once done you will see:

  • you will find ‘gettingstarted’ folder under project
  • Right click on the ‘gettingstarted’ folder
  • choose new Python file and name it HelloWorld
  • type command “print(“Hello World”) in the file
  • Right click and run the program
  • You should see Hello World in the console

Let us create a program to use sys.arguments

  • copy the below code and replace with the previous code
    import sys
    print("Hello " + sys.argv[1] + " from " + sys.argv[0])
  • To pass the arguments navigate to Run in main menu, Select ‘Edit configurations’
  • Pass the parameters at Parameters Input box: world
  • Save it return to the code window
  • Right click and run the program
  • You should see Hello world from and the program name in the console

Make sure PyCharm setup with Python is done and validated by running Hello World program. In case of any issues, please log in our forums.

Download Spark – compressed tar ball

Now let us see the details about setting up Spark on Windows

  • Why to setup Spark?
    • Before deploying on the cluster, it is good practice to test the script using spark-submit.
    • To run using spark-submit locally, it is nice to setup Spark on Windows
  • Which version of Spark?
    • We will be using Spark version 1.6.3 which is the stable version as of today
    • Search for spark 1.6.3 and find the link from downloads 
    • Choose Spark Release 1.6.3
    • Download Spark with .tgz file extension
    • Same instructions will work with any Spark version (even Spark 2.3.x)

We will see how to setup Spark and validate in the next section

Install 7z to uncompress and untar on Windows 10

  • Install 7z so that we can unzip and untar spark tar ball, from here
  • Use 7z software to unzip and untar to complete setup of spark

Setup Spark on Windows 10 using compressed tar ball

Let us see to untar the compressed tar ball for Spark Setup

  • Make sure to untar the file to a folder in the location where you want to install spark
  • Now run command prompt. Go to spark directory -> bin directory
  • give pyspark command to run pyspark
  • warning message may appear if Java is not installed

Let us see further steps in the next section

Setup JDK 1.8 on Windows 10 and configure environment variables

Let us see how to Setup Java and JDK on Windows PC

  • Before getting started check whether Java and JDK are installed or not
    • Launch command prompt – Go to search bar on windows laptop, type cmd and hit enter
    • Type java -version If it return version, check whether 1.8 or not. It is better to have 1.8 version. If you have other version, consider uninstall and install 1.8 (Search for programs installed and uninstall Java)
    • Type javac -version If it return version, check whether 1.8 or not. It is better to have 1.8 version. If you have other version, consider uninstall and install 1.8 (Search for programs installed and uninstall Java)
    • If you need other versions, make sure environment variables point to 1.8
    • If you do not have Java at all, make sure to follow the instructions and install 1.8 version of JRE and JDK.
  • Why do we need to install Java and JDK? Spark and many other technologies require Java and JDK to develop and build the applications.
  • How to install Java and JDK? 
  • How to validate?
    • Use java -version and javac -version commands in command prompt and see they return 1.8 or not

Configure environment variables for Spark

  • Why to setup Environment Variables? To run spark-submit, pyspark from any where on the PC using the jar file.
  • How to configure Environment Variables?
    • Let us assume that Spark is setup under C:\spark-1.6.3-bin-hadoop2.7
    • Setup new environment variable SPARK_HOME
      • Search for Environment Variables on Windows search bar
      • Click on Add Environment Variables
      • There will be 2 categories of environment variables
          • User Variables on top
          • System Variables on bottom
          • Make sure to click on Add for System Variables
          • Name: SPARK_HOME
        • Value: C:\spark-1.6.3-bin-hadoop2.7 (don’t include bin)
      • Also choose Path and click on Edit
        • Click on Add
        • Add new entry %SPARK_HOME%\bin
  • How to validate?
    • Go to any directory and run pyspark

Setup winutils to integrate HDFS APIs with Windows

  • Why to install winutils?
    • In the process of building data processing applications using Spark, we need to read data from files
    • Spark uses HDFS API to read files from several file systems like HDFS, s3, local etc
    • For HDFS APIs to work on Windows, we need to have WinUtils
  • How to install winutils?
    • Click here to download 64 bit winutils.exe
    • Create directory structure like this C:/hadoop/bin
    • Setup new environment variable HADOOP_HOME
      • Search for Environment Variables on Windows search bar
      • Click on Add Environment Variables
      • There will be 2 categories of environment variables
        • User Variables on top
        • System Variables on bottom
        • Make sure to click on Add for System Variables
        • Name: HADOOP_HOME
        • Value: C:\hadoop (don’t include bin)
      • Also choose Path and click on Edit
        • Click on Add
        • Add new entry %HADOOP_HOME%\bin

Develop pyspark program using Pycharm on Windows 10

We will see the steps to execute pyspark program in PyCharm

  • How to set up Spark for PyCharm? 
    • Launch PyCahrm IDE
    • Select the project ‘gettingstarted’
    • Go to Main menu, select Settings from File
    • Go to project: gettingstarted
    • expand the link and select Project Interpreter
    • make sure that Python version is 2.7
    • Navigate to Project Structure -> Click on ‘Add Content Root’ -> Go to folder where Spark is setup -> Select python folder
    • Again click on Add Content Root -> Go to Spark Folder -> expand python -> expand lib -> select py4j-0.9-src.zip and apply the changes and wait for the indexing to be done
    • Return to Project window
  • How to develop?
      • Select the project and create a new Python File and name it -> sparkdemo.py
      • Copy the below and code place in the File
      • execute the code by clicking on run
from pyspark import SparkConf, SparkContext
sc = SparkContext(master="local",appName="Spark Demo")
print(sc.textFile("C:\\deckofcards.txt").first())

output will be:
"BLACK|SPADE|2"

Setup Ubuntu using Windows subsystem for Linux

Now let us see how we can setup Ubuntu on Windows 10

  • Why to setup Ubuntu?
    • Windows is not completely fool proof in running spark jobs.
    • Using Ubuntu is better alternative and you will run into fewer issues
    • Using Windows subsystem for Linux we can quickly set up Ubuntu virtual machine
  • How to setup Ubuntu using Windows subsystem for Linux?
    • Follow this link to setup Ubuntu using Windows subsystem for Linux
    • Complete the setup process by giving username for the Ubuntu virtual machine

Accessing C Drive using Ubuntu built using Windows subsystem for Linux

  • It is better to understand how we can access C drive in Ubuntu built using subsystem for Linux
  • It will facilitate us to access files in C drive
  • In Linux root file system starts with / and does not have partitions like C drive
  • The location of C drive is /mnt/C

Setup Java and JDK on Ubuntu

  • Before getting started check whether Java and JDK are installed or not
    • Launch command prompt – Go to search bar on windows laptop, type cmd and hit enter
    • Type java -version If it return version, check whether 1.8 or not. It is better to have 1.8 version. If you have other version, consider uninstall and install 1.8 (Search for programs installed and uninstall Java)
    • Type javac -version If it return version, check whether 1.8 or not. It is better to have 1.8 version. If you have other version, consider uninstall and install 1.8 (Search for programs installed and uninstall Java)
    • If you need other versions, make sure environment variables point to 1.8
    • If you do not have Java at all, make sure to follow the instructions and install 1.8 version of JRE and JDK.
  • Why do we need to install Java and JDK? Scala, Spark and many other technologies require Java and JDK to develop and build the applications. Scala is JVM based programming language.
  • How to install Java and JDK on Ubuntu?
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
  • How to validate?
    • Use java -version and javac -version commands in command prompt and see they return 1.8 or not

Download and Untar Spark

Now let us see the details about setting up Spark on Ubuntu or any Linux flavor or Mac.

  • Why to setup Spark?
    • Before deploying on the cluster, it is good practice to test the script using spark-submit.
    • To run using spark-submit locally, it is nice to setup Spark on Windows
  • How to setup Spark?
    • In this video we will be seeing Spark 2.3.0 as example. You can follow the same steps with earlier versions as well.
    • Download spark 2.3 tar ball by going here. We can use wget to download the tar ball.
      • Choose Spark Release: 2.3.0
      • Choose a package type: Pre-built for Hadoop 2.7 or later
      • It gives the appropriate link pointing to mirror
      • Click on it go to mirror and click on it to download
      • Use tar xzf command to untar and unzip tar ball  – tar xzf spark-2.3.0-bin-hadoop2.7.tgz
  • We need to configure environment variables to run Spark any where

Setup Environment Variables – Mac or Linux

Let us see how we can configure environment variables of Spark

  • Why to setup Environment Variables? To run spark-submit, spark-shell from any where on the PC using the jar file.
  • How to configure Environment Variables?
    • Let us assume that Spark is setup under
      • /Users/itversity/spark-2.3.0-bin-hadoop2.7 on Mac
      • /mnt/c/spark-2.3.0-bin-hadoop2.7 on Ubuntu built using Windows subsystem
    • Setup new environment variable SPARK_HOME and update PATH
    • Make sure to restart terminal (no need to reboot the machine)
# On Mac - .bash_profile
export SPARK_HOME=/Users/itversity/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

# On Ubuntu built using Windows subsystem for Linux - .profile
export SPARK_HOME=/mnt/c/spark-2.3.0-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
  • How to validate?
    • Example in the video have spark-shell and scala based code. Instead of using code demonstrated as part of video try below code to make sure pyspark is working as expected.
    • Go to any directory and run pyspark
orderItems = sc.textFile("C:\\data\\retail_db\\order_items") 
revenuePerOrder = orderItems. \
map(lambda oi: (oi.split(",")[1].toInt, oi.split(",")[4].toFloat)). \
reduceByKey(lambda curr, next: curr + next). \
map(lambda oi: oi[0] + "," + oi[1]) 
for i in revenuePerOrder.take(10): print(i)

Below video is demonstrated using spark-shell with Scala based code. Instead of that you can use the above code after launching pyspark to validate the installation.

Share this post