As part of this blog post, we will see detailed instructions about setting up the development environment for Spark and Hadoop application development using Windows.
We have used Windows 10 for this demo using 64-bit version
Setup development environment on Windows
For each of the section, we will see
Why do we need to perform the step?
How to perform the step?
How we can validate whether it is working as expected?
We will also develop few programs to validate whether our setup is progressing as expected or not
In case you run into any issues, please log those in our forums
Setup Development environment on Windows
We are considering a fresh Windows laptop. We will start with Java/JDK on Windows laptop and we will go through step by step instructions to setup Scala, sbt, WinUtils etc.
For integrated development using IntelliJ
Typically programming will be done with IDEs such as IntelliJ
IDEs are typically integrated with other tools such as git which is code versioning tool. Tools like git facilitate team development.
sbt is build tool for Scala. Once applications are developed using IDE, they are typically built using tools like sbt
WinUtils is required for HDFS APIs to work on Windows laptop
Unless java is setup and validated successfully do not go further. If you need our support, please log the issues in our forums.
Setup Java and JDK
Before getting started check whether Java and JDK are installed or not
Launch command prompt – Go to search bar on windows laptop, type cmd and hit enter
Type java -version If it return version, check whether 1.8 or not. It is better to have 1.8 version. If you have other version, consider uninstall and install 1.8 (Search for programs installed and uninstall Java)
Type javac -version If it return version, check whether 1.8 or not. It is better to have 1.8 version. If you have other version, consider uninstall and install 1.8 (Search for programs installed and uninstall Java)
If you need other versions, make sure environment variables point to 1.8
If you do not have Java at all, make sure to follow the instructions and install 1.8 version of JRE and JDK.
Why do we need to install Java and JDK? Scala, Spark and many other technologies require Java and JDK to develop and build the applications. Scala is JVM based programming language.
Use java -version and javac -version commands in command prompt and see they return 1.8 or not
Setup Scala with IntelliJ
Now install IntelliJ
There are 2 versions of IntelliJ community edition and enterprise edition
The community edition is free and at times you need to install additional plugins
Enterprise edition is paid and supported and comes with most of the important plugins pre-installed. Also set of plugins are bundled together as part of the enterprise edition
Unless you have a corporate license, for now, consider installing community edition.
Why IntelliJ?
IntelliJ is created by JetBrains and it is very popular in building IDEs which boost productivity in team development
Scala and SBT can be added as plugins using IntelliJ
Most commonly used tools such as git comes out of the box for versioning the code in the process of application development by teams.
How to Install?
Go to the downloads page and make sure the right version is chosen.
Once downloaded, just double click on installable and follow the typical installation process
How to validate?
We will develop a program as part of the next section to validate.
Develop Hello World Program
We will see how to create first program using Scala as sbt project.
Click on New Project
For the first time, it selects java by default. Make sure to choose Scala and then sbt
Give name to the project -> spark2demo
Choose right version of Scala -> 2.11.12
Choose right version of sbt -> 0.13
It will take some time to setup the project
Once done you will see
src directory with the structure src/main/scala
src/main/scala is base directory for scala code
build.sbt under project
name – name of the project
version – project version (0.1)
scalaVersion – scala version (2.11.12)
name := "spark2demo"
version := "0.1"
scalaVersion := "2.11.12"
Steps to develop HelloWorld program
Right click on src/main/scala
Choose Scala Class
Give name as Hello World and change type to object
Make sure IntelliJ setup with Scala is done and validated by running Hello World program. In case of any issues, please log in our forums.
Setup sbt and run application
Once the application is developed, we need to build jar file and migrate to higher environments. sbt is the build tool which is typically used for Scala based projects.
Why sbt?
To build scala based applications to jar file
Validate jar file to make sure program is running fine
How to setup sbt?
Setup sbt by downloading relevant downloadable from this link
For Windows use Microsoft Installer (msi)
How to validate sbt?
Copy the path by right clicking the project in IntelliJ
Go to command prompt and cd to the path
Check the directory structure, you should see
src directory
build.sbt
Run sbt package
It will build jar file and you will see the path
Run program by using sbt run command
You should see Hello World printed on the console
Add Spark dependencies to the application
As we are done with validating IntelliJ, Scala and sbt by developing and running the program, now we are ready to integrate Spark and start developing Scala based applications using Spark APIs.
Choose a package type: Pre-built for Hadoop 2.7 or later
It gives the appropriate link pointing to mirror
Click on it go to mirror and click on it to download
Use 7z software to unzip and under to complete setup of spark
We need to configure environment variables to run Spark any where
Keep in mind that Spark is not very well supported on Windows and we will see how to setup on Ubuntu using Windows subsystem for Linux.
Configure environment variables for Spark
Let us see how we can configure environment variables of Spark
Why to setup Environment Variables? To run spark-submit, spark-shell from any where on the PC using the jar file.
How to configure Environment Variables?
Let us assume that Spark is setup under C:\spark-2.3.0-bin-hadoop2.7
Setup new environment variable SPARK_HOME
Search for Environment Variables on Windows search bar
Click on Add Environment Variables
There will be 2 categories of environment variables
User Variables on top
System Variables on bottom
Make sure to click on Add for System Variables
Name: SPARK_HOME
Value: C:\spark-2.3.0-bin-hadoop2.7(don’t include bin)
Also choose Path and click on Edit
Click on Add
Add new entry %SPARK_HOME%\bin
How to validate?
Go to any directory and run spark-shell
Setup Ubuntu using Windows subsystem for Linux
Now let us see how we can setup Ubuntu on Windows 10
Why to setup Ubuntu?
Windows is not completely fool proof in running spark jobs.
Using Ubuntu is better alternative and you will run into fewer issues
Using Windows subsystem for Linux we can quickly set up Ubuntu virtual machine
How to setup Ubuntu using Windows subsystem for Linux?
Follow this link to setup Ubuntu using Windows subsystem for Linux
Complete the setup process by giving username for the Ubuntu virtual machine
Accessing C Drive using Ubuntu built using Windows subsystem for Linux
It is better to understand how we can access C drive in Ubuntu built using subsystem for Linux
It will facilitate us to access files in C drive
In Linux root file system starts with / and does not have partitions like C drive
The location of C drive is /mnt/C
Setup Java and JDK on Ubuntu
Before getting started check whether Java and JDK are installed or not
Launch command prompt – Go to search bar on windows laptop, type cmd and hit enter
Type java -version If it return version, check whether 1.8 or not. It is better to have 1.8 version. If you have other version, consider uninstall and install 1.8 (Search for programs installed and uninstall Java)
Type javac -version If it return version, check whether 1.8 or not. It is better to have 1.8 version. If you have other version, consider uninstall and install 1.8 (Search for programs installed and uninstall Java)
If you need other versions, make sure environment variables point to 1.8
If you do not have Java at all, make sure to follow the instructions and install 1.8 version of JRE and JDK.
Why do we need to install Java and JDK? Scala, Spark and many other technologies require Java and JDK to develop and build the applications. Scala is JVM based programming language.
Use java -version and javac -version commands in command prompt and see they return 1.8 or not
Download and Untar Spark
Now let us see the details about setting up Spark on Ubuntu or any Linux flavor or Mac.
Why to setup Spark?
Before deploying on the cluster, it is good practice to test the script using spark-submit.
To run using spark-submit locally, it is nice to setup Spark on Windows
How to setup Spark?
Download spark 2.3 tar ball by going here. We can use wget to download the tar ball.
Choose Spark Release: 2.3.0
Choose a package type: Pre-built for Hadoop 2.7 or later
It gives the appropriate link pointing to mirror
Click on it go to mirror and click on it to download
Use tar xzf command to untar and unzip tar ball – tar xzf spark-2.3.0-bin-hadoop2.7.tgz
We need to configure environment variables to run Spark any where
Setup Environment Variables – Mac or Linux
Let us see how we can configure environment variables of Spark
Why to setup Environment Variables? To run spark-submit, spark-shell from any where on the PC using the jar file.
How to configure Environment Variables?
Let us assume that Spark is setup under
/Users/itversity/spark-2.3.0-bin-hadoop2.7 on Mac
/mnt/c/spark-2.3.0-bin-hadoop2.7 on Ubuntu built using Windows subsystem
Setup new environment variable SPARK_HOME and update PATH
Make sure to restart terminal (no need to reboot the machine)
# On Mac - .bash_profile
export SPARK_HOME=/Users/itversity/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
# On Ubuntu built using Windows subsystem for Linux - .profile
export SPARK_HOME=/mnt/c/spark-2.3.0-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin