As part of this topic, let us quickly review basic concepts of Scala before jumping into Spark APIs. Scala is a programming language and Spark APIs are compatible with Scala (along with Python, Java etc). It is imperative to master at least one of the programming languages to build applications using Spark.
Let us revise below concepts before jumping into programming using Spark base APIs
Overview of REPL and Getting Started
Functions – pre-defined, user-defined and lambda functions
Basic File I/O
Collections, Tuple, and Map Reduce APIs
Development Life Cycle using IDE
Overview of REPL and Getting Started
Let us see details about REPL and how we can get started with Scala.
We can launch Scala REPL using Scala
REPL stands for Read, Evaluate, Print, and Loop. It is an interactive way of building logic.
You can perform arithmetic operations, invoke functions, or even develop functions.
Each line of code is evaluated separately by being compiled into Java bytecode. Hence many times each line can be treated in an isolated way. Due to that, we should be able to define variables with the same name.
Defining Variables
We have to specify whether a variable is mutable or immutable while defining it.
val is for immutable variable
var is for mutable variable
Once an immutable variable is created using val, we will not be able to reassign into it.
It is optional to define a variable with the Data Type and mandatory to assign with a value.
Based on the value assigned, the variable will inherit Data Type.
Even though we do not have to explicitly define variables, Scala Variable is still a Statically Typed.
Getting Help
We can get some usage related information on the functions and classes as part of REPL. We will give a demo of the following.
Listing Functions
Getting a list of overloaded functions with arguments
Getting a list of all functions along with few details for a given Class.
Functions Overview
There are different things, which you need to understand about Functions and Arguments.
User-defined Functions (sum of integers using loops)
Functions with a Varying number of Arguments
Lambda or Anonymous Functions.
[gist]87e9b7600f4c160d5fe4606c8bcab3f1[/gist]
Basic File I/O
Let us see how we can perform basic file I/O using Scala.
scala.io package has few simple APIs which read data from the source into memory.
To read data from File we can use this API – scala.io.Source.fromFile(“PATH_TO_FILE”)
It will create an object of type BufferedSource
We can apply getLines on top of BufferedSource to create non-empty Iterator where each element is of type String (we have passed text file as path earlier)
Once we got the iterator, we can convert into any collection type based upon our requirement.
[gist]a95169d40441d270c7576d7774cbc73a[/gist]
Collections, Tuples, and Map Reduce APIs
Scala has rich collection API. But it is enough to understand the most basic ones.
List – a collection of homogeneous elements
Set – a collection of unique homogeneous elements
Map – a collection of key-value pairs, where keys are unique
Tuple – a collection of heterogeneous elements
List or Set or Map is like a Table or Spreadsheet where we have n number of elements of the same type, whereas Tuple is a group of fields in an individual row.
We typically create a list or set or map of Tuples depending upon our requirements.
Collection APIs
There are many functions as part of collections such as List, Set, and Map. Let us explore some of them.
Most of these functions are inherited from root trait of collections – Traversable
Here are some of the commonly used functions.
foreach
fold or reduce
map
filter
size or length
exists
Type conversion functions
and more
Map Reduce APIs
We have a bunch of APIs from Map-Reduce paradigm.
All these APIs take lambda or anonymous function as an argument
You have to pass business logic based upon the usage of the function
map – row level transformations
filter – filtering out data which are not required
reduce – perform aggregations
We have similar functions as part of Spark as well, but they operate on RDDs.
[gist]a95169d40441d270c7576d7774cbc73a[/gist]
Development Life Cycle using IDE
Now let understand Development Life Cycle using IDE.
We will be using IntelliJ to understand Development Live Cycle.
We will set up the project
Develop code to read data and then filter
Enhance code to extract necessary information and then aggregate using the map and reduce.
We will also see how to compile and run as an application using sbt.
Setup Project
Let us start setting up the project.
Create a project with Scala 2.11.8 using sbt
Make sure to select JDK 1.8
Give a name to the project and it will take some time to dump
We can develop the logic using IDE – let us get revenue for a given order
Read and Filter Data
Let us see how we can read and filter data.
We can read the data using scala.io.Source
It will create BufferedSource object which can be converted into a collection using getLines.
Once the collection is created, we can use the filter to apply the necessary logic to filter the data.
Enhance and Run Program
Now let us actually enhance the code to extract order_item_subtotal and then aggregate.
We can apply the map to extract order_item_subtotal for each record in the collection.
Once we get order_item_subtotal for each item we can add those to get total order revenue.
We can validate using IDE by passing Run Time arguments
Once this is done we can use sbt and build jar file out of it
Go to the project working directory
Run sbt package
You will see jar file under target/scala_2.11
Once jar file is created, we can invoke jar file using either sbt run or java jar – sbt "run-main OrderRevenue /mnt/c/data/retail_db/order_items/part-00000"
[gist]a95169d40441d270c7576d7774cbc73a[/gist]
Integrating 3rd Party Plugins
As we have developed a simple program using IDE leveraging core modules of Scala, let us also review how we can integrate 3rd Party Plugins.
One of the requirements of the applications is to externalize runtime properties
For the previous problem, if we have to deploy any other environment the code is bound to fail as the path is hardcoded.
One of the approaches is to externalize the properties and load at runtime leveraging 3rd party plugins.
Here we will be using typesafe config.
First we need to update build.sbt using artifact id, group id and version of the package.
We can bundle the external properties as part of application. For it, we need to create directory resources under src/main and then add file with one of the typesafe config standard names – application.properties
Now you can have properties defined in the form of key value pairs.
We can use typesafe config’s ConfigFactory.load method to load this standard file from resources and all the properties will be available at run time.
Once development is done we can use sbt and build jar file out of it
Go to the project working directory
Run sbt package
You will see jar file under target/scala_2.11
Once jar file is created, we can invoke jar file using either sbt run or java jar – sbt "run-main OrderRevenue /mnt/c/data/retail_db/order_items/part-00000"