Quick Recap of Scala

As part of this topic, let us quickly review basic concepts of Scala before jumping into Spark APIs. Scala is a programming language and Spark APIs are compatible with Scala (along with Python, Java etc). It is imperative to master at least one of the programming languages to build applications using Spark.

Let us revise below concepts before jumping into programming using Spark base APIs

  • Overview of REPL and Getting Started
  • Functions – pre-defined, user-defined and lambda functions
  • Basic File I/O
  • Collections, Tuple, and Map Reduce APIs
  • Development Life Cycle using IDE

Overview of REPL and Getting Started

Let us see details about REPL and how we can get started with Scala.

  • We can launch Scala REPL using Scala
  • REPL stands for Read, Evaluate, Print, and Loop. It is an interactive way of building logic.
  • You can perform arithmetic operations, invoke functions, or even develop functions.
  • Each line of code is evaluated separately by being compiled into Java bytecode. Hence many times each line can be treated in an isolated way. Due to that, we should be able to define variables with the same name.

Defining Variables

We have to specify whether a variable is mutable or immutable while defining it.

  • val is for immutable variable
  • var is for mutable variable
  • Once an immutable variable is created using val, we will not be able to reassign into it.
  • It is optional to define a variable with the Data Type and mandatory to assign with a value.
  • Based on the value assigned, the variable will inherit Data Type.
  • Even though we do not have to explicitly define variables, Scala Variable is still a Statically Typed.

Getting Help

We can get some usage related information on the functions and classes as part of REPL. We will give a demo of the following.

  • Listing Functions
  • Getting a list of overloaded functions with arguments
  • Getting a list of all functions along with few details for a given Class.

Functions Overview

There are different things, which you need to understand about Functions and Arguments.

  • Pre-defined Functions (example String Manipulation Functions)
  • User-defined Functions (sum of integers using loops)
  • Functions with a Varying number of Arguments
  • Lambda or Anonymous Functions.
[gist]87e9b7600f4c160d5fe4606c8bcab3f1[/gist]

Basic File I/O

Let us see how we can perform basic file I/O using Scala.

  • scala.io package has few simple APIs which read data from the source into memory.
  • To read data from File we can use this API – scala.io.Source.fromFile(“PATH_TO_FILE”)
  • It will create an object of type BufferedSource
  • We can apply getLines on top of BufferedSource to create non-empty Iterator where each element is of type String (we have passed text file as path earlier)
  • Once we got the iterator, we can convert into any collection type based upon our requirement.
[gist]a95169d40441d270c7576d7774cbc73a[/gist]

Collections, Tuples, and Map Reduce APIs

Scala has rich collection API. But it is enough to understand the most basic ones.

  • List – a collection of homogeneous elements
  • Set – a collection of unique homogeneous elements
  • Map – a collection of key-value pairs, where keys are unique
  • Tuple – a collection of heterogeneous elements

List or Set or Map is like a Table or Spreadsheet where we have n number of elements of the same type, whereas Tuple is a group of fields in an individual row.

We typically create a list or set or map of Tuples depending upon our requirements.

Collection APIs

There are many functions as part of collections such as List, Set, and Map. Let us explore some of them.

  • Most of these functions are inherited from root trait of collections – Traversable
  • Here are some of the commonly used functions.
    • foreach
    • fold or reduce
    • map
    • filter
    • size or length
    • exists
    • Type conversion functions
    • and more

Map Reduce APIs

We have a bunch of APIs from Map-Reduce paradigm.

  • All these APIs take lambda or anonymous function as an argument
  • You have to pass business logic based upon the usage of the function
  • map – row level transformations
  • filter – filtering out data which are not required
  • reduce – perform aggregations

We have similar functions as part of Spark as well, but they operate on RDDs.

[gist]a95169d40441d270c7576d7774cbc73a[/gist]

Development Life Cycle using IDE

Now let understand Development Life Cycle using IDE.

  • We will be using IntelliJ to understand Development Live Cycle.
  • We will set up the project
  • Develop code to read data and then filter
  • Enhance code to extract necessary information and then aggregate using the map and reduce.
  • We will also see how to compile and run as an application using sbt.

Setup Project

Let us start setting up the project.

  • Create a project with Scala 2.11.8 using sbt
  • Make sure to select JDK 1.8
  • Give a name to the project and it will take some time to dump
  • We can develop the logic using IDE – let us get revenue for a given order

Read and Filter Data

Let us see how we can read and filter data.

  • We can read the data using scala.io.Source
  • It will create BufferedSource object which can be converted into a collection using getLines.
  • Once the collection is created, we can use the filter to apply the necessary logic to filter the data.

Enhance and Run Program

Now let us actually enhance the code to extract order_item_subtotal and then aggregate.

  • We can apply the map to extract order_item_subtotal for each record in the collection.
  • Once we get order_item_subtotal for each item we can add those to get total order revenue.
  • We can validate using IDE by passing Run Time arguments
  • Once this is done we can use sbt and build jar file out of it
    • Go to the project working directory
    • Run sbt package
    • You will see jar file under target/scala_2.11
  • Once jar file is created, we can invoke jar file using either sbt run or java jar – sbt "run-main OrderRevenue /mnt/c/data/retail_db/order_items/part-00000"
[gist]a95169d40441d270c7576d7774cbc73a[/gist]

Integrating 3rd Party Plugins

As we have developed a simple program using IDE leveraging core modules of Scala, let us also review how we can integrate 3rd Party Plugins.

  • One of the requirements of the applications is to externalize runtime properties
  • For the previous problem, if we have to deploy any other environment the code is bound to fail as the path is hardcoded.
  • One of the approaches is to externalize the properties and load at runtime leveraging 3rd party plugins.
  • Here we will be using typesafe config.
  • First we need to update build.sbt using artifact id, group id and version of the package.
  • We can bundle the external properties as part of application. For it, we need to create directory resources under src/main and then add file with one of the typesafe config standard names – application.properties
  • Now you can have properties defined in the form of key value pairs.
  • We can use typesafe config’s ConfigFactory.load method to load this standard file from resources and all the properties will be available at run time.
  • Once development is done we can use sbt and build jar file out of it
    • Go to the project working directory
    • Run sbt package
    • You will see jar file under target/scala_2.11
  • Once jar file is created, we can invoke jar file using either sbt run or java jar – sbt "run-main OrderRevenue /mnt/c/data/retail_db/order_items/part-00000"
[gist]2a743032d0bad32558d3627295d87392[/gist]