As part of this session, we will see how we can build streaming applications using Spark Streaming (legacy). We will get data from log file populated by gen_logs and process using Spark Streaming.
Spark Streaming Overview
Overview of DStreams and APIs
Typical development process
Streaming Department Traffic – using socketTextStream
Streaming Department Traffic – using Kafka
Windowing Functions – Overview
Spark Streaming Overview
Spark Streaming is used to process data in streaming fashion.
It requires web service called StreamingContext
Unlike SparkContext, StreamingContext runs perpetually processing data at regular intervals
We cannot have multiple contexts running at same time, hence if there is running SparkContext we need to stop it before we launch StreamingContext
Overview of DStreams and APIs
Let us get an overview about core components of Spark Streaming.
RDD is the main data structure in Spark
DStreams is nothing but a series of RDDs which are created at regular intervals
We have transformations and operations to process data in DStreams
Most of the transformations are similar to RDD transformations, except that they take DStreams as input and return DStreams as output
Typical Development Process
Let us see typical development process to build streaming applications using Spark Streaming.
Read data from source. We have plugins for different types of sources.
Netcat – API as part of StreamingContext
Flume
Kafka
and many more
Process data using Transformations
Write data to target. We can write data to different targets
Streaming Department Count – socketTextStream
Let us get into details about building application to get streaming department count using socketTextStream.
Setup netcat
Develop Application
Build jar file
Ship and run jar file using spark-submit
Setting up Netcat
Netcat, which is a web service can be used to get started to read the data.
Start netcat service using host name and port number
We can start publishing messages to this web service
Sample nc command to redirect log messages from gen_logs to web service – tail_logs.sh | nc -lk 9999
Develop Application
Let us develop the program to perform streaming department count using socketTextStream (used for exploratory purposes).
Project Name:
Add dependencies for spark streaming to build.sbt
import org.apache.spark.streaming._ to import all the APIs
Create Spark Configuration object with master and app name
Pass Spark Configuration object and number of seconds to Streaming Context object. This will facilitate the Streaming Context to queue up the data for the interval equal to number of seconds and apply logic for processing.
Develop necessary logic to perform streaming department traffic
At the end use start and awaitTermination for Spark Streaming Context run perpetually
Build the jar file using sbt package command
Ship the jar file and run it on the lab
[gist]1c21c9d10c49cfe026543189dfdcf66d[/gist]
Streaming Department Count – using Kafka
Now let us understand how to consume data from Kafka topic and perform streaming analytics on top of it at regular intervals. We can take the previous example and modify socketTextStream with Kafka related APIs.
[gist]e1c121d7ddfa503489972b34746da6c7[/gist]
Windowing Functions – Overview
On top of regular transformations, Spark Streaming provide Windowing Functions.
Windowing Functions are primarily meant for use cases like moving averages
All the functions have ByWindow
These functions take window as well as sliding interval as arguments
If window size is 90 seconds and sliding interval is 30 seconds then we will be able to see moving 90 seconds aggregations every 30 seconds.