Spark Streaming (legacy) – Overview

As part of this session, we will see how we can build streaming applications using Spark Streaming (legacy). We will get data from log file populated by gen_logs and process using Spark Streaming.

  • Spark Streaming Overview
  • Overview of DStreams and APIs
  • Typical development process
  • Streaming Department Traffic – using socketTextStream
  • Streaming Department Traffic – using Kafka
  • Windowing Functions – Overview

Spark Streaming Overview

Spark Streaming is used to process data in streaming fashion.

  • It requires web service called StreamingContext
  • Unlike SparkContext, StreamingContext runs perpetually processing data at regular intervals
  • We cannot have multiple contexts running at same time, hence if there is running SparkContext we need to stop it before we launch StreamingContext

Overview of DStreams and APIs

Let us get an overview about core components of Spark Streaming.

  • RDD is the main data structure in Spark
  • DStreams is nothing but a series of RDDs which are created at regular intervals
  • We have transformations and operations to process data in DStreams
  • Most of the transformations are similar to RDD transformations, except that they take DStreams as input and return DStreams as output

Typical Development Process

Let us see typical development process to build streaming applications using Spark Streaming.

  • Read data from source. We have plugins for different types of sources.
    • Netcat – API as part of StreamingContext
    • Flume
    • Kafka
    • and many more
  • Process data using Transformations
  • Write data to target. We can write data to different targets

Streaming Department Count – socketTextStream

Let us get into details about building application to get streaming department count using socketTextStream.

  • Setup netcat
  • Develop Application
  • Build jar file
  • Ship and run jar file using spark-submit

Setting up Netcat

Netcat, which is a web service can be used to get started to read the data.

  • Start netcat service using host name and port number
  • We can start publishing messages to this web service
  • Sample nc command to redirect log messages from gen_logs to web service – tail_logs.sh | nc -lk 9999

Develop Application

Let us develop the program to perform streaming department count using socketTextStream (used for exploratory purposes).

  • Project Name:
  • Add dependencies for spark streaming to build.sbt
  • import org.apache.spark.streaming._ to import all the APIs
  • Create Spark Configuration object with master and app name
  • Pass Spark Configuration object and number of seconds to Streaming Context object. This will facilitate the Streaming Context to queue up the data for the interval equal to number of seconds and apply logic for processing.
  • Develop necessary logic to perform streaming department traffic
  • At the end use start and awaitTermination for Spark Streaming Context run perpetually
  • Build the jar file using sbt package command
  • Ship the jar file and run it on the lab
[gist]1c21c9d10c49cfe026543189dfdcf66d[/gist]

Streaming Department Count – using Kafka

Now let us understand how to consume data from Kafka topic and perform streaming analytics on top of it at regular intervals. We can take the previous example and modify socketTextStream with Kafka related APIs.

[gist]e1c121d7ddfa503489972b34746da6c7[/gist]

Windowing Functions – Overview

On top of regular transformations, Spark Streaming provide Windowing Functions.

  • Windowing Functions are primarily meant for use cases like moving averages
  • All the functions have ByWindow
  • These functions take window as well as sliding interval as arguments
  • If window size is 90 seconds and sliding interval is 30 seconds then we will be able to see moving 90 seconds aggregations every 30 seconds.