Spark Structured Streaming – Overview

As part of this session we will see the overview of technologies used in building Streaming data pipelines. Also we will have deeper look into Spark Structured Streaming by developing solution for a simple problem.

  • Overview of Streaming Technologies
  • Spark Structured Streaming
  • Development Life Cycle
  • Kafka and Spark Structured Streaming – Integration

Ingestion

There are many technologies which are used in ingesting data in real time.

  • Logstash
    • Can read data from different sources
    • Can apply simple row level transformations such as converting date formats, masking some of the attribute values in each message etc.
    • Can work with other technologies in building streaming pipelines such as Kafka
  • Flume
    • Runs as agent
    • Each agent have source, channel and sink
    • Supports many sources and sinks
    • Can work with other technologies in building streaming pipelines such as kafka
    • Can push data to technologies like Storm, Flink, Spark Streaming etc to run real time streaming analytics.
  • Kafka connect and Kafka topic
    • Kafka topic is false tolerant and highly reliable intermediate data streaming mechanism
    • Kafka connect is to read data from different sources and push messages to Kafka topic and also consume messages from Kafka topic and push to supported targets.
    • Kafka connect and topic will facilitate use to get data from different types of sources to different types of sinks.
    • Can push data to technologies like Kafka Streams, Storm, Flink, Spark Streaming etc to run real time streaming analytics.
  • Kinesis firehose and Kinesis data streams
    • Kinesis is AWS Service which is very similar to Kafka
    • Kinesis Firehose is similar to Kafka connect and Kinesis data streams is similar to topic
    • No need of dedicated cluster and will only be charged for the usage.
  • and more

Real time processing

As the data come through tools like logstash, flume, kafka etc we might want to perform standard transformations such as data cleansing, standardization, lookups, joins, aggregations, sorting, ranking etc. While some of the data ingestion tools are capable of some of the transformations they do not come up with all the features. Also the ingestion might get delayed and make the flow unstable. Hence we need to use the tools which are built for performing transformations as the data is streamed. Here are some of the prominent tools.

  • Spark Streaming or Spark Structured Streaming (a module built as part of Spark)
  • Flink
  • Storm
  • Kafka Streams
  • and more

Databases

Once the data is processed, we have to store data in databases to persist and build visualization layer on top of the processed data. We can use

  • RDBMS – such as Oracle, MySQL, Postgres etc
  • Data Warehouses – such as Teradata, Redshift etc
  • NoSQL databases – such as HBase, Cassandra, MongoDB, DynamoDB etc
  • Search based databases – such as Elastic Search

Visualization

Visualization is typically done as part of the application development using standard frameworks.

  • d3js
  • Kibana
  • Standard reporting tools such as Tableau
  • and more

Frameworks

As we discuss about different moving parts in building streaming pipelines now let us get into frameworks. Most of these frameworks do not have visualization included.

  • Kafka
    • Kafka Connect
    • Kafka Topic
    • Kafka Streams
  •  ELK
    • Elastic Search (Database)
    • Logstash (streaming and processing logs)
    • Kibana (Visualization)
  • HDF – Streaming services running behind NiFi
  • MapR Streams – Streaming services running on MapR cluster
  • AWS Services
    • DynamoDB (Database)
    • s3 (persistent storage of flat file format)
    • Kinesis (streaming and processing logs)

We have highlighted some of the popular frameworks. Almost all the top vendors such as Cloudera, Google, Microsoft Azure etc have necessary services to build streaming pipelines.

Spark Structured Streaming

Apache Spark is a proven distributed computing framework with modules for different purposes

  • Core APIs – Transformations and Actions
  • Spark SQL and Data Frames
  • Spark Streaming (legacy) and Spark Structured Streaming
  • Spark MLLib
  • Spark GraphX
  • and more

We can use Spark Streaming to apply complex business rules either by using Data Frame operations or Spark SQL. Let us see a demo about getting started with Spark Structured Streaming. Main demo is done with legacy streaming.

Important Concepts

  • Sources
    • File
    • Kafka
    • Socket (for testing)
  • Basic Operations
  • Window Operations on Event Time
    • Handling late data and watermarking
  • Output Modes
    • Append Mode
    • Update Mode
    • Complete Mode
  • Sinks/Targets
    • Console
    • File
    • Memory
    • Kafka
    • foreach (can be used to write to Database)

Development Life Cycle

Here are the steps involved to get started with HBase

  • Make sure gen_logs is setup and data is streamed being streamed
  • Create new project StreamingDemo using IntelliJ
    • Choose scala 2.11
    • Choose sbt 0.13.x
    • Make sure JDK is chosen
  • Update build.sbt. See below
  • Define application properties
  • Create GetStreamingDepartmentTraffic object
  • Add logic to process data using Spark Structured Streaming
  • Build jar file
  • Ship to cluster and deploy

Dependencies (build.sbt)

Spark structured streaming require Spark SQL dependencies.

  • Add type safe config dependency so that we can externalize properties
  • Add spark-core and spark-sql dependencies
  • Replace build.sbt with below lines of code
[gist]3a57c19f7fc669a772fcec4b28e39ba4[/gist]

Externalize Properties

We need to make sure that application can be run in different environments. It is very important to understand how to externalize properties and pass the information at run time.

  • Make sure build.sbt have dependency related to type safe config
  • Create new directory under src/main by name resources
  • Add file called application.properties and add below entries
[gist]9413955f1f298747927629138240c5d6[/gist]

Create GetStreamingDepartmentTraffic Program

  • Create scala program by choosing Scala Class and then type Object
  • Make sure program is named as GetStreamingDepartmentTraffic
  • First we need to import necessary APIs
  • Develop necessary logic
    • Get the properties from application.properties
    • Create spark session object by name spark
    • Create stream using spark.readStream
    • Process data using Data Frame Operations
    • Write the output to console (in actual applications we write the output to database)
[gist]f4a85d5c7a065c06a5ce8e11ad30aa37[/gist]

Build, Deploy and Run

  • Right click on the project and copy path
  • Go to terminal and run cd command with the path copied
  • Run sbt package
  • It will generate jar file for our application
  • Copy to the server where you want to deploy
  • Start streaming tail_logs to web service – tail_logs.sh|nc -lk gw02.itversity.com 9999
  • Run below command in another session on the server
[gist]a11b68fedcc24d3a526f5037d3af5e9f[/gist]

Kafka and Spark Structured Streaming – Integration

Let us see how we can get data from Kafka topic and process using Spark Structured Streaming.

Development Life Cycle

Here are the steps involved to get started with HBase

  • Make sure gen_logs is setup and data is streamed being streamed
  • Create new project StreamingDemo using IntelliJ
    • Choose scala 2.11
    • Choose sbt 0.13.x
    • Make sure JDK is chosen
  • Update build.sbt. See below
  • Define application properties
  • Create GetStreamingDepartmentTraffic object
  • Add logic to process data using Spark Structured Streaming
  • Build jar file
  • Ship to cluster and deploy

Dependencies (build.sbt)

Spark structured streaming require Spark SQL dependencies.

  • Add type safe config dependency so that we can externalize properties
  • Add spark-sql dependencies
  • Replace build.sbt with below lines of code
[gist]b5f9f2d1c79a510bcabe896178f8f644[/gist]

Externalize Properties

We need to make sure that application can be run in different environments. It is very important to understand how to externalize properties and pass the information at run time.

  • Make sure build.sbt have dependency related to type safe config
  • Create new directory under src/main by name resources
  • Add file called application.properties and add below entries
  • It should include information related to Kafka broker
[gist]a6f973217a677119bea858b218aa62f9[/gist]

Create GetStreamingDepartmentTraffic Program

  • Create scala program by choosing Scala Class and then type Object
  • Make sure program is named as GetStreamingDepartmentTraffic
  • First we need to import necessary APIs
  • Develop necessary logic
    • Get the properties from application.properties
    • Create spark session object by name spark
    • Create stream using spark.readStream
    • Pass broker information and topic information as options
    • Process data using Data Frame Operations
    • Write the output to console (in actual applications we write the output to database)
[gist]2c98869b55401ed35471b0a349fa6a63[/gist]
  • Validate locally
    • Make sure zookeeper and Kafka broker are running
    • Start streaming tail_logs to Kafka Broker – tail_logs.sh|kafka-console-producer.sh --broker-list localhost:9092 --topic retail
    • Run the program using IDE to make sure we see output in the console

Build, Deploy and Run

  • Right click on the project and copy path
  • Go to terminal and run cd command with the path copied
  • Run sbt package
  • It will generate jar file for our application
  • Copy to the server where you want to deploy
  • Make sure zookeeper and Kafka Broker are running
  • Start streaming tail_logs to web service – tail_logs.sh|kafka-console-producer.sh --broker-list wn01.itversity.com:6667,wn02.itversity.com:6667 --topic retail
  • Run below command in another session on the server
[gist]1d16388fc3ff2e3c5103a9caa39429a7[/gist]