Spark Structured Streaming – Overview

Private: HDP Certified Spark Developer (HDPCSD) – Scala Apache Spark 2 – Streaming Spark Structured Streaming – Overview

As part of this session we will see the overview of technologies used in building Streaming data pipelines. Also we will have deeper look into Spark Structured Streaming by developing solution for a simple problem.

Overview of Streaming Technologies
Spark Structured Streaming
Development Life Cycle
Kafka and Spark Structured Streaming – Integration

Ingestion

There are many technologies which are used in ingesting data in real time.

Logstash
- Can read data from different sources
- Can apply simple row level transformations such as converting date formats, masking some of the attribute values in each message etc.
- Can work with other technologies in building streaming pipelines such as Kafka
Flume
- Runs as agent
- Each agent have source, channel and sink
- Supports many sources and sinks
- Can work with other technologies in building streaming pipelines such as kafka
- Can push data to technologies like Storm, Flink, Spark Streaming etc to run real time streaming analytics.
Kafka connect and Kafka topic
- Kafka topic is false tolerant and highly reliable intermediate data streaming mechanism
- Kafka connect is to read data from different sources and push messages to Kafka topic and also consume messages from Kafka topic and push to supported targets.
- Kafka connect and topic will facilitate use to get data from different types of sources to different types of sinks.
- Can push data to technologies like Kafka Streams, Storm, Flink, Spark Streaming etc to run real time streaming analytics.
Kinesis firehose and Kinesis data streams
- Kinesis is AWS Service which is very similar to Kafka
- Kinesis Firehose is similar to Kafka connect and Kinesis data streams is similar to topic
- No need of dedicated cluster and will only be charged for the usage.
and more

Real time processing

As the data come through tools like logstash, flume, kafka etc we might want to perform standard transformations such as data cleansing, standardization, lookups, joins, aggregations, sorting, ranking etc. While some of the data ingestion tools are capable of some of the transformations they do not come up with all the features. Also the ingestion might get delayed and make the flow unstable. Hence we need to use the tools which are built for performing transformations as the data is streamed. Here are some of the prominent tools.

Spark Streaming or Spark Structured Streaming (a module built as part of Spark)
Flink
Storm
Kafka Streams
and more

Databases

Once the data is processed, we have to store data in databases to persist and build visualization layer on top of the processed data. We can use

RDBMS – such as Oracle, MySQL, Postgres etc
Data Warehouses – such as Teradata, Redshift etc
NoSQL databases – such as HBase, Cassandra, MongoDB, DynamoDB etc
Search based databases – such as Elastic Search

Visualization

Visualization is typically done as part of the application development using standard frameworks.

d3js
Kibana
Standard reporting tools such as Tableau
and more

Frameworks

As we discuss about different moving parts in building streaming pipelines now let us get into frameworks. Most of these frameworks do not have visualization included.

Kafka
- Kafka Connect
- Kafka Topic
- Kafka Streams
ELK
- Elastic Search (Database)
- Logstash (streaming and processing logs)
- Kibana (Visualization)
HDF – Streaming services running behind NiFi
MapR Streams – Streaming services running on MapR cluster
AWS Services
- DynamoDB (Database)
- s3 (persistent storage of flat file format)
- Kinesis (streaming and processing logs)

We have highlighted some of the popular frameworks. Almost all the top vendors such as Cloudera, Google, Microsoft Azure etc have necessary services to build streaming pipelines.

Spark Structured Streaming

Apache Spark is a proven distributed computing framework with modules for different purposes

Core APIs – Transformations and Actions
Spark SQL and Data Frames
Spark Streaming (legacy) and Spark Structured Streaming
Spark MLLib
Spark GraphX
and more

We can use Spark Streaming to apply complex business rules either by using Data Frame operations or Spark SQL. Let us see a demo about getting started with Spark Structured Streaming. Main demo is done with legacy streaming.

Important Concepts

Sources
- File
- Kafka
- Socket (for testing)
Basic Operations
Window Operations on Event Time
- Handling late data and watermarking
Output Modes
- Append Mode
- Update Mode
- Complete Mode
Sinks/Targets
- Console
- File
- Memory
- Kafka
- foreach (can be used to write to Database)

Development Life Cycle

Here are the steps involved to get started with HBase

Make sure gen_logs is setup and data is streamed being streamed
Create new project StreamingDemo using IntelliJ
- Choose scala 2.11
- Choose sbt 0.13.x
- Make sure JDK is chosen
Update build.sbt. See below
Define application properties
Create GetStreamingDepartmentTraffic object
Add logic to process data using Spark Structured Streaming
Build jar file
Ship to cluster and deploy

Dependencies (build.sbt)

Spark structured streaming require Spark SQL dependencies.

Add type safe config dependency so that we can externalize properties
Add spark-core and spark-sql dependencies
Replace build.sbt with below lines of code

[gist]3a57c19f7fc669a772fcec4b28e39ba4[/gist]

Externalize Properties

We need to make sure that application can be run in different environments. It is very important to understand how to externalize properties and pass the information at run time.

Make sure build.sbt have dependency related to type safe config
Create new directory under src/main by name resources
Add file called application.properties and add below entries

[gist]9413955f1f298747927629138240c5d6[/gist]

Create GetStreamingDepartmentTraffic Program

Create scala program by choosing Scala Class and then type Object
Make sure program is named as GetStreamingDepartmentTraffic
First we need to import necessary APIs
Develop necessary logic
- Get the properties from application.properties
- Create spark session object by name spark
- Create stream using spark.readStream
- Process data using Data Frame Operations
- Write the output to console (in actual applications we write the output to database)

[gist]f4a85d5c7a065c06a5ce8e11ad30aa37[/gist]

Build, Deploy and Run

Right click on the project and copy path
Go to terminal and run cd command with the path copied
Run sbt package
It will generate jar file for our application
Copy to the server where you want to deploy
Start streaming tail_logs to web service – tail_logs.sh|nc -lk gw02.itversity.com 9999
Run below command in another session on the server

[gist]a11b68fedcc24d3a526f5037d3af5e9f[/gist]

Kafka and Spark Structured Streaming – Integration

Let us see how we can get data from Kafka topic and process using Spark Structured Streaming.

Development Life Cycle

Here are the steps involved to get started with HBase

Make sure gen_logs is setup and data is streamed being streamed
Create new project StreamingDemo using IntelliJ
- Choose scala 2.11
- Choose sbt 0.13.x
- Make sure JDK is chosen
Update build.sbt. See below
Define application properties
Create GetStreamingDepartmentTraffic object
Add logic to process data using Spark Structured Streaming
Build jar file
Ship to cluster and deploy

Dependencies (build.sbt)

Spark structured streaming require Spark SQL dependencies.

Add type safe config dependency so that we can externalize properties
Add spark-sql dependencies
Replace build.sbt with below lines of code

[gist]b5f9f2d1c79a510bcabe896178f8f644[/gist]

Externalize Properties

We need to make sure that application can be run in different environments. It is very important to understand how to externalize properties and pass the information at run time.

Make sure build.sbt have dependency related to type safe config
Create new directory under src/main by name resources
Add file called application.properties and add below entries
It should include information related to Kafka broker

[gist]a6f973217a677119bea858b218aa62f9[/gist]

Create GetStreamingDepartmentTraffic Program

Create scala program by choosing Scala Class and then type Object
Make sure program is named as GetStreamingDepartmentTraffic
First we need to import necessary APIs
Develop necessary logic
- Get the properties from application.properties
- Create spark session object by name spark
- Create stream using spark.readStream
- Pass broker information and topic information as options
- Process data using Data Frame Operations
- Write the output to console (in actual applications we write the output to database)

[gist]2c98869b55401ed35471b0a349fa6a63[/gist]

Validate locally
- Make sure zookeeper and Kafka broker are running
- Start streaming tail_logs to Kafka Broker – tail_logs.sh|kafka-console-producer.sh --broker-list localhost:9092 --topic retail
- Run the program using IDE to make sure we see output in the console

Build, Deploy and Run

Right click on the project and copy path
Go to terminal and run cd command with the path copied
Run sbt package
It will generate jar file for our application
Copy to the server where you want to deploy
Make sure zookeeper and Kafka Broker are running
Start streaming tail_logs to web service – tail_logs.sh|kafka-console-producer.sh --broker-list wn01.itversity.com:6667,wn02.itversity.com:6667 --topic retail
Run below command in another session on the server

[gist]1d16388fc3ff2e3c5103a9caa39429a7[/gist]

Previous Topic

Back to Lesson

Next Lesson