As part of this session we will see the overview of technologies used in building Streaming data pipelines. Also we will have deeper look into Spark Structured Streaming by developing solution for a simple problem.
Overview of Streaming Technologies
Spark Structured Streaming
Development Life Cycle
Kafka and Spark Structured Streaming – Integration
Ingestion
There are many technologies which are used in ingesting data in real time.
Logstash
Can read data from different sources
Can apply simple row level transformations such as converting date formats, masking some of the attribute values in each message etc.
Can work with other technologies in building streaming pipelines such as Kafka
Flume
Runs as agent
Each agent have source, channel and sink
Supports many sources and sinks
Can work with other technologies in building streaming pipelines such as kafka
Can push data to technologies like Storm, Flink, Spark Streaming etc to run real time streaming analytics.
Kafka connect and Kafka topic
Kafka topic is false tolerant and highly reliable intermediate data streaming mechanism
Kafka connect is to read data from different sources and push messages to Kafka topic and also consume messages from Kafka topic and push to supported targets.
Kafka connect and topic will facilitate use to get data from different types of sources to different types of sinks.
Can push data to technologies like Kafka Streams, Storm, Flink, Spark Streaming etc to run real time streaming analytics.
Kinesis firehose and Kinesis data streams
Kinesis is AWS Service which is very similar to Kafka
Kinesis Firehose is similar to Kafka connect and Kinesis data streams is similar to topic
No need of dedicated cluster and will only be charged for the usage.
and more
Real time processing
As the data come through tools like logstash, flume, kafka etc we might want to perform standard transformations such as data cleansing, standardization, lookups, joins, aggregations, sorting, ranking etc. While some of the data ingestion tools are capable of some of the transformations they do not come up with all the features. Also the ingestion might get delayed and make the flow unstable. Hence we need to use the tools which are built for performing transformations as the data is streamed. Here are some of the prominent tools.
Spark Streaming or Spark Structured Streaming (a module built as part of Spark)
Flink
Storm
Kafka Streams
and more
Databases
Once the data is processed, we have to store data in databases to persist and build visualization layer on top of the processed data. We can use
RDBMS – such as Oracle, MySQL, Postgres etc
Data Warehouses – such as Teradata, Redshift etc
NoSQL databases – such as HBase, Cassandra, MongoDB, DynamoDB etc
Search based databases – such as Elastic Search
Visualization
Visualization is typically done as part of the application development using standard frameworks.
d3js
Kibana
Standard reporting tools such as Tableau
and more
Frameworks
As we discuss about different moving parts in building streaming pipelines now let us get into frameworks. Most of these frameworks do not have visualization included.
Kafka
Kafka Connect
Kafka Topic
Kafka Streams
ELK
Elastic Search (Database)
Logstash (streaming and processing logs)
Kibana (Visualization)
HDF – Streaming services running behind NiFi
MapR Streams – Streaming services running on MapR cluster
AWS Services
DynamoDB (Database)
s3 (persistent storage of flat file format)
Kinesis (streaming and processing logs)
We have highlighted some of the popular frameworks. Almost all the top vendors such as Cloudera, Google, Microsoft Azure etc have necessary services to build streaming pipelines.
Spark Structured Streaming
Apache Spark is a proven distributed computing framework with modules for different purposes
Core APIs – Transformations and Actions
Spark SQL and Data Frames
Spark Streaming (legacy) and Spark Structured Streaming
Spark MLLib
Spark GraphX
and more
We can use Spark Streaming to apply complex business rules either by using Data Frame operations or Spark SQL. Let us see a demo about getting started with Spark Structured Streaming. Main demo is done with legacy streaming.
Important Concepts
Sources
File
Kafka
Socket (for testing)
Basic Operations
Window Operations on Event Time
Handling late data and watermarking
Output Modes
Append Mode
Update Mode
Complete Mode
Sinks/Targets
Console
File
Memory
Kafka
foreach (can be used to write to Database)
Development Life Cycle
Here are the steps involved to get started with HBase
Make sure gen_logs is setup and data is streamed being streamed
Create new project StreamingDemo using IntelliJ
Choose scala 2.11
Choose sbt 0.13.x
Make sure JDK is chosen
Update build.sbt. See below
Define application properties
Create GetStreamingDepartmentTraffic object
Add logic to process data using Spark Structured Streaming
Add type safe config dependency so that we can externalize properties
Add spark-core and spark-sql dependencies
Replace build.sbt with below lines of code
[gist]3a57c19f7fc669a772fcec4b28e39ba4[/gist]
Externalize Properties
We need to make sure that application can be run in different environments. It is very important to understand how to externalize properties and pass the information at run time.
Make sure build.sbt have dependency related to type safe config
Create new directory under src/main by name resources
Add file called application.properties and add below entries
[gist]9413955f1f298747927629138240c5d6[/gist]
Create GetStreamingDepartmentTraffic Program
Create scala program by choosing Scala Class and then type Object
Make sure program is named as GetStreamingDepartmentTraffic
First we need to import necessary APIs
Develop necessary logic
Get the properties from application.properties
Create spark session object by name spark
Create stream using spark.readStream
Process data using Data Frame Operations
Write the output to console (in actual applications we write the output to database)
[gist]f4a85d5c7a065c06a5ce8e11ad30aa37[/gist]
Build, Deploy and Run
Right click on the project and copy path
Go to terminal and run cd command with the path copied
Run sbt package
It will generate jar file for our application
Copy to the server where you want to deploy
Start streaming tail_logs to web service – tail_logs.sh|nc -lk gw02.itversity.com 9999
Run below command in another session on the server
[gist]a11b68fedcc24d3a526f5037d3af5e9f[/gist]
Kafka and Spark Structured Streaming – Integration
Let us see how we can get data from Kafka topic and process using Spark Structured Streaming.
Development Life Cycle
Here are the steps involved to get started with HBase
Make sure gen_logs is setup and data is streamed being streamed
Create new project StreamingDemo using IntelliJ
Choose scala 2.11
Choose sbt 0.13.x
Make sure JDK is chosen
Update build.sbt. See below
Define application properties
Create GetStreamingDepartmentTraffic object
Add logic to process data using Spark Structured Streaming
Add type safe config dependency so that we can externalize properties
Add spark-sql dependencies
Replace build.sbt with below lines of code
[gist]b5f9f2d1c79a510bcabe896178f8f644[/gist]
Externalize Properties
We need to make sure that application can be run in different environments. It is very important to understand how to externalize properties and pass the information at run time.
Make sure build.sbt have dependency related to type safe config
Create new directory under src/main by name resources
Add file called application.properties and add below entries
It should include information related to Kafka broker
[gist]a6f973217a677119bea858b218aa62f9[/gist]
Create GetStreamingDepartmentTraffic Program
Create scala program by choosing Scala Class and then type Object
Make sure program is named as GetStreamingDepartmentTraffic
First we need to import necessary APIs
Develop necessary logic
Get the properties from application.properties
Create spark session object by name spark
Create stream using spark.readStream
Pass broker information and topic information as options
Process data using Data Frame Operations
Write the output to console (in actual applications we write the output to database)
Run the program using IDE to make sure we see output in the console
Build, Deploy and Run
Right click on the project and copy path
Go to terminal and run cd command with the path copied
Run sbt package
It will generate jar file for our application
Copy to the server where you want to deploy
Make sure zookeeper and Kafka Broker are running
Start streaming tail_logs to web service – tail_logs.sh|kafka-console-producer.sh --broker-list wn01.itversity.com:6667,wn02.itversity.com:6667 --topic retail
Run below command in another session on the server