Durga Gadiraju

Building Streaming Data Pipelines – Using Kafka and Spark

May 3, 2018
, 10:29 am
, public

Current Status

Not Enrolled

Price

Free

Get Started

As part of this workshop we will explore Kafka in detail while understanding the one of the most common use case of Kafka and Spark – Building Streaming Data Pipelines. Following are the technologies we will be using as part of this workshop.

IDE – IntelliJ
Programming Language – Scala
Get messages from web server log files – Kafka Connect
Channelize data – Kafka (it will be covered extensively)
Consume, process and save – Spark Streaming using Scala as programming language
Data store for processed data – HBase
Big Data Cluster – 7 node simulated Hadoop and Spark cluster (you can also use our existing 10 node Hortonworks cluster with all related services)

Here is the flow of the course

Setup Development Environment to build streaming applications
Setup every thing on single node (Logstash, HDFS, Spark, Kafka etc)
Overview of Kafka
Multibroker/Multi-server setup of Kafka
Overview about Streaming technologies and Spark Streaming
Overview of NoSQL Databases and HBase
Development life cycle of HBase application
Case Study: Kafka at LinkedIn
Final Demo: Streaming Data Pipelines

Let us start with setting up of Development Environment

Setup Development Environment

Make sure you have IntelliJ, Scala, sbt, Spark etc to build application using IDE (IntelliJ).

Setup Spark Development Environment – IntelliJ and Scala

Understanding Big Data Cluster

Let us go through the details of the cluster on which demo is done.

7 node plain vanilla Cluster with following services
- HDFS
- Spark 2.3.1 with Mesos
- HBase
Cloudera Quickstart VM’s gen_logs to simulate web server logs
logstash
A 3 node Kafka Cluster