Building Streaming Data Pipelines – Using Kafka and Spark

Current Status
Not Enrolled
Get Started

As part of this workshop we will explore Kafka in detail while understanding the one of the most common use case of Kafka and Spark – Building Streaming Data Pipelines. Following are the technologies we will be using as part of this workshop.

  • IDE – IntelliJ
  • Programming Language – Scala
  • Get messages from web server log files – Kafka Connect
  • Channelize data – Kafka (it will be covered extensively)
  • Consume, process and save – Spark Streaming using Scala as programming language
  • Data store for processed data – HBase
  • Big Data Cluster – 7 node simulated Hadoop and Spark cluster (you can also use our existing 10 node Hortonworks cluster with all related services)

Here is the flow of the course

  • Setup Development Environment to build streaming applications
  • Setup every thing on single node (Logstash, HDFS, Spark, Kafka etc)
  • Overview of Kafka
  • Multibroker/Multi-server setup of Kafka
  • Overview about Streaming technologies and Spark Streaming
  • Overview of NoSQL Databases and HBase
  • Development life cycle of HBase application
  • Case Study: Kafka at LinkedIn
  • Final Demo: Streaming Data Pipelines

Let us start with setting up of Development Environment

Setup Development Environment

Make sure you have IntelliJ, Scala, sbt, Spark etc to build application using IDE (IntelliJ).

Setup Spark Development Environment – IntelliJ and Scala

Understanding Big Data Cluster

Let us go through the details of the cluster on which demo is done.

  • 7 node plain vanilla Cluster with following services
    • HDFS
    • Spark 2.3.1 with Mesos
    • HBase
  • Cloudera Quickstart VM’s gen_logs to simulate web server logs
  • logstash
  • A 3 node Kafka Cluster

Share this post