Kafka Overview  

Let us look into the overview of Apache Kafka before getting into details about configuring and validating it.

  • Apache Kafka is an open-source stream-processing software platform developed by the Apache Software Foundation, written in Scala and Java. It was initially developed as an internal product at LinkedIn and was open-sourced and adopted by apache foundation. Named after author Franz Kafka
  • Salient Features:
    • Highly Scalable (partitioning)
    • Fault Tolerant (replication factor)
    • Low Latency
    • High Throughput

Kafka ecosystem

Heart of Kafka is topic a distributed and fault-tolerant log file. However, over a period of time, Kafka is evolved into an ecosystem of tools.

  • Kafka Connect
  • Kafka Streams and Kafka SQL
  • Producer and Consumer APIs
  • 3rd party plugins to integrate with Flume, logstash, Spark Streaming, Storm, Flink etc.

Kafka Use cases

As micro-services have evolved Kafka become popular to integrate data between different micro-services – asynchronous, real-time as well as the batch.

  • Activity Tracking: Kafka was originally developed for tracking user activity on LinkedIn
  • Messaging: Kafka is also used for messaging, where applications need to send notifications (such as emails) to users.
  • Metrics and logging: Applications publish metrics on a regular basis to a Kafka topic, and those metrics can be consumed by systems for monitoring and alerting.
  • Commit log: database changes can be published to Kafka and applications can easily monitor this stream to receive live updates as they happen. This changelog stream can also be used for replicating database updates to a remote system.
  • Stream processing: Kafka can be integrated with stream frameworks such as Spark Streaming, Flink, Storm etc. Users are allowed to write applications to operate on Kafka messages, performing tasks such as counting metrics, transform data, etc.

Glossary

Topic: A topic represent a group of files and directories. When we create a topic, it will create directories with topic name and partition index. These directories have a bunch of files which will actually store the messages that are being produced.

Publisher or Producer: Publishers or producers are processes that publish data (push messages) to the log file associated with Kafka topic.

Subscriber or Consumer: Subscribers or consumers are processes that read from the log file associated with Kafka topic

Kafka Pub SubModel

Partition: Kafka topics are divided into a number of partitions, which contains messages in an unchangeable sequence. This allows for multiple consumers to read from a topic in parallel.

Leader: When we create a Kafka topic with partitions and replication factor, each partition will have a leader. Messages will be first written to the partition on broker which is designated as the leader and then copied to rest of followers.

Replication Factor: Each partition can be cloned into multiple copies using replication factor. It will facilitate fault tolerance. With a replication factor of n on m node cluster (where n <= m), the cluster can survive the failure of n-1 nodes at any point in time.

Broker: A Kafka cluster consists of one or more servers (Kafka brokers), which are running Kafka. Producers query metadata of each of the topic and connect to the leader of each partition to produce messages into Kafka topic. Consumers do the same while consuming messages from the topic.

Offset: The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.

Learning Process

We will follow the same standard process to learn while adding any software-based service.

  • Downloading and Installing – The required binaries are downloaded as part of the initial setup, we will just add the Kafka Service to Cloudera Manager.
  • Configuration – we need to understand architecture and plan for the configuration.
    • Architecture – Producer and Consumer Architecture
      • Producers connect to one or more brokers and push messages to topics via leader.
      • Consumers pull a message from the topic by polling topic at regular intervals. Each time consumer read messages it needs to keep track of offset (can be done using multiple ways)
    • Components
      • Zookeeper Ensemble – Already set up as part of the Zookeeper configuration.
      • Kafka brokers
    • Configuration Files
      • /etc/kafka/conf/server.properties
  • Service logs/var/log/kafka

Share this post