Saltar al contenido principal
InicioTutorialesIngeniería de datos

Apache Kafka for Beginners: A Comprehensive Guide

Explore Apache Kafka with our beginner's guide. Learn the basics, get started, and uncover advanced features and real-world applications of this powerful event-streaming platform.
mar 2024  · 8 min leer

The modern application landscape has been completely transformed by microservices and event-based application architectures. The days of batch-based processing are long gone. To meet the demands of speed and availability, developers aim for near-real timeliness and fully decoupled application architecture.

An event management system is usually used by decoupled applications to help with integration across different services. Another application for an event store is in event sourcing-based architectures, where all application logic is expressed in terms of events. This is where Apache Kafka comes into the discussion.

Apache Kafka is an open-source distributed event-streaming framework developed at LinkedIn by a team that included Jay Kreps, Jun Rao, and Neha Narkhede. The tool was optimized to ingest and process streaming data in real time; thus, it can be used to implement high-performance data pipelines, streaming analytics applications, and data integration services.

Essentially, Kafka is a framework for distributed event stores and stream processing that can be used as a message broker, event store, queue management system, etc. You can read our comparison of Kafka vs Amazon SQS in a separate guide and check out how Kafka Partitions work in our tutorial. 

Why Kafka?

After storing the events, the event streaming platform directs them to the relevant services. Developers can scale and maintain a system like this without relying on other services. Since Kafka is an all-in-one package that can be used as a message broker, event store, or stream processing framework, it works excellently for such a requirement.

With high throughput and durability, hundreds of data sources can simultaneously provide huge inflows of continuous data streams that can be processed sequentially and progressively by Apache Kafka's message broker system.

Moreover, the advantages of Apache Kafka include:

High throughput

Kafka’s well-designed architecture, which includes data partitioning, batch processing, zero-copy techniques, and append-only logs, enables it to reach high throughput and handle millions of messages per second, catering to high-velocity and high-volume data scenarios. Thus, the lightweight communication protocol that facilitates effective client-broker interaction makes real-time data streaming feasible.

Scalability

Apache Kafka provides load balancing across servers by partitioning a topic into multiple partitions. This enables users to distribute production clusters among geographical areas or availability zones and scale them up or down to suit their needs. In other words, Apache Kafka can be easily scaled up to handle trillions of messages per day across a large number of partitions.

Low latency

Apache Kafka uses a cluster of servers with low latency (as low as 2 milliseconds) to efficiently deliver messages at network-limited throughput by decoupling data streams.

Durability and reliability

Apache Kafka enhances data fault-tolerances and durability in two key ways:

  • It distributes data stream storage in a fault-tolerant cluster to guard against server failure
  • It saves messages to disc and offers intra-cluster replication.

High availability

Due to its cluster-based brokers, Kafka remains operational during a server outage. This works because when one of the servers experiences issues, Kafka is smart enough to send queries to different brokers.

Become a Data Engineer

Become a data engineer through advanced Python learning
Start Learning for Free

Core Components of Kafka

Apache Kafka enables users to analyze data in real time, store records in the order they were created, and publish and subscribe to them.

The core components of Apache Kafka

The core components of Apache Kafka | Source: Start your real-time pipeline with Apache Kafka by Alex Kam

Apache Kafka is a horizontally scalable cluster of commodity servers that processes real-time data from multiple "producer" systems and applications (e.g., logging, monitoring, sensors, and Internet of Things applications) and makes it available to multiple "consumer" systems and applications (e.g., real-time analytics) at very low latency – as shown in the diagram above.

Note that applications that depend on real-time data processing and analytics systems can both be considered consumers. For example, an application for location-based micromarketing or logistics.

Here are some key terms to know to understand the core components of Kafka better:

  • Brokers are servers in the cluster that store data and serve clients.
  • Topics are categories or feeds to which records are published. Note that there are two types of topics: compacted and regular. Records stored in compacted topics are not limited by time or space. Older topic messages with the same key are updated by newer messages, and Apache Kafka does not remove the most recent message unless the user does so. In contrast, records stored in regular topics can be configured to expire, thus enabling outdated data to be deleted, which frees up storage space.
  • Producers are client applications that publish (write) events to Kafka topics.
  • Consumers are applications that read and process events from Kafka topics. Java programs that retrieve data from Topics and return results to Apache Kafka can be written thanks to the Streams API for Apache Kafka. External stream processing systems like Apache Spark, Apache Apex, Apache Flink, Apache NiFi, and Apache Storm can also process these message streams.
  • Kafka uses a client library called “Streams” to create microservices and applications where data is stored in clusters for input and output.

Getting Started with Kafka

It is often recommended that Apache Kafka is started with Zookeeper for optimum compatibility. Also, Kafka may run into several problems when installed on Windows because it’s not natively designed for use with this operating system. Consequently, it is advised to use the following to launch Apache Kafka on Windows:

  • WSL2 if you're running Windows 10 or later, or Docker
  • Docker If you're running Windows 8 or earlier

It is not advised to use the JVM to run Kafka on Windows since it does not have some of the POSIX characteristics that are unique to Linux. You will eventually encounter difficulties if you attempt to run Kafka on Windows without WSL2.

That said, the first step to install Apache Kafka on Windows is to install WSL2.

Step 1: Install WSL2

WSL2, or Windows Subsystem for Linux 2, gives your Windows PC access to a Linux environment without needing a virtual machine.

Most Linux commands are compatible with WSL2, bringing the Kafka installation process closer to the instructions offered for Mac and Linux.

Tip: Ensure you’re running Windows 10 version 2004 or higher (Build 19041 and higher) before installing WSL2. Press the Windows logo key + R, type “winver,” then click OK to see your Windows version.

The easiest way to install Windows Subsystem for Linux (WSL) is by running the following command in an administrator PowerShell or Windows Command Prompt and then restarting your computer:

wsl --install

Note you’ll be prompted to create a user account and password for your newly installed Linux distribution.

Follow the steps on the Microsoft Docs website if you get stuck.

Step 2: Install Java

If Java is not installed on your machine, then you’ll need to download the latest version.

Step 3: Install Apache Kafka

At the time of writing this article, the latest stable version of Apache Kafka is 3.7.0, released on February 27th, 2024. This can change at any time. To ensure you’re using the most updated, stable version of Kafka, check out the download page.

Download the latest version from the Binary downloads.

Once the download is complete, navigate to the folder where Kafka was downloaded and extract all the files from the zipped folder. Note we named our new folder “Kafka.”

Step 3: Start Zookeeper

Zookeeper is required for cluster management in Apache Kafka. Therefore, Zookeeper must be launched before Kafka. Installing Zookeeper separately is not necessary because it is part of Apache Kafka.

Open the command prompt and navigate to the root Kafka directory. Once there, run the following command to start Zookeeper:

.\bin\windows\zookeeper-server-start.bat .\config\zookeeper.properties

Start 4: Start the Kafka server

Open another command prompt and run the following command from the root of Apache Kafka to start Apache Kafka:

.\bin\windows\kafka-server-start.bat .\config\server.properties

Step 5: Create a topic

To create a topic, start a new command prompt from the root Kafka directory and run the following command:

.\bin\windows\kafka-topics.bat --create --topic MyFirstTopic --bootstrap-server localhost:9092

This will create a new Kafka topic called “MyFirstTopic.”

Note: to confirm it’s created correctly, executing the command will return “Create topic <name of topic>.”

Step 6: Start Kafka Producer

A producer must be started to put a message in the Kafka topic. Execute the following command to do so:

.\bin\windows\kafka-console-producer.bat --topic MyFirstTopic --bootstrap-server localhost:9092

Step 7: Start Kafka consumer

Open another command prompt from the root Kafka directory and run the following command to send to stat the Kafka consumer:

.\bin\windows\kafka-console-consumer.bat --topic MyFirstTopic --from-beginning --bootstrap-server localhost:9092

Now, when a message is produced in the Producer it is read by the Consumer in real time.

A message produced by the Producer and read by the Kafka Consumer in real time

A GIF demonstrates a message produced by the Producer and read by the Kafka Consumer in real time.

There you have it… You’ve just created your first Kafka topic.

Advanced Kafka Features

Here’s a few advanced features of Kafka…

Kafka Streams

Using Apache Kafka, developers may create robust stream-processing applications with the help of Kafka Streams. It offers APIs and a high-level Domain-Specific Language (DSL) for handling, converting, and evaluating continuous streams of data.

A few key features include:

Stream processing

Real-time processing of record streams is made possible by Kafka Streams. It enables you to take in data from Kafka topics, process and transform the data, and then return the processed information back to Kafka topics. Stream processing allows for near real-time analytics, monitoring, and data enrichment. It can be applied to individual recordings or to windowed aggregations.

Event-time processing

Using the timestamps attached to the records, you can handle out-of-order records with Kafka Streams' support for event-time processing. It provides event-time semantics windowing operations, allowing for window joining, sessionization, and time-based aggregations.

Time-based operations and windowing

Kafka Streams provides a number of windowing operations that enable users to perform computations over session windows, tumbling windows, sliding windows, and fixed time windows.

Event-based triggers, time-sensitive analysis, and time-based aggregations are possible via windowed procedures.

Stateful processing

During stream processing, Kafka Streams enables users to keep and update the state. Built-in support for state stores is provided. Note a state store is a key-value store that can be updated and queried within a processing topology.

Advanced stream processing functions like joins, aggregations, and anomaly detection are made possible by stateful operations.

Exactly-once processing

Kafka Streams provides end-to-end semantics for exactly-once processing, guaranteeing that every record is handled precisely once—even in the event of a failure. This is accomplished by utilizing Kafka's robust durability guarantees and transactional features.

Kafka Connect

The declarative, pluggable data integration framework for Kafka is called Kafka Connect. It is an open-source, free component of Apache Kafka that acts as a centralized data hub for simple data integration between file systems, databases, search indexes, and key-value stores.

The Kafka distribution includes Kafka Connect by default. To install it, you just have to start a worker process.

Use the following command from the root Kafka directory to launch the Kafka Connect worker process:

.\bin\windows\connect-distributed.bat 
.\config\connect-distributed.properties

This will launch the Kafka Connect worker in distributed mode, enabling the high availability and scalability of running numerous workers in a cluster.

Note: The .\config\connect-distributed.properties file specifies the Kafka broker information and other configuration properties for Kafka Connect.

Connectors are used by Kafka Connect to transfer data between external systems and Kafka topics. Connectors can be installed and configured to suit your unique requirements for data integration.

Downloading and adding the connector JAR file to the plugin.path directory mentioned in the .\config\connect-distributed.properties file is all that is required to install a connector.

A configuration file for the connector must be created, in which the connector class and any other characteristics must be specified. The Kafka Connect REST API and command-line tools can be used to construct and maintain connectors.

Next, in order to combine data from other systems, you must configure the Kafka Connector. Kafka Connect offers various connectors for integrating data from various systems, such as file systems, message queues, and databases. Select a connector based on the needs you have for integration - see the documentation for the list of connectors.

Real-World Applications of Kafka

Here are a few common use cases of Apache Kafka.

Activity tracking

Kafka can be used by an online e-commerce platform to track user activity in real time. Every user action, including viewing products, adding items to carts, making purchases, leaving reviews, doing searches, and so on, may be published as an event to particular Kafka topics.

Other microservices can store or use these events for real-time fraud detection, reporting, personalized offers, and recommendations.

Messaging

With its improved throughput, built-in partitioning, replication, fault-tolerance, and scaling capabilities, Kafka is a good substitute for conventional message brokers.

A microservices-based ride-hailing app can use it to facilitate message exchanges across various services.

For instance, the ride-booking business might use Kafka to communicate with the driver-matching service when a rider makes a reservation. The driver-matching service may then, in almost real time, locate a driver in the area and reply with a message.

Log aggregation

Typically, this entails obtaining physical log files from servers and storing them in a central location for processing, such as a file server or data lake. Kafka abstracts the data as a stream of messages and filters the file-specific information. This makes it possible to process data with reduced latency and to accommodate distributed data consumption and different data sources more easily.

After the logs are published to Kafka, they can be used for troubleshooting, security monitoring, and compliance reporting by a log analysis tool or a security information and event management (SIEM) system.

Stream processing

Several Kafka users process data in multi-stage processing pipelines. Raw input data is taken from Kafka topics, aggregated, enriched, or otherwise transformed into new topics for additional consumption or processing.

For example, a bank may use Kafka to process real-time transactions. Each transaction a customer starts is then posted to a Kafka topic as an event. Subsequently, an application may take in these events, verify and handle the transactions, stop dubious ones, and instantly update customer balances.

Metrics

Kafka could be used by a cloud service provider to aggregate statistics from distributed applications to generate real-time centralized streams of operational data. Metrics from hundreds of servers, such as CPU and memory consumption, request counts, error rates, etc., might be reported to Kafka. Monitoring applications might then use these metrics for anomaly identification, alerting, and real-time visualization.

Conclusion

Kafka is a framework for stream processing that enables applications to publish, consume, and process large amounts of record streams quickly and reliably. Considering the increasing prevalence of real-time data streaming, it has become a vital tool for data application developers.

In this article, we covered why one may decide to use Kafka, core components, how to get started with it, advanced features, and real-world applications. To continue your learning, check out:

Get certified in your dream Data Engineer role

Our certification programs help you stand out and prove your skills are job-ready to potential employers.

Get Your Certification
Timeline mobile.png

Photo of Kurtis Pykes
Author
Kurtis Pykes
LinkedIn
Temas

Start Your Data Engineer Journey Today!

Track

Data Engineer

57hrs hr
Gain in-demand skills to efficiently ingest, clean, manage data, and schedule and monitor pipelines, setting you apart in the data engineering field.
See DetailsRight Arrow
Start Course
Ver másRight Arrow
Relacionado

blog

The Kafka Certification Guide for Data Professionals

Learn how to advance your career with the Confluent Certified Developer (CCDAK) and Administrator (CCAAK) certifications, gaining the expertise and recognition needed to excel in data streaming and management.
Adejumo Ridwan Suleiman's photo

Adejumo Ridwan Suleiman

13 min

blog

Kafka vs SQS: Event Streaming Tools In-Depth Comparison

Compare Apache Kafka and Amazon SQS for real-time data processing and analysis. Understand their strengths and weaknesses for data projects.
Zahara Miriam's photo

Zahara Miriam

18 min

blog

What is A Graph Database? A Beginner's Guide

Explore the intricate world of graph databases with our beginner's guide. Understand data relationships, dive deep into the comparison between graph and relational databases, and explore practical use cases.
Kurtis Pykes 's photo

Kurtis Pykes

11 min

tutorial

Getting Started with Apache Airflow

Learn the basics of bringing your data pipelines to production, with Apache Airflow. Install and configure Airflow, then write your first DAG with this interactive tutorial.
Jake Roach's photo

Jake Roach

10 min

tutorial

Kafka Partitions: Essential Concepts for Scalability and Performance

Partitions are components within Kafka's distributed architecture that enable Kafka to scale horizontally, allowing for efficient parallel data processing.
Kurtis Pykes 's photo

Kurtis Pykes

11 min

tutorial

A Beginner's Guide to BigQuery

Learn what BigQuery is, how it works, its differences from traditional data warehouses, and how to use the BigQuery console to query public datasets provided by Google.
Eduardo Oliveira's photo

Eduardo Oliveira

9 min

See MoreSee More