Skip to main content

Kafka Streams Tutorial: Introduction to Real-Time Data Processing

Learn how to build real-time data processing applications with Kafka Streams. This guide covers core concepts, Java & Python implementations, and step-by-step examples for building scalable streaming applications.
Apr 9, 2025

Today, organizations need to process data as it comes in rather than waiting for scheduled batches. This stream processing can provide real-time insight by analyzing data in motion, resulting in better decision-making and enhanced user experiences. This shift has resulted in increased demand for tools that can process large-scale streaming data efficiently.

One of the most popular stream processing platforms is Apache Kafka, a distributed, fault-tolerant, and high-throughput technology. Initially developed at LinkedIn (where it sees much of its use as part of its data pipelines), it has become the heart of the Kafka ecosystem that include key players like Kafka Connect and Schema Registry. Kafka Streams, a lightweight library within this ecosystem, helps develop real-time applications directly on Kafka, and this article will show you how to use it effectively.

Note: This article assumes familiarity with fundamental concepts in stream processing and Apache Kafka. You can check out DataCamp’s Streaming Concepts Course and Intro to Apache Kafka tutorial if you need a refresher.

What Is Kafka Streams?

Kafka logo with Kafka Streams label

Kafka Streams is a client library that allows developers to build real-time stream processing applications directly on top of Apache Kafka. 

Unlike traditional systems that rely on separate clusters, Kafka Streams runs as a standard Java library within your application, making deployment and scaling simpler. It provides a high-level API for defining operations like filtering, transformations, joins, aggregations, and windowing.

Built to handle continuous data streams with minimal operational overhead, Kafka Streams leverages Kafka’s core strengths, such as partitioning, fault tolerance, and scalability. It adds stream processing semantics without the need to manage additional infrastructure. Since its release in Kafka 0.10 (2016), it has become a reliable, production-grade solution used across industries for critical streaming applications.

Kafka Streams Core Concepts and Architecture

Kafka Streams is built around two core abstractions: streams and tables

A stream is an unbounded sequence of records, each with a key, value, and timestamp, while a table represents the current state of data, mapping each key to its latest value. Developers can convert between the two using operations like aggregations (stream-to-table) and change capture (table-to-stream), enabling powerful patterns for real-time processing.

Kafka Streams applications are defined by a processing topology—a directed graph of processor nodes connected by streams. Each node performs specific tasks like filtering, mapping, or joining, and the entire topology is automatically partitioned based on Kafka’s topic partitions. 

This design allows parallel processing across multiple application instances, offering horizontal scalability without the need to manage distributed infrastructure manually.

Key concepts in Kafka Streams:

  • Stream: A continuous flow of records (key, value, timestamp).
  • Table: A snapshot of the latest value for each key.
  • Stream-table duality: Allows transformation between streams and tables.
  • Processing topology: A graph of stream processors that defines the data flow.
  • Parallelism: Achieved through Kafka topic partitioning for scalable processing.

For an in-depth exploration of how Kafka’s partitioning works and its impact on scalability, refer to DataCamp’s tutorial on Kafka Partitions: Essential Concepts for Scalability and Performance.

Key features and capabilities

Kafka Streams provides exactly-once processing guarantees, ensuring each record is processed once, even during failures or restarts. It supports stateful operations like joins and aggregations using local state stores, often backed by RocksDB, removing the need for external databases. 

These features make it suitable for tasks that require accuracy, such as financial processing and analytics.

Key features include:

  • Exactly-once processing: Maintains consistency across failures.
  • Local state stores: Support in-memory and disk-based state.
  • Flexible deployment: Runs on servers, containers, or Kubernetes.
  • Integration: Works with Schema Registry and Kafka Connect.
  • Interactive queries: Provide access to application state in real time.

These capabilities support a range of applications, from basic data processing to more complex event-driven systems.

Kafka Streams vs Apache Kafka

Now that we’ve explored what Kafka Streams is and its core architecture, it’s important to clarify its relationship with Apache Kafka itself. 

Many newcomers to the Kafka ecosystem often confuse Kafka Streams with Apache Kafka or assume they are interchangeable technologies. 

Understanding the distinction between these two components is crucial for architects and developers, as it affects how you design your data pipeline architecture and which tools you select for specific data processing requirements.

Apache Kafka characteristics

  • Functions as a distributed event streaming platform and messaging system
  • Consists of servers (brokers) that store streams of records in topics
  • Uses a publish-subscribe model with producers and consumers
  • Excels at reliably storing and transmitting data
  • Provides no built-in data transformation capabilities
  • Supports client libraries in multiple programming languages

Kafka Streams, by contrast, is a client library that builds upon Apache Kafka’s foundation. While Kafka itself provides the infrastructure for storing and moving data, Kafka Streams adds computational capabilities for transforming that data. 

Think of Apache Kafka as the highway system that moves data around, while Kafka Streams provides vehicles with specific functions to process the data as it travels. 

The traditional approach to data processing with Apache Kafka involves writing custom consumer applications, which becomes complex when implementing stateful operations like aggregations or joins.

Kafka Streams characteristics

  • Client library specifically for Java/Scala applications
  • Provides high-level APIs for stream processing operations
  • Manages state automatically for aggregations and joins
  • Offers exactly-once processing guarantees
  • Operates as standard applications without additional infrastructure
  • Excels at continuous, real-time data transformation
  • Simplifies operations in microservice architectures

The choice between using Apache Kafka alone or incorporating Kafka Streams depends on your specific use case. 

For simple data movement between systems with minimal transformation, standard Kafka producers and consumers provide sufficient functionality. These scenarios include collecting logs, forwarding events to analytics platforms, or implementing basic publish-subscribe patterns. 

However, when applications require continuous processing involving transformations, aggregations, joins, or windowed operations, Kafka Streams offers significant advantages. 

Its built-in state management, exactly-once processing guarantees, and operational simplicity make it particularly valuable for real-time data processing within existing application infrastructure.

Feature

Apache Kafka

Kafka Streams

Primary Purpose

Distributed messaging system and event store

Stream processing library

Architecture

Cluster of broker servers

Client-side library

Infrastructure Requirements

Dedicated server cluster

None (runs within application)

Programming Languages

Client libraries in multiple languages

Java and Scala only

Processing Capability

Basic message routing

Complex stream processing (filtering, aggregation, joins, windowing)

State Management

None (stateless)

Built-in state stores for stateful operations

Processing Guarantees

At-least-once delivery

Exactly-once processing

Deployment Model

Cluster deployment

Standard application deployment

Use Cases

Event storage, message queuing, log aggregation

Real-time analytics, data transformation, event-driven applications

Horizontal Scaling

Adding brokers to cluster

Running multiple application instances

Fault Tolerance

Replication across brokers

Automatic state recovery

Role in System

Central data backbone

Processing layer on top of Kafka

Comprehensive Instructions on Installing and Setting Up Kafka Streams

Having understood what Kafka Streams is and how it differs from Apache Kafka, we can now proceed with setting up a development environment for building Kafka Streams applications. 

This section provides detailed instructions for installing and configuring all necessary components on both macOS and Windows systems. 

Following these steps will prepare your environment for developing, testing, and running Kafka Streams applications.

Prerequisites

Before diving into Kafka Streams development, you need to set up several foundational components: Java Development Kit (JDK), Apache Kafka, and build tools. These components form the infrastructure upon which your Kafka Streams applications will run.

Java Development Kit installation

For macOS:

1. Install Homebrew if you haven’t already by running the following in Terminal:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

2. Install the JDK using Homebrew:

brew install openjdk@17

3. Set up the JAVA_HOME environment variable by adding these lines to your ~/.zshrc or ~/.bash_profile:

export JAVA_HOME=$(/usr/libexec/java_home -v 17) export PATH=$JAVA_HOME/bin:$PATH

4. Reload your profile:

source ~/.zshrc # or source ~/.bash_profile

5. Verify the installation:

java -version

For Windows:

1. Download the Java SE Development Kit 17 installer from the Oracle website or adopt OpenJDK from Adoptium.

2. Run the installer and follow the installation wizard instructions.

3. Set the JAVA_HOME environment variable:

  • Right-click on “This PC” or “My Computer” and select “Properties”
  • Click on “Advanced system settings”
  • Click the “Environment Variables” button
  • Under “System variables”, click “New”
  • Enter “JAVA_HOME” as the variable name
  • Enter the JDK installation path (e.g., C:\Program Files\Java\jdk-17) as the variable value

4. Add Java to the PATH variable:

  • Under “System variables”, find the “Path” variable and click “Edit”
  • Click “New” and add “%JAVA_HOME%\bin”
  • Click “OK” to close all dialogs

5. Verify the installation by opening Command Prompt and typing:

java -version

Apache Kafka Setup

For macOS:

1. Download and install Kafka using Homebrew:

brew install kafka

2. Start Zookeeper (Kafka’s coordination service):

brew services start zookeeper

3. Start Kafka:

brew services start kafka

4. Verify the installation by creating a test topic:

kafka-topics --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic test

If you run into any issues when starting the services or creating a test topic, try running the commands with sudo.

For Windows:

  1. Download the latest Kafka release from the Apache Kafka downloads page.
  2. Extract the downloaded archive to a directory of your choice (e.g., C:\kafka).
  3. Open PowerShell or Command Prompt as administrator and navigate to the Kafka directory:
  • cd C:\kafka

4. Start Zookeeper:

.\bin\windows\zookeeper-server-start.bat .\config\zookeeper.properties

5. Open another PowerShell or Command Prompt window and start Kafka:

.\bin\windows\kafka-server-start.bat .\config\server.properties

6. Create a test topic (in a third PowerShell or Command Prompt window):

.\bin\windows\kafka-topics.bat --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic test

IDE setup and configuration

Modern IDEs significantly enhance the development experience for Kafka Streams applications by providing code completion, debugging capabilities, and integration with build tools. Below, we cover setup instructions for the popular IntelliJ IDEA.

IntelliJ IDEA setup

For macOS:

1. Download IntelliJ IDEA from the JetBrains website.

  • The Community Edition is free and sufficient for Kafka Streams development.
  • The Ultimate Edition offers additional features like Spring support and database tools.

2. Open the downloaded DMG file and drag the IntelliJ IDEA icon to the Applications folder.

3. Launch IntelliJ IDEA from the Applications folder.

4. During first-time setup, configure your preferences and install any recommended plugins.

For Windows:

  1. Download IntelliJ IDEA from the JetBrains website.
  2. Run the installer and follow the installation wizard.
  3. Launch IntelliJ IDEA after installation.
  4. During first-time setup, configure your preferences and install any recommended plugins.

Project configuration

With your development environment set up, you can now create and configure a new Kafka Streams project using either Maven or Gradle, two popular build automation tools for Java applications.

Maven and Gradle are essential tools that handle dependency management, build processes, and project organization for Java applications:

Maven is an older and widely used build tool that uses XML configuration files (pom.xml) to define project settings and dependencies. It follows a convention-over-configuration approach with a predefined project structure, making it straightforward for beginners. 

Maven automatically downloads required libraries from central repositories and manages the build lifecycle.

Gradle is a more modern build tool that uses Groovy or Kotlin for its build scripts instead of XML. It offers greater flexibility and faster build times compared to Maven. 

While it has a steeper learning curve, Gradle provides more powerful customization options and is increasingly becoming the standard for new Java projects.

Both tools will help you manage the Kafka Streams libraries and other dependencies needed for your streaming applications, allowing you to focus on writing code rather than manually handling JAR files.

Creating a Maven project

In IntelliJ IDEA:

1. Click “File” > “New” > “Project…”

2. Select “Maven” from the left panel

3. Check “Create from archetype” and select “maven-archetype-quickstart”

4. Click “Next” and provide:

  • GroupId: e.g., “com.example”
  • ArtifactId: e.g., “kafka-streams-demo”
  • Version: e.g., “1.0-SNAPSHOT”

5. Click “Finish”

Adding Kafka Streams dependencies

For Maven projects, add the following to your pom.xml file:

<dependencies>
   <dependency>
       <groupId>org.apache.kafka</groupId>
       <artifactId>kafka-streams</artifactId>
       <version>3.4.0</version>
   </dependency>
   <dependency>
       <groupId>org.apache.kafka</groupId>
       <artifactId>kafka-clients</artifactId>
       <version>3.4.0</version>
   </dependency>
   <dependency>
       <groupId>org.slf4j</groupId>
       <artifactId>slf4j-api</artifactId>
       <version>2.0.5</version>
   </dependency>
   <dependency>
       <groupId>org.slf4j</groupId>
       <artifactId>slf4j-simple</artifactId>
       <version>2.0.5</version>
   </dependency>
</dependencies>

After adding the dependencies in Maven, reload the project by right-clicking on the file editor and the “Maven > Reload project” option.

For Gradle projects, add the following to your build.gradle file:

dependencies {
   implementation 'org.apache.kafka:kafka-streams:3.4.0'
   implementation 'org.apache.kafka:kafka-clients:3.4.0'
   implementation 'org.slf4j:slf4j-api:2.0.5'
   implementation 'org.slf4j:slf4j-simple:2.0.5'
}

After adding these dependencies, refresh your project to ensure they’re properly downloaded and configured. 

With these steps completed, your development environment is now ready for building Kafka Streams applications. Next, we’ll explore the core concepts of Kafka Streams to build a solid foundation before implementing our first application.

Core Kafka Streams Concepts

Before diving into code examples, it’s important to understand the fundamental concepts and terminology used in Kafka Streams. This section provides a beginner-friendly introduction to the key ideas you’ll encounter when building stream processing applications with Kafka Streams.

Basic Kafka terminology

Let’s explore some of the key terms that you will encounter when using Kafka. 

Topics: The foundation of data organization

Think of a topic as a category or feed name to which records are published. In the physical world, this might be similar to a file folder where related documents are stored together. For example, a retail application might have separate topics for sales, inventory-updates, and customer-activity.

Topics in Kafka have these important characteristics:

  • They can be partitioned for parallel processing
  • They store data in an append-only log (new data is always added to the end)
  • They retain data according to configurable retention policies

Understanding how topics are partitioned is crucial for building scalable stream processing applications. For a complete guide to how partitioning impacts your Kafka applications, see DataCamp’s tutorial on Kafka Partitions: Essential Concepts for Scalability and Performance.

Records: The data units

A record (sometimes called a message) is the basic unit of data in Kafka. Each record consists of:

  • A key (optional): Determines which partition the record goes to
  • A value: The actual data payload
  • A timestamp: When the record was created or processed

For example, in a weather monitoring system, a record might look like:

Key: "NYC"
Value: {"temperature": 72, "humidity": 65, "wind": 5}
Timestamp: 2025-03-13T14:25:30Z

Producers and consumers: The data movers

Producers are applications that publish (write) records to Kafka topics. For example, a temperature sensor might produce readings to a ‘temperature-readings’ topic.

Consumers are applications that subscribe to (read) topics and process the records. For instance, a dashboard application might consume from the temperature-readings topic to display current conditions.

Kafka Streams specific concepts

Here are some concepts you need to know that are specific to Kafka Streams. 

Streams and tables: Two views of your data

Kafka Streams provides two fundamental abstractions for working with data:

KStream represents an unbounded, continuously updating sequence of records. It’s like watching a flowing river of data where you see each event as it passes by.

Example use case: Processing each payment transaction as it occurs.

KTable represents the current state of the world, where each key maps to its latest value. It’s like looking at a database table that gets updated whenever new information arrives.

Example use case: Maintaining a current inventory count for each product.

This diagram illustrates the difference:

KStream (event stream):
 t1: ("product-1", "purchased")
 t2: ("product-2", "purchased")
 t3: ("product-1", "returned")
 t4: ("product-3", "purchased")

KTable (current state):
 After t4:
   "product-1": "returned"
   "product-2": "purchased"
   "product-3": "purchased"

Processing topology: The flow of data

A processing topology defines how data moves and transforms through your application. It’s like a flowchart for your data, consisting of:

  • Source processors: Read records from topics
  • Stream processors: Perform operations on the data
  • Sink processors: Write results to topics

Here’s a simple visual representation:

Source → Filter → Transform → Aggregate → Sink
(read)   (clean)  (modify)    (combine)   (write)

Stateless vs. stateful operations

Operations in Kafka Streams fall into two categories:

Stateless operations process each record independently, without remembering anything about previous records. Examples include:

  • filter(): Keep only records that match a condition
  • map(): Transform each record into a new one
  • flatMap(): Transform each record into zero or more new records

Stateful operations maintain information across multiple records. Examples include:

  • count(): Count occurrences of keys
  • aggregate(): Combine data for the same key
  • join(): Combine records from multiple streams

For example, calculating the average temperature by city requires remembering previous temperature readings for each city — that’s stateful processing.

State stores: Where data lives

State stores are the managed databases that Kafka Streams uses to store state for stateful operations. They can be:

  • In-memory for faster access
  • Disk-based for larger datasets
  • Backed by Kafka topics for fault tolerance

If your application crashes, Kafka Streams can restore state stores from their backing topics, ensuring durability.

Consumer groups and application IDs

Every Kafka Streams application needs an application ID, which:

  • Serves as a namespace for the application’s state stores
  • Acts as the consumer group ID for internal topic consumption
  • Allows multiple instances of your application to work together

When you scale your application by running multiple instances, they’ll automatically coordinate work by:

  • Splitting the partitions among the available instances
  • Handling failover if an instance goes down

Practical example: Word count conceptual workflow

To tie these concepts together, let’s examine a word-counting application conceptually before looking at any code:

1. Input: Records flow into a topic called sentences where each value contains a sentence like "Hello Kafka Streams"

2. Stream Creation: A source processor reads from the sentences topic, creating a stream of sentence records

3. Transformation (stateless):

  • Convert each sentence to lowercase
  • Split each sentence into individual words
  • Filter out empty strings
  • Change the record key to be the word itself

4. Aggregation (stateful):

  • Group records by word (key)
  • Count occurrences of each word
  • Store current counts in a state store

5. Output: Write the current count for each word to an output topic called word-counts

The resulting topic would contain records like:

"hello": 2
"kafka": 3
"streams": 3

This doesn’t require understanding any code yet — it’s simply about the flow of data and the transformations applied at each step.

Key takeaways before coding

Before we start writing code, remember these important points about Kafka Streams:

  1. Data flow: Data always flows from Kafka topics, through your processing logic, and back to Kafka topics
  2. Declarative API: You describe what transformations to apply, not how to implement them
  3. Scalability: Applications can scale by adding more instances that work together
  4. Fault yolerance: If your application crashes, it can resume processing where it left off
  5. Exactly-once processing: Results are accurate even if failures occur during processing

In the next section, we’ll translate these concepts into code by building a simple Kafka Streams application.

Building Your First Kafka Streams Application

Now that you understand the core concepts of Kafka Streams, let’s put that knowledge into practice by building a simple stream processing application. 

We’ll create the classic “word count” example — a streaming version of the “Hello World” program that demonstrates fundamental Kafka Streams capabilities.

I’m assuming you’re continuing with a blank IntelliJ IDEA project with Maven and have already configured the Kafka Streams dependencies in your pom.xml file. 

If you're new to Java's object-oriented programming approach, which is central to Kafka Streams development, consider reviewing DataCamp's tutorial on OOP in Java: Classes, Objects, Encapsulation, Inheritance, and Abstraction for a solid foundation.

Creating the Kafka topics

First, let’s create the Kafka topics our application will use. Open a terminal window (you can use IntelliJ’s built-in terminal by clicking “Terminal” at the bottom of the IDE) and run these commands:

# Create the input topic with 1 partition
kafka-topics --create --bootstrap-server localhost:9092 \
            --replication-factor 1 \
            --partitions 1 \
            --topic text-input

# Create the output topic with 1 partition
kafka-topics --create --bootstrap-server localhost:9092 \
            --replication-factor 1 \
            --partitions 1 \
            --topic word-count-output

Creating the WordCountApp class

Now, let’s create the Java class that will contain our Kafka Streams application. Understanding Java classes and objects is essential for working with the Kafka Streams API effectively. 

For a quick reference on Java’s class-based structure, check out DataCamp’s documentation on Java Classes and Objects.

  1. In the Project panel, navigate to your project’s source directory (typically src/main/java)
  2. Right-click on your package (e.g., com.example) and select New > Java Class
  3. Name the class WordCountApp and click OK
  4. Add the following code to the file:
package com.example; // Change this to match your package name

import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.kstream.KStream;
import org.apache.kafka.streams.kstream.KTable;
import org.apache.kafka.streams.kstream.Produced;

import java.util.Arrays;
import java.util.Properties;

public class WordCountApp {

   public static void main(String[] args) {
       // 1. Create Kafka Streams configuration
       Properties config = new Properties();
       config.put(StreamsConfig.APPLICATION_ID_CONFIG, "wordcount-application");
       config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
       config.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
       config.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());

       // 2. Create a StreamsBuilder - the factory for all stream operations
       StreamsBuilder builder = new StreamsBuilder();

       // 3. Read from the source topic: "text-input"
       KStream<String, String> textLines = builder.stream("text-input");

       // 4. Implement word count logic
       KTable<String, Long> wordCounts = textLines
           // Split each text line into words
           .flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+")))
           // Ensure the words are not empty
           .filter((key, value) -> !value.isEmpty())
           // We want to count words, so we use the word as the key
           .groupBy((key, value) -> value)
           // Count occurrences of each word
           .count();

       // 5. Write the results to the output topic
       wordCounts.toStream().to("word-count-output", Produced.with(Serdes.String(), Serdes.Long()));

       // 6. Create and start the Kafka Streams application
       KafkaStreams streams = new KafkaStreams(builder.build(), config);
      
       // Add shutdown hook to gracefully close the streams application
       Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
      
       // Start the application
       streams.start();
      
       // Print the topology description
       System.out.println(streams.toString());
   }
}

Make sure to adjust the package name at the top of the file to match your project’s package structure.

Running the application

To run the application in IntelliJ:

  1. Right-click on the WordCountApp class in the Project panel or in the editor
  2. Select Run ‘WordCountApp.main()’
  3. You’ll see the application start in the Run tool window at the bottom of the IDE
  4. The application will print the topology description and then continue running

Note: The run menu might print many red log messages as the application starts up but you can safely ignore them as long as they include a message that says ...State transition from REBALANCING to RUNNING in the last few lines.

The application is now running and ready to process data. It will keep running until you stop it, waiting for messages to arrive in the text-input topic.

Testing your application

To test the application, you’ll need to produce messages to the input topic and consume messages from the output topic. Open two separate terminal windows:

1. In the first terminal, start a Kafka console producer to send messages to the input topic:

kafka-console-producer --bootstrap-server localhost:9092 --topic text-input

Then type some messages, pressing Enter after each one:

Hello Kafka Streams
All streams lead to Kafka
Join Kafka Summit

2. In the second terminal, start a Kafka console consumer to see the results from the output topic:

kafka-console-consumer --bootstrap-server localhost:9092 \
                   --topic word-count-output \
                   --from-beginning \
                   --formatter kafka.tools.DefaultMessageFormatter \
                   --property print.key=true \
                   --property print.value=true \
                   --property key.deserializer=org.apache.kafka.common.serialization.StringDeserializer \
                   --property value.deserializer=org.apache.kafka.common.serialization.LongDeserializer

You should see output similar to:

hello       1
kafka       1
streams     1
all         1
streams     2
lead        1
to          1
kafka       2
join        1
kafka       3
summit      1

Notice how the count for words like “kafka” and “streams” increases as you type more messages containing those words. This demonstrates the stateful nature of the application — it remembers previous occurrences of each word.

Understanding the code

Let’s examine the key parts of our application:

1. Configuration (lines 17–21): We set basic properties for our Kafka Streams application:

  • The application.id identifies this application instance
  • bootstrap.servers points to our Kafka cluster
  • The default Serdes specify how keys and values are serialized/deserialized

2. StreamsBuilder (line 25): This is the entry point for building a Kafka Streams topology.

3. Source processor (line 27): builder.stream("text-input") creates a KStream that reads from the input topic.

4. Stream processing (lines 30–39): A series of operations that:

  • Split each line into words (flatMapValues)
  • Remove empty words (filter)
  • Group by the word itself (groupBy)
  • Count occurrences of each word (count)

5. Sink processor (line 42): to("word-count-output") writes the results to the output topic.

6. Application lifecycle (lines 45–54): Creates the KafkaStreams instance, sets up a shutdown hook, and starts the application.

Stopping the application

When you’re done testing:

  1. Click the red square “Stop” button in the Run tool window in IntelliJ
  2. Or press Ctrl+C in the terminal where you started the application (if you ran it from the command line)

The shutdown hook we added ensures the application closes gracefully.

Next steps

Congratulations! You’ve built your first Kafka Streams application. This initial example demonstrates several key concepts:

  • Reading from Kafka topics with source processors
  • Transforming data with operations like flatMapValues, filter, and groupBy
  • Maintaining state with the count operation
  • Writing results back to Kafka topics with sink processors

Don’t worry if this feels complex at first — everyone starts somewhere! The beauty of Kafka Streams is that once you understand these fundamental building blocks, you can create increasingly sophisticated data processing pipelines. 

Take your time to experiment with this example, maybe try changing some of the operations or adding new ones.

Kafka Streams with Python

After building our first Kafka Streams application in Java, you might be wondering if similar functionality is available for Python developers. 

While Apache Kafka Streams is specifically designed as a Java library, Python developers can still implement stream processing applications with Kafka using alternative frameworks. 

In this section, we’ll explore how to implement stream processing in Python and build a word count application similar to the one we just created in Java.

Understanding Python options for Kafka Stream processing

Kafka Streams is a JVM-based library that doesn’t have an official Python implementation. However, Python developers have several options to implement stream processing applications with Kafka:

  1. Faust: A Python library inspired by Kafka Streams that provides similar stream processing capabilities
  2. Structured Streaming with PySpark: Apache Spark’s Python API for stream processing that can read from and write to Kafka
  3. Kafka Python Client + Custom Logic: Combining the standard Kafka Python client with your custom processing code
  4. Streamz: A library for building streaming data pipelines in Python

For our example, we’ll use Faust, which provides an elegant API similar to Kafka Streams while maintaining Python’s simplicity and readability. Faust was developed at Robinhood and is designed to bring Kafka Streams-like functionality to Python developers.

Setting up a Python Kafka Streams project with Faust

Let’s set up a new Python project for our word count application using Faust:

  1. First, make sure you have Python 3.6+ installed on your system
  2. Create a new project directory and set up a virtual environment:
# Create a project directory
mkdir kafka-streams-python
cd kafka-streams-python

# Create and activate a virtual environment
python -m venv venv

# On macOS/Linux
source venv/bin/activate

# On Windows
venv\Scripts\activate

3. Install Faust using pip:

pip install faust-streaming

Building a word count application with Faust

Now, let’s create a Python version of our word count application. Create a file named word_count.py with the following code:

import faust
import re

# Define the Faust app
app = faust.App(
   'wordcount-python-application',
   broker='kafka://localhost:9092',
   value_serializer='raw',
)

# Define the input topic
text_topic = app.topic('text-input', value_type=bytes)

# Define the output topic
# Note: For demonstration purposes, we're using a different output topic than the Java example
counts_topic = app.topic('word-count-output-python', value_type=bytes)

# Define a table to store word counts
word_counts = app.Table('word_counts', default=int, partitions=1)

# Define a processor that counts words
@app.agent(text_topic)
async def process(stream):
   async for message in stream:
       # Decode the message to text
       text = message.decode('utf-8')
      
       # Split the text into words and convert to lowercase
       words = re.split(r'\W+', text.lower())
      
       # Count each word
       for word in words:
           if word:  # Skip empty strings
               word_counts[word] += 1
              
               # Send the current count to the output topic
               await counts_topic.send(
                   key=word.encode(),
                   value=str(word_counts[word]).encode(),
               )
              
               # Print for debugging
               print(f"{word}: {word_counts[word]}")

if __name__ == '__main__':
   app.main()

Detailed code explanation

Let’s break down the Python code in more detail:

1. Imports

import faust
import re
  • faust: The main Faust library for stream processing
  • re: Python's regular expression module, used for splitting text into words

2. Application definition

app = faust.App(
   'wordcount-python-application',
   broker='kafka://localhost:9092',
   value_serializer='raw',
)
  • faust.App: Creates a new Faust application instance
  • First parameter ('wordcount-python-application'): The application ID, similar to the application.id property in Kafka Streams
  • broker: Specifies the Kafka broker address
  • value_serializer='raw': Tells Faust to use raw bytes serialization, similar to the Serdes in Kafka Streams

3. Topic definition

text_topic = app.topic('text-input', value_type=bytes)
counts_topic = app.topic('word-count-output-python', value_type=bytes)
  • app.topic(): Defines a Kafka topic to consume from or produce to
  • First parameter: The name of the Kafka topic
  • value_type=bytes: Specifies that values in this topic are byte arrays
  • This is analogous to creating input and output topics in Kafka Streams

4.  State management with tables

word_counts = app.Table('word_counts', default=int, partitions=1)
  • app.Table(): Creates a state table to store word counts
  • First parameter ('word_counts'): The name of the table
  • default=int: Sets the default value for a new key to 0 (int() returns 0)
  • This is similar to KTable in Kafka Streams, providing persistent state

5. Stream processing with agents

@app.agent(text_topic)
async def process(stream):
  • @app.agent(): A decorator that defines a stream processor (agent)
  • text_topic: The input topic to consume from
  • async def process(stream): An asynchronous function that processes the stream
  • This is similar to the StreamsBuilder.stream() call in Kafka Streams

6. Message processing loop

async for message in stream:
   # Processing logic here
  • async for: Asynchronously iterates through messages in the stream
  • This is equivalent to the streaming operations in Kafka Streams, but uses Python’s async/await syntax

7. Text processing and word splitting

text = message.decode('utf-8')
words = re.split(r'\W+', text.lower())
  • message.decode('utf-8'): Converts bytes to a UTF-8 string
  • re.split(r'\W+', text.lower()): Splits text into words using a regular expression
  • r'\W+': Matches one or more non-word characters (spaces, punctuation, etc.)
  • text.lower(): Converts text to lowercase for case-insensitive matching
  • This is similar to the flatMapValues() operation in our Java example

8. Word counting and state updates

for word in words:
   if word:  # Skip empty strings
       word_counts[word] += 1
  • word_counts[word] += 1: Increments the count for each word in the table
  • if word:: Skips empty strings (as the regex might create some)
  • This is equivalent to the groupByKey() and count() operations in Kafka Streams

9. Output message production

await counts_topic.send(
   key=word.encode(),
   value=str(word_counts[word]).encode(),
)
  • await counts_topic.send(): Asynchronously sends a message to the output topic
  • key=word.encode(): Encodes the word as bytes for the message key
  • value=str(word_counts[word]).encode(): Converts the count to a string and then to bytes
  • This is similar to the toStream() and to() operations in Kafka Streams

10. Main application entry point

if __name__ == '__main__':
   app.main()
  • Standard Python pattern to ensure the code runs only when executed directly
  • app.main(): Starts the Faust worker, similar to KafkaStreams.start() in Java

The Faust API uses Python’s asynchronous features (async/await) to handle concurrency, which is different from Java's approach but achieves similar goals. The key difference is that Faust uses Python's event loop for asynchronous processing, while Kafka Streams uses Java's threading model.

Running the Python Word Count Application

Before running our Python application, make sure Kafka is still running from the previous section. If not, start it using the instructions from the setup section.

1. Create the output topic (only needed if it doesn’t exist yet):

kafka-topics --create --bootstrap-server localhost:9092 \
           --replication-factor 1 \
           --partitions 1 \
           --topic word-count-output-python

2. Run the Faust application:

python word_count.py worker

The worker argument tells Faust to start a worker process that will consume from the input topic and process messages.

3. Test the application by producing messages to the input topic:

kafka-console-producer --bootstrap-server localhost:9092 --topic text-input

Then type some messages:

Hello Python Kafka
Stream processing with Python

4. Consume the output from the Python output topic:

kafka-console-consumer --bootstrap-server localhost:9092 \
                    --topic word-count-output-python \
                    --from-beginning \
                    --property print.key=true \
                    --property print.value=true

You should see the word counts appearing in the consumer window as you type messages in the producer.

Conclusion And Next Steps

This article covered Apache Kafka Streams from its core concepts to Java implementation and Python alternatives with Faust. We explored how stream processing facilitates real-time data transformation and analysis, enabling immediate insights from data. 

Whether using Java for performance guarantees or Python for a simpler development experience, Kafka stream processing offers flexible solutions for building event-driven systems.

To deepen your Kafka knowledge, check out DataCamp's resources: start with How to Learn Apache Kafka in 2025 for a structured learning path, or explore Getting Started with Java for Data for foundational skills. For cloud-based deployment, refer to AWS MSK for Beginners Guide, and compare Kafka with other systems in Kafka vs RabbitMQ. Finally, apply your skills with Java Projects for All Levels to gain hands-on experience.

Kafka Streams FAQs

What is the difference between Apache Kafka and Kafka Streams?

Apache Kafka is a distributed messaging system and event store that functions as the data backbone for organizations, while Kafka Streams is a client library built on top of Kafka that enables stream processing. While Kafka requires a dedicated cluster of broker servers, Kafka Streams runs as part of your application with no additional infrastructure. Kafka primarily handles message routing, while Kafka Streams provides complex processing capabilities like filtering, aggregation, joins, and windowing operations with built-in state management.

Do I need to know Java to use Kafka Streams?

While the official Kafka Streams library is only available for Java and Scala, you can implement similar functionality in Python using alternatives like Faust. The Java implementation offers better performance, deeper integration with the Kafka ecosystem, and full exactly-once processing semantics. However, as shown in our article, Python developers can build effective stream processing applications using libraries inspired by Kafka Streams, which might be preferable if your team has stronger Python expertise or needs to integrate with Python-based data science tools

How does Kafka Streams handle application scaling and fault tolerance?

Kafka Streams inherits its scaling and fault tolerance capabilities from Kafka's architecture. Applications can scale horizontally by running multiple instances that automatically partition the work. Each instance processes specific partitions of the input topics, and Kafka Streams handles coordination between instances. For fault tolerance, Kafka Streams maintains state in local stores backed by Kafka topics, allowing it to recover state after failures and provide exactly-once processing guarantees. If an instance fails, its partitions are automatically reassigned to other running instances.

What types of real-time processing can I implement with Kafka Streams?

Kafka Streams supports a wide range of processing operations including filtering messages based on conditions, transforming data formats, enriching streams with additional information, aggregating data over time (like counts, sums, averages), joining multiple streams together, windowing operations (processing data within time boundaries), and stateful processing that remembers previous events. Common use cases include real-time analytics dashboards, fraud detection systems, event-driven microservices, recommendation engines, and IoT data processing pipelines.

How does Kafka Streams compare to other stream processing frameworks like Apache Spark or Flink?

Compared to frameworks like Apache Spark Streaming or Apache Flink, Kafka Streams offers simplicity and lightweight deployment as its main advantages. While Spark and Flink require separate clusters and resource managers, Kafka Streams applications run as standard applications alongside your existing services. Kafka Streams also has tighter integration with Kafka but less built-in functionality for machine learning or complex analytics. Spark and Flink generally offer more advanced features and may handle higher-throughput workloads better, but with increased operational complexity. Choose Kafka Streams when you want a simpler solution that's directly integrated with Kafka.

Topics

Top DataCamp Courses

Course

Introduction to Apache Kafka

2 hr
4.9K
Master Apache Kafka! From core concepts to advanced architecture, learn to create, manage, and troubleshoot Kafka for real-world data streaming challenges!
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

blog

Kafka vs SQS: Event Streaming Tools In-Depth Comparison

Compare Apache Kafka and Amazon SQS for real-time data processing and analysis. Understand their strengths and weaknesses for data projects.
Zahara Miriam's photo

Zahara Miriam

15 min

blog

The Kafka Certification Guide for Data Professionals

Learn how to advance your career with the Confluent Certified Developer (CCDAK) and Administrator (CCAAK) certifications, gaining the expertise and recognition needed to excel in data streaming and management.
Adejumo Ridwan Suleiman's photo

Adejumo Ridwan Suleiman

13 min

blog

How to Learn Apache Kafka in 2025: Resources, Plan, Careers

Learn what Apache Kafka is, why it is useful, and how to start learning it.
Maria Eugenia Inzaugarat's photo

Maria Eugenia Inzaugarat

10 min

Tutorial

Apache Kafka for Beginners: A Comprehensive Guide

Explore Apache Kafka with our beginner's guide. Learn the basics, get started, and uncover advanced features and real-world applications of this powerful event-streaming platform.
Kurtis Pykes 's photo

Kurtis Pykes

8 min

Tutorial

Streamlit Python: Tutorial

This Python Streamlit tutorial is designed for data scientists and machine learning engineers who want to quickly build web apps without extensive web development knowledge.
Nadia mhadhbi's photo

Nadia mhadhbi

15 min

Tutorial

Apache Spark Tutorial: ML with PySpark

Apache Spark tutorial introduces you to big data processing, analysis and ML with PySpark.
Karlijn Willems's photo

Karlijn Willems

15 min

See MoreSee More