Course
Kafka Streams Tutorial: Introduction to Real-Time Data Processing
Today, organizations need to process data as it comes in rather than waiting for scheduled batches. This stream processing can provide real-time insight by analyzing data in motion, resulting in better decision-making and enhanced user experiences. This shift has resulted in increased demand for tools that can process large-scale streaming data efficiently.
One of the most popular stream processing platforms is Apache Kafka, a distributed, fault-tolerant, and high-throughput technology. Initially developed at LinkedIn (where it sees much of its use as part of its data pipelines), it has become the heart of the Kafka ecosystem that include key players like Kafka Connect and Schema Registry. Kafka Streams, a lightweight library within this ecosystem, helps develop real-time applications directly on Kafka, and this article will show you how to use it effectively.
Note: This article assumes familiarity with fundamental concepts in stream processing and Apache Kafka. You can check out DataCamp’s Streaming Concepts Course and Intro to Apache Kafka tutorial if you need a refresher.
What Is Kafka Streams?
Kafka Streams is a client library that allows developers to build real-time stream processing applications directly on top of Apache Kafka.
Unlike traditional systems that rely on separate clusters, Kafka Streams runs as a standard Java library within your application, making deployment and scaling simpler. It provides a high-level API for defining operations like filtering, transformations, joins, aggregations, and windowing.
Built to handle continuous data streams with minimal operational overhead, Kafka Streams leverages Kafka’s core strengths, such as partitioning, fault tolerance, and scalability. It adds stream processing semantics without the need to manage additional infrastructure. Since its release in Kafka 0.10 (2016), it has become a reliable, production-grade solution used across industries for critical streaming applications.
Kafka Streams Core Concepts and Architecture
Kafka Streams is built around two core abstractions: streams and tables.
A stream is an unbounded sequence of records, each with a key, value, and timestamp, while a table represents the current state of data, mapping each key to its latest value. Developers can convert between the two using operations like aggregations (stream-to-table) and change capture (table-to-stream), enabling powerful patterns for real-time processing.
Kafka Streams applications are defined by a processing topology—a directed graph of processor nodes connected by streams. Each node performs specific tasks like filtering, mapping, or joining, and the entire topology is automatically partitioned based on Kafka’s topic partitions.
This design allows parallel processing across multiple application instances, offering horizontal scalability without the need to manage distributed infrastructure manually.
Key concepts in Kafka Streams:
- Stream: A continuous flow of records (key, value, timestamp).
- Table: A snapshot of the latest value for each key.
- Stream-table duality: Allows transformation between streams and tables.
- Processing topology: A graph of stream processors that defines the data flow.
- Parallelism: Achieved through Kafka topic partitioning for scalable processing.
For an in-depth exploration of how Kafka’s partitioning works and its impact on scalability, refer to DataCamp’s tutorial on Kafka Partitions: Essential Concepts for Scalability and Performance.
Key features and capabilities
Kafka Streams provides exactly-once processing guarantees, ensuring each record is processed once, even during failures or restarts. It supports stateful operations like joins and aggregations using local state stores, often backed by RocksDB, removing the need for external databases.
These features make it suitable for tasks that require accuracy, such as financial processing and analytics.
Key features include:
- Exactly-once processing: Maintains consistency across failures.
- Local state stores: Support in-memory and disk-based state.
- Flexible deployment: Runs on servers, containers, or Kubernetes.
- Integration: Works with Schema Registry and Kafka Connect.
- Interactive queries: Provide access to application state in real time.
These capabilities support a range of applications, from basic data processing to more complex event-driven systems.
Kafka Streams vs Apache Kafka
Now that we’ve explored what Kafka Streams is and its core architecture, it’s important to clarify its relationship with Apache Kafka itself.
Many newcomers to the Kafka ecosystem often confuse Kafka Streams with Apache Kafka or assume they are interchangeable technologies.
Understanding the distinction between these two components is crucial for architects and developers, as it affects how you design your data pipeline architecture and which tools you select for specific data processing requirements.
Apache Kafka characteristics
- Functions as a distributed event streaming platform and messaging system
- Consists of servers (brokers) that store streams of records in topics
- Uses a publish-subscribe model with producers and consumers
- Excels at reliably storing and transmitting data
- Provides no built-in data transformation capabilities
- Supports client libraries in multiple programming languages
Kafka Streams, by contrast, is a client library that builds upon Apache Kafka’s foundation. While Kafka itself provides the infrastructure for storing and moving data, Kafka Streams adds computational capabilities for transforming that data.
Think of Apache Kafka as the highway system that moves data around, while Kafka Streams provides vehicles with specific functions to process the data as it travels.
The traditional approach to data processing with Apache Kafka involves writing custom consumer applications, which becomes complex when implementing stateful operations like aggregations or joins.
Kafka Streams characteristics
- Client library specifically for Java/Scala applications
- Provides high-level APIs for stream processing operations
- Manages state automatically for aggregations and joins
- Offers exactly-once processing guarantees
- Operates as standard applications without additional infrastructure
- Excels at continuous, real-time data transformation
- Simplifies operations in microservice architectures
The choice between using Apache Kafka alone or incorporating Kafka Streams depends on your specific use case.
For simple data movement between systems with minimal transformation, standard Kafka producers and consumers provide sufficient functionality. These scenarios include collecting logs, forwarding events to analytics platforms, or implementing basic publish-subscribe patterns.
However, when applications require continuous processing involving transformations, aggregations, joins, or windowed operations, Kafka Streams offers significant advantages.
Its built-in state management, exactly-once processing guarantees, and operational simplicity make it particularly valuable for real-time data processing within existing application infrastructure.
Feature |
Apache Kafka |
Kafka Streams |
Primary Purpose |
Distributed messaging system and event store |
Stream processing library |
Architecture |
Cluster of broker servers |
Client-side library |
Infrastructure Requirements |
Dedicated server cluster |
None (runs within application) |
Programming Languages |
Client libraries in multiple languages |
Java and Scala only |
Processing Capability |
Basic message routing |
Complex stream processing (filtering, aggregation, joins, windowing) |
State Management |
None (stateless) |
Built-in state stores for stateful operations |
Processing Guarantees |
At-least-once delivery |
Exactly-once processing |
Deployment Model |
Cluster deployment |
Standard application deployment |
Use Cases |
Event storage, message queuing, log aggregation |
Real-time analytics, data transformation, event-driven applications |
Horizontal Scaling |
Adding brokers to cluster |
Running multiple application instances |
Fault Tolerance |
Replication across brokers |
Automatic state recovery |
Role in System |
Central data backbone |
Processing layer on top of Kafka |
Comprehensive Instructions on Installing and Setting Up Kafka Streams
Having understood what Kafka Streams is and how it differs from Apache Kafka, we can now proceed with setting up a development environment for building Kafka Streams applications.
This section provides detailed instructions for installing and configuring all necessary components on both macOS and Windows systems.
Following these steps will prepare your environment for developing, testing, and running Kafka Streams applications.
Prerequisites
Before diving into Kafka Streams development, you need to set up several foundational components: Java Development Kit (JDK), Apache Kafka, and build tools. These components form the infrastructure upon which your Kafka Streams applications will run.
Java Development Kit installation
For macOS:
1. Install Homebrew if you haven’t already by running the following in Terminal:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
2. Install the JDK using Homebrew:
brew install openjdk@17
3. Set up the JAVA_HOME environment variable by adding these lines to your ~/.zshrc
or ~/.bash_profile
:
export JAVA_HOME=$(/usr/libexec/java_home -v 17) export PATH=$JAVA_HOME/bin:$PATH
4. Reload your profile:
source ~/.zshrc # or source ~/.bash_profile
5. Verify the installation:
java -version
For Windows:
1. Download the Java SE Development Kit 17 installer from the Oracle website or adopt OpenJDK from Adoptium.
2. Run the installer and follow the installation wizard instructions.
3. Set the JAVA_HOME environment variable:
- Right-click on “This PC” or “My Computer” and select “Properties”
- Click on “Advanced system settings”
- Click the “Environment Variables” button
- Under “System variables”, click “New”
- Enter “JAVA_HOME” as the variable name
- Enter the JDK installation path (e.g., C:\Program Files\Java\jdk-17) as the variable value
4. Add Java to the PATH variable:
- Under “System variables”, find the “Path” variable and click “Edit”
- Click “New” and add “%JAVA_HOME%\bin”
- Click “OK” to close all dialogs
5. Verify the installation by opening Command Prompt and typing:
java -version
Apache Kafka Setup
For macOS:
1. Download and install Kafka using Homebrew:
brew install kafka
2. Start Zookeeper (Kafka’s coordination service):
brew services start zookeeper
3. Start Kafka:
brew services start kafka
4. Verify the installation by creating a test topic:
kafka-topics --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic test
If you run into any issues when starting the services or creating a test topic, try running the commands with sudo
.
For Windows:
- Download the latest Kafka release from the Apache Kafka downloads page.
- Extract the downloaded archive to a directory of your choice (e.g., C:\kafka).
- Open PowerShell or Command Prompt as administrator and navigate to the Kafka directory:
cd C:\kafka
4. Start Zookeeper:
.\bin\windows\zookeeper-server-start.bat .\config\zookeeper.properties
5. Open another PowerShell or Command Prompt window and start Kafka:
.\bin\windows\kafka-server-start.bat .\config\server.properties
6. Create a test topic (in a third PowerShell or Command Prompt window):
.\bin\windows\kafka-topics.bat --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic test
IDE setup and configuration
Modern IDEs significantly enhance the development experience for Kafka Streams applications by providing code completion, debugging capabilities, and integration with build tools. Below, we cover setup instructions for the popular IntelliJ IDEA.
IntelliJ IDEA setup
For macOS:
1. Download IntelliJ IDEA from the JetBrains website.
- The Community Edition is free and sufficient for Kafka Streams development.
- The Ultimate Edition offers additional features like Spring support and database tools.
2. Open the downloaded DMG file and drag the IntelliJ IDEA icon to the Applications folder.
3. Launch IntelliJ IDEA from the Applications folder.
4. During first-time setup, configure your preferences and install any recommended plugins.
For Windows:
- Download IntelliJ IDEA from the JetBrains website.
- Run the installer and follow the installation wizard.
- Launch IntelliJ IDEA after installation.
- During first-time setup, configure your preferences and install any recommended plugins.
Project configuration
With your development environment set up, you can now create and configure a new Kafka Streams project using either Maven or Gradle, two popular build automation tools for Java applications.
Maven and Gradle are essential tools that handle dependency management, build processes, and project organization for Java applications:
Maven is an older and widely used build tool that uses XML configuration files (pom.xml
) to define project settings and dependencies. It follows a convention-over-configuration approach with a predefined project structure, making it straightforward for beginners.
Maven automatically downloads required libraries from central repositories and manages the build lifecycle.
Gradle is a more modern build tool that uses Groovy or Kotlin for its build scripts instead of XML. It offers greater flexibility and faster build times compared to Maven.
While it has a steeper learning curve, Gradle provides more powerful customization options and is increasingly becoming the standard for new Java projects.
Both tools will help you manage the Kafka Streams libraries and other dependencies needed for your streaming applications, allowing you to focus on writing code rather than manually handling JAR files.
Creating a Maven project
In IntelliJ IDEA:
1. Click “File” > “New” > “Project…”
2. Select “Maven” from the left panel
3. Check “Create from archetype” and select “maven-archetype-quickstart”
4. Click “Next” and provide:
- GroupId: e.g., “com.example”
- ArtifactId: e.g., “kafka-streams-demo”
- Version: e.g., “1.0-SNAPSHOT”
5. Click “Finish”
Adding Kafka Streams dependencies
For Maven projects, add the following to your pom.xml
file:
<dependencies>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams</artifactId>
<version>3.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>3.4.0</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>2.0.5</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
<version>2.0.5</version>
</dependency>
</dependencies>
After adding the dependencies in Maven, reload the project by right-clicking on the file editor and the “Maven > Reload project” option.
For Gradle projects, add the following to your build.gradle
file:
dependencies {
implementation 'org.apache.kafka:kafka-streams:3.4.0'
implementation 'org.apache.kafka:kafka-clients:3.4.0'
implementation 'org.slf4j:slf4j-api:2.0.5'
implementation 'org.slf4j:slf4j-simple:2.0.5'
}
After adding these dependencies, refresh your project to ensure they’re properly downloaded and configured.
With these steps completed, your development environment is now ready for building Kafka Streams applications. Next, we’ll explore the core concepts of Kafka Streams to build a solid foundation before implementing our first application.
Core Kafka Streams Concepts
Before diving into code examples, it’s important to understand the fundamental concepts and terminology used in Kafka Streams. This section provides a beginner-friendly introduction to the key ideas you’ll encounter when building stream processing applications with Kafka Streams.
Basic Kafka terminology
Let’s explore some of the key terms that you will encounter when using Kafka.
Topics: The foundation of data organization
Think of a topic as a category or feed name to which records are published. In the physical world, this might be similar to a file folder where related documents are stored together. For example, a retail application might have separate topics for sales
, inventory-updates
, and customer-activity
.
Topics in Kafka have these important characteristics:
- They can be partitioned for parallel processing
- They store data in an append-only log (new data is always added to the end)
- They retain data according to configurable retention policies
Understanding how topics are partitioned is crucial for building scalable stream processing applications. For a complete guide to how partitioning impacts your Kafka applications, see DataCamp’s tutorial on Kafka Partitions: Essential Concepts for Scalability and Performance.
Records: The data units
A record (sometimes called a message) is the basic unit of data in Kafka. Each record consists of:
- A key (optional): Determines which partition the record goes to
- A value: The actual data payload
- A timestamp: When the record was created or processed
For example, in a weather monitoring system, a record might look like:
Key: "NYC"
Value: {"temperature": 72, "humidity": 65, "wind": 5}
Timestamp: 2025-03-13T14:25:30Z
Producers and consumers: The data movers
Producers are applications that publish (write) records to Kafka topics. For example, a temperature sensor might produce readings to a ‘temperature-readings’ topic.
Consumers are applications that subscribe to (read) topics and process the records. For instance, a dashboard application might consume from the temperature-readings
topic to display current conditions.
Kafka Streams specific concepts
Here are some concepts you need to know that are specific to Kafka Streams.
Streams and tables: Two views of your data
Kafka Streams provides two fundamental abstractions for working with data:
KStream represents an unbounded, continuously updating sequence of records. It’s like watching a flowing river of data where you see each event as it passes by.
Example use case: Processing each payment transaction as it occurs.
KTable represents the current state of the world, where each key maps to its latest value. It’s like looking at a database table that gets updated whenever new information arrives.
Example use case: Maintaining a current inventory count for each product.
This diagram illustrates the difference:
KStream (event stream):
t1: ("product-1", "purchased")
t2: ("product-2", "purchased")
t3: ("product-1", "returned")
t4: ("product-3", "purchased")
KTable (current state):
After t4:
"product-1": "returned"
"product-2": "purchased"
"product-3": "purchased"
Processing topology: The flow of data
A processing topology defines how data moves and transforms through your application. It’s like a flowchart for your data, consisting of:
- Source processors: Read records from topics
- Stream processors: Perform operations on the data
- Sink processors: Write results to topics
Here’s a simple visual representation:
Source → Filter → Transform → Aggregate → Sink
(read) (clean) (modify) (combine) (write)
Stateless vs. stateful operations
Operations in Kafka Streams fall into two categories:
Stateless operations process each record independently, without remembering anything about previous records. Examples include:
filter()
: Keep only records that match a conditionmap()
: Transform each record into a new oneflatMap()
: Transform each record into zero or more new records
Stateful operations maintain information across multiple records. Examples include:
count()
: Count occurrences of keysaggregate()
: Combine data for the same keyjoin()
: Combine records from multiple streams
For example, calculating the average temperature by city requires remembering previous temperature readings for each city — that’s stateful processing.
State stores: Where data lives
State stores are the managed databases that Kafka Streams uses to store state for stateful operations. They can be:
- In-memory for faster access
- Disk-based for larger datasets
- Backed by Kafka topics for fault tolerance
If your application crashes, Kafka Streams can restore state stores from their backing topics, ensuring durability.
Consumer groups and application IDs
Every Kafka Streams application needs an application ID, which:
- Serves as a namespace for the application’s state stores
- Acts as the consumer group ID for internal topic consumption
- Allows multiple instances of your application to work together
When you scale your application by running multiple instances, they’ll automatically coordinate work by:
- Splitting the partitions among the available instances
- Handling failover if an instance goes down
Practical example: Word count conceptual workflow
To tie these concepts together, let’s examine a word-counting application conceptually before looking at any code:
1. Input: Records flow into a topic called sentences
where each value contains a sentence like "Hello Kafka Streams"
2. Stream Creation: A source processor reads from the sentences
topic, creating a stream of sentence records
3. Transformation (stateless):
- Convert each sentence to lowercase
- Split each sentence into individual words
- Filter out empty strings
- Change the record key to be the word itself
4. Aggregation (stateful):
- Group records by word (key)
- Count occurrences of each word
- Store current counts in a state store
5. Output: Write the current count for each word to an output topic called word-counts
The resulting topic would contain records like:
"hello": 2
"kafka": 3
"streams": 3
This doesn’t require understanding any code yet — it’s simply about the flow of data and the transformations applied at each step.
Key takeaways before coding
Before we start writing code, remember these important points about Kafka Streams:
- Data flow: Data always flows from Kafka topics, through your processing logic, and back to Kafka topics
- Declarative API: You describe what transformations to apply, not how to implement them
- Scalability: Applications can scale by adding more instances that work together
- Fault yolerance: If your application crashes, it can resume processing where it left off
- Exactly-once processing: Results are accurate even if failures occur during processing
In the next section, we’ll translate these concepts into code by building a simple Kafka Streams application.
Building Your First Kafka Streams Application
Now that you understand the core concepts of Kafka Streams, let’s put that knowledge into practice by building a simple stream processing application.
We’ll create the classic “word count” example — a streaming version of the “Hello World” program that demonstrates fundamental Kafka Streams capabilities.
I’m assuming you’re continuing with a blank IntelliJ IDEA project with Maven and have already configured the Kafka Streams dependencies in your pom.xml file.
If you're new to Java's object-oriented programming approach, which is central to Kafka Streams development, consider reviewing DataCamp's tutorial on OOP in Java: Classes, Objects, Encapsulation, Inheritance, and Abstraction for a solid foundation.
Creating the Kafka topics
First, let’s create the Kafka topics our application will use. Open a terminal window (you can use IntelliJ’s built-in terminal by clicking “Terminal” at the bottom of the IDE) and run these commands:
# Create the input topic with 1 partition
kafka-topics --create --bootstrap-server localhost:9092 \
--replication-factor 1 \
--partitions 1 \
--topic text-input
# Create the output topic with 1 partition
kafka-topics --create --bootstrap-server localhost:9092 \
--replication-factor 1 \
--partitions 1 \
--topic word-count-output
Creating the WordCountApp class
Now, let’s create the Java class that will contain our Kafka Streams application. Understanding Java classes and objects is essential for working with the Kafka Streams API effectively.
For a quick reference on Java’s class-based structure, check out DataCamp’s documentation on Java Classes and Objects.
- In the Project panel, navigate to your project’s source directory (typically
src/main/java
) - Right-click on your package (e.g.,
com.example
) and select New > Java Class - Name the class
WordCountApp
and click OK - Add the following code to the file:
package com.example; // Change this to match your package name
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.kstream.KStream;
import org.apache.kafka.streams.kstream.KTable;
import org.apache.kafka.streams.kstream.Produced;
import java.util.Arrays;
import java.util.Properties;
public class WordCountApp {
public static void main(String[] args) {
// 1. Create Kafka Streams configuration
Properties config = new Properties();
config.put(StreamsConfig.APPLICATION_ID_CONFIG, "wordcount-application");
config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
config.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
config.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
// 2. Create a StreamsBuilder - the factory for all stream operations
StreamsBuilder builder = new StreamsBuilder();
// 3. Read from the source topic: "text-input"
KStream<String, String> textLines = builder.stream("text-input");
// 4. Implement word count logic
KTable<String, Long> wordCounts = textLines
// Split each text line into words
.flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+")))
// Ensure the words are not empty
.filter((key, value) -> !value.isEmpty())
// We want to count words, so we use the word as the key
.groupBy((key, value) -> value)
// Count occurrences of each word
.count();
// 5. Write the results to the output topic
wordCounts.toStream().to("word-count-output", Produced.with(Serdes.String(), Serdes.Long()));
// 6. Create and start the Kafka Streams application
KafkaStreams streams = new KafkaStreams(builder.build(), config);
// Add shutdown hook to gracefully close the streams application
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
// Start the application
streams.start();
// Print the topology description
System.out.println(streams.toString());
}
}
Make sure to adjust the package name at the top of the file to match your project’s package structure.
Running the application
To run the application in IntelliJ:
- Right-click on the
WordCountApp
class in the Project panel or in the editor - Select Run ‘WordCountApp.main()’
- You’ll see the application start in the Run tool window at the bottom of the IDE
- The application will print the topology description and then continue running
Note: The run menu might print many red log messages as the application starts up but you can safely ignore them as long as they include a message that says ...State transition from REBALANCING to RUNNING
in the last few lines.
The application is now running and ready to process data. It will keep running until you stop it, waiting for messages to arrive in the text-input topic.
Testing your application
To test the application, you’ll need to produce messages to the input topic and consume messages from the output topic. Open two separate terminal windows:
1. In the first terminal, start a Kafka console producer to send messages to the input topic:
kafka-console-producer --bootstrap-server localhost:9092 --topic text-input
Then type some messages, pressing Enter after each one:
Hello Kafka Streams
All streams lead to Kafka
Join Kafka Summit
2. In the second terminal, start a Kafka console consumer to see the results from the output topic:
kafka-console-consumer --bootstrap-server localhost:9092 \
--topic word-count-output \
--from-beginning \
--formatter kafka.tools.DefaultMessageFormatter \
--property print.key=true \
--property print.value=true \
--property key.deserializer=org.apache.kafka.common.serialization.StringDeserializer \
--property value.deserializer=org.apache.kafka.common.serialization.LongDeserializer
You should see output similar to:
hello 1
kafka 1
streams 1
all 1
streams 2
lead 1
to 1
kafka 2
join 1
kafka 3
summit 1
Notice how the count for words like “kafka” and “streams” increases as you type more messages containing those words. This demonstrates the stateful nature of the application — it remembers previous occurrences of each word.
Understanding the code
Let’s examine the key parts of our application:
1. Configuration (lines 17–21): We set basic properties for our Kafka Streams application:
- The
application.id
identifies this application instance bootstrap.servers
points to our Kafka cluster- The default Serdes specify how keys and values are serialized/deserialized
2. StreamsBuilder (line 25): This is the entry point for building a Kafka Streams topology.
3. Source processor (line 27): builder.stream("text-input")
creates a KStream that reads from the input topic.
4. Stream processing (lines 30–39): A series of operations that:
- Split each line into words (
flatMapValues
) - Remove empty words (
filter
) - Group by the word itself (
groupBy
) - Count occurrences of each word (
count
)
5. Sink processor (line 42): to("word-count-output")
writes the results to the output topic.
6. Application lifecycle (lines 45–54): Creates the KafkaStreams instance, sets up a shutdown hook, and starts the application.
Stopping the application
When you’re done testing:
- Click the red square “Stop” button in the Run tool window in IntelliJ
- Or press Ctrl+C in the terminal where you started the application (if you ran it from the command line)
The shutdown hook we added ensures the application closes gracefully.
Next steps
Congratulations! You’ve built your first Kafka Streams application. This initial example demonstrates several key concepts:
- Reading from Kafka topics with source processors
- Transforming data with operations like
flatMapValues
,filter
, andgroupBy
- Maintaining state with the
count
operation - Writing results back to Kafka topics with sink processors
Don’t worry if this feels complex at first — everyone starts somewhere! The beauty of Kafka Streams is that once you understand these fundamental building blocks, you can create increasingly sophisticated data processing pipelines.
Take your time to experiment with this example, maybe try changing some of the operations or adding new ones.
Kafka Streams with Python
After building our first Kafka Streams application in Java, you might be wondering if similar functionality is available for Python developers.
While Apache Kafka Streams is specifically designed as a Java library, Python developers can still implement stream processing applications with Kafka using alternative frameworks.
In this section, we’ll explore how to implement stream processing in Python and build a word count application similar to the one we just created in Java.
Understanding Python options for Kafka Stream processing
Kafka Streams is a JVM-based library that doesn’t have an official Python implementation. However, Python developers have several options to implement stream processing applications with Kafka:
- Faust: A Python library inspired by Kafka Streams that provides similar stream processing capabilities
- Structured Streaming with PySpark: Apache Spark’s Python API for stream processing that can read from and write to Kafka
- Kafka Python Client + Custom Logic: Combining the standard Kafka Python client with your custom processing code
- Streamz: A library for building streaming data pipelines in Python
For our example, we’ll use Faust, which provides an elegant API similar to Kafka Streams while maintaining Python’s simplicity and readability. Faust was developed at Robinhood and is designed to bring Kafka Streams-like functionality to Python developers.
Setting up a Python Kafka Streams project with Faust
Let’s set up a new Python project for our word count application using Faust:
- First, make sure you have Python 3.6+ installed on your system
- Create a new project directory and set up a virtual environment:
# Create a project directory
mkdir kafka-streams-python
cd kafka-streams-python
# Create and activate a virtual environment
python -m venv venv
# On macOS/Linux
source venv/bin/activate
# On Windows
venv\Scripts\activate
3. Install Faust using pip:
pip install faust-streaming
Building a word count application with Faust
Now, let’s create a Python version of our word count application. Create a file named word_count.py
with the following code:
import faust
import re
# Define the Faust app
app = faust.App(
'wordcount-python-application',
broker='kafka://localhost:9092',
value_serializer='raw',
)
# Define the input topic
text_topic = app.topic('text-input', value_type=bytes)
# Define the output topic
# Note: For demonstration purposes, we're using a different output topic than the Java example
counts_topic = app.topic('word-count-output-python', value_type=bytes)
# Define a table to store word counts
word_counts = app.Table('word_counts', default=int, partitions=1)
# Define a processor that counts words
@app.agent(text_topic)
async def process(stream):
async for message in stream:
# Decode the message to text
text = message.decode('utf-8')
# Split the text into words and convert to lowercase
words = re.split(r'\W+', text.lower())
# Count each word
for word in words:
if word: # Skip empty strings
word_counts[word] += 1
# Send the current count to the output topic
await counts_topic.send(
key=word.encode(),
value=str(word_counts[word]).encode(),
)
# Print for debugging
print(f"{word}: {word_counts[word]}")
if __name__ == '__main__':
app.main()
Detailed code explanation
Let’s break down the Python code in more detail:
1. Imports
import faust
import re
faust
: The main Faust library for stream processingre
: Python's regular expression module, used for splitting text into words
2. Application definition
app = faust.App(
'wordcount-python-application',
broker='kafka://localhost:9092',
value_serializer='raw',
)
faust.App
: Creates a new Faust application instance- First parameter (
'wordcount-python-application'
): The application ID, similar to theapplication.id
property in Kafka Streams broker
: Specifies the Kafka broker addressvalue_serializer='raw'
: Tells Faust to use raw bytes serialization, similar to the Serdes in Kafka Streams
3. Topic definition
text_topic = app.topic('text-input', value_type=bytes)
counts_topic = app.topic('word-count-output-python', value_type=bytes)
app.topic()
: Defines a Kafka topic to consume from or produce to- First parameter: The name of the Kafka topic
value_type=bytes
: Specifies that values in this topic are byte arrays- This is analogous to creating input and output topics in Kafka Streams
4. State management with tables
word_counts = app.Table('word_counts', default=int, partitions=1)
app.Table()
: Creates a state table to store word counts- First parameter (
'word_counts'
): The name of the table default=int
: Sets the default value for a new key to 0 (int() returns 0)- This is similar to
KTable
in Kafka Streams, providing persistent state
5. Stream processing with agents
@app.agent(text_topic)
async def process(stream):
@app.agent()
: A decorator that defines a stream processor (agent)text_topic
: The input topic to consume fromasync def process(stream)
: An asynchronous function that processes the stream- This is similar to the
StreamsBuilder.stream()
call in Kafka Streams
6. Message processing loop
async for message in stream:
# Processing logic here
async for
: Asynchronously iterates through messages in the stream- This is equivalent to the streaming operations in Kafka Streams, but uses Python’s
async/await
syntax
7. Text processing and word splitting
text = message.decode('utf-8')
words = re.split(r'\W+', text.lower())
message.decode('utf-8')
: Converts bytes to a UTF-8 stringre.split(r'\W+', text.lower())
: Splits text into words using a regular expressionr'\W+'
: Matches one or more non-word characters (spaces, punctuation, etc.)text.lower()
: Converts text to lowercase for case-insensitive matching- This is similar to the
flatMapValues()
operation in our Java example
8. Word counting and state updates
for word in words:
if word: # Skip empty strings
word_counts[word] += 1
word_counts[word] += 1
: Increments the count for each word in the tableif word:
: Skips empty strings (as the regex might create some)- This is equivalent to the
groupByKey()
andcount()
operations in Kafka Streams
9. Output message production
await counts_topic.send(
key=word.encode(),
value=str(word_counts[word]).encode(),
)
await counts_topic.send()
: Asynchronously sends a message to the output topickey=word.encode()
: Encodes the word as bytes for the message keyvalue=str(word_counts[word]).encode()
: Converts the count to a string and then to bytes- This is similar to the
toStream()
andto()
operations in Kafka Streams
10. Main application entry point
if __name__ == '__main__':
app.main()
- Standard Python pattern to ensure the code runs only when executed directly
app.main()
: Starts the Faust worker, similar toKafkaStreams.start()
in Java
The Faust API uses Python’s asynchronous features (async/await
) to handle concurrency, which is different from Java's approach but achieves similar goals. The key difference is that Faust uses Python's event loop for asynchronous processing, while Kafka Streams uses Java's threading model.
Running the Python Word Count Application
Before running our Python application, make sure Kafka is still running from the previous section. If not, start it using the instructions from the setup section.
1. Create the output topic (only needed if it doesn’t exist yet):
kafka-topics --create --bootstrap-server localhost:9092 \
--replication-factor 1 \
--partitions 1 \
--topic word-count-output-python
2. Run the Faust application:
python word_count.py worker
The worker
argument tells Faust to start a worker process that will consume from the input topic and process messages.
3. Test the application by producing messages to the input topic:
kafka-console-producer --bootstrap-server localhost:9092 --topic text-input
Then type some messages:
Hello Python Kafka
Stream processing with Python
4. Consume the output from the Python output topic:
kafka-console-consumer --bootstrap-server localhost:9092 \
--topic word-count-output-python \
--from-beginning \
--property print.key=true \
--property print.value=true
You should see the word counts appearing in the consumer window as you type messages in the producer.
Conclusion And Next Steps
This article covered Apache Kafka Streams from its core concepts to Java implementation and Python alternatives with Faust. We explored how stream processing facilitates real-time data transformation and analysis, enabling immediate insights from data.
Whether using Java for performance guarantees or Python for a simpler development experience, Kafka stream processing offers flexible solutions for building event-driven systems.
To deepen your Kafka knowledge, check out DataCamp's resources: start with How to Learn Apache Kafka in 2025 for a structured learning path, or explore Getting Started with Java for Data for foundational skills. For cloud-based deployment, refer to AWS MSK for Beginners Guide, and compare Kafka with other systems in Kafka vs RabbitMQ. Finally, apply your skills with Java Projects for All Levels to gain hands-on experience.
Kafka Streams FAQs
What is the difference between Apache Kafka and Kafka Streams?
Apache Kafka is a distributed messaging system and event store that functions as the data backbone for organizations, while Kafka Streams is a client library built on top of Kafka that enables stream processing. While Kafka requires a dedicated cluster of broker servers, Kafka Streams runs as part of your application with no additional infrastructure. Kafka primarily handles message routing, while Kafka Streams provides complex processing capabilities like filtering, aggregation, joins, and windowing operations with built-in state management.
Do I need to know Java to use Kafka Streams?
While the official Kafka Streams library is only available for Java and Scala, you can implement similar functionality in Python using alternatives like Faust. The Java implementation offers better performance, deeper integration with the Kafka ecosystem, and full exactly-once processing semantics. However, as shown in our article, Python developers can build effective stream processing applications using libraries inspired by Kafka Streams, which might be preferable if your team has stronger Python expertise or needs to integrate with Python-based data science tools
How does Kafka Streams handle application scaling and fault tolerance?
Kafka Streams inherits its scaling and fault tolerance capabilities from Kafka's architecture. Applications can scale horizontally by running multiple instances that automatically partition the work. Each instance processes specific partitions of the input topics, and Kafka Streams handles coordination between instances. For fault tolerance, Kafka Streams maintains state in local stores backed by Kafka topics, allowing it to recover state after failures and provide exactly-once processing guarantees. If an instance fails, its partitions are automatically reassigned to other running instances.
What types of real-time processing can I implement with Kafka Streams?
Kafka Streams supports a wide range of processing operations including filtering messages based on conditions, transforming data formats, enriching streams with additional information, aggregating data over time (like counts, sums, averages), joining multiple streams together, windowing operations (processing data within time boundaries), and stateful processing that remembers previous events. Common use cases include real-time analytics dashboards, fraud detection systems, event-driven microservices, recommendation engines, and IoT data processing pipelines.
How does Kafka Streams compare to other stream processing frameworks like Apache Spark or Flink?
Compared to frameworks like Apache Spark Streaming or Apache Flink, Kafka Streams offers simplicity and lightweight deployment as its main advantages. While Spark and Flink require separate clusters and resource managers, Kafka Streams applications run as standard applications alongside your existing services. Kafka Streams also has tighter integration with Kafka but less built-in functionality for machine learning or complex analytics. Spark and Flink generally offer more advanced features and may handle higher-throughput workloads better, but with increased operational complexity. Choose Kafka Streams when you want a simpler solution that's directly integrated with Kafka.
Top DataCamp Courses
Track
Data Engineer
Track
Associate Data Engineer
blog
Kafka vs SQS: Event Streaming Tools In-Depth Comparison

Zahara Miriam
15 min
blog
The Kafka Certification Guide for Data Professionals

Adejumo Ridwan Suleiman
13 min
blog
How to Learn Apache Kafka in 2025: Resources, Plan, Careers
Tutorial
Apache Kafka for Beginners: A Comprehensive Guide
Tutorial
Streamlit Python: Tutorial
Tutorial