When and When Not to Use Apache Kafka as a Database

cover
8 Jul 2024

Well, no. Apache Kafka isn't a database. It's a real-time event streaming platform. However, Kafka's ability to retain data in a durable and replicated manner does give it some database-like properties, which can be helpful in specific scenarios.

In this article, I intend to draw a connecting line highlighting similar properties of Kafka and a conventional database. By the end of this article, you'll understand when Kafka might be perfect for database-like use cases and when it should be used purely for its intended purpose as a streaming platform.

Overview: Kafka vs. Database

At its core, a database is a system designed to store, retrieve, and manage data in an organized manner. Simply put, it's anything that provides ACID properties. They allow for storing large volumes of data that can be quickly accessed, queried, and manipulated.

Apache Kafka, on the other hand, is a distributed event streaming platform. It works by allowing applications to publish (write) and subscribe to (read) streams of events, also known as records or messages.

Kafka stores these streams in a fault-tolerant, durable, and scalable manner across multiple servers. Its primary use cases include building real-time data pipelines, event sourcing, and stream processing applications.

The lines between these two get blurry in Kafka's data retention and querying capabilities. Kafka can store data for long periods, making it accessible for future reads, much like a database.

Additionally, tools like Kafka Streams and ksqlDB provide querying and processing capabilities, which can make Kafka feel like a database in some scenarios.

These features enable Kafka to handle use cases that traditionally require databases, such as data warehousing and analytics, albeit with different architectural approaches and strengths.

How Does Data Storage Work in Kafka?

In Kafka, data storage revolves around topics. A topic is a category or feed name to which records are sent. Each topic is split into partitions, which are the basic unit of parallelism and storage in Kafka.

When a producer sends data to a topic, the data is appended to the end of one of its partitions. Each partition is an ordered, immutable sequence of records that Kafka stores and maintains.

Kafka's durability and fault tolerance come from its replication mechanism. Each partition is replicated across multiple brokers (servers) in the Kafka cluster. One of these replicas is designated as the leader, while the others are followers.

The leader handles all read and write requests, and the followers replicate the data to ensure consistency. If the leader fails, one of the followers automatically takes over, providing high availability.

Kafka allows users to configure data retention policies, which dictate how long it retains data in a topic. By default, Kafka keeps data for a set period, but this can be adjusted based on the application's needs.

For instance, you can configure Kafka to retain data indefinitely or to delete records after a specific time or once the log reaches a certain size. This flexibility makes Kafka suitable for both short-term and long-term data storage needs.

Internally, Kafka stores data in a write-ahead log. The log.dir value configures the directory where this data is stored. This log structure ensures that all records are written to disk in the order they are received, providing strong durability guarantees. Kafka leverages the filesystem for storing and retrieving data, taking advantage of sequential disk access, which is faster than random access.

Database-like Properties of Kafka

Although already talked about briefly, there are some properties that make Kafka resemble a database. These properties include:

Long Data Retention

In a traditional database, data is stored persistently and can be accessed and queried over long periods. Similarly, Kafka provides long data retention capabilities, allowing it to store data for extended durations.

By default, Kafka retains data for a set period, but this retention period can be customized to meet specific needs. Whether you need to keep data for days, weeks, or even indefinitely, Kafka offers the flexibility to adjust retention policies.

This long-term storage capability ensures that data remains available for future access and processing, similar to how databases retain historical data for queries and analysis.

Splitting Topics into Partitions

Another database-like capability is Kafka's partitioning of topics. In a normal database, partitioning tables can spread the data across multiple machines for better performance and scalability.

Kafka achieves similar scalability by splitting topics into partitions that are distributed across multiple brokers in a Kafka cluster. This not only allows Kafka to handle large volumes of data but also enables parallel processing, as different consumers can read from different partitions simultaneously.

Because consumers in a Kafka environment are only reading data and not altering it, data integrity or consistency issues cannot arise in systems allowing concurrent database access.

The use of Topics and not Queues

When it comes to data access, Kafka uses topics rather than queues. This means users can flexibly control precise message delivery based on their needs, much like how databases allow for fine-tuned queries and data retrieval.

Kafka's ability to retain data for configured periods, sometimes even indefinitely, gives it strong durability guarantees. Additionally, Kafka supports log-compacted topics, ensuring that a given key's latest value is retained.

This feature resembles a key-value store in a database, where the most recent value associated with a specific key is always available.

Durability and Log Compaction

The "D" in the ACID properties of databases stands for durability, which means that once data is committed, it remains safe and retrievable even in the event of failures.

Kafka achieves similar durability through its replication mechanism, where each partition is replicated across multiple brokers.

Furthermore, Kafka supports log-compacted topics, which ensure that the latest value for a given key is retained, resembling a key-value store.

This means that the most recent value associated with a specific key is always available, similar to how databases maintain the latest data state.

When & When Not to Use Kafka as a DB

While Kafka has some database-like properties, there are important distinctions to consider when deciding whether to use it as a database.

Queries

Kafka does not support complex queries like a relational database management system (RDBMS) or most NoSQL databases do. If your application requires complex querying capabilities, you'd typically ingest the data into a system designed for querying, such as relational databases, where such operations are optimized and supported.

For simple key-based lookups, you may need to maintain an external store, such as a cache populated by a Kafka consumer. However, this process can be inefficient compared to traditional databases that offer built-in indexing and query optimization.

Retention

Databases typically offer lifelong data retention. While Kafka can be configured to retain data indefinitely, this is not its default behaviour.

Kafka's retention settings need to be explicitly configured to keep data for long periods, making it less straightforward compared to a database where long-term retention is built-in.

Backups

Most databases have built-in backup and recovery mechanisms, ensuring that data can be restored in case of failure. In Kafka, data durability is achieved through replication across multiple brokers.

However, Kafka does not have traditional backup features. To mimic backup capabilities, you would need to replicate data across several clusters, which can be complex and resource-intensive.

Traditional DB Features

Kafka lacks many traditional database features, such as secondary indexes and query languages. Secondary indexes allow for fast lookups of data based on non-primary key attributes, which is crucial for many database applications.

Query languages like SQL provide powerful tools for data manipulation and retrieval. Kafka's focus is on stream processing and event-driven architecture, not on providing these database-specific features.

In The End...

Although Apache Kafka shares some similarities with databases, it is fundamentally designed for a different purpose. Kafka excels in scenarios where real-time data streaming and event-processing data pipelines are required.

However, it lacks essential database features such as complex querying capabilities, long-term data retention by default, traditional backup solutions, and advanced indexing.

Therefore, whether Kafka can be considered a database depends on your application's specific requirements. Use Kafka, where real-time data processing, event-driven architecture, and scalability are paramount.

For tasks requiring complex queries, extensive data retention, and advanced querying capabilities, traditional databases remain the preferred choice.

The bottom line is that Kafka complements rather than replaces databases, offering a powerful toolset for streaming data processing while databases continue to serve as the backbone for structured data management and complex querying needs.