From the course: Kafka Essentials: Quick Start for Building Effective Data Pipelines by Pearson
Understanding Kafka components - Kafka Tutorial
From the course: Kafka Essentials: Quick Start for Building Effective Data Pipelines by Pearson
Understanding Kafka components
Welcome to Lesson 1.2, Understanding Kafka Components. In this lesson, I'm going to cover the basic Kafka components and concepts that are used in the Kafka messaging system. The first concept is that of a Kafka producer, and that is an independent application or a program that sends data to the Kafka cluster. A producer publishes messages to one or more Kafka topics, and I'll mention topics in a moment. So a producer is where the data comes from, and it can be any type of application that is generally sending text-based data to the Kafka cluster. On the other end is the Kafka consumer, and that is also a client or an independent program that consumes messages that are held by Kafka. The rate of consumption is independent of the rate at which the producer sends messages. So we've decoupled the producer and the consumer and they're buffered in Kafka. A Kafka topic is a category or feed name to which a message is sent or stored and then published. It is a way to separate out different messages that are being sent to Kafka. Producer applications write data to topics and consumer applications read from topics. So you can have multiple topics running at the same time for multiple streams of data. Sitting between producers and consumer applications is a Kafka broker. And this is essentially what Kafka is. It is a broker. A broker is minimally one machine or machine instance that is receiving data from producers and sending data to consumers. Brokers can manage many different topics. And for scalability, meaning message throughput, multiple brokers can be used. That means you can have multiple brokers with the same topic receiving data from the same producer. And it's a way to increase message throughput. Topics can be divided into multiple partitions on one or more brokers. And when using multiple brokers, each broker writes to its local topic partition, providing scalable performance. And in the next slides, we'll go over this, and it'll make a bit more sense. For reliability, partitions are replicated on all brokers. Let's take a look and see how this actually works. In this slide, we show a broker, the purple box, and then inside the broker is a specific topic log. And within the topic log are three partitions. As mentioned, a broker runs on a single server or cloud instance. It can be replicated on multiple machines, and we'll get to that in a second. The topic log has three partitions in this case, partition 1, 2, and 3. Messages are written to each partition in a round-robin fashion. So that means in the right-hand column of messages, M1 is going to go up here to partition one. M2, the second message, will go down to partition two. And then M3 will go to partition three. And since there's no more partitions, we'll go round-robin back to the top. M4 starts at partition one. M5, partition two, and so on. So each topic log may have multiple partitions where the messages are landing in each partition and not necessarily the messages are landing in each partition as they come in. It's also possible to send specific messages to specific partitions using a key. That way, for instance, all the messages from messages M1 to M6 can be directed to say partition one. Notice that the time sequence from old to new goes from each partition. And in each partition, there are indices or offsets that are used to keep track as messages come in. The idea behind a Kafka topic log is actually quite simple. A topic log is an append-only log of events or messages. It's similar to any other kind of log you may have in a server, for instance, a web server log. Messages are always written at the end of the log. You never insert messages in another place in the log. It's always at the end of the log. A topic log can be spread across one or n partitions, as I showed in the previous image. Kafka logs can only be read by seeking an offset in the log and then moving forward in a sequential fashion. The other thing to remember, like any log, topic logs are immutable. Data is fixed until it expires. And topics can be configured so that the data can expire after it has reached a certain age or size. So an age can range from seconds to forever. Now it's not suggested that Kafka be used as a permanent storage and your age of your data set forever. However, it can be. And we'll talk about this a little bit more in the next lesson. So let's take a little bit of a closer look at partitioning and redundancy. So in this case, I'm showing a producer that's sending messages and a broker, a single server, and it's got topic A and it's got two partitions. Now the producer doesn't know which partition the message is going to. It's going to send its messages to the broker and the broker is going to figure that out. So all the producer needs to worry about, I'm sending a message to Kafka to a certain topic. Now, as you might imagine, a single broker is vulnerable. If this server goes down for whatever reason, the producer, which is sending messages, and they could be real time from anywhere out in the web or the cloud, now it's got nowhere to send messages, you're going to lose data. This situation is not very good if you wanna make sure you don't lose any messages. So what can be done? Basically, Kafka provides a mechanism for partition replication for producers. In this case, we're going to employ two brokers, broker one and broker two. We're going to have the same producer messages being sent to Kafka. And what Kafka is going to do is replicate both partitions, Topic A Partition 1, Topic A Partition 2 on both brokers. However, it's going to assign a leader partition on each broker. So on Broker 1, it's going to say Topic A Partition 1 is the leader partition, and that's where data are going to be written as I receive them. In the same way, Broker 2 is going to assign topic A partition two as the leader. Now what's going to happen is as messages are received by Kafka, it's going to direct some of them to partition A and then some to partition A on broker two in a round robin fashion. Now, since there is a leader partition, there's also a follower partition that is replicated from the active leader partition on another broker. So for instance, on broker two, topic A partition one is replicated in real time from topic A in partition one. So that basically you've got a backup copy of the partition of the data that's being delivered to topic A partition one. And the same way with Topic A Partition 2 is now a follower on Broker 1. So data that's being written to Topic A Partition 2 gets replicated to Topic A Partition 2 on Broker 1. And you can see how this can be helpful because if Broker 1 stops, so suppose this server goes down, We have a copy of Topic A, Partition 1 already there. Messages can be redirected to this partition and messages going to Partition 2, Topic A can continue to go as they did before. This provides a failover mechanism for brokers and it's one of the reasons why data are broken up into partitions. So most of the time, you would break your data into the number of partitions for the number of servers you're using. And this provides a replicated failover mechanism so that even if one of these servers goes down, your producer continues to send messages and you don't lose any data. In addition, you can see that this also provides throughput increase because now the producer sending messages to Kafka, the Kafka cluster is going to divide the messages between brokers so it can handle more messages as each broker receives and writes messages. In a similar way, partition replication for consumers can ensure that consumers' reading data can happen even if a server goes down. So in this case, if broker two goes down, then the consumer can be switched to reading data from topic A partition two. And we still have all the data available to the consumer because as before, we've replicated all the data from the leader to the follower. So it's a built-in failover mechanism that works for both producers and consumers within the Kafka cluster. Also within the Kafka cluster is something called a partition rebalance. Within a consumer group, Kafka changes the ownership of a partition from one consumer to another. The process of changing the partition ownership across the consumers is called a rebalance. and a rebalance generally happens with the following events. For instance, a new consumer joins a consumer group. The new consumer starts consuming messages from partitions previously consumed by another partition. So at that point, Kafka needs to rebalance the partitions and who's connecting to what partition. An existing consumer partition shuts down or crashes. So a server goes down. At that point, obviously, you need to rebalance. As shown here, we need to rebalance. And in this case, use one server instead of two. And we're going to need to rebalance Topic A Partition 2 and Topic A Partition 1. And the third event is a topic is modified. For instance, new partitions are added to the topic because you've increased the number of servers, say, and wanted to add to the throughput you could manage with your topic. So these kinds of things can happen in general with a Kafka cluster. The reason I'm talking about these is because in this scheme of partitions and replicated servers, It's important to keep track of where your message is and what messages have been written. In order to do that, Kafka uses topic log offsets. And there's two types of offsets that Kafka manages. The first is the obvious one, the current offset. It starts at zero, right here, the first message. and each time a consumer asks for a new message, the current offset becomes the next available offset. So basically, my consumer application asks for a message, Kafka sends it, and then increments the committed offset, and then increments the current offset to the next available location. And this will continue forward, will not go backward as more data are requested. Obviously, the current offset is used to avoid resending the same records again to the same consumer. It's a way to keep track of what messages have been requested. A committed offset is a position that the consumer has confirmed it has processed and received the messages, mostly like a database commit. It's basically telling Kafka, yes, I got the message. The current offset gets incremented when the message is requested, but it's not confirmed that the consumer was able to take the message and do something with it. A committed offset is used to avoid resending the same records to a new consumer in the event of a partition rebalance or due to a consumer change. So this is basically saying, I'm going to make sure whatever consumer I'm using got a specific message at a specific offset. And I confirm that with Kafka. Should something happen, like I said, a partition rebalance because of one of the events I've mentioned, because of one of the events I mentioned, it's a way to make sure everybody's on the same page. Kafka has two basic methods of doing commits. The auto commit, which is the default, uses a commit interval. The default is five seconds. So Kafka will commit your current offset every five seconds and you don't have to worry about your current offset or your committed offset. They just sync up every five seconds. It is possible messages can be read twice if the second consumer takes over before the commit period completes. So you need to be a little bit aware of this if you have a rebalance of some sort with auto commit, and that happens within that five second window. If you are really concerned about message commit and want to make sure things are written and recorded, there's a synchronous commit, which is a manual commit method. It's slower, but sure. There's also an asynchronous method. It's faster, but can be a little risky. And these are manual commits that you as a user will have to confirm with Kafka as part of your consumer. When using a synchronous commit, it will block until the commit is confirmed and will also retry if there are recoverable errors. So basically, again, similar to a database commit, a manual commit will confirm that the message has been received and will retry if there is something going on where on the consumer end, it's possibly recoverable, but it's not going to fix anything, it's just going to continue to retry. If you're using a manual asynchronous commit, the request will be sent and then just continue. There are no retries on error. And to preserve ordering, if it fails, data could be lost. So when you do a asynchronous manual commit, it's just going to acknowledge the commit and continue. It's not going to wait to hear back from the consumer or Kafka that everything synced up. Works quicker, but you could run into some situations where data could be lost. In general, and what we're going to do in these live lessons, we're going to basically use the auto commit method and not get into working with manual commits. After all this discussion of brokers and offsets and partitions, the thing to remember is as a user of Kafka, and in most cases it will be a Kafka cluster, you're going to see the cluster as one big broker. In other words, you're going to interact with Kafka topics and do not need to know how many topic brokers or partitions are used per topic. You certainly can have access to that information. However, to get started and just begin, you just need to concern yourselves with the producers and the consumers. You can write producers and consumers in any language you want, including Java, C, C++, .NET, Python, Go, Spark, or any other application that has libraries supporting Kafka. We're going to use Python and PySpark to communicate with Kafka as an example of writing producers and consumers. In addition, you can also use something called Kafka Connect. For example, managed connectors are available for getting data from Kafka to S3 buckets or moving data from MongoDB into Kafka. These are pre-written connectors for popular services and storage systems so that I don't have to write the consumer or producer for these services that can exist in the cloud or on-premises. As I mentioned, Kafka Connect is a component of Apache Kafka that provides pre-written connections to Kafka from many popular systems, such as databases, cloud services, search indexes, et cetera. The important thing here is by eliminating custom producers and consumers, again, that we write as users, Kafka Connects provides a standard and efficient way to connect directly to popular services. So in many cases, it eliminates the need for us to write producers and consumers and make sure that we get it right so the Kafka Connect services are written by experienced Kafka programmers and work as you would expect them to and we'll take a look at some of these in future lessons. Finally, I wanted to say a few words about Kafka with Hadoop and Spark. Because Kafka is often used with big data or data analytics systems, there are plenty of ways to connect to HDFS, which is the Hadoop distributed file system, and with Spark. So, for instance, for HDFS versions 2 and 3, and future versions, I would assume, there is a sync connector, a Kafka connector, is available that can be used to connect to the Hadoop distributed file system, and with Spark. that enables export of data from Kafka topics to HDFS files in a variety of formats. Integration with Hive, which is a Hadoop-based SQL database, makes data immediately available for Hive queries. If, for example, you also wanted to use Spark, there is actually a Spark streaming API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from Kafka and can be processed using Map, Reduce, Join, and Window, all kinds of tools that are available in Spark. Finally, processed data can be pushed out to HDFS, a local file system, a database, and live dashboards. We're going to take a look at a couple of these mechanisms in future lessons. And the important point here, Kafka can be integrated with pretty much any tool you want to use for your big data analysis.