From the course: Kafka Essentials: Quick Start for Building Effective Data Pipelines by Pearson

Why do I need a message broker?

Welcome to Lesson 1.1, Why Do I Need a Message Broker? In this lesson, I'm going to introduce the Kafka Message Broker with some usage examples. I'm also going to provide a little background and history, and then finish up with some prerequisites that will be helpful in running the examples I present in these live lessons. Let's begin by considering a simple web server application. In this case, we've got two web servers and a back end server connecting the server logs. This is a very simple, easy to use, easy to set up case, and works for many web servers. Now let's consider what happens as our web server farm or web server presence begins to grow. Now you see we've added a chat server, a database, a database failover server. We still are collecting logs. We're collecting metrics on all the servers that are running. We have a shopping cart, a shopping cart failover. So now things have gotten a little bit more complicated and you can see all the various data paths that are connecting these various servers in our web presence. So we've gone from something very simple to something a little bit more complex. The difficulty comes when we want to change something here or turn something off or update something or add something, we've got to manage all these connections and make sure things keep operating at the same time. One way to solve this is to introduce a message broker. And in many cases, you can place Kafka, a message broker, in between all these web servers and monitors and separate the input from the output so that any one of these servers or services can be taken down, removed, changed, restarted without interfering with the communications with any other server that's in this bunch here. It makes connecting things easier, and it also makes the ability to share data at various times and various places a lot easier. This is a general idea behind a message broker, and it applies to many other situations where you've got information flowing from one source to another or multiple sources, and you need to disconnect the input from the output so that they're not operating in lockstep and creating issues for maintenance and upgradability. What is Kafka then? More formally, Kafka is used for real-time streams of data to collect big data or to do real-time analysis or both, a message broker, as we call it. Kafka is high throughput, scalable, meaning we can add more servers as the amount of data increases, and reliable due to replication, meaning that we can have servers fail and keep running, and that records all kinds of data in real time. And again, we don't need to worry about how fast the data is coming in because it gets stored in Kafka and our applications that read data can read data at their speed, not at the speed in which data is hitting the servers. Data can include logs, web apps, messages, anything you can think of, manufacturing, weather, financial streams, or anything else. Kafka collects data from a variety of sources, or streams as they're often called, and timelines so it can be consistently available for processing at a later time. Again, it decouples the input from the output and buffers them or acts as a broker for the messages. Kafka was developed by Jay Kreps at LinkedIn as a way to address the numerous data parallel pipelines that were developing across all their applications. And if you think about a site like LinkedIn, you can understand how they've got lots of messages going different ways, and it can very quickly become difficult to manage all these without something like a message broker. It was designed to handle many types of data and provide clean structured data about user activity and system metrics in real time. Now, Jcreps is now the CEO of Confluent, which is a Kafka management company. Kafka is open source. It became an Apache project in 2011. And one of the things I run into is people ask, why the name Kafka? Because if you know anything about Kafka, he is a author who's written some bizarre, kind of weird stories. And people ask, does this mean Kafka is like a bizarre or weird piece of software? And absolutely not. Basically, according to Kreps, I thought that since Kafka was a system optimized for writing using a writer's name would make sense. I had taken a lot of lit classes in college and liked Franz Kafka, plus the name sounded cool for an open source project. So there you have it. It's not because it's mysterious or weird. It's just because J. Krebs liked Franz Kafka. Previously, I showed a web example of using Kafka to manage your web servers. Now we'll talk about another use case where we'll talk about a data science workflow using Kafka. So let's consider we have three data streams, A, B, and C, and they're coming at us from various locations. The cloud could be some internal data. We don't know. We just know we have three streams of data. They're all being sent to Kafka. Kafka is taking that data, holding onto it in its data buffer. And now let's look how a data science workflow can use Kafka. So the first thing to take a look at here is Kafka is in the middle of everything and it can communicate and send data to long-term storage. In general, Kafka is not considered a long-term storage system. It's generally used as a temporary storage where temporary can mean hours to weeks to even months, but in general, not a long-term storage. So Kafka can write data to things like HDFS, that's Hadoop distributed file system to a database, to AWS S3 buckets, to some form of data lake that you may have, and it can read it back in should it need to in the future. In addition, Kafka can make information available to say the data engineering team, they're using tools like Hive and Spark, doing things at scale. It's reading that data that has just come in maybe, or maybe the data over the last week it would like to use. And it prepares that data for the analytics model generation team that are using things like Hive and Spark again, and some other tools, and they're developing the models and testing the models. And they may be grabbing data, long-term data from long-term storage as well. And once the model is working, then it can get turned over to the real-time analytics team and put into production, where data is now probably going to come directly from Kafka because it's pulling data from the real-time streams and you've got a real-time analytics engine running, and you can do your analysis and then send your data to your client-facing systems, whatever your analysis was. So maybe it's a web panel or something like that. And also at the same time, it can, the real-time analytics can be sending data to long-term storage. So basically you get the idea that there's data flying all around here, used in different ways at different times by different groups. And that's why Kafka can be very important in keeping track of this data and allowing multiple pathways at multiple times between inputs, data streams, and usage outputs, data engineering, and real-time analytics. This is one of the advantages of using a message broker like Kafka in a data analytics workflow. Finally, I'd like to recommend some background prerequisites for using the examples we're going to run in these live lessons. Many of the examples I will present are run on a virtual machine. The virtual machine runs Linux and has Hadoop slash HDFS, which is the Hadoop distributed file system, Spark, MySQL, and Kafka pre-installed. The virtual machine is freely available, runs on Windows, Macintosh, and Linux hosts. Instructions for downloading and starting the VM are introduced in a subsequent lesson, because this is a full virtual machine, there are some prerequisites that I highly recommend being familiar with. The first, basic Linux command line and bash processing. We will use some of this when we run examples. You do not have to be an expert and it helps to understand what it is we're doing when we start using the Linux command line. Basic knowledge of Python, PySpark, and MySQL is also important. These are generally not an issue because most people have at least some experience with Python. PySpark follows a little bit with PySpark is based on Python and allows spark operations to be part of the Python language. And finally, knowledge of the Hadoop distributed file system is important and it's something that can be learned very easily. With this background, you should have no problem working with the examples and actually running them on your own with Kafka.

Contents