Fluent Bit, a Specialized Event Capture and Distribution Tool

Editor’s note: This article is an excerpt from the Manning MEAP (Manning Early Access Program) book, “Effective Platform Engineering.” In MEAP, you read a book chapter by chapter while it’s being written and get the final eBook as soon as it’s finished. |
When an issue reaches us, we need to know what happened or, better, be able to engage with it as it is happening. The issue might be a serious system failure or a question about how something or someone was or is interacting with our system(s). To address the issue, we must have this information available and a tool that fits our needs. Fluent Bit is such a tool.
In this excerpt of “Effective Platform Engineering” (the book’s first chapter.) we’ll take a moment to understand what Fluent Bit is and answer some important questions about it, such as why it is so important — and worthy of a book — and how it fits into the IT ecosystem.
Why Is Fluent Bit So Important?
Fluent Bit is, at its heart, a specialized event capture and distribution tool. Let’s break that statement down a bit. Why is it specialized? Fluent Bit focuses on log events, metrics, and traces (sometimes called signals):
- Log — Can be seen as each output message or line in a log file or, put another way, a string of text that provides some information about what has happened. The message can range from completely unstructured to a fully structured and self-describing message.
- Metrics — Measurements, usually numeric values with a descriptive label, generated by our IT hardware and software. Examples are the use of each CPU core on a computer or the number of transactions processed in an application per minute.
- Traces — A trace is a linked set of values recorded at important waypoints in the execution of our software, often aligning with transactions. Traces have a lot in common with log events. The key difference is that trace events have a relationship with each other, and sometimes, a trace is not shared until a transaction ends or an error occurs.
It’s important to note that trace identifiers are carried through the different parts of the application. Traces have become more significant with Kubernetes and the adoption of microservice strategies because, when used properly, they can make following what is happening across distributed solutions far easier.
The book “Effective Platform Engineering” explores event data in greater depth. The ability to handle various events within a single tool isn’t unique, but it does distinguish Fluent Bit from technologies it’s sometimes compared with, such as Logstash.
Because Fluent Bit reacts to and processes events, typically in near real time as they’re received or tracked from sources such as a file, it’s described as event-driven.
Why do we need Fluent Bit to be event-driven? After all, we look at the data when something isn’t right. Although we may adopt the traditional approach of looking at logs when someone has declared there to be an issue, people still like to see stats and metrics closer to real time.
We should also remember that we can derive meaningful time-sensitive metrics from log events. In our code, we are interested in the events when our software has done something that may be of interest to confirm that:
- All is well.
- Understand which decision branch was taken.
- Or find the answer to a calculation applied to data.
Even when a scheduler triggers the monitored solution, we want the logs, events and traces to be provided when they are still meaningful.
Clever words, then, for something mundane? It would be easy to think that. Unfortunately, this thinking can lead us to miss a wealth of possibilities and opportunities that Fluent Bit offers to make our lives a lot easier.
If we consider a log event as just a block of text from our code, for example, we may overlook that we can derive meaning from it and determine whether something else needs to occur there and then.
If the event is a health check indicating everything is fine, we could send the data to the operations dashboards and do no more. But if the event reports the receipt of a large, malformed payload, it could indicate a more serious problem that needs immediate intervention before users start calling to complain.
The Value of Event Distribution
Tackling the pain of identifying (and possibly needing to resolve) an issue with a system benefits us all individually, whether we’re:
- Part of a team working within an environment practicing some variation of DevOps.
- Part of a tiered support system on the operational front line.
- Or the developer is last in the escalation chain for a testing issue.
The information we need to address an issue could be as simple as the complete log message. Often, we need to understand what happened before, during and after the event of concern to establish cause and effect. (For example, a database may be producing errors because we’ve run out of storage. Did we run out of storage because the housekeeping process failed, or did we overlook the need to monitor our storage capacity?)
We need to capture and aggregate data from many different sources. Logs, metrics and traces are the building blocks of observability, and monitoring data (logs, events, and traces) is generally transient. Using Fluent Bit, and tools like it, enables us to gather data from all sources and put it somewhere secure. It’s been my experience that when things go seriously wrong, people aren’t worrying about preserving state information, logs and the like. Their concern is returning to an operational status, which can mean that logs and stored metrics in the production environment may easily be trashed.
Aggregating log events doesn’t just mitigate the risk of data loss, but also helps us see the complete picture. COBOL solutions, for example, were usually made up of multiple programs run in sequence. Processes were sequential, but distribution processes were already possible. As technology advanced, we adopted two- or three-tier solutions running concurrently (application and database servers, usually with separate UIs).
Even if we’re operating monolithic application servers, work can be spread across multiple virtualized load-balanced servers, and microservices have led to a further explosion of distribution. To make sense of what is happening, we need to bring together all the events spread across all these distribution points to get an accurate picture of what is happening.
Aside from being able to preserve information that can help us diagnose an issue, we can easily overlook one challenge: The more time we take to get from issue to diagnosis, the more damage can occur, and therefore, the more painful the recovery process becomes.
Whether we’re fixing failed transactions or working out the scale of a security breach, by processing the metrics and logs as they occur, we can automate the evaluation of whether they indicate an issue occurring now or, better, an imminent problem. Thus, we can reduce the amount of pain because we’ve avoided or kept the effect of the issue as small as possible.
The ability to distribute data easily also allows us to adopt different tools for different tasks. If the data is difficult to distribute, we end up with the lowest common denominator or with tools that support the most vocal team using the data rather than ones that address different needs. PagerDuty, for example, is ideal for notifying the right person depending on the identified system and the time and day of the week.
Fluent’s Place in CNCF
The Fluent tools, Fluentd and Fluent Bit, are key players in the Cloud Native Computing Foundation (CNCF) ecosystem, helping us gather, secure, and, ideally, analyze logs and metrics. These solutions allow us to get the observability data (logs, traces and metrics) in a form that another tool can render in an easily digestible format. Fluent Bit is having a greater effect than Fluentd in terms of adoption and support for the latest observability standards and tools, as we’ll see.
Within the CNCF, projects are classified to reflect their process, quality, maturity, support and adoption. Graduated projects such as Fluentd and Fluent Bit need contributors from multiple organizations with processes that demonstrate good project governance and development processes. Most importantly, these projects need several public adopters so the wider community can be confident that it will not likely adopt something that could be abandoned overnight.