Join Your Data at Ingest Time

OLAP (online analytical processing) databases excel at complex data analytics, making them an ideal fit for use cases like analyzing log data. However, they typically aren’t designed for performant JOIN queries. For users coming from backgrounds in relational databases where JOINs are a crucial feature, this may seem like a significant drawback.

If an OLAP database isn’t designed for JOIN statements, how are you supposed to query and compare data across multiple tables? One typical approach for OLAP databases is denormalization, which has significant downsides.

Hydrolix is a real-time data platform designed for sub-second query latency even on trillion-row datasets, but JOIN statements aren’t a focal point in Hydrolix’s design. Instead, Hydrolix uses an approach that’s unique in the industry: the ability to ingest multiple logically-grouped data sources into one table. In this case, “logically-grouped” data sources might be all your CDNs or it could be a set of microservices that you want to monitor together. With this approach, you don’t need JOIN statements at all, and you can quickly get all the data you need without needing to reference and compare multiple tables.

One use case we see at Hydrolix where this approach is especially valuable is multi-CDN monitoring. Media companies typically need to compare CDN performance and then switch out CDNs if performance doesn’t meet certain thresholds. For this use case, it doesn’t make sense to ingest data from each CDN into separate tables, only to JOIN or denormalize that data later. Instead, you can stream all of your CDN sources into one table in real time.

This post will cover:

How OLAP databases typically handle data comparisons using denormalization
One of Hydrolix’s unique features: ingesting multiple sources into a single table

The Typical OLAP Approach: Denormalization

OLAP databases are designed for analyzing very high volumes of data. With OLAP databases, denormalization is a common approach that can make data analysis more efficient for complex queries. That’s the main benefit—improving query performance. It’s a very significant benefit considering that OLAP databases are specifically designed for complex, efficient analytics.

Denormalization can have significant downsides, though:

Higher storage and compute costs: Denormalization results in duplicated data, leading to higher storage costs. The process of duplication also uses additional compute resources, especially when complex transformations are applied through an ELT (extract—load—transform) process.
Additional maintenance and issues with consistency: Because denormalizing typically leads to duplication, you need to ensure that duplicated data remains consistent across tables or you’ll have issues with accuracy.
More complex data pipelines: Data must be shipped from the source table to denormalized tables, requiring a more complex pipeline (once again, think ELT). And some pipelines may create a replica of the data during the denormalization process.
Less query flexibility: With data denormalization, you need to think carefully about what kinds of analytical queries you need to make ahead of time, and then create denormalized tables that fit your specific use case. What if you realize you want to look at your data in a completely different way? You might need to create a new table for that.

Considering these issues, database experts can make a strong argument that performant JOINs are the answer, not denormalization. But Hydrolix has another approach—think of it as joining your data at ingest time.

Join Your Log Data at Ingest Time

With Hydrolix, you can ingest multiple real-time data sources into a single table. And because tables have flexible schema, each data source can have its own schema. The ability to ingest multiple sources into a single table (in effect, joining log data as soon as it’s ingested) is unique to Hydrolix.

What makes this feature so valuable, and what is required of the Hydrolix datastore in order to ingest multiple sources into a single table?

Let’s look at the value of ingesting multiple sources into a single table for the specific use case that Hydrolix is designed for: log data.

Logically Grouping Log Sources Into a Single Table

It’s often very useful to compare log sources in distributed systems. One common use case for Hydrolix users is multi-CDN monitoring, where it’s important to compare the performance of different CDNs to perform tasks like CDN switching. But there are many others, too—you can compare microservices, end user devices, and so on.

Combining these sources at ingest time gives you a number of major benefits:

No need for compute-intensive, complex JOIN statements later. Your log data is already logically grouped together.
Easier to reason about your data. By grouping log sources in one place, you have fewer tables to manage and query, leading to simpler data structures.
No storage bloat due to denormalized data. You don’t need to denormalize your data into additional tables in order to make comparisons, minimizing data redundancy and bloat. Combined with Hydrolix’s high-density compression (20x-50x), the storage footprint of your data in Hydrolix is extremely compact. Note that there is one exception for denormalization: summary tables, which you can use to efficiently query aggregated and regularly-accessed data. Summary tables tend to be very small compared to raw data tables in Hydrolix.

At this point, you might be wondering about the fine print—how does Hydrolix handle very large tables, especially in terms of query performance? And how does a table handle sources with different schema?

Let’s take a look at how Hydrolix’s architecture supports ingesting multiple data sources into a single table.

Flexible Schema

Each data source has an associated transform file which specifies how all of its fields should be mapped in Hydrolix. You can think of it as a write schema for the data source because it provides a set of rules for how incoming data from a source should be handled. You can change this write schema at any time, making it very flexible. The only exception is the primary key (which must be a timestamp), and which must remain consistent across all data sources for the table. This is because Hydrolix is optimized for (and partitions data by) time.

The table itself does not enforce a schema. You can add additional transforms with different write schemas to a table or change a transform to have different columns. And changes you make will not impact any of the existing data in the table.

Couldn’t that lead to a table with extremely high dimensionality, and wouldn’t that impact performance? And how are you supposed to manage all those different columns so your log data doesn’t turn into a data swamp?

Columnar Storage for High Dimensionality

Structurally, Hydrolix has a number of differences from traditional OLAP databases (it stores data in simple lists, not in multi-dimensional cubes or in a relational database). One major commonality, however, is that Hydrolix is a columnar datastore like typical OLAP databases. With columnar storage, tables can have extremely high dimensionality (think thousands of columns), and each row can have a different number of columns (think flexible schema).

In a row-based database, high dimensionality can significantly impact both storage (more space) and query performance (more data in each row means more data needs to be retrieved).

Columnar databases use compression techniques like run length encoding to minimize the amount of space that columns take in storage. They are especially good at compressing repeated values, which are common in log data. For example, the majority of HTTP status codes in your application will typically be a 200 status code—hopefully, anyway! And Hydrolix uses advanced high-density compression techniques to reduce the storage footprint of log data 20x-50x.

Columnar databases are also designed for low-latency queries even for data with high dimensionality. You specify only the columns you need for queries, and because the data is already highly compressed, queries typically need to pull less data from storage. As one of our core engineers describes it, you get any columns you don’t end up using in queries “for free.”

As a result, it’s not an issue to have many different sources contributing to high dimensionality in a table.

Standardizing and Normalizing Data Through Transforms

Let’s move onto the next issue—the fact that data lakes can quickly turn into data swamps. This is an issue with many data lakes because data is typically unstructured when it’s stored. Unstructured data can get messy fast, and you typically need an ELT (extract—load—transform) process to structure that data later. However, ELT processes add complexity to your data pipeline and can be extremely compute-intensive, leading to higher costs and potential resource contention. To make matters worse, ELT processes often lead to yet more replicas, resulting in storage bloat.

With Hydrolix’s transformation process, you can standardize, enrich, and modify streaming data in real time before storage, a process typically known as ETL, or extract—transform—load. This is extremely important for ingesting multiple sources into a single table. Here are just a few examples that show why it’s critical:

Log data coming from different sources will often have different naming conventions. One example is CDN data, where one CDN might log status codes as statusCode while another logs them as status_code. By standardizing the naming convention before your log data is stored, you can much more easily compare data sources.
Log data often doesn’t include basic information about where it came from. This makes sense when the log file remains in the source, but it becomes a problem as soon as the log is sent elsewhere. Without adding contextual information like its original source, the log data would quickly become useless. With Hydrolix’s transform process, you can enrich your data to add context and make your data easier to reason about and query later.

Hydrolix’s transformation process provides some of the major benefits of data warehouses such as structured data for efficient retrieval—all while giving you the benefits of a data lake such as cost-effective storage and flexible schema.

Complex SQL Transformations

One typical use case for JOIN statements is when you want to apply SQL transformations to your data. Hydrolix’s transform process allows for more complex SQL transformations. For example, you can perform computations, combine several columns into a single column, or split the result of a single column into multiple columns. A common use case at Hydrolix is splitting a URL into its different parts. For example, you might want to have a column that has the entire URL, but you might also want to have additional columns for the path, query component, and any query parameters.

Trillion-Row Tables

Ingesting multiple sources into one table may sound good in theory, but what if all those sources are emitting huge volumes of data? This is often the case for large distributed systems, and combining all these sources can lead to very large tables. OLAP-style databases are designed to handle large tables, and they use advanced query techniques like column pruning to quickly retrieve data.

Multi-CDN monitoring is one use case Hydrolix handles really well, and major broadcasters and stream providers are using Hydrolix to ingest terabytes of CDN log data every day, leading to very large tables.

Because Hydrolix is designed specifically for log data, time-filtered queries of a trillion-row table can have sub-second latency due to Hydrolix’s partitioning and query strategies. Data is partitioned by time, and any partitions that don’t include the time range can immediately be pruned from consideration. Partition pruning is just the first query optimization, though—Hydrolix also uses predicate pushdown, sparse indexing of all columns, and high-density compression to provide performant queries even for trillion-row tables. Read more about how Hydrolix achieves low-latency queries.

Join at Ingest and Save Headaches Later

There are many real-world use cases where you might need to join data after it’s already been ingested. But for log data, we’ve found that the best time to “join” data is ingest time. The entire process of ingesting, transforming, and storing data in Hydrolix typically takes about ten seconds, and then it’s ready for analytics. You can detect issues and compare data quickly for use cases like multi-CDN switching, root cause mitigation, and threat hunting. There’s no need for complex join statements just to get a better understanding of all your services. Instead, the data is all there at your fingertips. Just specify a time filter and the columns you need—no headaches, just results.

Next Steps

If you’re not using Hydrolix yet and would like to learn more, contact Hydrolix about a proof of concept or demo.

Join Your Data at Ingest Time

Table of Contents

The Typical OLAP Approach: Denormalization

Join Your Log Data at Ingest Time