Why Streaming Is the Power Grid for AI-Native Data Platforms

To power AI at scale, organizations must adapt to the evolving role of the modern data warehouse. Here's how.

Jun 2nd, 2025 11:00am by Tyler Rockwood

Featued image for: Why Streaming Is the Power Grid for AI-Native Data Platforms

AI is transforming every business, and every organization is trying to figure out how they can benefit from or enable AI. Wherever you currently fall on the road map, here are two truths you should know: AI needs context to be useful, and the state of the art for AI is rapidly evolving.

The data warehouse is the linchpin for the first truth: It’s where organizational data sprawl converges, making it the perfect central source of truth for AI agents. In light of the second truth, we also need that platform to be agile and capable of keeping pace with ongoing innovation.

The Data Flywheel

As NVIDIA CEO Jensen Huang stated at last year’s Snowflake Summit, AI allows for extracting more at each phase of the data flywheel.

Diagram of the data flywheel

At its core, the flywheel concept is that people using your products will generate usage data; more usage data will give you better insights into how people benefit from your products; and in turn, you can better improve the product, ultimately deriving more consumption (and more data). Let’s break down how AI improves your ability to execute each of these phases:

Smarter Insights

AI can improve the data analysis process in several ways. It can automate the boilerplate of writing dashboards and SQL queries. It can process reports of unstructured data — such as bug reports, user feedback and support questions — identifying common trends and patterns. You can use embeddings to cluster information at large scale and identify gaps in user journeys, as well as missed opportunities.

Better Products

AI can be integrated into a product suite in many ways: enabling real-time personalization or recommendations, giving users the ability to automate tasks, or using models to summarize historical activity and trends. The important part when integrating AI into a product is to add guardrails, measure performance, and give you and your users the confidence that the agent’s automations are working as intended. Lastly, product and development teams can integrate AI into their workflows to ship and test new features faster, allowing them to move timelines up instead of back.

More Data

AI-enhanced products offer plenty of opportunities to capture and measure engagement by seeing what prompts are sent to chatbots, or using AI to reason about data that would have been too tedious to analyze by hand. Also, features like AI-based automation allow power users to lean in harder to your product and derive even more value. Models are also able to clean and extract structure from previously unusable data lake sources, giving you even more data to work with.

Building AI-Enhanced Products

One does not simply slap a chatbot onto an existing product and call it AI-enhanced. It takes thought and care to see where you can apply the most value while avoiding risks associated with this new and nondeterministic technology.

The key thing to remember is that these models are stateless and have no context about what task you’re attempting to prompt them to complete. You must provide them with all the information and instructions needed to complete their task. That context must be both accurate and current: Stale information undermines performance, leads to drift and introduces risk in decision‑making.

For example, a sales development representative agent would need to know what new features have been launched or what pain points a prospective customer might be experiencing. An AI working on behalf of your support team needs to be able to identify other recent issues, recommend workarounds for common problems and quickly surface relevant historical information about a customer and their usage of the product.

To enable these agents, you need a data warehouse that has all this information available. You need to be able to add data from anywhere in your enterprise to a centralized data bank. Moreover, the information in the warehouse needs to be up to date — nobody wants to get a recommendation for an old version of the product or misidentify a pain point you can help a prospect with.

With more recent AI advancements, the window between insight and action is shrinking. You want to be able to invoke an agent to act on an event as it happens, not when the next job runs. When combined with the rapid innovation of AI capabilities, you need a data platform that is flexible enough to accommodate new features and requirements.

Streaming Is the Key to an Agile Data Platform

Data streaming architecture: The streaming engine is the intermediary between data and object storage

Building a data platform varies greatly depending on your technology stack, infrastructure provider and industry. However, they all share common patterns, and one in particular is pivotal to iterating quickly: streaming.

Data streaming is the continuous, incremental flow of data emitted to a message bus or write-ahead log (WAL). The primary advantage of adopting a streaming engine is that it enables you to decouple the producers (the applications generating events) and the consumers (the receivers of records in the log). This enables dynamically adding or removing sources easily, taking advantage of your data in real time, surfacing the latest information to your applications and triggering agents when the event first takes place.

Take a full-text or vector search engine as an example. In these engines, indexing data causes rebuilding of various structures on disk (especially in vector databases, which require a large language model to compute embeddings for each piece of text). This makes batching operations coming from a single source much more effective. Plus, the replayability of a long-lived stream is appealing when testing out different embedding models or different chunking techniques in your retrieval-augmented generation (RAG) pipelines.

Traditionally, this wouldn’t be feasible, but modern streaming engines can leverage tiered storage to offload cold data to object storage, meaning that you can keep full replayability without needing to plumb another data path. All of these auxiliary systems can become materialized views of the raw event stream.

Another example is leveraging change data capture (CDC) from your database systems and writing them into the streaming engine. This lets you have consumers that stream changes from the database as they happen, enabling reactivity to database events. It also helps ensure that auxiliary data systems (such as your full text/vector search or your analytical database) have a copy of the relational database, without having to complicate your applications by trying to keep all these systems in sync.

Generally, CDC streams are quite expensive, as they can prevent WAL cleanup or stress the database differently than traditional database traffic. But landing a single CDC stream and then having a purpose-built system fan out the data to various different consumers can keep the platform simple and reliable. Adding new features or synchronizing two systems doesn’t require a bunch of careful capacity planning, and you can quickly add reactivity to your application layer.

For example, you may want to invoke an agent and analyze what happened when a user immediately downgrades their account. Doing that via the CDC stream for the user_plans table means that the application layer doesn’t have to rearchitect its system to allow other applications to react to these changes.

Open Formats = Freedom and Flexibility

Just like streaming was central to the operational use cases above, these same events can be materialized into your data warehouse, giving you fresh and up-to-date information for your analytical queries. The direct event data can be converted into open formats like Apache Iceberg (some streaming engines can do this directly, such as Redpanda’s Iceberg Topics), or stream it into proprietary formats (such as Snowpipe streaming in Snowflake). Alternatively, streams can be joined and processed in real time to ensure that data lands in any form to maximize its queryability, without expensive batch jobs that constantly reprocess the entire dataset.

Open formats like Apache Iceberg give you freedom and flexibility to choose from a number of different query engines. For example, say you’re a Google Cloud Platform user but use Snowflake as your data warehouse for business analysts and AI teams.

Leveraging Apache Iceberg means that you can keep Snowflake as your primary data warehouse, but also enable BigQuery and all the integrations available for model serving and training without having to store your data twice. This happens without comprising functionality in either platform, as Apache Iceberg comes with a full ACID transactional model, well-defined schema evolution policies, time-traveling queries and fine-grained access controls through a catalog like Apache Polaris. Proprietary systems manage both the data and the metadata for the data warehouse.

However, with Iceberg, you have the option to either keep the metadata in proprietary systems or choose an open standard like the Apache Iceberg REST catalog for metadata management.

Streaming also enables transforming and joining your data as it’s already in motion, saving you from costly reprocessing of data in large batch jobs. For example, if you have compliance or masking requirements before data lands in long-term storage in the analytical plane of your data platform, you can do a small stateless transformation of your data as it lands in the data warehouse.

Best Practices and the Future of AI-Native Data Platforms

Using a streaming engine as the power grid of your data platform unlocks a great deal of flexibility and responsiveness, allowing you to take advantage of your data in real time. When you have a loosely coupled system that is moving fast, there are a few best practices to keep in mind to keep your systems robust and reliable.

First of all, it’s important to have a schema registry so that when new applications are set up to read from your data streams, they can seamlessly handle the evolution of the schema safely. These schemas become the contract between teams, just like how HTTP-based services have API contracts. When streaming to the data warehouse, keeping your schemas synced between your schema registry and query engine catalog can ensure that batch and streaming systems are both working with a consistent view of the data. Hooking up schema changes and publications as part of your CI/CD pipelines and infrastructure-as-code (IaC) can also help your engineering teams catch issues earlier during development instead of in staging or production environments.

As your organization grows and teams make changes, having mechanisms for lineage can help you track down issues quickly or understand where data is coming from. Using best practices like OpenTelemetry tracing standard conventions and propagating the tracing using record headers is particularly helpful as organizations adopt OpenTelemetry for all their observability data.

While event-driven architectures unlock adaptable, resilient and responsive platforms, they can be seen as expensive. You can keep costs low by leveraging features like tiered storage to offload cold storage, compression to reduce storage size and bandwidth, efficient formats like Google Protocol Buffers or Apache Avro, and tuning batching to keep the streaming system fast and performant. As with any system, start with good observability and monitor usage as the platform and usage scale.

While the best practices around security and these new AI applications are evolving all the time, the fundamentals for data platforms — such as role-based access control, fine-grained access control lists (ACLs) and the principle of least privilege — apply to both streaming and batch datasets. Take advantage of standards like OpenID Connect for authentication and audit logging to monitor access in a uniform manner across all systems.

Lastly, real-time streaming data platforms enable emerging trends like AI operations (AIOps) to allow data systems to monitor, optimize and react to changes in real time. Without streaming, you’ll be forced to set up a periodic job, increasing the iteration cycle and reducing the amount of training data for AI agents to handle operational tasks in the platform.

Summary

As technology advances, the gap between insight and action is shrinking. Automated systems can produce and process data instantaneously, and AI is unlocking this in previously impossible areas.

Streaming as the backbone for a data platform enables all of these real-time use cases, and it will become more important as AI is democratized further with open source models, and as the cost of adopting these powerful AI models decreases. As organizations look to adopt more and more autonomy for their IT systems, it’s important to ensure your data platform is designed to react in real time and keep up with the pace of innovation.

Tyler Rockwood is a Principal Software Engineer at Redpanda Data. He has previously worked at OpenAI, Shortwave and Google Cloud. Tyler has a strong background in distributed systems, databases, software architecture and is constantly learning the ever-evolving space of Generative...